Skip to Tutorial Content

Factors

In R, factors represent categorical variables created from character strings or numerical vectors. They are essential for tasks like:

  • Compute statistics (e.g mean, variance, t-test…) according to categories
  • Create diagrams (e.g distributions, boxplot…) according to categories

As an example, you may be interested to analyze values according to any membership. For example, below, we create an ordered vector containing the treatments received by 60 patients.

# ttmt is, for the moment, a simple vector of character strings.
ttmt <- sort(sample(c("Placebo", "hydroxychloroquine", "dexamethasone"), size=60, replace = TRUE))
class(ttmt)
print(head(ttmt))

By default, this vector contains no information about any categories.

The as.factor() function

The as.factor() function converts a vector into a factor (i.e categorical variable).

# ttmt is tranformed/casted into a factor
ttmt <- as.factor(ttmt)
class(ttmt)

The levels() function

The levels() function lets you extract the names of the categories (the levels/modalities of these categories).

levels(ttmt)

You can change the category names as shown in the following example. Check, using print(), that the levels have changed.

# We can change the category names.
levels(ttmt) <- c("dex", "hcq", "cont")
levels(ttmt)

The table() function

The table() function returns the size of each category. For instance, let say we have information related to patients genders.

set.seed(123)
gender <- sample(c("Male", "Female"), size = 60, replace = T)
gender <- as.factor(gender)
head(gender)

The table() function can be used to count the number of men and women.

table(gender)

Now that we have two variables (ttmtand gender) we can also easily cross the variables and create a contengency matrix to get the number of men/women expose to the various treatments.

table(ttmt, gender)

Exercise with the table() function

  • Given the age factor (see below), count the number of individuals in the different levels through the factors ttmt, gender and age.

NB: The first argument to table (cf help(table)) is ‘…’, which means you can pass as many factors as you like as arguments.

set.seed(456)
age <- sample(c("<10", "10-30", "30-50", ">50"), size = 60, replace = T)
set.seed(456)
age <- sample(c("<10", "10-30", "30-50", ">50"), size = 60, replace = T)
table(ttmt,  gender, age)

Ordinal variables

An ordinal variable is similar to a categorical variable but has a defined order (e.g., education level: “high school,” “BS,” “MS,” “PhD”; or rating: “like,” “neutral,” “dislike”).

Creating ordinal variables

For example, age could be naturally converted to an ordinal variable using the factor() function (which contains ordered arguments compared to as.factor()).

age <- factor(age, levels = c("<10", "10-30", "30-50", ">50"), ordered = TRUE)
age

Applying operations based on catagories

The tapply() function

Factors enable operations on vectors by category. For example, tapply() applies functions like mean() to each category. Below, it calculates the mean and visualizes infection distributions across three patient treatment groups.

# infection is a vector containing the infection levels
# infection levels of each patient
infection <- round(c(rnorm(20, mean = 3, sd=1), 
                      rnorm(20, mean = 10, sd=2), 
                      rnorm(20, mean = 10, sd=1)), 
                    2)
tapply(infection, ttmt, mean)

This mean can also be calculated as a function of several categorical variables.

tapply(infection, list(ttmt, gender), mean)

These factors are also extremely useful for graphical operations. Below, for example, we’ve created boxplot based on patient categories using very basic R functions. The ~ (tilda) can be read as ‘as a function of’ or ‘according to’.

boxplot(infection ~ ttmt)

Creating factors with the cut() function

The cut() function splits a numeric vector into intervals, encoding values based on their intervals, effectively converting a numeric variable into categories.

infection_class <- cut(infection, breaks = 4)
is(infection_class)
print(levels(infection_class))

Now that we’ve created infection level classes, we could create the corresponding contingency table:

cont_table <- table(infection_class, ttmt)
cont_table

We’ll look at the use of this factor object in more detail later.

Exercises

Exercise 1

  • Given the object below.
chromosome <- c(11, 2, 7, 7, 8, 10, 2, 20, 22, 1, 3, 10, 10, 11, 20)
  • Transform chromosome with the following code.
chromosome <- as.factor(chromosome)

Exercise 2

  • Given the codons object below. Complete the code to print the number of “ATG” codons. Store this value in a variable named att_nb.
set.seed(123)
codons <- sample(c("ATT", "ATG", "TTG", "TCG", "GCG", "TGA"), size=1000, replace = TRUE)
codons <- as.factor(codons)
set.seed(123)
codons <- sample(c("ATT", "ATG", "TTG", "TCG", "GCG", "TGA"), size=1000, replace = TRUE)
codons <- as.factor(codons)
tb_codon <- table(codons)
att_nb <- tb_codon['ATG']
  • Propose an instruction to transform the codon categories “ATG” into “START” and “TGA” into “STOP”.
set.seed(123)
codons <- sample(c("ATT", "ATG", "TTG", "TCG", "GCG", "TGA"), size=1000, replace = TRUE)
codons <- as.factor(codons)
# Look at the order 
levels(codons)
# Change the names of the levels
levels(codons) <- c("START", "ATT", "GCG", "TCG", "STOP", "TTG")

End of the section

Thank you for following this tutorial.

The ‘factor’ object