Factors
In R, factors represent categorical variables created from character strings or numerical vectors. They are essential for tasks like:
- Compute statistics (e.g mean, variance, t-test…) according to categories
- Create diagrams (e.g distributions, boxplot…) according to categories
- …
As an example, you may be interested to analyze values according to any membership. For example, below, we create an ordered vector containing the treatments received by 60 patients.
# ttmt is, for the moment, a simple vector of character strings.
ttmt <- sort(sample(c("Placebo", "hydroxychloroquine", "dexamethasone"), size=60, replace = TRUE))
class(ttmt)
print(head(ttmt))
By default, this vector contains no information about any categories.
The as.factor() function
The as.factor()
function converts a vector into a factor
(i.e categorical variable).
# ttmt is tranformed/casted into a factor
ttmt <- as.factor(ttmt)
class(ttmt)
The levels() function
The levels()
function lets you extract the names of the
categories (the levels/modalities of these
categories).
levels(ttmt)
You can change the category names as shown in the following example.
Check, using print()
, that the levels have changed.
# We can change the category names.
levels(ttmt) <- c("dex", "hcq", "cont")
levels(ttmt)
The table() function
The table()
function returns the size of each category.
For instance, let say we have information related to patients
genders.
set.seed(123)
gender <- sample(c("Male", "Female"), size = 60, replace = T)
gender <- as.factor(gender)
head(gender)
The table()
function can be used to count the number of
men and women.
table(gender)
Now that we have two variables (ttmt
and
gender
) we can also easily cross the variables and create a
contengency matrix to get the number of men/women expose to the various
treatments.
table(ttmt, gender)
Exercise with the table() function
- Given the
age
factor (see below), count the number of individuals in the different levels through the factorsttmt
,gender
andage
.
NB: The first argument to table (cf
help(table)
) is ‘…’, which means you can pass as many
factors as you like as arguments.
set.seed(456)
age <- sample(c("<10", "10-30", "30-50", ">50"), size = 60, replace = T)
set.seed(456)
age <- sample(c("<10", "10-30", "30-50", ">50"), size = 60, replace = T)
table(ttmt, gender, age)
Ordinal variables
An ordinal variable is similar to a categorical variable but has a defined order (e.g., education level: “high school,” “BS,” “MS,” “PhD”; or rating: “like,” “neutral,” “dislike”).
Creating ordinal variables
For example, age
could be naturally converted to an
ordinal variable using the factor()
function (which
contains ordered
arguments compared to
as.factor()
).
age <- factor(age, levels = c("<10", "10-30", "30-50", ">50"), ordered = TRUE)
age
Applying operations based on catagories
The tapply() function
Factors enable operations on vectors by category. For example,
tapply()
applies functions like mean()
to each
category. Below, it calculates the mean and visualizes infection
distributions across three patient treatment groups.
# infection is a vector containing the infection levels
# infection levels of each patient
infection <- round(c(rnorm(20, mean = 3, sd=1),
rnorm(20, mean = 10, sd=2),
rnorm(20, mean = 10, sd=1)),
2)
tapply(infection, ttmt, mean)
This mean can also be calculated as a function of several categorical variables.
tapply(infection, list(ttmt, gender), mean)
These factors are also extremely useful for graphical operations.
Below, for example, we’ve created boxplot based on patient categories
using very basic R functions. The ~
(tilda) can be read as
‘as a function of’ or ‘according to’.
boxplot(infection ~ ttmt)
Creating factors with the cut() function
The cut()
function splits a numeric vector into
intervals, encoding values based on their intervals, effectively
converting a numeric variable into categories.
infection_class <- cut(infection, breaks = 4)
is(infection_class)
print(levels(infection_class))
Now that we’ve created infection level classes, we could create the corresponding contingency table:
cont_table <- table(infection_class, ttmt)
cont_table
We’ll look at the use of this factor object in more detail later.
Exercises
Exercise 1
- Given the object below.
chromosome <- c(11, 2, 7, 7, 8, 10, 2, 20, 22, 1, 3, 10, 10, 11, 20)
- Transform
chromosome
with the following code.
chromosome <- as.factor(chromosome)
Exercise 2
- Given the codons object below. Complete the code to print the number of “ATG” codons. Store this value in a variable named att_nb.
set.seed(123)
codons <- sample(c("ATT", "ATG", "TTG", "TCG", "GCG", "TGA"), size=1000, replace = TRUE)
codons <- as.factor(codons)
set.seed(123)
codons <- sample(c("ATT", "ATG", "TTG", "TCG", "GCG", "TGA"), size=1000, replace = TRUE)
codons <- as.factor(codons)
tb_codon <- table(codons)
att_nb <- tb_codon['ATG']
- Propose an instruction to transform the codon categories “ATG” into “START” and “TGA” into “STOP”.
set.seed(123)
codons <- sample(c("ATT", "ATG", "TTG", "TCG", "GCG", "TGA"), size=1000, replace = TRUE)
codons <- as.factor(codons)
# Look at the order
levels(codons)
# Change the names of the levels
levels(codons) <- c("START", "ATT", "GCG", "TCG", "STOP", "TTG")
End of the section
Thank you for following this tutorial.