The ggplot2 library (session 4)

Facets

Principle

The ggplot2 library offers an extremely powerful tool for dividing a plot into panels (facets) based on the levels of specified categorical variables. Facets allow for data exploration based on a factor or a given group of factors. For the following example, we will create a matrix containing the results of a fictitious ELISA test, where measurements are taken at 2 different times (days) for experiments conducted by four different operators.

url <- "https://zenodo.org/record/8210893/files/elisa_artificial.txt"
elisa <- read.table(url, sep="\t", header=TRUE, row.names=1)
head(elisa)

This is an artificial dataset from an ELISA test using 96-well plates. Eight ELISA plates (12 columns / 8 rows) were used, as can be verified here.

table(elisa$rows, elisa$columns)

These eight plates were produced by 4 experimenters on two different days.

table(elisa$user, elisa$day)

The facet_wrap() Function

With the ggplot2 syntax, it becomes very easy to produce histograms corresponding to the intensity of colorations obtained in each well (value) for a given experimenter (user). In the example below, note the use of the facet_wrap() function:

This function creates a one-dimensional arrangement of facets, which can optionally be displayed across multiple rows using the nrow and ncol arguments.
The facets argument passed to facet_wrap() must be a formula (formula). In our example, facets = ~ user translates to ‘create graphical panels based on the value of the user variable.’

p <- ggplot(data = elisa, 
            mapping = aes(x=value))
p + geom_histogram() + 
    facet_wrap(facets = ~ user, ncol=2)

For exploratory purposes, we can similarly analyze the distributions of the values obtained based on the operator (user) and the day (day).

p <- ggplot(data = elisa, 
            mapping = aes(x=value))
p + geom_histogram() + 
    facet_wrap(facets = ~ user + day, ncol=2)

Display the density distributions of value in facets based on the experimenter and the day.

p <- ggplot(data = elisa, 
            mapping = aes(x=value, fill=user))
p + geom_density(color=NA) + 
    facet_wrap(facets = ~ user + day, ncol=2)

The facet_grid() Function

Since each user performed an ELISA experiment on Monday and Friday, we can choose a two-dimensional faceted representation with facet_grid() (a grid/matrix of facets). Note that the facets argument is set to user ~ day, indicating that user will be displayed in rows and day in columns.

p <- ggplot(data = elisa, 
            mapping = aes(x=value, fill=user))
p + geom_histogram() + 
    facet_grid(facets = user ~ day) +
    scale_fill_manual(values=c("#1B9E77", "#D95F02", "#7570B3", "#E7298A"))

Display horizontal boxplots (using coord_flip()) corresponding to the distributions of the variable value for each user, creating a facet based on the day.

p <- ggplot(data = elisa, 
            mapping = aes(x=user, y=value, fill=user))
p + geom_boxplot() + coord_flip() +
    facet_wrap(facets = ~ day, ncol=2)

Application Example: Heatmap

Based on the numerical data loaded into R, we may want to create a color-coded image (heatmap) of the ELISA plates produced by different users.

The coordinates x (elisa$rows) and y (elisa$columns) of the wells in the plate are available.
The fill aesthetic will be mapped to value.

We can use geom_raster() to represent an ELISA plate and partition the plot based on user and day.

p <- ggplot(data = elisa, 
            mapping = aes(x=columns, y=rows, fill=value))
p + geom_raster() + 
  facet_grid(facets = user ~ day)

For geom_raster(), which represents the continuous numeric variable value, one of the following functions can be used to control the colors:

scale_fill_gradient(): This function is used to specify the fill colors in a gradual manner in a plot, using a single start color and a single end color.
scale_fill_gradient2(): This function is similar to scale_fill_gradient(), but it allows specifying an additional color, which serves as a midpoint or central value in the color scale, creating a two-color gradient.
scale_fill_gradientn(): This function is used to specify a fill gradient with multiple custom colors. You can define the colors you want to use in the color scale based on your data and preferences.

p <- ggplot(data = elisa, 
            mapping = aes(x=columns, y=rows, fill=value))
p + geom_raster() + 
  facet_grid(facets = user ~ day) +
  scale_fill_gradientn(colours = c("#0000BF", "#0000FF", 
                                   "#0080FF", "#00FFFF", 
                                   "#40FFBF", "#80FF80", 
                                   "#BFFF40", "#FFFF00", 
                                   "#FF8000", "#FF0000", 
                                   "#BF0000"))

Ordering Rows/Columns

You may have noticed that the rows are not ideally ordered. We would prefer the order: ‘cont’, ‘a’, ‘b’, ‘c’, …

In ggplot2, to order factors, you need to create ordinal variables. This can be done, as we saw earlier, using the ordered=TRUE argument in the factor() function.

Modify the following code so that the columns are ordered.

___
p <- ggplot(data = elisa, 
            mapping = aes(x=columns, y=rows, fill=value))
p + geom_raster() + 
  facet_grid(facets = user ~ day) +
  scale_fill_gradientn(colours = c("#0000BF", "#0000FF", 
                                   "#0080FF", "#00FFFF", 
                                   "#40FFBF", "#80FF80", 
                                   "#BFFF40", "#FFFF00", 
                                   "#FF8000", "#FF0000", 
                                   "#BF0000"))

elisa$rows <- factor(x = elisa$rows, ordered = T, levels=c('cont', letters[1:7]))
p <- ggplot(data = elisa, 
            mapping = aes(x=columns, 
                          y=rows, 
                          fill=value))
p <- p + geom_raster() + 
          facet_grid(facets = user ~ day) +
          scale_fill_gradientn(colours = c("#0000BF", "#0000FF", 
                                           "#0080FF", "#00FFFF", 
                                           "#40FFBF", "#80FF80", 
                                           "#BFFF40", "#FFFF00", 
                                           "#FF8000", "#FF0000", 
                                           "#BF0000"))
print(p)

Predefined Graphic Themes

Introduction to Themes

There are many ways to adjust the overall visual appearance of a plot. As a first step, you can apply a predefined theme, which affects various parameters of the plot (fonts, character sizes, axis styles, background color, contrast, etc.). ggplot2 includes around ten built-in themes. The names of these global configuration functions usually start with ‘theme_’.

apropos("^theme_")

 [1] "theme_bw"       "theme_classic"  "theme_dark"     "theme_get"     
 [5] "theme_gray"     "theme_grey"     "theme_light"    "theme_linedraw"
 [9] "theme_minimal"  "theme_replace"  "theme_set"      "theme_test"    
[13] "theme_update"   "theme_void"

For example:

theme_gray(): The signature theme of ggplot2 with a gray background and white grid lines, designed to highlight the data while facilitating comparisons.

theme_bw(): The classic ggplot2 theme with a white background and black grid lines, designed to highlight the data while facilitating comparisons. It may be better suited for presentations displayed using a projector.

theme_linedraw(): A theme with only black lines of varying widths on a white background, reminiscent of a line drawing. The goal is similar to that of theme_bw().

theme_void(): A totally empty theme.

theme_minimal(): un thème totalement épuré.

theme_dark() dont le nom est très parlant…

theme_classic(), theme_test(), theme_dark(), theme_light()…

D’autres thèmes prédéfinis sont disponibles dans la librairie ggthemes.

Par exemple theme_excel() pour les nostalgiques du tableur Microsoft… L’aide indique: “Thème permettant de reproduire l’affreuse monstruosité qu’était l’ancien graphique Excel à fond gris. Ne l’utilisez jamais.” :). A noter que vous pouvez aussi bénéficier de l’hideuse palette excel (scale_colour_excel()). Un must… :)

Ou encore theme_wsj() pour simuler un diagramme du Wall Street Journal…

Exercice

Essayez successivement d’ajouter l’un des thèmes suivant au diagramme $p$: theme_bw(), theme_classic(), theme_dark(), theme_gray(), theme_grey(), theme_light(), theme_minimal(), theme_void(), ggthemes::theme_wsj(), ggthemes::theme_excel(), ggthemes::theme_excel_new(), ggthemes::theme_economist()…

p <- p + theme_bw()
print(p)

p <- p + theme_bw()
print(p)
p <- p + theme_classic()
print(p)
p <- p + theme_dark()
print(p)
p <- p + theme_light()
print(p)
p <- p + theme_minimal()
print(p)
p <- p + ggthemes::theme_wsj()
print(p)
p <- p + ggthemes::theme_excel()
print(p)
p <- p + ggthemes::theme_excel_new()
print(p)
p <- p + ggthemes::theme_economist()
print(p)
#...

Fine-Tuning Graphs

The theme() and element_*() Functions

Beyond applying predefined themes (like theme_minimal() or theme_bw()), you can customize every aspect of a graph according to your needs.

The theme() function offers maximum flexibility for customizing the appearance of your graphs. For example, you can specify the font, font size, and text color for axis titles and legends, or change the background of the graph to fit a dark or light theme.
The element_*() functions (notably element_text(), element_line(), element_rect(), element_blank()…) are used in combination with theme() to control specific elements of the graph.
The element_text() function is used to define the font, font size, and color of a text element.
The element_line() function customizes graph lines, such as line thickness or line type.
The element_rect() function customizes box/rectangle-type elements.
You can use element_blank() to completely remove certain elements of the graph if needed.

Place your cursor between the parentheses and press the key on your keyboard to view all the arguments of theme(). This reveals all the modifiable elements of the graph (and there are many…).

theme()

# rect
# text
# title
# aspect.ratio
# axis.title
# axis.title.x
# axis.title.x.top
# axis.title.x.bottom
# axis.title.y
# axis.title.y.left
# axis.title.y.right
# axis.text
# axis.text.x
# axis.text.x.top
# axis.text.x.bottom
# axis.text.y
# axis.text.y.left
# axis.text.y.right
# axis.ticks
# axis.ticks.x
# axis.ticks.x.top
# axis.ticks.x.bottom
# axis.ticks.y
# axis.ticks.y.left
# axis.ticks.y.right
# axis.ticks.length
# axis.ticks.length.x
# axis.ticks.length.x.top
# axis.ticks.length.x.bottom
# axis.ticks.length.y
# axis.ticks.length.y.left
# axis.ticks.length.y.right
# axis.line
# axis.line.x
# axis.line.x.top
# axis.line.x.bottom
# axis.line.y
# axis.line.y.left
# axis.line.y.right
# legend.background
# legend.margin
# legend.spacing
# legend.spacing.x
# legend.spacing.y
# legend.key
# legend.key.size
# legend.key.height
# legend.key.width
# legend.text
# legend.text.align
# legend.title
# legend.title.align
# legend.position
# legend.direction
# legend.justification
# legend.box
# legend.box.just
# legend.box.margin
# legend.box.background
# legend.box.spacing
# panel.background
# panel.border
# panel.spacing
# panel.spacing.x
# panel.spacing.y
# panel.grid
# panel.grid.major
# panel.grid.minor
# panel.grid.major.x
# panel.grid.major.y
# panel.grid.minor.x
# panel.grid.minor.y
# panel.ontop
# plot.background
# plot.title
# plot.title.position
# plot.subtitle
# plot.caption
# plot.caption.position
# plot.tag
# plot.tag.position
# plot.margin
# strip.background
# strip.background.x
# strip.background.y
# strip.clip
# strip.placement
# strip.text
# strip.text.x
# strip.text.x.bottom
# strip.text.x.top
# strip.text.y
# strip.text.y.left
# strip.text.y.right
# strip.switch.pad.grid
# strip.switch.pad.wrap

Examples

Below, we customize various elements of a graph (with varying levels of aesthetic appeal…). You will notice that it is quite intuitive to determine whether to use element_text(), element_line(), or element_rect() depending on the context. Note that the argument names are consistent across these three functions (color, size…), which makes them easy to use.

p <- p + theme_minimal()
p <- p + theme(strip.background = element_rect(color="red", fill="orange"),  
               strip.text       = element_text(color="white", face="bold"),
               axis.text.x      = element_text(color="blue", size=7, angle=45, family = "Helvetica", face="bold"),
               axis.text.y      = element_text(color="darkviolet", size=10, family = "Times", face="bold"),
               axis.ticks.x     = element_line(color="brown", linewidth=1),
               axis.ticks.y     = element_line(color="darkturquoise", linewidth=1),
               plot.background  = element_rect(fill="paleturquoise"),  
              )
#...

Exercises

In the following plot:

Change the font of the graph title (family="Times").
Adjust the angle of the x-axis text (angle=45°).
Modify the background color (fill).
Remove the secondary grid lines (using element_blank()).
Add a border line to the boxes containing the legends.

p <- p + ggtitle("Flipper Lengths vs Bill Lengths") +
         theme(plot.title = ___, 
               axis.text.x = ___,
               plot.background = ___, 
               panel.grid.minor = ___,  
               legend.background = element_blank(),
               legend.box.background = ___
           )
print(p)

p <- p + ggtitle("Flipper Lengths vs Bill Lengths") +
               theme(plot.title  = element_text(family="Times"), 
               axis.text.x = element_text(angle=45),
               plot.background = element_rect(fill="#EEDDAA"), 
               panel.grid.minor = element_blank(),  
               legend.background = element_blank(),
               legend.box.background = element_rect(color = "black", size=1)
           )
print(p)

Exercises

The Dataset

Here, our dataset contains several pieces of information related to nearly all known transcripts in the human genome (one per row). This data was produced in tsv format using the pygtftk software (v1.6.3) from a GTF file downloaded from Ensembl (genome version GRCh38, release 92).

Since the file is somewhat large, we will download it and place it in your user folder so that it does not need to be downloaded again later.

options(timeout=10000)
dir_path <- file.path(fs::path_home(), ".rtrainer")
dir.create(dir_path, showWarnings = FALSE)
## The URL pointing to the dataset
url <- "https://zenodo.org/record/8211383/files/Homo_sapiens.GRCh38.110.subset_2.tsv.gz"
# Download
file_path <- file.path(dir_path, "Homo_sapiens.GRCh38.110.subset_2.tsv.gz")
if(!file.exists(file_path)) download.file(url=url, destfile = file_path, quiet = TRUE)

We will load the file into R using the read.table() function. At the same time, we will assign the transcript_id column to the row names (row.names=6).

tx_info <- read.table(file=file_path, header=TRUE, sep="\t", row.names=6)
dim(tx_info)

Here is our dataset:

head(tx_info)

Number of Transcripts per Chromosome

Create a diagram with geom_bar() showing the number of different transcripts per chromosome (seqid). - Use + coord_flip() to rotate the diagram. - Order the chromosomes as follows: 1, 2, 3 .. 22, X, Y, MT.

___
p <- ggplot(data=tx_info, ___) +
  ___

tx_info$seqid <- factor(tx_info$seqid, 
                        levels = c(as.character(1:22), "X", "Y", "MT"), 
                        ordered = TRUE)
p <- ggplot(data=tx_info, 
            mapping=aes(x=seqid)) + 
  geom_bar() + coord_flip() +
  theme_bw()
print(p)

Transcript Sizes

In the data.frame tx_info, create a new column tx_genomic_size_log10 containing the variable tx_genomic_size converted to base 10 logarithm (log10()). Using histograms and facets, explore the variable tx_genomic_size_log10 (transcript size including introns in base 10 logarithm).
Use geom_histogram() and facet_grid(gene_biotype~., scale="free_y"). The argument scale="free_y" allows each facet to have its own specific scale.
Appropriately configure (theme()) the size and orientation of the textual elements.

p <- ggplot(data=tx_info, 
            mapping=aes(x=tx_genomic_size_log10)) + 
     geom_histogram(bins=50) +
     facet_grid(gene_biotype~., scale="free_y") + 
     labs(x="Genomic Size of Transcripts (log10)") +
     theme_minimal() +
     theme(panel.grid.minor = element_blank(), 
           strip.text.y = element_text(angle=0, size=5), 
           axis.text.y = element_text(size=5))

Number of Exons

If processed pseudogenes no longer have introns, we should only find one exon…

Transform the nb_exons column into its logarithm and place the result in the nb_exons_log10 column. What can you say about the number of exons (nb_exons_log10) for transcripts based on “gene_biotype”? Use a boxplot or violin plot to present this information.

biotypes <- unique(tx_info$gene_biotype)
palette <- setNames(rainbow(length(biotypes)), biotypes)
tx_info$nb_exons_log10 <- log10(tx_info$nb_exons)
p <- ggplot(data=tx_info, 
            mapping=aes(x=gene_biotype, 
                        y=nb_exons_log10,
                        fill=gene_biotype)) + 
     geom_boxplot() +
     theme_minimal() + 
     labs(y="Nummber of exons (log10)") +
     coord_flip() +
     scale_fill_manual(values=palette) +
     theme(legend.position = "none")
     
print(p)

Chromosomal Distribution of Gene Types

Create a bar chart (geom_bar) showing the number of transcripts for each gene_biotype class on each chromosome. Use geom_bar() with the position argument set to stack, dodge, or fill. Depending on this argument, do you get the same impression about the distribution of gene biotypes across chromosomes? What are the advantages and disadvantages of each representation? What can you say about the number and types of genes present on the Y chromosome?

tx_info$nb_exons_log10 <- log10(tx_info$nb_exons)
p <- ggplot(data=tx_info, 
            mapping=aes(x=seqid, 
                        fill=gene_biotype)) + 
     geom_bar(position="fill") +
     theme_minimal() + 
     labs(y="Count", 
          x="Chromosome") +
     coord_flip() +
     theme(legend.position = "bottom")
     
print(p)

End of the section

Thank you for following this tutorial.