Skip to Tutorial Content

The ggplot2 Library

Limitations of Basic Graphics in R

R provides numerous graphical functions, some of which we have encountered: hist(), plot(), and barplot(). However, these basic graphical functions suffer from several limitations:

  • Their syntax/arguments are not standardized.
  • Modifying certain graphical parameters can be complex and unintuitive.
  • Although it is possible to overlay plots (e.g., points(), abline() on a diagram created with plot()), the number of combinations is limited.
  • Moreover, when performing descriptive statistics, we often want to partition the graphical window (i.e., create facets) based on the levels of a categorical (i.e., qualitative) or ordinal variable. Creating such graphics quickly becomes extremely complicated with the basic tools.

The ggplot Library

To rethink the design of graphics in R, Hadley Wickham developed the ggplot2 library, which has quickly become popular and essential in the world of bioinformatics and data analysis. This library is now considered one of R’s true strengths compared to languages like Python or Julia.

One of the unique aspects of the ggplot2 library is that its development is based on a model proposed by Leland Wilkinson in his book The Grammar of Graphics. This book offers insights into the components of graphics. According to Leland Wilkinson:

  • A graphic represents a correspondence between data and aesthetic attributes (color, shape, size…) of geometric objects (points, lines, bars…).
  • The graphic can also include data transformations (e.g., logarithmic scale) and is drawn in a specific coordinate system (e.g., Cartesian or polar).
  • Facets can be used to generate the same graphic for different subsets of the dataset.

It is the combination of these independent components that forms a graphic. In ggplot2, these different components are modeled as layers that can be added to one another using the addition operator. The success of ggplot2 has led many developers to produce extensions, creating an extremely rich suite of graphical functions today.

You can find a ggplot2 cheatsheet as a complement to this course at this address.

The Dataset

To start, we will use the “diamonds” dataset from the ggplot2 library. The diamonds dataset contains various information about 54,000 diamonds. It is part of the ggplot2 package. The following details are available:

  • Carat, size, color, clarity, depth, table, price, x, y, z

Let’s load the dataset.

## Load the ggplot2 library
library(ggplot2)
## Next, load the chickwts dataset
data(diamonds)

The first lines of this dataset are the following :

head(diamonds)

The types of the columns are the following.

str(diamonds)

Note that the diamonds object could be a data.frame. In fact, it is a tibble, which is very close to a data.frame.

class(diamonds)

Scatterplots

There are more than 50 types of plots (geometries) available in ggplot2, and even more if we consider ggplot2 extensions.

We will start with a simple scatterplot. To create our scatterplot, we will use the geom_points() geometry. This requires:

  • (i) Data in data.frame or tibble format.
  • (ii) Associating this data with a ggplot object (ggplot(data=diamonds)).
  • (iii) Defining the columns of the dataframe to use and associate them with the aesthetics of a selected geometrical objects.

For a scatterplot (geom_points()), two aesthetics are mandatory: \(x\) (x-axis) and \(y\) (y-axis). The \(x\) and \(y\) variables are typically numeric. Here, we will associate the carat and price columns, both of which are numeric, to the xand yaesthetic using the aes() function.

Then we apply the geometry using + geom_points() which adds a layer to the diagram.

# We create the plot by passing 
# the data.frame (diamonds) and the result 
# of the aes() function to ggplot. 
# The aes() function associates column names 
# with aesthetic elements.
p <- ggplot(data=diamonds, aes(x=carat, y=price)) + geom_point()
p

You can modify the aesthetics of the geometric shape (here, points) by passing arguments to the geometry function (here geom_points()).

For the geom_points geometry, the following aesthetics are accepted:

Aesthetic Purpose Value
x Values for the x-axis Numeric or factor
y Values for the y-axis Numeric or factor
alpha Opacity of points Numeric between 0 and 1
colour or color Color of points or borders (if points have borders) A color
fill Filling, if the point has a fill A color
shape Shapes of points An integer
size Sizes of points Numeric
stroke Border sizes Numeric

For example, you can change the color and size of the points as follows:

p <- ggplot(data=diamonds, aes(x=carat, y=price)) + 
       geom_point(size=2, colour="red")
p

Point Shapes (Aesthetic shape)

The correspondences between point shapes and the numeric variables passed are shown below:

Other point shapes are possible, for example, by using the ggstar library:

More generally, any image can be used as a point (using ggimage).

Exercise

  • Using point shape 22 (shape=22), create a scatterplot (geom_point()) and color the interior and borders of the points with colors of your choice. Adjust the opacity to better visualize overlapping points.
colors()
p <- ggplot(data=diamonds, aes(x=carat, y=price))
p <- ggplot(data=diamonds, aes(x=carat, y=price))
p + geom_point(size=2, shape=22, fill="violet", color="darkviolet", alpha=0.3)

Point Colors (Aesthetics colour and fill)

Colors play an essential role in data visualization, and R provides a wide range of options to manipulate and use colors in your plots and analyses.

R has a convenient function called colors() that displays the names of the colors available in R. This function returns a vector containing the predefined color names. Here’s an example to display some of these colors:

head(colors(), 20)

In addition to color names, it is common to represent colors using the hexadecimal format. In this format, each color is represented by a combination of six hexadecimal digits, corresponding to the amount of red (R), green (G), and blue (B) in the color. Each color has an intensity level between 00 (minimum) and FF (maximum). For example, green is represented by #00FF00, red by #FF0000, black by #000000, and white by #FFFFFF.

In R, you can specify colors in hexadecimal format by simply using the # prefix followed by the appropriate hexadecimal combination. For example:

p <- ggplot(data=diamonds, aes(x=carat, y=price)) + 
      geom_point(size=2, shape=22, fill="#7F00FF", color="#9400D3", alpha=0.3)
p

All ‘Aesthetics’ can be linked to variables

The control of aesthetics in ggplot2 provides remarkable flexibility by allowing you to customize plots with fixed values as we have seen (e.g., color, size, opacity) but also with variables. This unique feature enables dynamic and customized graphics based on your data. You can modify the color, size, shape (…) of points based on the values of a column. To do this, you associate additional variables with aesthetics using the aes() function.

For example, we can use a variable like cut to control the color of the points and another variable like depth to adjust their size:

p <- ggplot(data = diamonds, aes(x = carat, 
                                 y = price, 
                                 color = cut, 
                                 size = depth)) +
         geom_point(alpha=0.5)

p

If you want to customize the colors, you can add a layer to the plot using the functions scale_color_manual() or scale_fill_manual() to manually define the colors for the color and fill aesthetics.

  • Complete the following code to associate custom colors with the plot \(p\):
col_pal <- c("Ideal" = "#A22200", 
             "Premium" = "#0871A4", 
             "Very Good" = "#00B850", 
             "Good" = "#226666", 
             "Fair" = "#FF8900")

p <- ggplot(data = diamonds, aes(x = carat, 
                                 y = price, 
                                 color = cut, 
                                 size = depth)) +
  geom_point(alpha=1, shape=16) +
  scale_color_manual(values = ___)

p
col_pal <- c("Ideal" = "#A22200", 
             "Premium" = "#0871A4", 
             "Very Good" = "#00B850", 
             "Good" = "#226666", 
             "Fair" = "#FF8900")

p <- ggplot(data = diamonds, aes(x = carat, 
                                 y = price, 
                                 color = cut, 
                                 size = depth)) +
  geom_point(alpha=1, shape=16) +
 scale_color_manual(values = col_pal)

p

ggplot2 offers great flexibility in controlling the aesthetics of your graphs. You can use fixed values to create static graphs, but using variables allows even more dynamic, data-driven customization.

End of the section

Thank you for following this tutorial.

The ggplot2 library (session 1)