The ggplot2 library (session 3)

Simple Bar Charts

The geom_bar() function in ggplot2 is used to create bar charts, which are particularly suited for representing categorical data. This function allows you to construct vertical or horizontal bars based on the variables specified in the x and y aesthetics of the aes() function.

In this example, we will create a bar chart to represent the frequency of different diamond cuts (cut) in the diamonds dataset. The function will count the number of occurrences of each cut: ‘Fair’, ‘Good’, ‘Very Good’, etc. The position = "dodge" argument places the bars side by side, making it easier to compare frequencies across different categories. Note below the use of labs(), which allows control over the axis labels.

## Load the ggplot2 library
library(ggplot2)
## Next, load the chickwts dataset
data(diamonds)

p <- ggplot(data = diamonds, aes(x = cut, fill = cut)) +
     geom_bar(position = "dodge") +
     labs(title = "Number of Diamond Cuts",
          x = "Cut",
          y = "Count")

print(p)

Modify the following code to associate a unique color with each bar.

p <- ggplot(data = diamonds, aes(x = cut)) +
        geom_bar(position = "dodge") +
       labs(title = "Nombre de type de clarté par Coupe",
            x = "Coupe",
            y = "Nombre")
print(p)

col_palette <- c("Ideal" = "#A22200", 
                  "Premium" = "#0871A4", 
                  "Very Good" = "#00B850", 
                  "Good" = "#226666", 
                  "Fair" = "#FF8900")

p <- ggplot(data = diamonds, aes(x = cut, fill=cut)) +
            geom_bar(position = "dodge") +
            labs(title = "Nombre de type de clarté par Coupe",
                     x = "Coupe",
                     y = "Nombre",
                     fill="Coupe") +
            scale_fill_manual(values=col_palette)
print(p)

Exemple à deux variables catégorielles

Dans cet exemple plus complexe, nous allons créer un graphique à barres en utilisant cut comme axe x et clarity comme axe y pour explorer la fréquence des différentes clartés de diamants (clarity) pour chaque type de coupes (cut). Le principe est de dire que les barres seront colorés (fill) en fonction de la variable clarity.

Nous utiliserons l’argument position = "stack" (i.e. empilée) pour placer les comptes des différentes clartés les une sur les autres.

p <- ggplot(data = diamonds, aes(x = cut, fill=clarity)) +
        geom_bar(position = "stack") +
       labs(title = "Nombre de type de clarté par Coupe",
            x = "Coupe",
            y = "Nombre")

print(p)

Note that clarity is an ordinal categorical variable, and ggplot naturally chooses a discrete color gradient to represent it.

head(diamonds$clarity)

As in the previous examples, you can change the colors with scale_fill_manual. However, at this stage, we can introduce a new function, scale_fill_brewer(), which requires installing the RColorBrewer library. For this function, you need to choose the name of one of the palettes displayed by RColorBrewer::display.brewer.all(). You pass the palette name to the function (e.g., scale_fill_brewer(palette='Purples')).

library('RColorBrewer')
display.brewer.all()

There are three types of palettes in RColorBrewer: sequential, diverging, and qualitative.

Sequential palettes are suitable for ordered data that progresses from the lowest to the highest value.
Diverging palettes emphasize critical middle values and extremes at both ends of the data range.
Qualitative palettes are best suited for representing nominal or categorical data.

Test different sequential palettes from RColorBrewer (e.g. Blues, BuGn, BuPu, GnBu, Greens, Greys, Oranges, OrRd, PuBu, PuBuGn, PuRd, Purples, RdPu, Reds, YlGn, YlGnBu, YlOrBr, YlOrRd) for the representation.

p <- ggplot(data = diamonds, aes(x = cut, fill=clarity)) +
        geom_bar(position = "stack", color="black") +
       labs(title = "Nombre de type de clarté par Coupe",
            x = "Coupe",
            y = "Nombre") + scale_fill_brewer(palette='___')
print(p)

p <- ggplot(data = diamonds, aes(x = cut, fill=clarity)) +
        geom_bar(position = "stack", color="black") +
       labs(title = "Nombre de type de clarté par Coupe",
            x = "Coupe",
            y = "Nombre") + scale_fill_brewer(palette='Purples')
print(p)

Bar Positioning

In ggplot2, the ‘position’ argument of geom_bar determines the positioning of the bars and can drastically change the perception:

‘Stack’ stacks the bars to represent cumulative values, useful for showing totals and proportions of subcategories.
‘Dodge’ places the bars side by side without overlap, ideal for directly comparing values between categories.
‘Dodge2’ is similar to ‘dodge’ but further separates the bars based on another variable, creating side-by-side groups.
‘Fill’ fills the entire bar. Useful for visualizing proportions/percentages.
‘Identity’ places the bars in front of each other. Be cautious, as this is rarely what you want.
‘Jitter’ adds a bit of noise to the x-axis. In the context of bar charts, this argument has little value. Again, the bars are placed in front of each other.

In the diagram below, change the position successively to ‘stack’, ‘dodge’, and ‘fill’.

p <- ggplot(data = diamonds, aes(x = cut, fill=clarity)) +
        geom_bar(position = "___", color="black") +
       labs(title = "Number of Clarity Types by Cut",
            x = "Coupe",
            y = "Nombre") + scale_fill_brewer(palette='Oranges')
print(p)

ggplot(data = diamonds, aes(x = cut, fill=clarity)) +
        geom_bar(position = "stack", color="black") +
       labs(title = "Number of Clarity Types by Cut",
            x = "Coupe",
            y = "Nombre") + scale_fill_brewer(palette='Purples')

ggplot(data = diamonds, aes(x = cut, fill=clarity)) +
        geom_bar(position = "dodge", color="black") +
       labs(title = "Number of Clarity Types by Cut",
            x = "Coupe",
            y = "Nombre") + scale_fill_brewer(palette='Purples')

ggplot(data = diamonds, aes(x = cut, fill=clarity)) +
        geom_bar(position = "fill", color="black") +
       labs(title = "Number of Clarity Types by Cut",
            x = "Coupe",
            y = "Proportion") + scale_fill_brewer(palette='Purples')

Consolidation Exercises

Iris Dataset

Create a plot like the one shown below that compares the sepal lengths (Sepal.Length) for each flower species (Species) in the iris dataset (data(iris)). The plot, stored in a variable p, should display the distributions as boxes (geom_boxplot()) with jittered points (geom_jitter()) to visualize each individual observation. The axes should be renamed “Species” and “Sepal Lengths.” Each species should be associated with a unique box color (‘fill’).
Control colors using scale_fill_brewer(palette='Dark2').

library(ggplot2)
set.seed(456)
data(iris)

library(ggplot2)
set.seed(456)
data(iris)
# Example of comparing distributions with violins
p <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
  geom_boxplot(alpha = 0.7) +
  geom_jitter(height = 0) +
  scale_fill_brewer(palette='Dark2') +
  labs(title = "Comparison of Sepal Lengths by Species",
       x = "Species",
       y = "Sepal Lengths",
       fill = "Species") 

print(p)

ToothGrowth Dataset

Create a violin plot like the one shown below to compare the distribution of tooth lengths (len) based on the method of vitamin C administration (supp) in the ToothGrowth dataset.
The categories ‘OJ’ and ‘VC’ should be replaced with ‘Orange juice’ and ‘Ascorbic acid’.
Add “rugs” (small lines) to the plot (geom_rug()) to visualize the data distribution along the y-axis.
The axes should be renamed “Supplements” and “Tooth Lengths”. Each violin should have its own unique color (‘fill’).
Control colors with scale_fill_brewer(palette='Accent') and scale_color_brewer(palette='Accent').

data(ToothGrowth)

data(ToothGrowth)
levels(ToothGrowth$supp) <- c('Orange juice', 'Ascorbic acid')
p <- ggplot(ToothGrowth, aes(x = supp, y = len, fill = supp, color=supp)) +
  geom_violin(alpha = 0.7, color="black") +
  geom_rug() +
  scale_fill_brewer(palette='Accent') +
  scale_color_brewer(palette='Accent') +
  labs(title = "Comparison of Tooth Length by Dose",
       x = "Supplements",
       y = "Tooth Lengths",
       fill = "Supplements",
       color = "Supplements") 
print(p)

Palmerpenguins Dataset

For this example, we will use the penguins dataset from the palmerpenguins package. This dataset contains information about penguins from the Palmer Archipelago in Antarctica.

Using the penguins dataset, generate a plot p identical to the diagram below. In this example, we use flipper_length_mm (flipper length) on the x-axis, bill_length_mm (bill length) on the y-axis, and body_mass_g (body mass) for the size of the points, with species (species) for the color of the points. You must also ensure the axis labels for the x and y axes match the given names.

library(palmerpenguins)
data("penguins")
penguins <- na.omit(penguins)

library(palmerpenguins)
data("penguins")
penguins <- na.omit(penguins)

# Create a custom color palette
palette_couleurs <- c("dodgerblue", "darkorange", "forestgreen")

# Create the colored bubble plot
p <- ggplot(penguins, aes(x = flipper_length_mm, y = bill_length_mm, size = body_mass_g, color = species)) +
    geom_point(alpha = 0.7) +
    scale_color_manual(values = palette_couleurs) +
    labs(x = "Flipper Length (mm)", 
         y = "Bill Length (mm)", 
         color = "Species", 
         size = "Body Mass (g)") 
print(p)

End of the section

Thank you for following this tutorial.