Skip to Tutorial Content

Histograms

Simple Histograms

Analyzing distributions is a crucial step in data exploration and understanding the characteristics of a dataset. With ggplot2, we can easily create plots such as histograms, density plots, boxplots, or violin plots to analyze distributions.

To create a histogram for the “price” variable in the “diamonds” dataset, we use the geom_histogram() function. For a histogram, the x-axis represents intervals (bins), and the y-axis represents the count of observations within these intervals. Therefore, only one variable needs to be provided to the aes() function.

library(ggplot2)
data(diamonds)

ggplot(data = diamonds, aes(x = price)) + geom_histogram()

We can slightly improve this histogram by changing the labels (e.g title and axis names) and the colors of the bars.

ggplot(data = diamonds, aes(x = price)) + 
  geom_histogram(color = "black", 
                 fill = "lightblue") +
  labs(title = "Histogramme de Price",
       x = "Prix",
       y = "Fréquence")
  • The key parameter of a histogram is the interval size. Here, in the geom_histogram function, it is controlled by the binwidth argument (interval size) or bins (number of intervals). Create histograms with interval sizes of $250, $500, $1000, and $10000.
ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(binwidth = 1000)
ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(binwidth = 250, color = "black", fill = "lightblue")
ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(binwidth = 500, color = "black", fill = "lightblue")
ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(binwidth = 1000, color = "black", fill = "lightblue")
ggplot(data = diamonds, aes(x = price)) +
  geom_histogram(binwidth = 10000, color = "black", fill = "lightblue")

Overlapping Histograms

In the context of geom_point(), the geometry is a point, and a variable can be associated with the color of the points. For a histogram, it is the same concept, except the geometry is a histogram. If we use cut for the color, we will obtain histograms with different colors.

  • Complete the following code (“___“) so that the coloring of the bar (fill) is associated with the quality of the cut column (the quality of the cut) of the diamond dataset.
col_palette <- c("Ideal" = "#A22200", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

p <- ggplot(data = diamonds, aes(x = price, ___)) +
            geom_histogram(binwidth = 1000, 
                           color = "white") +
            scale_fill_manual(values=col_palette)
print(p)
col_palette <- c("Ideal" = "#A22200", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

p <- ggplot(data = diamonds, aes(x = price, 
                                 fill=cut)) +
            geom_histogram(binwidth = 1000, 
                           color = "white") +
            scale_fill_manual(values=col_palette)
print(p)

Here we see the limitation of the histogram for visualizing distributions. In this case, another solution to analyze price distributions is to use density analysis. Later, we will also see that it is possible to create histogram panels (facets).

Density Profiles

The density plot is another way to visualize the distribution of values in a numeric variable. It can be created using the geom_density() function. The geom_density() function displays a distribution model that aims to closely fit the histogram data.

ggplot(data = diamonds, aes(x = price)) + geom_density()
  • The geom_density() displays a distribution model. The adjust parameter controls the way the curve is adjusted to the data. It can significantly influence the representation. Depending on the value of adjust, the model may tend to be over- or under-fitted. Recreate the previous density plot using the following values for adjust: 1, 3/4, 1/2, 1/4, and 1/8.
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = ___)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 1)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 3/4)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 1/2)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 1/4)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 1/8)
  • Complete the code to produce a plot where the densities for the different cuts (cut) are represented with a specific border color for each. Apply your own colors.
color_palette <- c("Ideal" = ___, 
                   "Premium" = ___, 
                   "Very Good" = ___, 
                   "Good" = ___, 
                   "Fair" = ___)

p <- ggplot(data = diamonds, aes(x = price, ___ )) + 
  geom_density(adjust = 1/2, linewidth=0.5) +
  scale_color_manual(values=color_palette)

print(p)
color_palette <- c("Ideal" = "#A22200", 
                   "Premium" = "#0871A4", 
                   "Very Good" = "#00B850", 
                   "Good" = "#226666", 
                   "Fair" = "#FF8900")

p <- ggplot(data = diamonds, aes(x = price, color=cut )) + 
  geom_density(adjust = 1/2, linewidth=0.5) +
  scale_color_manual(values=color_palette)
  
print(p)

Boxplots and Violin Plots

Boxplots (boxplots) and violin plots (violin plots) can be used to represent distributions associated with a dataset. Below are some examples.

As always, to create our plot we will first:

  • (i) Associate data (data.frame or tibble) with an object (data=diamonds).
  • (ii) Define the variables/columns of the data.frame to analyze and pass them to the aes() function.

For a boxplot or violin plot, the aesthetic variables accepted are generally the same as for geom_points. However, a key difference is that the x variable must be a factor (indicating the names of the boxes on the x axis) and the y variable must be numeric (the values taken by each category).

ggplot(data=diamonds, aes(x=cut, y=price)) + geom_boxplot()
  • Since geom_boxplot() and geom_violin() share the same aesthetic elements, complete the following code. This code creates a violin plot where the colors of the violins correspond to the cut type (cut).
col_palette <- c("Ideal" = "#A222A0", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=___)) + 
  geom_violin(color=___) +
  scale_fill_manual(values=___)

print(p)
col_palette <- c("Ideal" = "#A222A0", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) + 
  geom_violin(color="black") +
  scale_fill_manual(values=col_palette)

print(p)

Overlaying Graphical Elements

Example with the Violin Plot

The underlying ggplot model allows for relatively easy layering of graphical elements.

Consider the previous plot, \(p\), created with the geom_violin() geometry.

library(ggplot2)
data(diamonds)
col_palette <- c("Ideal" = "#A22200", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut))  +
  geom_violin(color="black") +
  scale_fill_manual(values=col_palette)

p

You might want to visualize the points that were used to construct the violin model. One might think to simply add a geom_points() geometry. Nothing could be easier: just add a geom_point() layer using the addition operator (+). The aesthetics are inherited, so x=cut and y=price are naturally passed to geom_point(). One issue is that the points are aligned vertically…

col_palette <- c("Ideal" = "#A22200", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

 p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) +
        geom_violin(color="black") +
        scale_fill_manual(values=col_palette) + 
        geom_point()
p

This can be resolved by using geom_jitter() (instead of geom_point()), which adds a bit of noise to the x-axis and y-axis. As we only wants some “jittering” on the x axis we will indicate height=0 (i.e no jittering on the y-axis).

col_palette <- c("Ideal" = "#A22200", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) +
  geom_violin(color="black") +
  scale_fill_manual(values=col_palette) +
  geom_jitter(height=0)

p

This is not ideal here because we have many points in each class. Instead, we can reverse the layers.

col_palette <- c("Ideal" = "#A22200", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) + 
  geom_jitter(height=0) +
  geom_violin(color="black") +
  scale_fill_manual(values=col_palette)

p
  • Complete the previous code so that the points have the same color as the violins.
col_palette <- c("Ideal" = "#A22200", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut, ___)) +
    geom_jitter(height=0) +
    geom_violin(color="black") +
    scale_fill_manual(values=col_palette) +
    ___
p
col_palette <- c("Ideal" = "#A22200", 
                 "Premium" = "#0871A4", 
                 "Very Good" = "#00B850", 
                 "Good" = "#226666", 
                 "Fair" = "#FF8900")

p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut, color=cut))  +
     geom_jitter(height=0) +
     geom_violin(color="black") +
     scale_fill_manual(values=col_palette) +
     scale_color_manual(values=col_palette)

p

End of the section

Thank you for following this tutorial.

The ggplot2 library (session 2)