Histograms
Simple Histograms
Analyzing distributions is a crucial step in data exploration and understanding the characteristics of a dataset. With ggplot2, we can easily create plots such as histograms, density plots, boxplots, or violin plots to analyze distributions.
To create a histogram for the “price” variable in the “diamonds”
dataset, we use the geom_histogram()
function. For a
histogram, the x-axis represents intervals (bins), and the
y-axis represents the count of observations within these intervals.
Therefore, only one variable needs to be provided to the
aes()
function.
library(ggplot2)
data(diamonds)
ggplot(data = diamonds, aes(x = price)) + geom_histogram()
We can slightly improve this histogram by changing the labels (e.g title and axis names) and the colors of the bars.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(color = "black",
fill = "lightblue") +
labs(title = "Histogramme de Price",
x = "Prix",
y = "Fréquence")
- The key parameter of a histogram is the interval size. Here, in the
geom_histogram
function, it is controlled by thebinwidth
argument (interval size) orbins
(number of intervals). Create histograms with interval sizes of $250, $500, $1000, and $10000.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 1000)
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 250, color = "black", fill = "lightblue")
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 500, color = "black", fill = "lightblue")
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 1000, color = "black", fill = "lightblue")
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 10000, color = "black", fill = "lightblue")
Overlapping Histograms
In the context of geom_point()
, the geometry is
a point, and a variable can be associated with the color of the
points. For a histogram, it is the same concept, except
the geometry is a histogram. If we use cut
for the color, we will obtain histograms with different colors.
- Complete the following code (“___“) so that the coloring of the bar
(
fill
) is associated with the quality of thecut
column (the quality of the cut) of the diamond dataset.
col_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data = diamonds, aes(x = price, ___)) +
geom_histogram(binwidth = 1000,
color = "white") +
scale_fill_manual(values=col_palette)
print(p)
col_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data = diamonds, aes(x = price,
fill=cut)) +
geom_histogram(binwidth = 1000,
color = "white") +
scale_fill_manual(values=col_palette)
print(p)
Here we see the limitation of the histogram for visualizing distributions. In this case, another solution to analyze price distributions is to use density analysis. Later, we will also see that it is possible to create histogram panels (facets).
Density Profiles
The density plot is another way to visualize the distribution of
values in a numeric variable. It can be created using the
geom_density()
function. The geom_density()
function displays a distribution model that aims to closely fit the
histogram data.
ggplot(data = diamonds, aes(x = price)) + geom_density()
- The
geom_density()
displays a distribution model. Theadjust
parameter controls the way the curve is adjusted to the data. It can significantly influence the representation. Depending on the value ofadjust
, the model may tend to be over- or under-fitted. Recreate the previous density plot using the following values foradjust
: 1, 3/4, 1/2, 1/4, and 1/8.
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = ___)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 1)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 3/4)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 1/2)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 1/4)
ggplot(data = diamonds, aes(x = price)) + geom_density(adjust = 1/8)
- Complete the code to produce a plot where the densities for the
different cuts (
cut
) are represented with a specific border color for each. Apply your own colors.
color_palette <- c("Ideal" = ___,
"Premium" = ___,
"Very Good" = ___,
"Good" = ___,
"Fair" = ___)
p <- ggplot(data = diamonds, aes(x = price, ___ )) +
geom_density(adjust = 1/2, linewidth=0.5) +
scale_color_manual(values=color_palette)
print(p)
color_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data = diamonds, aes(x = price, color=cut )) +
geom_density(adjust = 1/2, linewidth=0.5) +
scale_color_manual(values=color_palette)
print(p)
Boxplots and Violin Plots
Boxplots (boxplots) and violin plots (violin plots) can be used to represent distributions associated with a dataset. Below are some examples.
As always, to create our plot we will first:
- (i) Associate data (data.frame or tibble) with an object
(
data=diamonds
). - (ii) Define the variables/columns of the data.frame to analyze and pass them to the aes() function.
For a boxplot or violin plot, the aesthetic variables accepted are generally the same as for geom_points. However, a key difference is that the x variable must be a factor (indicating the names of the boxes on the x axis) and the y variable must be numeric (the values taken by each category).
ggplot(data=diamonds, aes(x=cut, y=price)) + geom_boxplot()
- Since
geom_boxplot()
andgeom_violin()
share the same aesthetic elements, complete the following code. This code creates a violin plot where the colors of the violins correspond to the cut type (cut
).
col_palette <- c("Ideal" = "#A222A0",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=___)) +
geom_violin(color=___) +
scale_fill_manual(values=___)
print(p)
col_palette <- c("Ideal" = "#A222A0",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) +
geom_violin(color="black") +
scale_fill_manual(values=col_palette)
print(p)
Overlaying Graphical Elements
Example with the Violin Plot
The underlying ggplot model allows for relatively easy layering of graphical elements.
Consider the previous plot, \(p\),
created with the geom_violin()
geometry.
library(ggplot2)
data(diamonds)
col_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) +
geom_violin(color="black") +
scale_fill_manual(values=col_palette)
p
You might want to visualize the points that were used to construct
the violin model. One might think to simply add a
geom_points()
geometry. Nothing could be easier: just add a
geom_point()
layer using the addition operator
(+
). The aesthetics are inherited, so x=cut
and y=price
are naturally passed to
geom_point()
. One issue is that the points are aligned
vertically…
col_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) +
geom_violin(color="black") +
scale_fill_manual(values=col_palette) +
geom_point()
p
This can be resolved by using geom_jitter()
(instead of
geom_point()
), which adds a bit of noise to the
x-axis and y-axis. As we only wants some “jittering” on the x
axis we will indicate height=0
(i.e no jittering
on the y-axis).
col_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) +
geom_violin(color="black") +
scale_fill_manual(values=col_palette) +
geom_jitter(height=0)
p
This is not ideal here because we have many points in each class. Instead, we can reverse the layers.
col_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut)) +
geom_jitter(height=0) +
geom_violin(color="black") +
scale_fill_manual(values=col_palette)
p
- Complete the previous code so that the points have the same color as the violins.
col_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut, ___)) +
geom_jitter(height=0) +
geom_violin(color="black") +
scale_fill_manual(values=col_palette) +
___
p
col_palette <- c("Ideal" = "#A22200",
"Premium" = "#0871A4",
"Very Good" = "#00B850",
"Good" = "#226666",
"Fair" = "#FF8900")
p <- ggplot(data=diamonds, aes(x=cut, y=price, fill=cut, color=cut)) +
geom_jitter(height=0) +
geom_violin(color="black") +
scale_fill_manual(values=col_palette) +
scale_color_manual(values=col_palette)
p
End of the section
Thank you for following this tutorial.