Manipulating data.frames with the dplyr library

Overview of Pipes (|>)

Pipes allow function calls to be chained together, making the code more readable. It can be read as a sequence of steps in which the result of the preceding function is sent, implicitly, to the first argument of the following function:

rnorm(100, mean=10) |> 
  mean() |> 
  round(digit=2) |> 
  as.character()

The dplyr library

The dplyr library is part of the tidyverse package. It is designed for fast, readable, and concise data manipulation. It simplifies common data manipulation tasks. It focuses on a set of function to make data transformations intuitive. The input to dplyr functions are always a data.frame (or tibble which is very close) and the output is always a data.frame. Functions (e.g. select(), mutate(), filter()) will typically receive the names of the columns (without quotes) to operate on.

We will illustrate the use of dplyr with the iris dataset that provide various flower measurement over 3 species of iris.

library(dplyr)
data(iris) 
head(iris)

The select() function

Select a set of columns. This is an alternative to the indexing function.

iris |> 
  select(Sepal.Length, Sepal.Width, Species) |> 
  head()

Or select() a column that matches a regular expression (here something that ends with and “h”).

iris |> 
  select(matches("h$")) |> 
  head()

The filter() function

Select a set of rows. This is an alternative to the indexing function.

iris |> 
  filter(Sepal.Length > 6 & Sepal.Width > 3.5)

The mutate() function

This allow to create new columns that are functions of existing variables/columns:

iris |> 
  select(matches("al")) |> 
  mutate(Sepal.diff=Sepal.Length - Sepal.Width,
         Petal.diff=Petal.Length - Petal.Width) |>
  head()

The summarise() function

It can be viewed as an alternative to the apply (with MARGIN=2). It will return a data.frame after applying an operation over all rows of user-defined columns:

iris |> summarise(mean.Sepal.Length=mean(Sepal.Length), 
                  mean.Sepal.Width=mean(Sepal.Width))

The groupby() function

All this functions naturally combine with the group_by()function allowing to group rows by a set of categories. This may be particularly interesting to compute a statistic by category.

iris |> 
  group_by(Species) |> 
  summarise(mean.Sepal.Length=mean(Sepal.Length), 
            mean.Sepal.Width=mean(Sepal.Width))

Note

The dplyr library is quite a vast subject. We provide only a few basics in this tutorial. We encourage users to consult the Data Transformation section of the “R for Data Science” book for more information.

End of the section

Thank you for following this tutorial.