Overview of Pipes (|>)
Pipes allow function calls to be chained together, making the code more readable. It can be read as a sequence of steps in which the result of the preceding function is sent, implicitly, to the first argument of the following function:
rnorm(100, mean=10) |>
mean() |>
round(digit=2) |>
as.character()
The dplyr library
The dplyr
library is part of the tidyverse package. It
is designed for fast, readable, and concise data manipulation. It
simplifies common data manipulation tasks. It focuses on a set of
function to make data transformations intuitive. The input to
dplyr
functions are always a data.frame
(or
tibble
which is very close) and the output is always a
data.frame
. Functions (e.g. select()
,
mutate()
, filter()
) will typically receive the
names of the columns (without quotes) to operate on.
We will illustrate the use of dplyr with the iris dataset that provide various flower measurement over 3 species of iris.
library(dplyr)
data(iris)
head(iris)
The select() function
Select a set of columns. This is an alternative to the indexing function.
iris |>
select(Sepal.Length, Sepal.Width, Species) |>
head()
Or select()
a column that matches
a regular
expression (here something that ends with and “h”).
iris |>
select(matches("h$")) |>
head()
The filter() function
Select a set of rows. This is an alternative to the indexing function.
iris |>
filter(Sepal.Length > 6 & Sepal.Width > 3.5)
The mutate() function
This allow to create new columns that are functions of existing variables/columns:
iris |>
select(matches("al")) |>
mutate(Sepal.diff=Sepal.Length - Sepal.Width,
Petal.diff=Petal.Length - Petal.Width) |>
head()
The summarise() function
It can be viewed as an alternative to the apply (with MARGIN=2). It will return a data.frame after applying an operation over all rows of user-defined columns:
iris |> summarise(mean.Sepal.Length=mean(Sepal.Length),
mean.Sepal.Width=mean(Sepal.Width))
The groupby() function
All this functions naturally combine with the
group_by()
function allowing to group rows by a set of
categories. This may be particularly interesting to compute a statistic
by category.
iris |>
group_by(Species) |>
summarise(mean.Sepal.Length=mean(Sepal.Length),
mean.Sepal.Width=mean(Sepal.Width))
Note
The dplyr
library is quite a vast subject. We provide
only a few basics in this tutorial. We encourage users to consult the Data Transformation
section of the “R for Data Science” book for more information.
End of the section
Thank you for following this tutorial.