Creation
The data.frame() Function
The data.frame
object is truly central in R. This object
allows to store various information about objects under study (patients,
biological samples, experiments…). In short, a data.frame can be viewed
as a matrix where each column can have a different and unique
mode (character, numeric, boolean, etc.).
You can create a dataframe using the data.frame()
function. In the example below, the note
column is numeric,
the diploma
column is a factor, the age
column
is numeric, and the name
column is a character string.
set.seed(123)
df_1 <- data.frame(
note = round(runif(n = 6, min = 0, max = 20), 2),
diploma = as.factor(sample(c("B.Sc", "M.Sc"), size = 6, replace = TRUE )),
age = sample(18:25, size=6, replace=TRUE),
name = c("Miescher", "Watson", "Crick", "Chargaff", "Levene", "Kossel")
)
df_1
Exercice
- Create a
data.frame
namedmy_df
containing the vectors and factors below (the columns will be named “col_A”, “col_B”, “col_C”, and “col_D”, respectively).
set.seed(123)
c_a <- round(rnorm(10), 2)
c_b <- 1:10
c_c <- letters[1:10]
c_d <- factor(sort(rep(c("M", "F"), 5)), ordered=TRUE)
set.seed(123)
c_a <- round(rnorm(10), 2)
c_b <- 1:10
c_c <- letters[1:10]
c_d <- factor(sort(rep(c("M", "F"), 5)), ordered=TRUE)
my_df <- data.frame(col_A=c_a, col_B=c_b, col_C=c_c, col_D=c_d)
The as.data.frame() and as.matrix() Functions
You can convert a matrix
object into a
data.frame
object using the as.data.frame()
function. This allows you to present an object in the correct
format for a function that requires a data.frame
instead of
a matrix
.
set.seed(1)
mat <- matrix(data=round(rnorm(20), 2),
ncol = 5)
colnames(mat) <- LETTERS[1:ncol(mat)]
rownames(mat) <- letters[1:nrow(mat)]
df_2 <- as.data.frame(mat)
print(df_2)
Similarly, you can convert a data.frame
into a
matrix using as.matrix()
.
- Since
data.frame
objects may contain elements other than numeric values, thehist()
function will not accept them by default. Convert the followingdata.frame
into a matrix (object \(mat\)) and pass this matrix to thehist()
function to construct the corresponding histogram.
set.seed(123)
df <- data.frame(A=rnorm(10000, mean=0, sd=1),
B=rnorm(10000, mean=5, sd=2),
C=rnorm(10000, mean=10, sd=1)
)
# L'instruction suivante renvoie une erreur
# hist(df)
# Error in hist.default(df) : 'x' must be numeric
# Il faut donc convertir
set.seed(123)
df <- data.frame(A=rnorm(10000, mean=0, sd=1),
B=rnorm(10000, mean=5, sd=2),
C=rnorm(10000, mean=10, sd=1)
)
mat <- as.matrix(df)
hist(mat, breaks=100)
Datasets
You can also use one of the many datasets provided by the R software,
many of which are data.frame
objects.
- List the available demonstration datasets using the
data()
function without any arguments.
data()
Below, we load the ChickWeight
dataset (chicken
weights), which contains weight measurements used to evaluate the
effectiveness of various dietary supplements on the growth rate of
chickens. This dataset is provided by the datasets
library.
Therefore, it is necessary to load this library into memory
(library(datasets)
) to access the dataset. The dataset is
explicitly loaded with data(ChickWeight)
.
# On charge/active la librairie datasets
library(datasets)
# On charge le jeu de donnée chickwts
data(ChickWeight)
# Voilà ses premières lignes
head(ChickWeight)
Row and Column Names
As with matrices, the column names of a
data.frame
can be obtained with
colnames()
(and row names with
rownames()
).
colnames(ChickWeight)
head(rownames(ChickWeight))
Indexing
A data.frame
can be indexed by querying rows and
columns, just like a matrix. As with a matrix, you can use
position indexing, logical indexing, and name-based
indexing. However, the data.frame
also has an
additional indexing operator, the dollar sign $
.
Type ChickWeight$
and press the Diet
column. Do the same for
the weight
column.
# Place your cursor after 'chickwts', type '$' and
# press the tab key (<TAB>)
ChickWeight
Adding Columns
You can easily add a column to a data.frame, especially by using the $ operator.
ChickWeight$genotype <- sample(c("wild_type", "transgenic"), size=nrow(ChickWeight), replace=TRUE)
head(ChickWeight)
Add to the ChickWeight
table a column named:
- ‘genotype’ containing NA values.
- ‘source’ containing “anonymous”.
- ‘location’ containing random values chosen from ‘box_1’ or ‘box_2’.
ChickWeight
ChickWeight$genotype <- NA
ChickWeight$source <- "anonymous"
ChickWeight$location <- sample(c('box_1', 'box_2'), nrow(ChickWeight), replace = TRUE)
‘wide’ vs ‘long’
For many applications, especially for graphics with the
ggplot2
library (which is indispensable, and we
will discuss it further later), it is necessary to change the
representation of data in a data.frame
or a matrix.
Let’s take the following matrix. It is initially in the so-called “wide” format. At this stage, there is no formal way to identify this (you will see the difference with the long format shortly).
mat
The melt()
function from the reshape2
library allows transforming (or “melting”) it into the
long format. The row names now appear in the first
column, the column name in the second column and the values in the
third.
library(reshape2)
m_melted <- melt(mat)
head(m_melted)
Applying Functions
As with matrices, you can apply functions to the rows and columns of
a data.frame
using the apply()
function
(see the chapter on matrices).
End of the section
Thank you for following this tutorial.