Skip to Tutorial Content

Creation

The data.frame() Function

The data.frame object is truly central in R. This object allows to store various information about objects under study (patients, biological samples, experiments…). In short, a data.frame can be viewed as a matrix where each column can have a different and unique mode (character, numeric, boolean, etc.).

You can create a dataframe using the data.frame() function. In the example below, the note column is numeric, the diploma column is a factor, the age column is numeric, and the name column is a character string.

set.seed(123)
df_1 <- data.frame(
  note = round(runif(n = 6, min = 0, max = 20), 2),
  diploma = as.factor(sample(c("B.Sc", "M.Sc"), size = 6, replace = TRUE )),
  age = sample(18:25, size=6, replace=TRUE),
  name = c("Miescher", "Watson", "Crick", "Chargaff", "Levene", "Kossel")
)
df_1

Exercice

  • Create a data.frame named my_df containing the vectors and factors below (the columns will be named “col_A”, “col_B”, “col_C”, and “col_D”, respectively).
set.seed(123)
c_a <- round(rnorm(10), 2)
c_b <- 1:10
c_c <- letters[1:10]
c_d <- factor(sort(rep(c("M", "F"), 5)), ordered=TRUE)
set.seed(123)
c_a <- round(rnorm(10), 2)
c_b <- 1:10
c_c <- letters[1:10]
c_d <- factor(sort(rep(c("M", "F"), 5)), ordered=TRUE)
my_df <- data.frame(col_A=c_a, col_B=c_b, col_C=c_c, col_D=c_d)

The as.data.frame() and as.matrix() Functions

You can convert a matrix object into a data.frame object using the as.data.frame() function. This allows you to present an object in the correct format for a function that requires a data.frame instead of a matrix.

set.seed(1)
mat <- matrix(data=round(rnorm(20), 2), 
            ncol = 5)
colnames(mat) <- LETTERS[1:ncol(mat)]
rownames(mat) <- letters[1:nrow(mat)]
df_2 <- as.data.frame(mat)
print(df_2)

Similarly, you can convert a data.frame into a matrix using as.matrix().

  • Since data.frame objects may contain elements other than numeric values, the hist() function will not accept them by default. Convert the following data.frame into a matrix (object \(mat\)) and pass this matrix to the hist() function to construct the corresponding histogram.
set.seed(123)
df <- data.frame(A=rnorm(10000, mean=0, sd=1),
                 B=rnorm(10000, mean=5, sd=2),
                 C=rnorm(10000, mean=10, sd=1)
                 )

# L'instruction suivante renvoie une erreur
# hist(df)
# Error in hist.default(df) : 'x' must be numeric

# Il faut donc convertir
set.seed(123)
df <- data.frame(A=rnorm(10000, mean=0, sd=1),
                 B=rnorm(10000, mean=5, sd=2),
                 C=rnorm(10000, mean=10, sd=1)
                 )

mat <- as.matrix(df)
hist(mat, breaks=100)

Datasets

You can also use one of the many datasets provided by the R software, many of which are data.frame objects.

  • List the available demonstration datasets using the data() function without any arguments.
data()

Below, we load the ChickWeight dataset (chicken weights), which contains weight measurements used to evaluate the effectiveness of various dietary supplements on the growth rate of chickens. This dataset is provided by the datasets library. Therefore, it is necessary to load this library into memory (library(datasets)) to access the dataset. The dataset is explicitly loaded with data(ChickWeight).

# On charge/active la librairie datasets
library(datasets)
# On charge le jeu de donnée chickwts
data(ChickWeight)
# Voilà ses premières lignes
head(ChickWeight)

Row and Column Names

As with matrices, the column names of a data.frame can be obtained with colnames() (and row names with rownames()).

colnames(ChickWeight)
head(rownames(ChickWeight))

Indexing

A data.frame can be indexed by querying rows and columns, just like a matrix. As with a matrix, you can use position indexing, logical indexing, and name-based indexing. However, the data.frame also has an additional indexing operator, the dollar sign $.

Type ChickWeight$ and press the key to display the contents of the Diet column. Do the same for the weight column.

# Place your cursor after 'chickwts', type '$' and 
# press the tab key (<TAB>)
ChickWeight

Adding Columns

You can easily add a column to a data.frame, especially by using the $ operator.

ChickWeight$genotype <- sample(c("wild_type", "transgenic"), size=nrow(ChickWeight), replace=TRUE)
head(ChickWeight)                        

Add to the ChickWeight table a column named:

  • ‘genotype’ containing NA values.
  • ‘source’ containing “anonymous”.
  • ‘location’ containing random values chosen from ‘box_1’ or ‘box_2’.
ChickWeight
ChickWeight$genotype <- NA
ChickWeight$source <- "anonymous"
ChickWeight$location <- sample(c('box_1', 'box_2'), nrow(ChickWeight), replace = TRUE)

‘wide’ vs ‘long’

For many applications, especially for graphics with the ggplot2 library (which is indispensable, and we will discuss it further later), it is necessary to change the representation of data in a data.frame or a matrix.

Let’s take the following matrix. It is initially in the so-called “wide” format. At this stage, there is no formal way to identify this (you will see the difference with the long format shortly).

mat

The melt() function from the reshape2 library allows transforming (or “melting”) it into the long format. The row names now appear in the first column, the column name in the second column and the values in the third.

library(reshape2)
m_melted <- melt(mat)
head(m_melted)

Applying Functions

As with matrices, you can apply functions to the rows and columns of a data.frame using the apply() function (see the chapter on matrices).

End of the section

Thank you for following this tutorial.

The ‘data.frame’ object.