Definitions
Matrices
In R, matrices (matrix objects) are two-dimensional arrays. They generally contain row and column names. A matrix must contain elements of the same mode (numeric, character, boolean…).
As an example, a matrix can be created as follows.
x <- round(runif(25), 2)
mat <- matrix(data=x,
ncol = 5,
byrow = TRUE)
print(mat)
You can also create a matrix by grouping vectors of the same size using the functions cbind() (column bind) or rbind() (row bind).
mat <- cbind(0:5, 20:25, 30:35)
mat
mat <- rbind(0:5, 20:25, 30:35)
mat
Functions for matrix object
Row and column names
Column/row names can be changed as follows:
set.seed(1)
mat <- matrix(data=round(rnorm(20), 2),
ncol = 4,
byrow = TRUE)
colnames(mat) <- LETTERS[1:4]
rownames(mat) <- letters[1:5]
print(mat)
Matrix dimensions
We can find out the number of rows, columns and dimensions of the matrix with the nrow(), ncol() and dim() functions respectively.
nrow(mat)
ncol(mat)
dim(mat)
Exercise
- Given the following matrix, use the
paste0()
function to create row names of the form gene_1, gene_2, gene3… and column names of the form sample_1, sample_2, sample_3…. Associate these column and row names with the matrix mat.
set.seed(123)
mat <- matrix(data=round(runif(200, 0, 100), 0),
ncol = 10,
byrow = TRUE)
set.seed(123)
mat <- matrix(data=round(runif(200, 0, 100), 0),
ncol = 10,
byrow = TRUE)
rown <- paste0("gene_", 1:nrow(mat))
rownames(mat) <- rown
coln <- paste0("sample_", 1:ncol(mat))
colnames(mat) <- coln
The transposition function
To transpose a matrix (\(mat^{T}\)),
rows and columns are swapped. In machine learning, features
(e.g., genes) often appear in columns, while samples are rows.
Use the t()
function to perform the transposition.
mat
t(mat)
The diag() function
You can easily manipulate the matrix object with various specific functions. For example, getting and modifying the diagonal values can be performed by the diag() function.
Let’s imagine that a matrix represents the adjacency matrix of a graph which for any protein, A to H, indicates with a 1 whether it interacts with another (0 otherwise). Proteins will be the nodes (nodes/vertices) of the graph and interactions will constitute the edges (edges).
Let’s create such a matrix (we will see just later the graph representation).
mat <- matrix(c(0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 1, 1, 0, 1, 0, 0,
0, 1, 1, 1, 1, 0, 0, 1, 0, 1,
0, 1, 0, 0, 1, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 0, 0),
ncol = 8,
byrow = TRUE)
cr_names <- LETTERS[1:8]
colnames(mat) <- rownames(mat) <- cr_names
print(mat)
Using the igraph
library, the graph can be created with
the graph_from_adjacency_matrix()
function. Here, we choose
to declare the graph as undirected (mode=“undirected”) because,
here, for protein-protein interactions, there is no particular source
and target (i.e we don’t know whether one activates or
represses the other, they just interact…).
library(igraph)
my_graph <- igraph::graph_from_adjacency_matrix(mat, mode="undirected")
plot(my_graph)
From the diagram, and by extracting the values from the matrix diagonal, we can see that B interacts with itself, as does D. This proteins may create homodimers. To check all the proteins that can create homodimers, we can just ask for the matrix diagonal.
diag(mat)
If we do not want to focus on these homodimeric interactions we may simply set the diagonal values to 0.
diag(mat) <- 0
all(diag(mat) == 0) # TRUE
print(mat) #
- By creating a graph using the
graph_from_adjacency_matrix()
from theigraph
library, check graphically that homodimeric interaction are no more present in the graph.
mat
library(igraph)
my_graph <- igraph::graph_from_adjacency_matrix(mat, mode="undirected")
plot(my_graph)
The lower.tri() function
The functions upper.tri()
or lower.tri()
return a logical matrix indicating if a cell from the matrix is part of
the upper or lower triangle respectively.
upper.tri(mat)
Indexing
Indexing by a matrix
A test can be applied to all cells of the matrix. For example we can test whether the value is 1.
mat > 0.5
We can apply more complex tests by using boolean operators. For instance we could test whether a cell value is equal to 1 and part of the lower triangle.
mat > 0.5 & lower.tri(mat)
Two-dimensional indexing
Since a matrix contains rows and columns, we’ll (most of the time) use two-dimensional indexing. Two pieces of information are passed to the indexing operator in the form [lines, columns] (where lines and columns are vectors for the size of rows and columns respectively). If lines is not defined (e.g. [, columns]), all rows are extracted. Same principle for columns.
Given the matrix declared below:
- Extract the value of the cell at position 1,1 and store the result in a variable a.
- Extract the values of the cells at position 1,1 and 1,2 and store the result in a variable b.
- Extract cell values from row 1 and store the result in a variable c.
- Extract cell values from rows 1 and 3 and store the result in a variable d.
- Extract cell values from column 1 and store the result in a variable e.
- Extract cell values from columns 1 and 3 and store the result in a variable f.
set.seed(123)
mat <- matrix(data=sample(1:20, size=40, replace = TRUE),
ncol = 4,
byrow = TRUE)
colnames(mat) <- LETTERS[1:4]
rownames(mat) <- letters[1:10]
set.seed(123)
mat <- matrix(data=sample(1:20, size=40, replace = TRUE),
ncol = 4,
byrow = TRUE)
colnames(mat) <- LETTERS[1:4]
rownames(mat) <- letters[1:10]
a <- mat[1, 1]
b <- mat[1, c(1, 2)]
c <- mat[1, ]
d <- mat[c(1,3), ]
e <- mat[, 1]
f <- mat[ ,c(1, 3)]
Given the matrix declared below:
- Extract the values from the cells in columns 1 and 3 for rows 1 and 3, and store the result in a variable g.
- Extract all rows where the values in column 1 are greater than 11, and store the result in a variable h.
- Extract the cell where the row name is “a” and the column name is “B”, and store the result in a variable i.
- Extract the cells where the row names are “a”, “b”, and “c”, and the column name is “B”, and store the result in a variable j.
- Extract all columns where the values in row 1 are greater than 10, and store the result in a variable k.
set.seed(123)
mat <- matrix(data=sample(1:20, size=40, replace = TRUE),
ncol = 4,
byrow = TRUE)
colnames(mat) <- LETTERS[1:4]
rownames(mat) <- letters[1:10]
set.seed(123)
mat <- matrix(data=sample(1:20, size=40, replace = TRUE),
ncol = 4,
byrow = TRUE)
colnames(mat) <- LETTERS[1:4]
rownames(mat) <- letters[1:10]
g <- mat[c(1, 3) ,c(1, 3)]
h <- mat[mat[, 1] > 11, ]
i <- mat["a", "B"]
j <- mat[c("a", "b", "c"), "B"]
k <- mat[, mat[1,] > 10]
Implicit Conversion by the Indexing Function
The indexing function can cause a type conversion that is not always desired (but is often very practical). For example, below, if we select a column from the matrix, we end up with a vector, which seems quite natural (the same is observed if we select a row).
mat <- matrix(1:20, ncol=4)
is(mat[, 1])
We can prevent this default behavior by setting the drop argument of the indexing function to FALSE. It is set to TRUE by default.
mat <- matrix(1:20, nc=10)
is(mat[, 1, drop=FALSE])
The apply Function
Using the apply()
function, we can
apply functions, that take a vector as their first
argument (e.g. mean()
, median()
,
var()
, sd()
…), to the rows or columns of a
matrix.
The syntax and arguments of the apply function are as follows: apply(X, MARGIN, FUN,…).
- X is a matrix or a data.frame
- MARGIN indicates whether the function should be
applied to:
- the rows (MARGIN=1)
- or the columns (MARGIN=2)
- FUN is the function to be applied
- … additional arguments for FUN
If we write apply(X=mat, MARGIN=2, FUN=median)
, each
column (MARGIN=2
) of mat
(X=mat
)
will be passed successively to the median()
function. This
will return a vector of size ncol(mat)
containing the
median values of each column.
Given the matrix \(mat\), use apply() to:
- Calculate the mean (mean()) of each row and store the result in the variable a.
- Calculate the variance (var()) of each row and store the result in the variable b.
- Calculate the standard deviation (sd()) of each row and store the result in the variable c.
- Calculate the interquartile range (IQR()) of each row and store the result in the variable d.
set.seed(123)
mat <- matrix(data=sample(1:20, size=40, replace = TRUE),
ncol = 4,
byrow = TRUE)
colnames(mat) <- LETTERS[1:4]
rownames(mat) <- letters[1:10]
set.seed(123)
mat <- matrix(data=sample(1:20, size=40, replace = TRUE),
ncol = 4,
byrow = TRUE)
colnames(mat) <- LETTERS[1:4]
rownames(mat) <- letters[1:10]
a <- apply(mat, 1, mean)
b <- apply(mat, 1, var)
c <- apply(mat, 1, sd)
d <- apply(mat, 1, IQR)
When the function being called has multiple arguments that need to be specified, the arguments can be passed after the apply function:
# E.g. apply a trimmed mean
# to the rows by removing 20%
# of the extreme values.
apply(mat, 1, mean, trim = 0.2)
- Check the help for the quantile() function.
Calculate the values of the \(1^{st}\)
and \(3^{rd}\) quartiles for
each column. Store the results in
q_25
andq_75
respectively.
set.seed(123)
mat <- matrix(data=sample(1:20, size=40, replace = TRUE),
ncol = 4,
byrow = TRUE)
set.seed(123)
mat <- matrix(data=sample(1:20, size=40, replace = TRUE),
ncol = 4,
byrow = TRUE)
q_25 <- apply(mat, MARGIN = 2, quantile, 0.25)
q_75 <- apply(mat, MARGIN = 2, quantile, 0.75)
Mathematical Operations
We will often work with numeric matrices on which we can perform mathematical operations. As with vectors, these operations are generally greatly simplified because they implicitly apply to all elements of the matrix. Thus, we can write the following instructions:
mat
mat + 10
mat / 2
abs(mat)^0.5
mat + mat ^ 2
End of the section
Thank you for following this tutorial.