Objectives
It is important to be able to interact with the file system in order to import/read files or export/write to files. The following section introduces some elements for manipulating the file structure.
The getwd() Function
The getwd()
(get working directory) function
displays the current working directory. This directory
is where R is working at a given time and from which it will read or
write by default.
getwd()
NB: The starting working directory is generally located in user home directory. Here this tutorial is launched inside a shiny application. Thus R is located in a temporary folder.
Creating and listing directory content
The dir.create()
function is used to create directories.
Here we will create a directory called “rtrainer_projects”.
dir.create("rtrainer_projects", showWarnings = FALSE)
NB: The showWarnings = FALSE
argument
is used in case the folder was previously created (e.g. re-running this
tutorial).
You can list the files in a directory using the dir()
function. The value returned by dir()
is a character
vector.
dir()
NB: Since we are working in a Shinny application you might find additional files/directories in the current working directory. Ignore them.
The file.path() Function
Creating Paths
You can use the file.path()
function to create a path to
an object (folder or file). Since the format of paths differs between
operating systems (Linux, Windows, OSX…), using file.path
ensures that the generated path will be adapted to the platform where R
is installed. Using file.path()
allows for writing code
that is compatible across different operating systems.
fp <- file.path("rtrainer_projects", "project_1", "input")
fp
You can then create the corresponding folders using
dir.create()
.
dir.create(fp, showWarnings=FALSE, recursive=TRUE)
NB: Here the recursive=TRUE
indicate to
create “project_1” and its subfolder “input”.
- Using
dir()
orlist.files()
check the content of “rtrainer_projects” and “rtrainer_projects/project_1”
dir("rtrainer_projects")
dir(file.path('rtrainer_projects', 'project_1'))
Changing Directories
The setwd()
(set working directory) function
allows you to navigate the file structure and move from
one directory to another (thus setting a new working
directory). This new directory will become the default location
where R searches for documents (files and folders). You simply pass the
name of the target directory to setwd()
. After using
setwd()
, you can verify your position in the file structure
with getwd()
.
We are currently here:
getwd()
Let’s move into the rtrainer_projects/project_1/input
directory.
fp
setwd(fp)
getwd()
NB: In Rstudio you may use the menu Session > Set working directory > Choose Directory to change directory.
The read.table() Function
Introduction to read.table()
This command is essential as it allows for reading data tables locally or remotely (http, ftp) in text format. These text-format tables contain a column separator such as, for example:
- Files with the “.csv” extension (comma separated values) where the columns are separated by commas.
- Files with the “.tsv” or “.tab” extension (tab separated values) where the columns are separated by tabs (see the next section).
All these “.csv”, “.tsv” and “.tab” files are considered flat files. This means they contain only characters and no formatting information (e.g., bold, italic, underline, etc., as in a Word document).
What is a Tab-Delimited File?
Since tab-delimited files (often with the “.tsv” extension) are very popular (they can be generated from Excel, for example), it is important to take a closer look at this format.
The column separator in a tab-delimited file is the ’ character. If you have not studied computer science, you might have never heard of it. However, you have likely encountered it without realizing it… It is written as ’, but the computer represents it as a large space (though it is not actually a space).
- Run the following code (
cat()
is an alternative toprint()
).
cat("1\t2\t3")
Another important character to know is the ‘\n’ character (newline). It appears in all files (tab-delimited or not…) unless they contain only a single line. This character is interpreted by software as a line break.
- Run the following code (
cat()
is an alternative toprint()
).
cat("1\t2\t3\n4\t5\t6\n8\t9\t10")
Consider the following matrix (in tab-delimited format):
cat("Gene\tA\tB\tC\ngene_1\t1\t2\t3\ngene_2\t4\t5\t6\n")
We can therefore create a tab-delimited file with the following code
(cat()
can print to a file…):
cat("Gene\tA\tB\tC\ngene_1\t1\t2\t3\ngene_2\t4\t5\t6\n",
file="file_1.tab")
dir()
NB: It is extremely rare to write an entire matrix character by character as shown here. However, this exercise is presented for educational purposes.
Les arguments de la fonction read.table()
Les principaux arguments de read.table sont:
- file: le nom du fichier
- header: la première ligne correspond aux noms des colonnes.
- skip: Passer les n premières lignes avant la lecture.
- sep: le type de séparateurs de colonnes (e.g “, une tabulation qui est le séparateur le plus classique).
- row.names: la colonne contenant les noms des lignes (e.g, 1)
- quote: le délimiteur de champs (à positionner plutôt sur ““)
- comment.char: par défaut “#”. Le texte précédé de ce caractère n’est pas lu. A utiliser pour des lignes de commentaires.
- A l’aide de la fonction read.table(), lisez le
contenu du fichier file_1.tab.
- Pensez à positionner:
file
,header
, etrow.names
(les autres arguments peuvent conserver leurs valeurs par défaut). - Stockez le résultat dans l’objet
df
et imprimez le contenu de la variable.
- Pensez à positionner:
df <- read.table("file_1.tab"=fp, header=TRUE, row.names=1)
print(df)
Reading a Remote File
R can be given a path in the file system as well as a URL (i.e., an internet link).
For example, here we read a table (Sultan dataset) from the recount database. This table is accessible here.
url <- "https://bowtie-bio.sourceforge.net/recount/countTables/sultan_count_table.txt"
sultan <- read.table(file = url,
sep="\t",
row.names = 1,
header = T,
quote = "",
skip=0,
comment.char = "#")
Exercise
Retrieve the dataset modencodefly from the recount database. This dataset represents gene expression values (rows) across samples (columns). The measurements were performed using RNA sequencing.
Delete all row for which the sum is lower or equal to 2
Add the value 1 to the entire matrix (pseudo-count to allow logarithmic transformation).
Transform the values using base 10 logarithm (
log10()
).Create a scatter plot (
plot()
) where the values from the sample SRX008027 are represented on the x-axis and those from SRX008015 on the y-axis.Check the help for the
densCols()
function and represent the point density using a color gradient.
url <- "https://bowtie-bio.sourceforge.net/recount/pooled/modencodefly_pooledreps_count_table.txt"
modencodefly <- read.table(file = url,
sep="\t",
row.names = 1,
header = T,
quote = "",
skip=0,
comment.char = "#")
modencodefly <- modencodefly[apply(modencodefly, 1, sum) >= 2, ]
modencodefly <- modencodefly + 1
modencodefly <- log10(modencodefly)
cols <- densCols(modencodefly$SRX008027,
modencodefly$SRX008018,
colramp = colorRampPalette(rainbow(5)))
plot(modencodefly$SRX008027,
modencodefly$SRX008018,
pch=16, col=cols,
xlab="SRX008027", ylab="SRX008018")
End of the section
Thank you for following this tutorial.