Input and Output and file system.

Objectives

It is important to be able to interact with the file system in order to import/read files or export/write to files. The following section introduces some elements for manipulating the file structure.

The getwd() Function

The getwd() (get working directory) function displays the current working directory. This directory is where R is working at a given time and from which it will read or write by default.

getwd()

NB: The starting working directory is generally located in user home directory. Here this tutorial is launched inside a shiny application. Thus R is located in a temporary folder.

Creating and listing directory content

The dir.create() function is used to create directories. Here we will create a directory called “rtrainer_projects”.

dir.create("rtrainer_projects", showWarnings = FALSE)

NB: The showWarnings = FALSE argument is used in case the folder was previously created (e.g. re-running this tutorial).

You can list the files in a directory using the dir() function. The value returned by dir() is a character vector.

dir()

NB: Since we are working in a Shinny application you might find additional files/directories in the current working directory. Ignore them.

The file.path() Function

Creating Paths

You can use the file.path() function to create a path to an object (folder or file). Since the format of paths differs between operating systems (Linux, Windows, OSX…), using file.path ensures that the generated path will be adapted to the platform where R is installed. Using file.path() allows for writing code that is compatible across different operating systems.

fp <- file.path("rtrainer_projects", "project_1", "input")
fp

You can then create the corresponding folders using dir.create().

dir.create(fp, showWarnings=FALSE, recursive=TRUE)

NB: Here the recursive=TRUE indicate to create “project_1” and its subfolder “input”.

Using dir() or list.files() check the content of “rtrainer_projects” and “rtrainer_projects/project_1”

dir("rtrainer_projects")
dir(file.path('rtrainer_projects', 'project_1'))

Changing Directories

The setwd() (set working directory) function allows you to navigate the file structure and move from one directory to another (thus setting a new working directory). This new directory will become the default location where R searches for documents (files and folders). You simply pass the name of the target directory to setwd(). After using setwd(), you can verify your position in the file structure with getwd().

We are currently here:

getwd()

Let’s move into the rtrainer_projects/project_1/input directory.

fp
setwd(fp)
getwd()

NB: In Rstudio you may use the menu Session > Set working directory > Choose Directory to change directory.

The read.table() Function

Introduction to read.table()

This command is essential as it allows for reading data tables locally or remotely (http, ftp) in text format. These text-format tables contain a column separator such as, for example:

Files with the “.csv” extension (comma separated values) where the columns are separated by commas.
Files with the “.tsv” or “.tab” extension (tab separated values) where the columns are separated by tabs (see the next section).

All these “.csv”, “.tsv” and “.tab” files are considered flat files. This means they contain only characters and no formatting information (e.g., bold, italic, underline, etc., as in a Word document).

What is a Tab-Delimited File?

Since tab-delimited files (often with the “.tsv” extension) are very popular (they can be generated from Excel, for example), it is important to take a closer look at this format.

The column separator in a tab-delimited file is the ’ character. If you have not studied computer science, you might have never heard of it. However, you have likely encountered it without realizing it… It is written as ’, but the computer represents it as a large space (though it is not actually a space).

Run the following code (cat() is an alternative to print()).

cat("1\t2\t3")

Another important character to know is the ‘\n’ character (newline). It appears in all files (tab-delimited or not…) unless they contain only a single line. This character is interpreted by software as a line break.

Run the following code (cat() is an alternative to print()).

cat("1\t2\t3\n4\t5\t6\n8\t9\t10")

Consider the following matrix (in tab-delimited format):

cat("Gene\tA\tB\tC\ngene_1\t1\t2\t3\ngene_2\t4\t5\t6\n")

We can therefore create a tab-delimited file with the following code (cat()can print to a file…):

cat("Gene\tA\tB\tC\ngene_1\t1\t2\t3\ngene_2\t4\t5\t6\n", 
    file="file_1.tab")
dir()

NB: It is extremely rare to write an entire matrix character by character as shown here. However, this exercise is presented for educational purposes.

Les arguments de la fonction read.table()

Les principaux arguments de read.table sont:

file: le nom du fichier
header: la première ligne correspond aux noms des colonnes.
skip: Passer les n premières lignes avant la lecture.
sep: le type de séparateurs de colonnes (e.g “, une tabulation qui est le séparateur le plus classique).
row.names: la colonne contenant les noms des lignes (e.g, 1)
quote: le délimiteur de champs (à positionner plutôt sur ““)
comment.char: par défaut “#”. Le texte précédé de ce caractère n’est pas lu. A utiliser pour des lignes de commentaires.

A l’aide de la fonction read.table(), lisez le contenu du fichier file_1.tab.
- Pensez à positionner: file, header, et row.names (les autres arguments peuvent conserver leurs valeurs par défaut).
- Stockez le résultat dans l’objet df et imprimez le contenu de la variable.

df <- read.table("file_1.tab"=fp, header=TRUE, row.names=1)
print(df)

Reading a Remote File

R can be given a path in the file system as well as a URL (i.e., an internet link).

For example, here we read a table (Sultan dataset) from the recount database. This table is accessible here.

url <- "https://bowtie-bio.sourceforge.net/recount/countTables/sultan_count_table.txt"
sultan <- read.table(file = url,
                    sep="\t",
                    row.names = 1, 
                    header = T,
                    quote = "",
                    skip=0,
                    comment.char = "#")

Exercise

Retrieve the dataset modencodefly from the recount database. This dataset represents gene expression values (rows) across samples (columns). The measurements were performed using RNA sequencing.
Delete all row for which the sum is lower or equal to 2
Add the value 1 to the entire matrix (pseudo-count to allow logarithmic transformation).
Transform the values using base 10 logarithm (log10()).
Create a scatter plot (plot()) where the values from the sample SRX008027 are represented on the x-axis and those from SRX008015 on the y-axis.
Check the help for the densCols() function and represent the point density using a color gradient.

url <- "https://bowtie-bio.sourceforge.net/recount/pooled/modencodefly_pooledreps_count_table.txt"
modencodefly <- read.table(file = url,
                    sep="\t",
                    row.names = 1, 
                    header = T,
                    quote = "",
                    skip=0,
                    comment.char = "#")
modencodefly <- modencodefly[apply(modencodefly, 1, sum) >= 2, ]
modencodefly <- modencodefly + 1
modencodefly <- log10(modencodefly)
cols <- densCols(modencodefly$SRX008027, 
                 modencodefly$SRX008018, 
                 colramp = colorRampPalette(rainbow(5)))
plot(modencodefly$SRX008027, 
     modencodefly$SRX008018, 
     pch=16, col=cols, 
     xlab="SRX008027", ylab="SRX008018")

End of the section

Thank you for following this tutorial.