Chapter 1 Session 1

In this first session we will learn about:

  • R and RStudio
  • data types and data structures
  • vectorisation
  • how to read in and write out data

1.1 R and RStudio

R is a free and open source statistical programming language, great for performing data analysis. RStudio is a free and open source R integrated development environment (IDE) which makes it easier for you to write code. It does this by providing you with auto-completion (of variable names, function names, file paths etc.), helping with formatting and keeping track of your variables.

You can think of R as the engine in a car and RStudio as the body & controls. R is doing all the calculations/computations but RStudio makes it easier for you to use R.

When you first open RStudio, there will be three panels - see Figure 1.1 (don’t worry if your RStudio does not have the same colours/appearance - different versions of RStudio look slightly different and the text colour I use is not the default one.)

Three panels in RStudio.

Figure 1.1: Three panels in RStudio.

  • Left panel: this panel features two tabs, ‘Console’, which is where you can type in commands for R to run and ‘Terminal’, which we won’t worry about in this course.
  • Top right panel:
    • Environment - in this tab you can see all variables you have created.
    • History - R keeps track of all commands you have run and you can review them in this tab.
    • Connections - this tab helps you connect to data sources but we will not be using it in this course.
  • Bottom right:
    • Files - you can explore your file directory here and we will use it to set our working directory later.
    • Plots - plots that you create will either appear here or be saved to a file.
    • Help - help files for R functions can be viewed in this tab. Help files tell you about what a function does and how to use it.
    • Packages - basic R includes many useful functions. You can add even more functions by downloading packages. A package is a collection of functions, generally with a certain data analysis theme. For example, the package ‘limma’, which we will use later, includes functions for analysing RNA-sequencing data.
    • Viewer - this tab lets you view local web content but we won’t be using it in this course.

1.2 R scripts

To open a R script, go to File > New File > R Script.

Opening a new R Script.

Figure 1.2: Opening a new R Script.

This will open a fourth panel on the top left.

Four panels, including a R Script.

Figure 1.3: Four panels, including a R Script.

A R Script is just a text document. You can type and run commands using a R Script and down in the Console - the difference is that you can’t save the commands you run in the Console but you can save the R Script with all your commands. Note that to run a command in the Console press Enter key but to run a command in a R Script you must press Cmd/Ctrl + Enter keys.

Lastly, you can make ‘comments’ in your R Script. Comments are notes to yourself that are not interpreted by R and start with #:

# this is a note to myself
1+3-2
## [1] 2

1.3 Help

There are two ways to access help files in RStudio. You can type in the name of the function you want help with, in the top right of help tab (indicated with a red arrow):

Accessing help files via 'help' tab.

Figure 1.4: Accessing help files via ‘help’ tab.

The other option is to run the command ? followed by the name of the function:

?sum

Help files are very useful but can be difficult to interpret at first due to the technical language used. We won’t get too much practice reading help files during this course but I would encourage you to try to use them when figuring out how to use a new function.

1.4 Working directory

Every file on your computer is located in a specific location. This location can be referred to by a path. In Mac, paths look something like this: /Users/Lucy/Documents/. In Windows, paths look something like this: C:\Users\Lucy\Documents\.

When you open a R session, it launches from a specific location. You can find out where this is using the command getwd(). This location called the ‘working directory’. R will, by default, look in this directory when reading in data and write out files/plots to this directory. It is often useful to have your data and R Scripts in the same directory and set this as your working directory.

You can set your working directory to be anywhere you like and we will now do this:

  1. Make a folder for this course, somewhere sensible on your computer that you will be able to easily find.
  2. Go back to your RStudio window, go to the bottom right panel, click on the ‘Files’ tab and then click on the three dots on the top right hand corner (Figure 1.5).
Setting the working directory - 1.

Figure 1.5: Setting the working directory - 1.

  1. This will open up a new window (Figure 1.6) which lets you explore the files and folders on your computer. Find the new folder you created, click on it then click ‘Open’.
Setting the working directory - 2.

Figure 1.6: Setting the working directory - 2.

  1. The files tab will now show the contents of your new folder (which should be empty). At the top of the files tab, click on More > Set As Working Directory (Figure 1.7).
Setting the working directory - 3.

Figure 1.7: Setting the working directory - 3.

Please set your working directory to be this folder at the start of EVERY session.

1.5 Maths

R performs maths and follows standard order of operations. In order of highest to lowest precedence, here is how mathematical operations are denoted in R:

  • () - parentheses
  • ** or ^ - exponents
  • / - divide
  • * - multiply
  • + - add
  • - - subtract

Another useful function is modulus (%%), which gives the remainder after dividing:

8%%3
## [1] 2

1.6 Comparisons

You can also compare numbers in R:

1 == 1  # equal to.
## [1] TRUE
1 != 1 # not equal to
## [1] FALSE
2 > 1   # greater than.
## [1] TRUE
2 < 1   # less than.
## [1] FALSE
1 <= 2  # greater or equal to.
## [1] TRUE
1 >= 2  # less than or equal to.
## [1] FALSE

The ! sign by itself means “not” and it reverses the logical. For example, “not” TRUE is FALSE.

!TRUE
## [1] FALSE

You can also compare words. What do you think will happen below?

"cat" > "dog"
## [1] FALSE

R will use alphabetical order to determine which word is ‘greater’:

"a" < "b"
## [1] TRUE

1.7 Variables

A variable in R is just a name which refers to an ‘thing’ (more technically an ‘object’ in R).

For example, I can do some maths:

2 + 6 * 7**2
## [1] 296

R outputs the result. However, if I want to ‘save’ this result to use later, I need to ‘assign’ the output to a variable. This can be thought of as giving it a name, so that we can refer to it later.

You can this with <- (shortcut = alt + -) in R. (You can also use =, however stylistically <- is preferred.)

Here R performs the calculation on the right of <- and then saves the result as a variable called my_num.

my_num <- 2 + 6 * 7**2

Now if I run the command my_num, I see the number I stored earlier:

my_num
## [1] 296

You can also ‘overwrite’ variables:

my_num <- 3 * 4

The above code evaluates 3 * 4 and assigns the output to the variable my_num - effectively ‘overwritting’ the previous value assigned to it.

my_num
## [1] 12

R is particular about variable names. Variable names cannot:

  • start with a number
  • contain any spaces

If we try to create a variable that starts with a number, R will return an error:

2myvar <- 2 + 6 * 7**2
## Error: <text>:1:2: unexpected symbol
## 1: 2myvar
##      ^

1.7.1 Errors and warnings

This is a good time to talk about ‘errors’ and ‘warnings’ in R.

An error is R telling you that it couldn’t do what you told it to do. Do not be disheartened at seeing an error message - it happens to everyone, including experienced programmers, all the time. DO read the error message, it is often very useful and tells you what you need to do to fix the error.

A warning is R telling you that is has done what you told it to do, however the result may not be what you want. Sometimes it is okay to ignore a warning, sometimes it is not!

1.8 Data structures

A data structure can be thought of as a ‘container’ for data. There are a number of different data structures in R and each have different specifications about how data is stored.

Three commonly used data structures are summarised below:

Data structures. Source: http://libguides.wellesley.edu.

Figure 1.8: Data structures. Source: http://libguides.wellesley.edu.

Another useful data structure is a ‘list’, which we will talk about in Session 3.

1.8.1 Vector

You can think of a vector like a row or column in excel. You can only store one type of data in a vector - e.g. all numbers or all text.

You can create vectors with the c() function (‘c’ for ‘combine’):

my_vect1 <- c(1,2,3)
my_vect2 <- c("a", "b", "c")

1.8.2 Matrix

A matrix is like an excel spreadsheet. It is two dimensional meaning you have columns and rows of data. You can only store one type of data in a matrix - e.g. all numbers or all text.

1.8.3 Dataframe

Dataframes are also two dimensional (has both rows and columns) however, you can store different types of data in a dataframe. The only restriction is that all the data within one column must be of the same type.

A dataframe is composed of vectors, with each vector being one column. You can create a dataframe using the data.frame() function:

my_df <- data.frame(
  cats = c("Hello Kitty", "Garfield"), 
  weight = c(4.5,7)
)
my_df
##          cats weight
## 1 Hello Kitty    4.5
## 2    Garfield    7.0

There are a few important things to note:

  1. name and age become the names of the columns.
  2. the , at the end of cats = c("Hello Kitty", "Garfield") is important and should not be missed.
  3. each column in a dataframe is essentially a vector. Do not forget the c() when inputing the values within each column.

You can access a column in a dataframe with the shortcut $. Notice that the names of all columns of the dataframe appears after typing in my_df$:

Accessing columns in a dataframe using the '$' symbol.

Figure 1.9: Accessing columns in a dataframe using the ‘$’ symbol.

Challenge 1.1

Create a dataframe called my_df2 with 2 columns. The first column named “dogs” should be a vector with the values: “spot”, “snoopy”. The second column named “weight” should be a vector with the following values: 3.5, 4.8.

Write this dataframe out to a .tsv file named ‘Ses1_dogs.tsv’.

1.9 Reading in data

Let’s read in some data and start exploring it.

You should have received the data files via email before the course. Please download these files and make sure they are located in your working directory. Recall, we set our working directory above. You can find the location of your working directory with the function getwd().

If you are following along outside of the WEHI course, you can download the data files from Github - instructions for downloading data from GitHub can be found in the Preface.

The file we want to read in is named ‘Ses1_genes.tsv’.

I have put all my data files in a directory called ‘data’ - thus the path to the file (relative to my working directory) is ‘data/Ses1_genes.tsv’. Depending on where you have put your data (in your working directory or in another file in your working directory), the path to your file ‘Ses1_genes.tsv’ may be different.

Read in your data by typing in the path to your ‘Ses1_genes.tsv’ file (relative to your working directory), within the brackets ( ):

read.delim("data/Ses1_genes.tsv")
##     SYMBOL GeneLength Count
## 1  Gm10568       1634     0
## 2  Gm19860        799     4
## 3  Gm19938       3259     0
## 4   Lypla1       2433   768
## 5      Rp1       9747     0
## 6    Sox17       3130     1
## 7    Tcea1       2847   810
## 8   Mrpl15       4203   431
## 9     Xkr4       3634     1
## 10   Rgs20       2241   452

Note that read.delim() also lets you specify what kind of file you are reading in - by this I mean how each value is separated. Two common formats are ‘csv’ (comma separated values) and ‘tsv’ (tab separated values).

Example of csv file:

Name, Age
Andy, 10
Bob, 8

Example of tsv file:

Name  Age
Andy  10
Bob   8

Notice in the above two examples the values are separated by different ‘characters’.

You can specify what ‘character’ separates each value by using the sep input in read.delim(). E.g. if your file was a csv, you can read it in using:

read.delim("file.csv", sep = ",")

We don’t need to specify the sep for our file, which is a tsv, as the default separater read.delim() is tab. We only need to specify the separater character when we are reading in a file NOT separated by tabs.

Above, we have read in our data as a dataframe and printed it. However, we can’t refer to this dataframe again and manipulate it because we haven’t assigned it to a variable.

Let’s assign our dataframe to a variable called genes:

genes <- read.delim("data/Ses1_genes.tsv")

Notice how genes now appears in our ‘Environment’ tab:

Variable in 'Environment' tab.

Figure 1.10: Variable in ‘Environment’ tab.

All variables that we create will be shown in this tab, so it is a useful way to keep track of variables that we have created. Notice how R also tells us that there are 10 observations (rows) and 3 variables (columns) in the genes dataframe.

1.9.1 Summary

A useful function for investigating your data is summary(). Running this function on our genes dataframe provides us with summary statistics on all the numeric columns. For the columns that don’t contain numbers, the output simply states the length of that column.

summary(genes)
##      SYMBOL    GeneLength       Count       
##  Gm10568:1   Min.   : 799   Min.   :  0.00  
##  Gm19860:1   1st Qu.:2289   1st Qu.:  0.25  
##  Gm19938:1   Median :2988   Median :  2.50  
##  Lypla1 :1   Mean   :3393   Mean   :246.70  
##  Mrpl15 :1   3rd Qu.:3540   3rd Qu.:446.75  
##  Rgs20  :1   Max.   :9747   Max.   :810.00  
##  (Other):4

1.9.2 Structure

A useful function for understanding how our data is stored in R is str() (structure).

str(genes)
## 'data.frame':    10 obs. of  3 variables:
##  $ SYMBOL    : Factor w/ 10 levels "Gm10568","Gm19860",..: 1 2 3 4 7 8 9 5 10 6
##  $ GeneLength: int  1634 799 3259 2433 9747 3130 2847 4203 3634 2241
##  $ Count     : int  0 4 0 768 0 1 810 431 1 452

The output tells us that genes is a dataframe. It also tells us what data type each column is.

1.10 Data types

Every bit of data in R has a ‘data type label’. The label doesn’t change the data in any way - it just tells R what kind of data it is and thus what it can and can’t do with the data.

For example, it makes sense to perform mathematical functions on numbers but not on words. It makes sense to look for a certain term, like a gene name, in words but not in numbers.

For example, you can’t add words so the code belwo doesn’t work. Note that quotes (" " or ' ') ALWAYS surround values that are of the ‘character’ data type:

"one" + "two"
## Error in "one" + "two": non-numeric argument to binary operator

This does work because they are numbers:

1 + 2
## [1] 3

There are five basic types of data in R:

  • logical - either TRUE or FALSE. This is useful for data that only has two possible values, like if patient has a disease or not.
  • integer - number without decimal point e.g. 3.
  • double - number with decimal point e.g. 3.14.
  • complex - complex number with a real and imaginary part e.g. 1 + 3i
  • character - Anything with character(s) within it. Quotes (double or single) signify this data type e.g. "pi". Also known as a ‘string’.

In our genes dataframe above, we can see that both the GeneLength and Count columns are integers (‘int’). But what is the SYMBOL column? It is a ‘Factor’.

1.10.1 Factors

Factor is a data type label used for categorical variables (e.g. small, medium and large OR red, blue and green). Internally, R stores factors as numbers (integers), with each number corresponding to a category.

For example, if your data was:

red, blue, green, red, green

Internally, R would store the data as:

1, 2, 3, 1, 3

Each number corresponds to a category. This information is also stored. In R, the categorical values each number corresponds to, is called ‘levels’.

The levels for the above data would be:

1 = red
2 = blue
3 = green

Factors can be difficult to work with, so we don’t want our gene symbols to be ‘labelled’ as factors. We can do this by specifying this when reading our data in:

genes <- read.delim("data/Ses1_genes.tsv", stringsAsFactors = FALSE)

By default, R will label all word (‘character’) data as a ‘factor’. Setting stringsAsFactors to be FALSE, tells R that you DON’T want it to do this.

We can check the structure again:

str(genes)
## 'data.frame':    10 obs. of  3 variables:
##  $ SYMBOL    : chr  "Gm10568" "Gm19860" "Gm19938" "Lypla1" ...
##  $ GeneLength: int  1634 799 3259 2433 9747 3130 2847 4203 3634 2241
##  $ Count     : int  0 4 0 768 0 1 810 431 1 452

Notice that now the column SYMBOL is now a character (‘chr’).

Another way to change the labels of data in R is with the following functions:

  • as.logical()
  • as.integer()
  • as.double()
  • as.complex()
  • as.character()

For example, this will turn a integer vector into a character vector:

as.character(c(1,2,3))
## [1] "1" "2" "3"

1.10.2 Type coercion

In some circumstances, R will change the data type label of your data. This is called ‘type coercion’. A common scenario in which this will happen is with vector (and dataframe column) labeling.

All elements within a vector (and within a column of a dataframe) must have the same data type label. Thus, if you create a vector like the one below, the whole vector will be labelled as one data type. Which data type do you think it will be?

my_vect3 <- c(1, 2, "a")

We can find out with str():

str(my_vect3)
##  chr [1:3] "1" "2" "a"

It has labelled the whole vector as characters. This is because the letter "a" cannot be “expressed” as a number but numbers can be “expressed” as characters - thus R will always pick the data type label that does not result in any loss of information. Recall that data type labels NEVER change the data - only the label changes.

There is an order to which data label R will label a vector as. Of all the data types that exist in the vector, the data type of the element that is furthermost right, will be the “final” data type label for the vector.

logical > integer > double > complex > character

This order makes sense:

  • the logicals TRUE and FALSE can be represented by numbers with TRUE = 1 and FALSE = 0.
    • when logicals are labelled as the character type, TRUE simply becomes "TRUE"
  • an integer can easily be represented by a double - 3 becomes 3.0
  • as we saw above, numbers can easily be represented as a character

The last thing to note is that you may see the type num:

str(c(1,2,3))
##  num [1:3] 1 2 3

num stands for numeric and is just the ‘number’ data types grouped together. R does this for two reasons. First, generally you don’t need to know exactly what ‘type’ of number class your data is. Second, R will often convert between number types depending the type of calculations/function performed.

Challenge 1.3

  1. What data type will the following vectors be?
# 1
c(TRUE, 3, 4)
# 2 
c("hello", TRUE, 4)
# 3
c(4, "5")
  1. Create a vector called my_vect3 that contains the following numbers: 1,1,2,3,5,8.
    • Convert this vector into the character data type. How can you tell that the values are now characters?

1.11 Vectorisation

R is quite efficient at doing calculations or performing a functions on a vector of data.

Let’s use the Count column from our dataframe genes. We can access just one column using the $ symbol:

genes$Count
##  [1]   0   4   0 768   0   1 810 431   1 452

What do you think will happen if we do this?

genes$Count + 10
##  [1]  10  14  10 778  10  11 820 441  11 462

R will perform this calculation on all numbers in the Count column and return a vector.

We can add this new vector to our dataframe, as a new column.

genes$Count_2 <- genes$Count + 10

There are a few things happening in the above code:

  1. The code on the right side of <- is evaluated first. It returns a vector as we saw above.
  2. This vector is assigned to a column in genes called Count_2. Since this column does not yet exist in the genes dataframe, a new column called Count_2 is created first. (If there was already a column name Count_2, this command will have overwritten that column with the new vector created in the right side of <-)

Note that you could also REPLACE a column using the same notation. For example, if, in the command above, the left side was genes$Count, the old column called Count would be REPLACED with the new vector of numbers created on the right side.

Challenge 1.2

Create a new column called Prop_Count that contains each count value as a proportion of the total count value of all 10 genes in the dataframe. E.g. if Count was 10 and total count of all 10 genes is 100, that row in Prop_Count should be 0.1.

Hint use the sum() function.

1.12 Writing out data

The last thing we will do this session is learn to write out data using the function write.table().

There are a few things we must tell write.table(), for it to be able to write out the data the way we want:

  • x - the name of this input is not very informative, but first you must tell the function what you want to write out. In our case we want to write out our dataframe genes.
  • file - the name of the file that we want to write to.
  • sep - how each value in our output file is separated. Common file formats are ‘csv’ and ‘tsv’ (discussed above). In R, a tab is represented by "\t".
  • row.names - this is either TRUE or FALSE, and let’s you specify whether you want to write out row names. If your dataframe does not have row names, put FALSE.
  • col.names - this is also either TRUE or FALSE, and let’s you specify whether you want to write out column names. If your dataframe has column names, put TRUE.

We can write out our genes dataframe into a .tsv file using the command below:

write.table(x = genes, file = "Ses1_Genes_output.tsv", sep = "\t",
            row.names = FALSE, col.names = TRUE)

1.13 Homework

Read in the data file “Ses1_homework.tsv” using read.delim(). This file is similar to the “Ses1_genes.tsv” file but with a different 10 genes.

Create a new column called Prop_GeneLength that contains the Count value as a proportion of GeneLength. E.g. if the count value was 50 and the gene length was 2400, the Prop_GeneLength value for that row would be 0.02083333.

Finally, write out this new dataframe as a file called “Ses1_homework_output.tsv”, as a tsv (tab separated values) file.

1.14 Answers

Challenge 1.1

my_df2 <- data.frame(
  dogs = c("Garfield", "Hello Kitty"),
  weight = c(3.5, 4.8)
)

Challenge 1.2

str(c(TRUE, 3, 4))
##  num [1:3] 1 3 4
str(c("hello", TRUE, 4))
##  chr [1:3] "hello" "TRUE" "4"
str(c(4, "5"))
##  chr [1:2] "4" "5"
my_vect3 <- c(1,1,2,3,5,8)
# convert to character
my_vect3 <- as.character(my_vect3)
# quotes around each value show that they are of the character type
my_vect3 
## [1] "1" "1" "2" "3" "5" "8"

Challenge 1.3

The sum() function can be used to calculate the total Count of all 10 rows by giving the genes$Count vector as the input to sum().

genes$Prop_Count <- genes$Count / sum(genes$Count) 
genes
##     SYMBOL GeneLength Count Count_2   Prop_Count
## 1  Gm10568       1634     0      10 0.0000000000
## 2  Gm19860        799     4      14 0.0016214025
## 3  Gm19938       3259     0      10 0.0000000000
## 4   Lypla1       2433   768     778 0.3113092825
## 5      Rp1       9747     0      10 0.0000000000
## 6    Sox17       3130     1      11 0.0004053506
## 7    Tcea1       2847   810     820 0.3283340089
## 8   Mrpl15       4203   431     441 0.1747061208
## 9     Xkr4       3634     1      11 0.0004053506
## 10   Rgs20       2241   452     462 0.1832184840