Open R Session

Author : Vincent D. Warmerdam

Blog: http://koaning.github.io

This document contains a notebook created at an open Rstudio session in Amsterdam. Free for all to use/refer to. It is meant as a sprint through R functionality to help you get started using the free software such that you aren’t depended on software you need to pay for (SPPS/Excel/SAS).

Libraries used.

In this document I will assume that you have installed the dplyr and ggplot2 package and that you have activated them. You can check this by running:

library(dplyr)
library(ggplot2)

Variables

We can declare variables just like excel cells.

a = 1 
b = 3 
a + b 
## [1] 4

We can also assign variables just like excel cells.

c = a + b 
c
## [1] 4

We can make it as complex as we want, just like in excel. We’ll shortly see that this might not be the best way of doing things though.

one = 1 
two = 2 
three = 3 
four = 4 
five = 5 
total = one + two + three + four + five 
total
## [1] 15

Functions

You will have used a function before in Excel even though you might not have been aware of this.

max(one,5)
## [1] 5

Notice that you can also use variables that you’ve defined just like cells. Remember that four and five are now variables?

max(four, five)
## [1] 5

There are many other functions that excel uses that R also has. Some examples:

sqrt(2)
## [1] 1.414214
sum(1,2,3,4,5)
## [1] 15
log(2)
## [1] 0.6931472

R even has some variables predefined for you that might be useful.

pi
## [1] 3.141593

And if you want to, you can offcourse use a function within a function.

log(sqrt(2))
## [1] 0.3465736

Custom Functions

This is where excel doesn’t excel (pun intended). In excel you have a list of useful functions, in R you can make your own.

Simple Example

Suppose that you are given a radius of a circle and you want to know the area of the circle.

circle_area = function(radius){
  resultaat = radius*radius*pi
  return(resultaat)
}

add = function(n1, n2){
  return(n1 + n2)
}

add(1,circle_area(2))
## [1] 13.56637
circle_area(1)
## [1] 3.141593

Just like that, you can create ANY function! You may not be able to appreciate the power that this gives you, but you may soon.

Assignment

Try to create a function that calculates the amount of money on a savings account. Write a function that takes into account a starting amount, an interest rate and a number of years. When calling the function money(start=100, interest=0.03, years =1) it should give back 103.

Pipes

Let’s get a little bit philosophical about lanuages now. By typing code we are giving instructions to the computer in a way that the computer understands. But this is not by definition the way we understand language. Note that in a computer language, just like in a real language, we can usually explain things in two ways.

For example, take this bit of code:

sqrt(2)
## [1] 1.414214

We are reading it as: “take the square root of two”. Another way in which we can describe this is in human terms is “take the number two and calculate the square root of it”. R is a nice language, because it allows both human thoughts to be translated into a computer program.

2 %>% sqrt 
## [1] 1.414214

Notice the similarity. This is just notation but it makes code just a bit easier to read. Just a simple example:

sqrt(log(sqrt(log(2))))
## Warning in sqrt(log(sqrt(log(2)))): NaNs produced
## [1] NaN
2 %>% log %>% sqrt %>% log %>% sqrt
## Warning in sqrt(.): NaNs produced
## [1] NaN

It feels more as if I am reading a chain of commands that I can give the computer without having to write a very ugly function.

Types

So we have variables, which are basically names of objects. We can apply functions to these variables. But not every function will work on every variable. Just like in excel, this would produce an error:

3 + 'three' 

This is because the computer doesn’t know how to add 3 (a number) to 'three' (a sequence of characters, also known as ‘strings’). This might give a lot of errors and this is something you will need to be aware of. Certain functions work on certain variables.

Some examples of functions that work on characters:

paste("hello", "world")
## [1] "hello world"
a = 12
substr("mattie",a , 2)
## [1] ""

You can aslo change a type of a variable (this is know as casting):

as.numeric("104.5")
## [1] 104.5
as.character(1)
## [1] "1"

Arrays

An array is another type of object we can use in R. It is basically a list of other variables, just like a row or a column in excel. These can be created easily.

c(1,2,3,4,5,6,7,8,9,0)
##  [1] 1 2 3 4 5 6 7 8 9 0
1:100
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17
##  [18]  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34
##  [35]  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51
##  [52]  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68
##  [69]  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85
##  [86]  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100
seq(1, 10, 0.1)
##  [1]  1.0  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9  2.0  2.1  2.2  2.3
## [15]  2.4  2.5  2.6  2.7  2.8  2.9  3.0  3.1  3.2  3.3  3.4  3.5  3.6  3.7
## [29]  3.8  3.9  4.0  4.1  4.2  4.3  4.4  4.5  4.6  4.7  4.8  4.9  5.0  5.1
## [43]  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9  6.0  6.1  6.2  6.3  6.4  6.5
## [57]  6.6  6.7  6.8  6.9  7.0  7.1  7.2  7.3  7.4  7.5  7.6  7.7  7.8  7.9
## [71]  8.0  8.1  8.2  8.3  8.4  8.5  8.6  8.7  8.8  8.9  9.0  9.1  9.2  9.3
## [85]  9.4  9.5  9.6  9.7  9.8  9.9 10.0

These arrays can also be store in variables just like anything else. They can also have functions being applied on them.

a = seq(1, 10, 0.1)
sqrt(a)
##  [1] 1.000000 1.048809 1.095445 1.140175 1.183216 1.224745 1.264911
##  [8] 1.303840 1.341641 1.378405 1.414214 1.449138 1.483240 1.516575
## [15] 1.549193 1.581139 1.612452 1.643168 1.673320 1.702939 1.732051
## [22] 1.760682 1.788854 1.816590 1.843909 1.870829 1.897367 1.923538
## [29] 1.949359 1.974842 2.000000 2.024846 2.049390 2.073644 2.097618
## [36] 2.121320 2.144761 2.167948 2.190890 2.213594 2.236068 2.258318
## [43] 2.280351 2.302173 2.323790 2.345208 2.366432 2.387467 2.408319
## [50] 2.428992 2.449490 2.469818 2.489980 2.509980 2.529822 2.549510
## [57] 2.569047 2.588436 2.607681 2.626785 2.645751 2.664583 2.683282
## [64] 2.701851 2.720294 2.738613 2.756810 2.774887 2.792848 2.810694
## [71] 2.828427 2.846050 2.863564 2.880972 2.898275 2.915476 2.932576
## [78] 2.949576 2.966479 2.983287 3.000000 3.016621 3.033150 3.049590
## [85] 3.065942 3.082207 3.098387 3.114482 3.130495 3.146427 3.162278

Notice that our variable a has now been overwritten. You can confirm this by looking at the environment tab of Rstudio. This variable is no longer a number, but a list of numbers. Simple arrays that only have numbers in them can also be plotted very easily using the plot function.

# plot(sqrt(a))
a %>% sqrt %>% plot(t='l')

We can also create an array with random numbers and use a histogram to draw them.

1000 %>% rnorm %>% hist

hist(rnorm(1000))

We can also perform some operations before we actually plot something. Also, note that an array is also an object that can use certain functions.

a = sqrt(seq(1, 500, 0.1))
b = rnorm(length(a))
plot(a + b)

Again, because of pipes, we could also do this:

a = seq(1, 500, 0.1) %>% sqrt
b = a %>% length %>% rnorm 
(a + b) %>% plot

For loops

Arrays are useful, but sometimes we want to not change one by hand. We want to automate! That’s the whole point of using a computer. For this, we could use a for loop.

arr = c(1,2,3,4,5,6)
for(i in arr){
  print(i*2)
}
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 12
for(i in seq(1,2,0.1)){
  for(j in arr){
    print(c(i,j))
  }
}
## [1] 1 1
## [1] 1 2
## [1] 1 3
## [1] 1 4
## [1] 1 5
## [1] 1 6
## [1] 1.1 1.0
## [1] 1.1 2.0
## [1] 1.1 3.0
## [1] 1.1 4.0
## [1] 1.1 5.0
## [1] 1.1 6.0
## [1] 1.2 1.0
## [1] 1.2 2.0
## [1] 1.2 3.0
## [1] 1.2 4.0
## [1] 1.2 5.0
## [1] 1.2 6.0
## [1] 1.3 1.0
## [1] 1.3 2.0
## [1] 1.3 3.0
## [1] 1.3 4.0
## [1] 1.3 5.0
## [1] 1.3 6.0
## [1] 1.4 1.0
## [1] 1.4 2.0
## [1] 1.4 3.0
## [1] 1.4 4.0
## [1] 1.4 5.0
## [1] 1.4 6.0
## [1] 1.5 1.0
## [1] 1.5 2.0
## [1] 1.5 3.0
## [1] 1.5 4.0
## [1] 1.5 5.0
## [1] 1.5 6.0
## [1] 1.6 1.0
## [1] 1.6 2.0
## [1] 1.6 3.0
## [1] 1.6 4.0
## [1] 1.6 5.0
## [1] 1.6 6.0
## [1] 1.7 1.0
## [1] 1.7 2.0
## [1] 1.7 3.0
## [1] 1.7 4.0
## [1] 1.7 5.0
## [1] 1.7 6.0
## [1] 1.8 1.0
## [1] 1.8 2.0
## [1] 1.8 3.0
## [1] 1.8 4.0
## [1] 1.8 5.0
## [1] 1.8 6.0
## [1] 1.9 1.0
## [1] 1.9 2.0
## [1] 1.9 3.0
## [1] 1.9 4.0
## [1] 1.9 5.0
## [1] 1.9 6.0
## [1] 2 1
## [1] 2 2
## [1] 2 3
## [1] 2 4
## [1] 2 5
## [1] 2 6

In this for-loop I am printing values in the array. In the next one I am changing the old array.

old_arr = c(1,2,3,4,5,6)
new_arr = c()
for(i in old_arr){
  new_arr = c(new_arr, i*i)
}
new_arr
## [1]  1  4  9 16 25 36

If statement

You might be able to imagine how useful this is. It becomes especially useful when combining with an if statement.

Try to imagine what this next bit of code does.

numbers = rnorm(100)
largest_num = 0
for(i in numbers){
  if(i > largest_num){
    largest_num = i
  }
}

Notice that this is something we could create a function for.

my_max = function(numbers){
  largest_num = 0
  for(i in numbers){
    if(i > largest_num){
      largest_num = i
    }
  }
  return(largest_num)
}

Assignments

  1. Create a function that grabs the minimum from a list of numbers.
  2. Create a function that counts how often the number 2 occurs in a list of numbers.
  3. Create a function that counts how often a number a occurs in a list of numbers.
  4. Create a histogram of random numbers that are larger than 0.5.

Part 2 : DataFrames

An excel file usually consists of rows and columns. Thusfar we’ve only considered one dimensional arrays. The power of R lies in something called the dataframe object, which is a variable that represents a table containing rows and columns. For many data analysis tasks, this is an object that contains the perfect abstraction.

R comes with some datasets out of the box that you can play with. Let’s play with one called ChickWeight. There are a few functions that are useful for a dataframe, as well as some selection functionality. Can you guess what the following commands do?

ChickWeight %>% head
##   weight Time Chick Diet
## 1     42    0     1    1
## 2     51    2     1    1
## 3     59    4     1    1
## 4     64    6     1    1
## 5     76    8     1    1
## 6     93   10     1    1
ChickWeight$weight %>% head
## [1] 42 51 59 64 76 93
ChickWeight %>% select(weight) %>% head
##   weight
## 1     42
## 2     51
## 3     59
## 4     64
## 5     76
## 6     93
ChickWeight$Diet %>% unique
## [1] 1 2 3 4
## Levels: 1 2 3 4
unique(ChickWeight$Diet) 
## [1] 1 2 3 4
## Levels: 1 2 3 4
ChickWeight %>% colnames
## [1] "weight" "Time"   "Chick"  "Diet"
ChickWeight[1:5,2:3]
##   Time Chick
## 1    0     1
## 2    2     1
## 3    4     1
## 4    6     1
## 5    8     1
ChickWeight[3] %>% head
##   Chick
## 1     1
## 2     1
## 3     1
## 4     1
## 5     1
## 6     1
ChickWeight[1,1]
## [1] 42
agg = ChickWeight %>% filter(Time == 21)
ggplot() + geom_point(data=ChickWeight, aes(Time, weight))

p = ggplot() + geom_histogram(data=agg, aes(weight, fill=Diet)) 
p + facet_grid(Diet ~ .)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Remember that you can always prefix a function with a questionmark to get the R help to help you out. For example: ?sqrt.

plotting

GGplot is the best plotting library I’ve had the pleasure of using. Here are some examples of things you can do with it. For each plot, try to imagine what the plot looks like before plotting it.

ns = read.csv("~/Development/notebooks/data/ns-storingen.csv")
pltr = ns %>% 
  filter(reason == "aanrijding met een persoon") %>% 
  group_by(month = substr(startdate,6,7) %>% as.numeric) %>% 
  summarise(n = n())

ggplot() + geom_line(data=pltr, aes(month, n))

ggplot() + geom_bar(data=pltr, aes(month, n), stat="identity")