Tutorials

Learn how to use R.

Help


  • Official docs: Intro to R

  • ?function # gives you a help file for that function e.g.

?mean

Create objects


We store information into objects. Things are assigned to and stored in objects using the <- operator.

# This line will not be read by R.
# "#" tells R that what follows is a comment.
foo <- 3
bar <- c("Johan", "latulippe")
# list all objects in current session 
ls()
## [1] "args" "bar"  "file" "foo"

Create a Data Frame


A data frame is a table with rows (observations), and columns (variables) \begin{frame}[fragile]{Data}

# note how we put strings inside '' or "", 
# we don't do this with numbers unless we want numbers to be treated as text 
my_dat <- data.frame(x = 1:4, y = (1:4)^2, name = c("bob","julie","gil","jane"), 
    letter = LETTERS[3:6],numbers = c( 2, 33, 65, 1),
    numastext = c("1","2","4","22"))
my_dat
##   x  y  name letter numbers numastext
## 1 1  1   bob      C       2         1
## 2 2  4 julie      D      33         2
## 3 3  9   gil      E      65         4
## 4 4 16  jane      F       1        22
my_dat$numbers * 2
## [1]   4  66 130   2
# my_dat$numastext * 2
as.numeric( my_dat$numastext ) * 2
## [1] 2 4 8 6

Adding Variables


# lets add a ~column~ variable
my_dat$z <- my_dat$x * my_dat$y
# and another variable
my_dat$small_letter <- tolower( my_dat$letter ) 
my_dat
##   x  y  name letter numbers numastext  z small_letter
## 1 1  1   bob      C       2         1  1            c
## 2 2  4 julie      D      33         2  8            d
## 3 3  9   gil      E      65         4 27            e
## 4 4 16  jane      F       1        22 64            f

Load and Save


read.csv() can read in data stored (local or remote) as text files.

terror <- read.csv("http://latul.be/mba_565/data/syria.csv") # remote data

# The following is specific to your system. The line below would not
# work on your system.
# terror <- read.csv("/home/johan/foo_bar.csv") # local data
  • List the available variables
  • Select rows 17 to 22 for three selected variables > All the values of the same variable are stacked in a single column. Six lines from a data frame with n rows and k variables. Think of each row as an observation and each column as a variable.
names(terror)
## [1] "date"        "city"        "perpetrator" "target"      "attack"     
## [6] "weapon"      "fatalities"
terror[17:22, c("city", "perpetrator", "fatalities")]
##      city perpetrator fatalities
## 17 Kobani        ISIS          8
## 18 Kobani        ISIS          8
## 19 Kobani        ISIS          8
## 20 Kobani        ISIS          5
## 21  other   Al-Nusrah          1
## 22 Raqqah        ISIS          2

Saving data

save data locally:

# The following is specific to your system. The line below would not
# work on your system. Choose a location that agrees with your system.
write.csv(my_dat, file = "/home/johan/foobar.csv")

If you wish to save the data (all the R objects in your session) on your computer, you can use the save.image() command. For example.

# The file will be saved in your working directory.
save.image("my_session.rdata")
# Then later
load("my_session.rdata")

Absolute and Relative Path


Absolute path: is a path that starts right from the root of your computer file system.

Relative path: is a path that starts from the location of your R sesssion.

But where Am I?

The file you saved with write.csv() or save.image() will be found in the absolute path given by the function getwd().

getwd()
## [1] "/home/johan/docs/viu/mba_565/tutorials"
setwd('/home/johan/')

The above outputs the absolute path of your R working directory. Note that you can change your working directory with:

setwd('/home/johan/')
Windows Path

A typical path in Windows looks like:

“C:\Documents and Settings\Windows\My Documents\foo.r”

Note to windows user: you may need to escape the spaces with the escape character ‘', and escape the escape character, i.e.’\’. Thus, in R, windows users would type

save.image( "C:\\Documents\ and\ Settings\\Windoze\\My\ Documents\\foo.r" )

To copy the full path for an individual file, hold down the Shift key as you right-click the file, and then choose Copy As Path.

Mac Path

A typical path on Mac looks like:

'/Users/Mac/Desktop/foo.r'

Select the file or folder in the OS X Finder, then hit Command+i to summon Get Info. The absolute path is displayed alongside “Where”.

Descriptive Functions


# Remember that the hash, '#', is telling R not to read the line
n <- 5
head(terror, n) # you can omit the n, default n = 6. head(airline)
##         date   city perpetrator                       target
## 1 2014-12-29  other        ISIS                     Military
## 2 2014-12-28 Kobani        ISIS Terrorists/Non-state Militia
## 3 2014-12-28 Kobani        ISIS  Private Citizens & Property
## 4 2014-12-27  other        ISIS                     Military
## 5 2014-12-24 Raqqah        ISIS                     Military
##                        attack                    weapon fatalities
## 1           Bombing/Explosion Explosives/Bombs/Dynamite          9
## 2           Bombing/Explosion Explosives/Bombs/Dynamite          4
## 3           Bombing/Explosion Explosives/Bombs/Dynamite          8
## 4                     Unknown                   Unknown          0
## 5 Hostage Taking (Kidnapping) Explosives/Bombs/Dynamite          1
tail(terror, n) # you can omit the n, default n = 6. head(airline)
##           date     city perpetrator               target            attack
## 244 2012-03-17 Damascus   Al-Nusrah Government (General) Bombing/Explosion
## 245 2012-03-17 Damascus   Al-Nusrah Government (General) Bombing/Explosion
## 246 2012-02-10   Aleppo   Al-Nusrah               Police Bombing/Explosion
## 247 2012-02-10   Aleppo   Al-Nusrah             Military Bombing/Explosion
## 248 2012-01-06 Damascus   Al-Nusrah               Police Bombing/Explosion
##                        weapon fatalities
## 244 Explosives/Bombs/Dynamite         14
## 245 Explosives/Bombs/Dynamite         14
## 246 Explosives/Bombs/Dynamite         15
## 247 Explosives/Bombs/Dynamite         15
## 248 Explosives/Bombs/Dynamite         26
names(terror) # the variable names
## [1] "date"        "city"        "perpetrator" "target"      "attack"     
## [6] "weapon"      "fatalities"
nrow(terror) # number of observations
## [1] 248
ncol(terror) # number of variables
## [1] 7
# Using dim, we get the number of observations(rows) 
# and variables(columns) in dataframe.
dim(terror)
## [1] 248   7
Functions

# levels of a factor
levels( terror$weapon )
## [1] ""                          "Chemical"                 
## [3] "Explosives/Bombs/Dynamite" "Firearms"                 
## [5] "Incendiary"                "Melee"                    
## [7] "Other"                     "Sabotage Equipment"       
## [9] "Unknown"
str( terror )
## 'data.frame':    248 obs. of  7 variables:
##  $ date       : Factor w/ 163 levels "2012-01-06","2012-02-10",..: 163 162 162 161 160 159 159 158 157 157 ...
##  $ city       : Factor w/ 10 levels "Aleppo","Damascus",..: 8 7 7 8 9 10 10 10 8 8 ...
##  $ perpetrator: Factor w/ 2 levels "Al-Nusrah","ISIS": 2 2 2 2 2 2 2 2 1 1 ...
##  $ target     : Factor w/ 14 levels "","Business",..: 7 13 11 7 7 11 7 11 7 7 ...
##  $ attack     : Factor w/ 9 levels "","Armed Assault",..: 4 4 4 9 7 7 9 7 4 7 ...
##  $ weapon     : Factor w/ 9 levels "","Chemical",..: 3 3 3 9 3 6 9 6 3 3 ...
##  $ fatalities : int  9 4 8 0 1 1 22 4 NA 90 ...
# relabel and new variable
weapon_strength <- terror$weapon
levels(weapon_strength) <- 1:9
terror$weapon_strength <- as.numeric(weapon_strength)
# telling R that we use the variables from "terror"
# then no need to specify dataframe, 
# but only variable name
attach(terror)
table(weapon) # as opposed to 
## weapon
##                                            Chemical 
##                         1                         1 
## Explosives/Bombs/Dynamite                  Firearms 
##                       155                        24 
##                Incendiary                     Melee 
##                         2                        11 
##                     Other        Sabotage Equipment 
##                         1                         1 
##                   Unknown 
##                        52
table( terror$weapon )
## 
##                                            Chemical 
##                         1                         1 
## Explosives/Bombs/Dynamite                  Firearms 
##                       155                        24 
##                Incendiary                     Melee 
##                         2                        11 
##                     Other        Sabotage Equipment 
##                         1                         1 
##                   Unknown 
##                        52

Using, accessing data


Individual rows, columns, and cells in a data frame can be accessed through many methods of indexing.

We most commonly use object[row, column] notation.

# display the second variable (column) of third observation (row)
terror[3, 2]
## [1] Kobani
## 10 Levels: Aleppo Damascus Daraa Hamah Homs Idlib Kobani other ... Unknown
# omitting row value implies all rows; here all rows in column 3
terror[, 3]
##   [1] ISIS      ISIS      ISIS      ISIS      ISIS      ISIS      ISIS     
##   [8] ISIS      Al-Nusrah Al-Nusrah Al-Nusrah ISIS      ISIS      ISIS     
##  [15] ISIS      ISIS      ISIS      ISIS      ISIS      ISIS      Al-Nusrah
##  [22] ISIS      ISIS      Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah
##  [29] Al-Nusrah Al-Nusrah ISIS      ISIS      ISIS      ISIS      Al-Nusrah
##  [36] Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah ISIS      ISIS      ISIS     
##  [43] ISIS      ISIS      ISIS      ISIS      ISIS      ISIS      ISIS     
##  [50] ISIS      ISIS      ISIS      ISIS      ISIS      ISIS      ISIS     
##  [57] ISIS      ISIS      ISIS      Al-Nusrah ISIS      ISIS      ISIS     
##  [64] Al-Nusrah Al-Nusrah ISIS      ISIS      ISIS      ISIS      ISIS     
##  [71] Al-Nusrah ISIS      ISIS      ISIS      ISIS      ISIS      ISIS     
##  [78] ISIS      ISIS      ISIS      Al-Nusrah ISIS      Al-Nusrah ISIS     
##  [85] ISIS      ISIS      ISIS      ISIS      ISIS      ISIS      ISIS     
##  [92] ISIS      ISIS      ISIS      ISIS      ISIS      ISIS      ISIS     
##  [99] Al-Nusrah ISIS      ISIS      ISIS      Al-Nusrah ISIS      ISIS     
## [106] ISIS      ISIS      Al-Nusrah ISIS      Al-Nusrah ISIS      ISIS     
## [113] ISIS      ISIS      ISIS      ISIS      ISIS      ISIS      ISIS     
## [120] ISIS      Al-Nusrah ISIS      Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah
## [127] Al-Nusrah Al-Nusrah Al-Nusrah ISIS      Al-Nusrah ISIS      Al-Nusrah
## [134] Al-Nusrah ISIS      ISIS      Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah
## [141] Al-Nusrah ISIS      ISIS      ISIS      ISIS      ISIS      Al-Nusrah
## [148] ISIS      Al-Nusrah ISIS      ISIS      ISIS      Al-Nusrah ISIS     
## [155] ISIS      ISIS      ISIS      ISIS      ISIS      ISIS      ISIS     
## [162] ISIS      ISIS      ISIS      ISIS      ISIS      Al-Nusrah Al-Nusrah
## [169] ISIS      ISIS      ISIS      ISIS      ISIS      Al-Nusrah Al-Nusrah
## [176] Al-Nusrah Al-Nusrah ISIS      ISIS      ISIS      ISIS      Al-Nusrah
## [183] Al-Nusrah Al-Nusrah ISIS      ISIS      Al-Nusrah ISIS      ISIS     
## [190] Al-Nusrah Al-Nusrah ISIS      ISIS      ISIS      ISIS      ISIS     
## [197] Al-Nusrah Al-Nusrah Al-Nusrah ISIS      Al-Nusrah Al-Nusrah Al-Nusrah
## [204] ISIS      Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah ISIS     
## [211] Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah
## [218] Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah
## [225] Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah
## [232] Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah
## [239] Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah Al-Nusrah
## [246] Al-Nusrah Al-Nusrah Al-Nusrah
## Levels: Al-Nusrah ISIS
# omitting column values implies all columns; here all columns (variables) in row (observation) 2 
terror[2, ]
##         date   city perpetrator                       target
## 2 2014-12-28 Kobani        ISIS Terrorists/Non-state Militia
##              attack                    weapon fatalities weapon_strength
## 2 Bombing/Explosion Explosives/Bombs/Dynamite          4               3
# can also use ranges - rows 2 and 3, columns 2 and 3 
terror[2:3, 2:3]
##     city perpetrator
## 2 Kobani        ISIS
## 3 Kobani        ISIS
# can also use lists - rows 2 and 4, columns 2, 4, and 6 
terror[c(2, 4), c(2, 4, 6)]
##     city                       target                    weapon
## 2 Kobani Terrorists/Non-state Militia Explosives/Bombs/Dynamite
## 4  other                     Military                   Unknown
Using, accessing data

We can also access variables directly by using their names, either with object[,“variable”] notation or object$variable notation.

# get first 5 rows of variable weapon using two methods 
terror[5:10, "weapon"]
## [1] Explosives/Bombs/Dynamite Melee                    
## [3] Unknown                   Melee                    
## [5] Explosives/Bombs/Dynamite Explosives/Bombs/Dynamite
## 9 Levels:  Chemical Explosives/Bombs/Dynamite Firearms ... Unknown
# same result with
terror$weapon[5:10] # "$" only works on variables (columns) of a data.frame
## [1] Explosives/Bombs/Dynamite Melee                    
## [3] Unknown                   Melee                    
## [5] Explosives/Bombs/Dynamite Explosives/Bombs/Dynamite
## 9 Levels:  Chemical Explosives/Bombs/Dynamite Firearms ... Unknown
# or ...
terror[5:10, 6]
## [1] Explosives/Bombs/Dynamite Melee                    
## [3] Unknown                   Melee                    
## [5] Explosives/Bombs/Dynamite Explosives/Bombs/Dynamite
## 9 Levels:  Chemical Explosives/Bombs/Dynamite Firearms ... Unknown

Relational Operators


Operator Description
< Less than
<= Less than or equal to
== Equal to
!= Not equal to
a|b a or b
a & b a and b
my_dat$x == 2
## [1] FALSE  TRUE FALSE FALSE
name_list <- c("bob", "gil", "rex")
my_dat$name != "bob"
## [1] FALSE  TRUE  TRUE  TRUE
1 == 1 | 1 == 2 # Statement a or b is TRUE
## [1] TRUE
1 == 1 & 1 == 2 # Statement a and b are TRUE
## [1] FALSE
Conditional index

index <- my_dat < 3
my_dat[ index ] <- "small digit"
# look at our data.frame
str( my_dat )
## 'data.frame':    4 obs. of  8 variables:
##  $ x           : chr  "small digit" "small digit" "3" "4"
##  $ y           : chr  "small digit" "4" "9" "16"
##  $ name        : Factor w/ 4 levels "bob","gil","jane",..: 1 4 2 3
##  $ letter      : Factor w/ 4 levels "C","D","E","F": 1 2 3 4
##  $ numbers     : chr  "small digit" "33" "65" "small digit"
##  $ numastext   : Factor w/ 4 levels "1","2","22","4": 1 2 4 3
##  $ z           : chr  "small digit" "8" "27" "64"
##  $ small_letter: chr  "c" "d" "e" "f"

Subset


Conditional grouping–subsetting}

If we want conditional summaries, for example only for those attacks with high level of fatalities (fatalities 50), we first subset the data using subset, then summarize.

R permits nested function calls, where the results of one function are passed directly as an argument to another function. Here, subset returns a dataset containing observations where fatalities 50. This data subset is then passed to summary to obtain distributions of the variables in the subset.

summary( subset(terror, fatalities >= 50) )
##          date         city      perpetrator
##  2014-04-29:2   other   :9   Al-Nusrah:12  
##  2014-12-14:2   Homs    :2   ISIS     : 4  
##  2012-05-10:1   Aleppo  :1                 
##  2012-11-05:1   Damascus:1                 
##  2013-02-06:1   Idlib   :1                 
##  2013-05-22:1   Raqqah  :1                 
##  (Other)   :8   (Other) :1                 
##                           target                                   attack 
##  Military                    :10   Bombing/Explosion                  :6  
##  Private Citizens & Property : 2   Hostage Taking (Kidnapping)        :5  
##  Terrorists/Non-state Militia: 2   Armed Assault                      :4  
##  Business                    : 1   Hostage Taking (Barricade Incident):1  
##  Government (General)        : 1                                      :0  
##                              : 0   Assassination                      :0  
##  (Other)                     : 0   (Other)                            :0  
##                        weapon     fatalities     weapon_strength
##  Explosives/Bombs/Dynamite:14   Min.   : 50.00   Min.   :3.000  
##  Firearms                 : 2   1st Qu.: 56.25   1st Qu.:3.000  
##                           : 0   Median : 70.50   Median :3.000  
##  Chemical                 : 0   Mean   :117.44   Mean   :3.125  
##  Incendiary               : 0   3rd Qu.:111.75   3rd Qu.:3.000  
##  Melee                    : 0   Max.   :517.00   Max.   :4.000  
##  (Other)                  : 0

Group wise operation


Split the data
attach(terror)
damascus <- fatalities[city == "Damascus"] # split (data for Damascus only)
aleppo <- fatalities[city == "Aleppo"] # split (data for Aleppo only)
Apply operation over the split data
mean_damascus <- mean(damascus, na.rm = TRUE) # apply 
mean_aleppo <- mean(aleppo, na.rm = TRUE) # apply
Group your results
data.frame("city" = c("Damascus","Aleppo"), "mean" = c(mean_damascus, mean_aleppo) ) # combine
##       city     mean
## 1 Damascus 12.66667
## 2   Aleppo 13.16667

gin{frame}[fragile]{Split–Apply–combine (one liner)}

We can separate the data in other ways, such as by groups. Let“s look at the mean for each city.

Here, we are asking the by function to apply the mean function to each city.

##### get mean by city
# first argument is the data to which we apply the function
# second argument is the grouping (here we group by cities)
# last is the function to apply
tmp <- by(terror$fatalities, terror$city, function(x) mean(x, na.rm = TRUE) )
barplot(tmp)

barplot(table(city), main="Example: number of incidents per city")

or

barplot(table(perpetrator,city),beside=TRUE,legend=TRUE,
    main="Number of incidents per city conditional on perpetrator")

Frequencies


Proportions and Percentages
  • Proportion compares the number of cases in a category with the total number of cases.

\[ proportion=\frac{f}{N}=\frac{\#\,of\,attacks\,on\,a\,city}{total\,number\,of\,attacks} \] So for Damascus, \(P=207/832=.249\).

Proportions are always between 0 and 1.

  • Percentage Multiply the proportion by 100 to obtain a percentage.

Create a frequency table

freq <- c(5,2,9) # a vector of frequencies
names(freq) <- c("a","b","c") # naming each element
barplot(freq) # visualize

Suppose we would like to subset our dataset into 2 datasets, one for two cities, Damascus and Aleppo. Here we use the subset function to split the dataset into 3 subsets, and we store each subset in a new object.

damascus <- subset( terror, city == "Damascus" )
aleppo <- subset( terror, city == "Aleppo" )
# Attacks with less than 3 fatalities.
low_fatalities <- subset( terror, fatalities < 3 )

Keep only three variables

t_small <- terror[, c("target","weapon","fatalities")]
head(t_small)
t_small <- terror[, -c(1,3,5)] # use "-" to remove variable 1, 3, and 5
head(t_small)

Histogram (bar graph)

Example: number of incidents per city

terror <- read.csv( 'http://latul.be/mba_565/data/syria.csv' )

barplot(table( terror$city ))

Comparison Graph

Number of incidents per city conditional on perpetrator

Note that the first argument in table(), ‘perpetrator’, takes only two values ‘Al-Nusrah Front’ or ‘Islamic State of Iraq and the Levant (ISIL)’.

barplot(table( terror$perpetrator, terror$city ),
    beside=TRUE, legend=TRUE)