Tutorials

Learn how to use R.

Objectives and learning outcomes


  • Learn a powerful tool: R

  • Conduct and interpret simple analyses

  • NOT a course in statistics

  • IS primarily introductory and practical

  • Median salary of a data scientists in US: $144,000

  • Data scientists demand on indeed.com increased by 600% in the last 4 years

Grading


  • Problem sets (50%)

  • Seven days to complete problems.

  • Problem solutions should be submitted as a single pdf

  • Solution set must be submitted by email or in print before the Monday class

  • Final Project (40%)

R Software


  • Free from cran.r-project.org/

  • Fast

  • Flexible

  • Handle large data efficiently

  • Necessary for data scientists

  • Lots of free documentation

  • Open Source

Who uses R? Why?


  • Google - to calculate the ROI on advertising campaigns.

  • Ford - to improve the design of its vehicles.

  • Twitter -to monitor user experience. and data visualization.

  • US Weather Services - to predict certain weather patterns.

  • New York Times - to create graphics and interactive data journalism applications.

  • Facebook - For behavior analysis related to status updates and profile pictures.

  • Uber - For statistical analysis

  • Airbnb - Scale data science.

Quantitative approach to Business Analytics


  • Rationale
    – Compare variables found in the business world
    – Measure important features of the business world
    – Create measures that we can track over time

  • Quantity versus quality
    – Quantitative analysis is often used to supplement qualitative analyses
    – Same fundamental approach to knowledge
    – Anything comparable can be quantified
    – Quantity makes analysis reproducible

Stages of data analytics


Green hotels example
  1. The problem to be studied is reduced to a testable hypothesis. Example: which Vancouver hotels are greenest.
  2. An appropriate set of instruments, measures, or statistics is developed. Example: counting the number of green terms used in hotel reviews.
  3. The data are collected. Example: hotel reviews from TripAdvisor.
  4. The data are analyzed for their bearing on the initial hypotheses. Example: number of green keywords conditional on hotel.
  5. Results are interpreted and communicated. Example: A paper or report is written.

Explore the data


setwd('/home/johan/docs/viu/pete/tripadvisor')
word_pairs <- readRDS('word_pairs.RDS')
word_pairs
##  [1] "front:desk"            "stanley:park"         
##  [3] "walking:distance"      "granville:island"     
##  [5] "highly:recommend"      "robson:street"        
##  [7] "within:walking"        "downtown:vancouver"   
##  [9] "great:location"        "english:bay"          
## [11] "one:night"             "canada:place"         
## [13] "two:nights"            "three:nights"         
## [15] "four:seasons"          "next:door"            
## [17] "hot:tub"               "pan:pacific"          
## [19] "minute:walk"           "well:appointed"       
## [21] "definitely:stay"       "free:wifi"            
## [23] "centrally:located"     "will:definitely"      
## [25] "desk:staff"            "five:star"            
## [27] "cruise:ship"           "customer:service"     
## [29] "even:though"           "air:conditioning"     
## [31] "gave:us"               "continental:breakfast"
## [33] "flat:screen"           "pacific:rim"          
## [35] "blue:horizon"          "convention:center"    
## [37] "rogers:arena"          "come:back"            
## [39] "top:notch"             "blocks:away"          
## [41] "coal:harbour"          "reasonably:priced"    
## [43] "coffee:maker"          "four:nights"          
## [45] "next:time"             "good:value"           
## [47] "go:back"               "friendly:staff"       
## [49] "make:sure"             "sutton:place"         
## [51] "pleasantly:surprised"  "living:room"          
## [53] "false:creek"           "well:equipped"        
## [55] "easy:walking"          "conveniently:located" 
## [57] "bedroom:suite"         "robson:st"            
## [59] "room:service"          "bc:place"

Find measures


ggplot(data=heat_data, aes(x=Var1, y=Var2, fill=value)) + geom_tile() +
     scale_fill_gradient2(low=RtoWrange(100), mid=WtoGrange(100), 
     high="green3") +
     ggtitle("Per 100,000 words \n 10 greenest hotels") +
     labs(x="Hotel",y="Words") +
     coord_fixed(ratio=1/3) +
     theme(axis.text.x  = element_text(angle=45, vjust=1, hjust=1, size=10))

Data types


  • Nominal Cases grouped into one and only one category; no relative numerical information even when numerical (e.g. student ID number)

  • Ordinal Numbers assigned that rank units but no indication of the magnitude of differences (e.g. preferences)

  • Interval Numbers indicate exact difference between units and use constant units of measurement (e.g. time, money, age)

Install R


First (mandatory): Install R
Fox @McMaster may be able to help.

Second (optional): There is a GUI shell called RStudio here:
Install RStudio
and Paulson may help: @Paulson

Third Visit: Try R to learn how to write basic R code.

Last Ask questions: Seek help at StackOverflow, a reliable forum of questions and answers about R.

Packages we will use


dplyr: Simplifies how you can think about common data manipulation tasks.

install.packages('dplyr')

ggplot2: Powerful graphics language for creating elegant and complex plots.

install.packages('ggplot2')

XML: Reading and creating XML (and HTML) documents, both local and accessible via HTTP or FTP.

install.packages('XML')

twitteR: Provides an interface to the Twitter web API.

install.packages('twitteR')

quanteda: Managing and analyzing text

install.packages('quanteda')

Run some basic R commands


1+2
## [1] 3
7 - 12
## [1] -5
pi * 4^4
## [1] 804.2477
log(1)
## [1] 0
x <- -10:10
(x <- -10:10)
##  [1] -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3   4   5   6
## [18]   7   8   9  10
y <- abs(x)
plot(x, y, type = 'l', main = 'Victory!')