Problems

Hone your R skills by doing problems.

Please attempt all questions.

Scrape the data table from Monorail Systems and create a data frame. Clean up the data, e.g. rename variables.


Clean the data

Make reasonable column names and convert the columns into the correct data types. Name the resulting data frame ‘rail’. Show me part of the results with head(rail).

library( "XML" )
library( "RCurl" )
url <- "https://en.wikipedia.org/wiki/List_of_monorail_systems"
rail_page <- getURL(url)
rail <- readHTMLTable( rail_page, which = 2, stringsAsFactors = FALSE )
# I shorten two of the variable names.
names(rail)[c(4,6)] <- c('year','length')
# I don't like upper cases in my variable names.
names(rail) <- tolower(names(rail))
# Remove the "[#]" wiki links with gsub. We use '\' as an escape character
# and [0-9] is regex that matches any digit.
rail <- data.frame(gsub("\\[[0-9]\\]","",as.matrix(rail)))
# The stations variable should be numeric
rail$stations <- as.numeric(rail$stations)
head(rail)
    location    country                               name year stations
1 Gold Coast  Australia          Sea World Monorail System 1986       10
2  Lichtaart    Belgium             Bobbejaanland Monorail 1961       10
3  São Paulo     Brazil          Line 15 (São Paulo Metro) 2014        8
4   Montreal     Canada La Ronde (amusement park) Minirail 1967        8
5  Chongqing      China      Chongqing Monorail Line 2 & 3 2005       14
6   Shanghai      China              Shanghai Maglev Train 2004        8
             length
1     2 km (1.2 mi)
2 1.85 km (1.15 mi)
3   2.9 km (1.8 mi)
4   2.1 km (1.3 mi)
5     98 km (61 mi)
6 30.5 km (19.0 mi)

Create a new column.

Add a column to the data frame that records the relative frequency of station per country.

# Tough one with many solutions a loop? aggregate?
rail <- within(rail, stations_per_cntry <- ave(stations, country, FUN=sum))
rail$stations_relfreq <- rail$stations/rail$stations_per_cntry
head(rail)
    location    country                               name year stations
1 Gold Coast  Australia          Sea World Monorail System 1986       10
2  Lichtaart    Belgium             Bobbejaanland Monorail 1961       10
3  São Paulo     Brazil          Line 15 (São Paulo Metro) 2014        8
4   Montreal     Canada La Ronde (amusement park) Minirail 1967        8
5  Chongqing      China      Chongqing Monorail Line 2 & 3 2005       14
6   Shanghai      China              Shanghai Maglev Train 2004        8
             length stations_per_cntry stations_relfreq
1     2 km (1.2 mi)                 10        1.0000000
2 1.85 km (1.15 mi)                 10        1.0000000
3   2.9 km (1.8 mi)                  8        1.0000000
4   2.1 km (1.3 mi)                  8        1.0000000
5     98 km (61 mi)                 49        0.2857143
6 30.5 km (19.0 mi)                 49        0.1632653

Stations and outliers.

Create a plot of total number of station per year. Do you think any year is an outlier? Remove the outlier(s) and recreate the plot.

tmp_table <- aggregate(stations ~ year, data = rail, sum)
plot(tmp_table)

  • There is no obvious outlier.

Load American presidents inaugural addresses from the quanteda package.

president <- quanteda::data_corpus_inaugural

Use key words in context (KWIC), and return the context around ‘cotton’.

Report your results. Do you notice any pattern?

library(quanteda)
president <- quanteda::data_corpus_inaugural
kwic(president, 'cotton')
                                                              
  [1889-Harrison, 1272]       was no reason why the | cotton |
  [1889-Harrison, 1292] States in the production of | cotton |
  [1889-Harrison, 1428] wealth and contentment. The | cotton |
 [1953-Eisenhower, 875]    and turn lathes and pick | cotton |
                              
 - producing States should not
 fabrics. There was this      
 plantation will not be less  
 and heal the sick and        

What are the five most frequent words in the speeches?

Are they stopwords? Punctuation?

pres_dfm<-dfm(president, groups="President")
topfeatures(pres_dfm)[1:5]
  the    of     ,   and     . 
10082  7103  7026  5310  4945 

We have only stopwords and punctuation marks.

Any president has the word ‘cesspools’ in one of its speech?

Which one? What is the context?

The word ‘cesspools’ does not appear!

kwic(president, 'cesspools')
NULL

Which president is the most concerned with foreign policy? Tell me why.

pres_dfm<-dfm(president, 
    groups="President", 
    remove = c(stopwords("english"),"will","us"), 
    remove_punct = TRUE)
words<-c('foreign', 'trade')
rowSums(pres_dfm[,words])
     Adams   Buchanan       Bush     Carter  Cleveland    Clinton 
        14          7          2          1          5          2 
  Coolidge Eisenhower   Garfield      Grant    Harding   Harrison 
         2          1          0          4          5         10 
     Hayes     Hoover    Jackson  Jefferson    Johnson    Kennedy 
         3          1          5          4          1          0 
   Lincoln    Madison   McKinley     Monroe      Nixon      Obama 
         4          3         12         14          0          0 
    Pierce       Polk     Reagan  Roosevelt       Taft     Taylor 
         5         13          0          5         15          2 
    Truman      Trump  Van Buren Washington     Wilson 
         4          4          6          0          1 

According to my imperfect measure, it seems that Taft is the most concerned with foreign policy.

Compare the word cloud from Trump’s speech with the one from Obama’s two speeches.

textplot_wordcloud(pres_dfm["Obama"], 
                    max.words=100, 
                    random.order = FALSE )

textplot_wordcloud(pres_dfm["Trump"], 
                    max.words=100, 
                    random.order = FALSE )