Problems

Hone your R skills by doing problems.

1 Fetch the last 100 posts from your favorite politician.

[1] "Using direct authentication"
kim <- userTimeline("KimJongNumberUn", n=100)
k <- twListToDF(kim)

1.1 Remove any word that you deem unecessary. How do you do that?

library(quanteda)
mystops <- c('co','https','http','t','will','&amp','amp','thank','t.co')
k_dfm<-dfm(k$text,
            remove_punct = TRUE,
            remove_twitter = TRUE,
            remove = c(stopwords('english'), mystops))

1.2 Create a word cloud from the cleaned up list.

textplot_wordcloud(k_dfm, 
                    max.words=100, 
                    random.order = FALSE )

1.3 Fetch the last 100 posts from a second politician.

Devise a method to determine which one cares the most about the economy. Hint: create or download a dictionary of economic terms.

muga <- userTimeline("RGMugabe", n=100)
m <- twListToDF(muga)

econ <- read.csv( 'http://latul.be/mba_565/data/econ_words.csv', 
                stringsAsFactors = FALSE )
                
poli_corpus <- corpus( c('kim' = k$text, 'mugabe' = m$text) )
docvars( poli_corpus ) <- data.frame( 
    name = gsub("\\d*", "", docnames(poli_corpus)
))

my_dict <- dictionary( list("econ"=rbind(econ,'leader')) )
( p_dfm <- dfm(poli_corpus, groups = 'name', dictionary = my_dict) )
Document-feature matrix of: 2 documents, 1 feature (50% sparse).
2 x 1 sparse Matrix of class "dfmSparse"
        features
docs     econ.econ_words
  kim                  5
  mugabe               0
  • They are not too concerned.

2 Load the inaugural speech data from quanteda::data_corpus_inaugural.

I suggest you download the positive.txt and negative.txt lists from the course homepage.

positive <- read.csv( 'http://latul.be/mba_565/data/positive.txt', 
                stringsAsFactors = FALSE,
                comment.char = '#' ) 
negative <- read.csv( 'http://latul.be/mba_565/data/negative.txt',   
                stringsAsFactors = FALSE,
                comment.char = '#' )

2.1 Is the distribution of positive and negative terms constant accross presidents (not speeches)?

Use dfm_weight(, ‘relFreq’) to get comparable numbers. Try to plot the values accross presidents.

my_dict <- dictionary( list(positive=positive, 
    negative=negative))
pres <- quanteda::data_corpus_inaugural
pres_dfm<-dfm(pres, 
    groups="President", 
    remove = c(stopwords("english"),"will","us"), 
    remove_punct = TRUE,
    dictionary = my_dict)
words <- c('positive','negative')
relfreq <- dfm_weight(pres_dfm[,words], 'relFreq')

# Barplot expect a matrix as argument.
par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.
barplot(t(as.matrix(relfreq)), horiz=TRUE, 
    legend = c('positive','negative'))

2.2 Create your own dictionary with at least two categories.

Create a barplot showing how frequently they appear accross presidents.