Tutorials

Learn how to use R.

Help a classmate: Konidel Survey


Konidel Survey

Where to find Data?


  • Data from a text file
  • read.csv()
  • Data from the web using the ‘XML’ and ‘RCurl’ packages
  • getURL(), and readHTMLTable()
  • Data from large texts using package ‘quanteda’
  • corpus(), and dfm()
  • Data from Twitter API
  • searchTwitter()
  • favorites()
  • userTimeline()

Connect to Twitter API


Setup account in R

We create four R objects from the key and access token generated by twitter.

Do not Copy/Paste those values

You need to use the values associated with YOUR Twitter account.

consumer_key    <- "IGWVbItQSpNI4WzjYzk4VvgKS" 
consumer_secret <- "2dkYOJn2NNuJbLyNS68CLdHDuU30Raw82zDnRZFwpWL6fietP0"
access_token    <- "198230874-W4G81SX3ApmTz551IQi5M55oKmOHz7LyTzUWzeIr"
access_secret   <- "MGRwEJ9LndHrQnDxmuFoV01lGRElx7YGSIKfRuDzIFsKwXEwBw"
Connect to Twitter API.
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"

At R prompt: Answer: Enter “2” for “No”

We first load the twitter package. Remember that the userTimeline() function will return the tweets from one specific user. This will download a twitter list.

library( 'twitteR' )
# Here you should run setup_twitter_oauth()
# with four arguments
theresa <- userTimeline("theresa_may", n = 150)
justin  <- userTimeline("justintrudeau", n = 150)
donald  <- userTimeline("realdonaldtrump", n = 150)

Corpus from Tweets


Put the scraped tweets into a dataframe using a function from package ‘twitteR’. We then extract the text part of the tweet.

t <- twListToDF( theresa )
j <- twListToDF( justin )
d <- twListToDF( donald )

Create a corpus from the three texts. All tweets for a given polician use the name of the politician followed by digits. We get those names using docnames. And then, gsub() using the digit wildcard to remove all the digits and keep only the name.

library( 'quanteda' )
poli_corpus <- corpus( c('theresa' = t$text, 
    'justin' = j$text, 'donald' = d$text) )
docvars( poli_corpus ) <- data.frame( 
    name = gsub("\\d*", "", docnames(poli_corpus)
))

Dictionary


Create a dictionary

We use the quanteda function dictionary() to create a dictionary that will be called by dfm() later. The asterisk“" is a regex wild card matching any character any number of times, e.g. ’tax{}‘will match ’tax’ as well as ‘taxation’.

politics <- c('tax*', 'govern*', 'vot*', 'polic*', 'labo*', 'wage*', 'employ*',
                'constitution', 'job*', 'busine*','g20','nato','nafta')
# The function dictionary from the Quanteda package takes a list of vector
# as sole argument.
my_dict <- dictionary( list(politics = politics) )

Create a Document Frequency Matrix using quanteda and the dictionary. Notice how we use the groups option. The option tells the function dfm() to treat all tweets from a politician as one document.

( p_dfm <- dfm(poli_corpus, groups = 'name', dictionary = my_dict) )
## Document-feature matrix of: 3 documents, 1 feature (0% sparse).
## 3 x 1 sparse Matrix of class "dfmSparse"
##          features
## docs      politics
##   donald        46
##   justin        28
##   theresa       38
Create another dictionary

We use the quanteda function dictionary.

sports = c('*ball','criquet','foot*','soccer','rugby')
my_dict <- dictionary( list(
                            politics = politics, 
                            sports = sports
                            ))

Create a document frequency matrix using quanteda.

p_dfm <- dfm(poli_corpus, groups = 'name', dictionary = my_dict)

Thesaurus


Replace the option dictionary with thesaurus in function dfm().

unwanted_words = c(stopwords(), stopwords('french'),'http',
                 'https', 'd','amp','t.co')
p_dfm <- dfm(poli_corpus, 
            groups = 'name', 
            thesaurus = my_dict,
            stem = TRUE,
            remove_punct = TRUE,
            remove_twitter = TRUE,
            remove = unwanted_words
            )

We can now extract the relative frequency of terms. The weight() function gives us the relative frequency. I then extract the two terms ‘sports’ and ‘government’. They are in all caps, because the function dfm() puts thesaurus words in all caps so they are disctinguishable from the actual words.

dfm_weight(p_dfm, 'relFreq')[, c('POLITICS','great')]*100
## Document-feature matrix of: 3 documents, 2 features (16.7% sparse).
## 3 x 2 sparse Matrix of class "dfmSparse"
##          features
## docs      POLITICS    great
##   donald  2.744425 2.115495
##   justin  2.393162 0       
##   theresa 2.220923 1.052016
Get dictionaries online

We load two text strings containing thousands of keywords.

positive <- read.csv( 'http://latul.be/mba_565/data/positive.txt', 
                stringsAsFactors = FALSE ) 
negative <- read.csv( 'http://latul.be/mba_565/data/negative.txt',   
                stringsAsFactors = FALSE )
# We unlist the objects because they are in a dataframe
my_dict <- dictionary( list(     
                    positive = unlist(positive),
                    negative = unlist(negative),
                    politics = politics,
                    sports= sports
            ))
Alternative to the dictionary

Create a document frequency matrix using quanteda.

s_dfm <- dfm(poli_corpus, groups = 'name', 
                dictionary = my_dict,
                stem = TRUE,
                remove_punct = TRUE,
                remove_twitter = TRUE,
                remove = unwanted_words
            )

Sentiments


Sentiment Analysis

We first create a feature matrix for Donald.

d_dfm <- dfm( corpus_subset( poli_corpus, name == 'donald' ),
                dictionary = my_dict,
                remove_punct = TRUE,
                remove_twitter = TRUE,
                remove = unwanted_words
            )

Second we create an emotion measure. The measure is used to classify a tweet as positive, negative, or neutral. We take the number of positive terms minus the number of negative terms in a tweet, and check the sign of our measure. A negative measure would be a negative tweet, a positive measure, a positive tweet, and a result of zero a neutral tweet.

measure <- as.vector( d_dfm[,'positive'] - d_dfm[,'negative'] )
Sentiment Analysis–Dataframe

We create a sentiment dataframe from our measure. The function rep( ‘x’, n ) will repeat x, n times.

emo <- rep( 'neutral', length(measure) )  
emo[ measure < 0 ] <- 'negative'  
emo[ measure > 0 ] <- 'positive' 
emo_df <- data.frame( measure, emo )
table( emo_df$emo ) / dim(emo_df)[1]*100
## 
## negative  neutral positive 
## 20.00000 31.72414 48.27586