Tutorials

Learn how to use R.

Large Text Analytics
  • Install package ‘quanteda’
  • Use library( ‘quanteda’ )
  • Download text
  • Create a feature matrix
  • Text Analysis

  • Descriptive statistics
  • Sentiment analysis
  • Classify documents

Install


Install package Quanteda
  • Quanteda is used managing and analyzing text.
  • readr To help loading text to R.
install.packages( "Rcpp" )
install.packages( "stringi" )
install.packages( "tibble" )
install.packages( "BH" )
install.packages( "readr" )
install.packages( "quanteda" )

Or, if the above fails for windows, try this:

install.packages( "quanteda_0.9.4.zip", repos=NULL, type="source" )

This link for R and Windows:

Library Quanteda

Don’t forget to tell R to use the quanteda library.

library( "quanteda" )
Definitions
  • corpus a library of documents e.g. books, texts

  • token a list of every word, and token count is total number of words

  • stem words with suffixes removed

  • stopword words that are designated for exclusion from any analysis of a text

Corpus


Library of texts

A corpus gathers all the information (meta-data) contained in the text files.

txt1 <- readr::read_file('http://latul.be/mba_565/data/keynes.txt')
txt2 <- readr::read_file('http://latul.be/mba_565/data/ricardo.txt')
txt3 <- readr::read_file('http://latul.be/mba_565/data/smith.txt')
econ <- c( 'keynes' = txt1, 'ricardo' = txt2, 'smith' = txt3)
econ_cs <- 
    corpus(econ,
           docvars = data.frame(Year = c(1919,1817,1776),
                                Origin = c("British","British","Scott")
                    )
           )
summary(econ_cs)
## Corpus consisting of 3 documents.
## 
##     Text Types Tokens Sentences Year  Origin
##   keynes  7978  85783      2575 1919 British
##  ricardo  6168 140960      3578 1817 British
##    smith 11453 439719     11511 1776   Scott
## 
## Source:  /home/johan/docs/viu/mba_565/tutorials/* on x86_64 by johan
## Created: Sun Nov  5 19:03:36 2017
## Notes:
Keywords in Context (KWIC)

The function kwic() will take a corpus as input object. What comes before and after a keyword?

econ_kwic <- kwic( econ_cs, "china", window = 6 )
options(width=200) # Just so we see the output on one line.
head( econ_kwic )
##                                                                                                    
##   [keynes, 18130]                  26] operating in Russia, | China | , Turkey, Austria, Hungary   
##   [keynes, 19119]       privileges she may have acquired in | China | .[ 30] There are             
##   [keynes, 53489] utility undertakings operating in Russia, | China | , Turkey, Austria, Hungary   
##  [ricardo, 70161]                       , might, if laid in | China | , on the exportation of the  
##  [ricardo, 70182]         the expenses of the Government of | China | . Taxes on luxuries have some
##     [smith, 8836]          some of the eastern provinces of | China | , though the great extent of

Subset


Have a look at only a subset of the corpus. Note that you can use any of the document variables given by docvars().

# The second argument should be logical, e.g. '==', '>', '!>'...
corpus_subset(econ_cs, Year>1800)
## Corpus consisting of 2 documents and 2 docvars.

Token


Token

Token are elements that we extract from the full text. The default unit is a word.

txt <- c(
            text1 = "Today, we learn about token. This is a sentence.",
            text2 = "This is the first sentence. These are fruits: apple, peach, and melon."
            )
tokens(txt)
## tokens from 2 documents.
## text1 :
##  [1] "Today"    ","        "we"       "learn"    "about"    "token"    "."        "This"     "is"       "a"        "sentence" "."       
## 
## text2 :
##  [1] "This"     "is"       "the"      "first"    "sentence" "."        "These"    "are"      "fruits"   ":"        "apple"    ","        "peach"    ","        "and"      "melon"    "."
tokens(txt, remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
## [1] "Today"    "we"       "learn"    "about"    "token"    "This"     "is"       "a"        "sentence"
## 
## text2 :
##  [1] "This"     "is"       "the"      "first"    "sentence" "These"    "are"      "fruits"   "apple"    "peach"    "and"      "melon"
tokens(txt, what = 'sentence')
## tokens from 2 documents.
## text1 :
## [1] "Today, we learn about token." "This is a sentence."         
## 
## text2 :
## [1] "This is the first sentence."                "These are fruits: apple, peach, and melon."
econ_token <- tokens(tolower(econ_cs['keynes']), remove_punct = TRUE)
# try
keynes_sentence <- tokens(econ_cs['keynes'], what = 'sentence')

Document Feature Matrix (DFM)


You can skip the token part and go straight to the feature matrix (FM). The FM will give us the frequencies of each words.

DFM–Easier than token

With dfm() we create a list of features (token) with lower case and punctuation removed.

econ_dfm <- dfm( econ_cs, remove_punct = TRUE )
head(econ_dfm)
## Document-feature matrix of: 3 documents, 15,718 features (49.4% sparse).
## (showing first 3 documents and first 6 features)
##          features
## docs        the project gutenberg ebook economic consequences
##   keynes   6016      89        30    10      140           21
##   ricardo  9769      87        30    11        0            4
##   smith   32367     123        30    11        0            8

Let get a list of words most used.

topfeatures( econ_dfm, 10 )
##   the    of    to   and    in     a    it    is    be which 
## 48152 35577 17207 15389 14013  9916  7601  7127  6823  6734
DFM

With option remove, we can get rid of some uninteresting words.

# We first create a column of words to be removed.
unwanted_words <- c('the', 'it', 'is', 'of')
econ_dfm <- dfm( econ_cs, 
                remove_punct = TRUE,
                remove = unwanted_words 
            )
topfeatures( econ_dfm, 10 )
##    to   and    in     a    be which  that   for    by    as 
## 17207 15389 14013  9916  6823  6734  5938  4968  4800  4655
Stopword

The quanteda package includes a list of most commonly removed words. Stopwords gives us an extensive list of stopwords.

stopwords()
##   [1] "i"          "me"         "my"         "myself"     "we"         "our"        "ours"       "ourselves"  "you"        "your"       "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"       "herself"    "it"         "its"        "itself"     "they"       "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"       "these"      "those"      "am"         "is"         "are"        "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"         "does"       "did"        "doing"      "would"      "should"     "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"       "you've"     "we've"      "they've"    "i'd"        "you'd"      "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"    "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"     "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"      "cannot"     "couldn't"   "mustn't"    "let's"      "that's"     "who's"      "what's"     "here's"     "there's"    "when's"    
## [106] "where's"    "why's"      "how's"      "a"          "an"         "the"        "and"        "but"        "if"         "or"         "because"    "as"         "until"      "while"      "of"        
## [121] "at"         "by"         "for"        "with"       "about"      "against"    "between"    "into"       "through"    "during"     "before"     "after"      "above"      "below"      "to"        
## [136] "from"       "up"         "down"       "in"         "out"        "on"         "off"        "over"       "under"      "again"      "further"    "then"       "once"       "here"       "there"     
## [151] "when"       "where"      "why"        "how"        "all"        "any"        "both"       "each"       "few"        "more"       "most"       "other"      "some"       "such"       "no"        
## [166] "nor"        "not"        "only"       "own"        "same"       "so"         "than"       "too"        "very"

The quanteda package also includes stopwords in different languages.

head(stopwords("arabic"))
## [1] "فى" "في" "كل" "لم" "لن" "له"

We can add to the list by extending the stopword() column vector.

unwanted_words <- c(stopwords(), "gutenberg", "ebook", "upon","will","may",
                    "can", "must","much","therefore","project"
                    )
econ_dfm <- dfm( econ_cs, 
                remove_punct = TRUE,
                remove = unwanted_words 
            )
topfeatures( econ_dfm, 10 )
##    price    great    value     part  country   labour  produce      one quantity    money 
##     2309     1809     1785     1662     1651     1631     1558     1480     1331     1308
Stem

Use stem = TRUE option to remove word stem, e.g. ‘ending’ will become ‘end’.

econ_dfm <- dfm( econ_cs, stem = TRUE,
                    remove_punct = TRUE,
                    remove = unwanted_words
                )
topfeatures( econ_dfm, 10 )
##   price countri    part  labour  produc   great    valu     tax  employ     one 
##    2523    2462    2066    2052    1971    1908    1847    1614    1575    1523
Subset

Have a look at word frequencies by book.

# Create a column vector of words
econ_words <- c("price","corn","trade")
econ_dfm[,econ_words]
## Document-feature matrix of: 3 documents, 3 features (11.1% sparse).
## 3 x 3 sparse Matrix of class "dfmSparse"
##          features
## docs      price corn trade
##   keynes     39    0    60
##   ricardo  1150  562   191
##   smith    1334  436  1123
Groups

We can group the documents according to a value shared in the document variables.

group_dfm <- dfm(econ_cs, 
                    groups = "Origin", 
                    remove = unwanted_words, 
                    remove_punct = TRUE)

# Note the use of dfm_sort to order token by frequencies
dfm_sort(group_dfm)[,1:10]
## Document-feature matrix of: 2 documents, 10 features (0% sparse).
## 2 x 10 sparse Matrix of class "dfmSparse"
##          features
## docs      price great value part country labour produce  one quantity money
##   British  1052   227   991  261     415    629     615  429      534   547
##   Scott    1257  1582   794 1401    1236   1002     943 1051      797   761

Word clouds


Clouds are a great visual tool when manipulating large texts.

Cloud

Quick visualization with a word cloud.

textplot_wordcloud(econ_dfm, 
                    max.words=100,
                    random.order = FALSE,
                    colors = c('red','blue','green') 
                )

Cloud from a subset of the books.

textplot_wordcloud(econ_dfm["keynes"], 
                    max.words=100, 
                    random.order = FALSE,
                    colors = c('red','blue','green') 
                )

Have a look at the 657 colours you can choose.

colors()
textplot_wordcloud(econ_dfm, 
                    max.words=100, 
                    random.order = FALSE,
                    colors = sample( colors(), 3) )
Top Features

Draw a bar plot for word frequencies.

feat <- topfeatures( econ_dfm, n=30 ) 
barplot( feat, ylab = 'count', main = 'Smith--Ricardo--Keynes', las = 2 )
Ngrams–word groups
bigram_dfm <- dfm(econ_cs, 
                    remove_punct = TRUE,
                    remove = unwanted_words, 
                    stem = TRUE, 
                    ngrams = 2)
feat <- topfeatures( bigram_dfm, n=30 ) 
barplot( feat, ylab = 'count', main = 'Bi-gram', las = 2, cex.names = 3/4 )

Just for fun; Have a look at Google’s great n-gram tool: Google Ngram