Tutorials

Learn how to use R.

Install


Install package XML
  • [XML] Reading and creating XML (and HTML) documents, both local and accessible via HTTP or FTP
install.packages( "XML" )
install.packages( "RCurl" )

Or, if the above fails for windows, try this:

install.packages( "RCurl_1.95-4.8.zip", repos=NULL, type="source" )
install.packages( "XML_3.98-1.4.zip", repos=NULL, type="source" )

This link for R and Windows:

Troubleshoot

It is good to know your R installation in order to troubleshoot. Type the following to get your installation info

sessionInfo()
# or simply
version

For Windows user who need to compile from source: R-tools

Good Practice

You should always keep your code for a task in a single text file, e.g. my_work.r. You can then open this file with any text editor or with R Studio.

On the first line of your text file

  # loading XML library
  library( "XML" )
  library( "RCurl" )

To run the whole source file in R:

  # loading XML library
  source( 'PATH/my_work.r' )

Scrape a Wiki Table


Source Code

First we store the link into an r object

fifa_url <- "https://en.wikipedia.org/wiki/FIFA_World_Cup"

Then you download the source code of that page and store it into an R object

fifa_page <- getURL( fifa_url )
fifa_page
Source Code–Windows users

First we store the link into an r object

fifa_url <- "https://en.wikipedia.org/wiki/FIFA_World_Cup"

Then you download the source code of that page and store it into an R object

fifa_page <- getURL( fifa_url, ssl.verifypeer = FALSE )
fifa_page
Source Code–Windows trouble shout

First we store the link into an r object

# try
install.packages( 'httr' )
fifa_url <- "https://en.wikipedia.org/wiki/FIFA_World_Cup"

Then you download the source code of that page and store it into an R object

fifa_page <- getURL( fifa_url, ssl.verifypeer = FALSE )
fifa_page
Get the table

We can now extract the table from the source code

fifa <- readHTMLTable( fifa_page, which = 4, stringsAsFactors = FALSE ) 

Fifa is an n by k data frame. We use dim() to get its actual size.

d <- dim( fifa )

Cleaning data


Gsub

When working with strings you may need to substitute some values. Have a look at the total number of fan at the gate.

fifa[, 7]

This is not numeric. It was read as a string (character) because of the comma.

gsub( ",", "", fifa[, 7])
Clean data frame

We remove the first column (variable) and the last row (observation). The first column is the row number, and the last row is the total. We do not need these.

# minus sign tells R to remove that row or column
fifa <- fifa[-d[1], -1]
Clean data frame–all at once

We now clean the whole data frame. We create a single object called fifa_df.

fifa_df <- data.frame( 
    year      <- as.numeric(fifa[,1]),
    host      <- gsub(" ", "", fifa[,2]),
    total     <- as.numeric( gsub(",", "", fifa[,4]) ),
    matches   <- as.numeric( fifa[,5] ),
    avg       <- as.numeric( gsub(",", "", fifa[,6]) ),
    top       <- as.numeric( gsub(",", "", fifa[,7]) ),
    site      <- fifa[,8]
)
Clean data frame–one by one

You can create the variable one by one if you prefer. This way you can check your work as you progress. The only inconvenient here, is that you are creating many variables within the R environment.

year      <- as.numeric(fifa[,1])
host      <- gsub(" ", "", fifa[,2])
total     <- as.numeric( gsub(",", "", fifa[,4]) )
matches   <- as.numeric( fifa[,5] )
avg       <- as.numeric( gsub(",", "", fifa[,6]) ) 
top       <- as.numeric( gsub(",", "", fifa[,7]) ) 
site      <- fifa[,8]
fifa_df <- data.frame( year, host, total, matches, avg, top, site )
Visual

We can create a simple scatterplot to see the number of people at the gate over the years:

# we attach() fifa_df. This way we refer to variables directly
attach( fifa_df )
plot( x = year, y = total, type = 'l' )
Visual

We can create a simple scatterplot to see the number of people at the gate over the years. Here we add one variable to the plot, the number of matches.

symbols( x = year, y = total, circles = matches, inches = 1/3, bg = "steelblue2")