Tutorials

Learn how to use R.

Linear Regression


nanaimo <- read.csv( 
    'http://latul.be/mba_565/data/nanaimo.csv')
trinity <- c('price', 'area', 'landarea')
pairs( nanaimo[, trinity] )

Subset the data

Only keep houses located on lots smaller than an acre.

tmp <- subset(nanaimo, area > 0 & landarea < 1)
pairs( tmp[, trinity] )

Think about your final project

At this point, you should have raw data for your project. If not, do it this week-end.

  • Are there any variables you should inspect?
  • How do your variables interact?
  • Which of your variables are discrete? Continuous?
  • What would you like to measure?

Try to include some analytical part to your project. This will greatly help in your ABP.

  • Could you model your data in order to conduct inference?
  • Any theory that would explain why one of your variable is dependent on another?
  • Could you use a linear model?

Analytical Steps


  • Load data: read.csv(), attach()
  • Have a look: summary(), pairs(), plot(), density(), ggplot(), geom_smooth()
  • Get rid of potential outliers: subset()
  • model: lm( y ~ x + y + z )
  • results: summary()
  • Information: model\$coef, model\$fitted, plot( model )

Discrete Variable

In the following function, plot() will create a boxplot giving us a glimpse of the distribution of price accross various housing types. Notice the use of tilda ‘~’, which is interpreted as price explained by type.

plot( price ~ type, las = 2, data = nanaimo )

The boxplot displays minimimum value, bottom quartile, median, upper quartile, max, and outliers.

Linear Model


Modelling

A model has a deterministic component and a random component.

The validity of a model only arises from its ability to dominate other models.

Fitting a model

We are assuming that the dependent variable \(y\) depends on the independent variable \(x\).

\[ y_{i}=\alpha+\beta_1 x_{i}+error_{i} \] The model consists of a straight line with intercept \(\alpha\) and slope \(\beta\).

The model of interest is \[ price=\alpha+\beta_1 area+error \] which can be fitted in R unsing the linear model function, lm().

model <- lm( price ~ area, data = subset(nanaimo, area>0) )

Notice the use of tilda “~”, which is interpreted as “explained by”. Thus, the above would read as price explained by area.

Ploting a linear model


  • Prediction Line \(\hat{y}_{i}=\hat{\alpha}+\hat{\beta}x_{i}\), e.g. the predicted value for a house with 1500 sq.ft. is given by \[ \$321,590=\hat{\alpha}+\hat{\beta}\times1500. \]

Vector multiplication

\[\mathbf{X} \mathbf{\beta}= \left[\begin{array} {rr} 1 & 1500 \end{array}\right] \left[\begin{array} {r} \hat{\alpha} \\ \hat{\beta} \end{array}\right] \]

model$coef %*% c(1, 1500)
##          [,1]
## [1,] 314547.5
# or
tmp <- data.frame( area = 1500 )
predict( model, tmp, interval = 'confidence' )
##        fit      lwr      upr
## 1 314547.5 296484.3 332610.6

The above function, %*%, multiplies the two vectors. See the difference between regular multiplication and vector multiplication.

x <- c(1,2,3)
y <- c(2,3,4)
x*y
## [1]  2  6 12
x %*% y
##      [,1]
## [1,]   20

Graphing the regression line

plot( price ~ area, col = ifelse( model$fitted > price, 'green','red'),
    data = subset( nanaimo, area > 0) )
abline( model$coef )

Fitted Model

Which houses would you buy? The red ones or the green ones?

Outliers


We can use base R plot() function to track outliers. * ** Residuals vs Fitted ** No obvious pattern should be present. That is the errors should appear random. * ** Normal Q-Q ** Are the residuals normally distributed? * ** Scale-Location ** Checking for heteroskedasticity. Is the variance of error terms equal along the range of predictors? * ** Residuals vs Leverage ** Just check for outliers.

plot( model )