# Tutorials

Learn how to use R.

### Linear Regression

nanaimo <- read.csv(
'http://latul.be/mba_565/data/nanaimo.csv')
trinity <- c('price', 'area', 'landarea')
pairs( nanaimo[, trinity] )

#### Subset the data

Only keep houses located on lots smaller than an acre.

tmp <- subset(nanaimo, area > 0 & landarea < 1)
pairs( tmp[, trinity] )

At this point, you should have raw data for your project. If not, do it this week-end.

• Are there any variables you should inspect?
• How do your variables interact?
• Which of your variables are discrete? Continuous?
• What would you like to measure?

Try to include some analytical part to your project. This will greatly help in your ABP.

• Could you model your data in order to conduct inference?
• Any theory that would explain why one of your variable is dependent on another?
• Could you use a linear model?

### Analytical Steps

• Load data: read.csv(), attach()
• Have a look: summary(), pairs(), plot(), density(), ggplot(), geom_smooth()
• Get rid of potential outliers: subset()
• model: lm( y ~ x + y + z )
• results: summary()
• Information: model\$coef, model\$fitted, plot( model )

#### Discrete Variable

In the following function, plot() will create a boxplot giving us a glimpse of the distribution of price accross various housing types. Notice the use of tilda ‘~’, which is interpreted as price explained by type.

plot( price ~ type, las = 2, data = nanaimo )

The boxplot displays minimimum value, bottom quartile, median, upper quartile, max, and outliers.

### Linear Model

#### Modelling

A model has a deterministic component and a random component.

The validity of a model only arises from its ability to dominate other models.

#### Fitting a model

We are assuming that the dependent variable $$y$$ depends on the independent variable $$x$$.

$y_{i}=\alpha+\beta_1 x_{i}+error_{i}$ The model consists of a straight line with intercept $$\alpha$$ and slope $$\beta$$.

The model of interest is $price=\alpha+\beta_1 area+error$ which can be fitted in R unsing the linear model function, lm().

model <- lm( price ~ area, data = subset(nanaimo, area>0) )

Notice the use of tilda “~”, which is interpreted as “explained by”. Thus, the above would read as price explained by area.

### Ploting a linear model

• Prediction Line $$\hat{y}_{i}=\hat{\alpha}+\hat{\beta}x_{i}$$, e.g. the predicted value for a house with 1500 sq.ft. is given by $\321,590=\hat{\alpha}+\hat{\beta}\times1500.$

#### Vector multiplication

$\mathbf{X} \mathbf{\beta}= \left[\begin{array} {rr} 1 & 1500 \end{array}\right] \left[\begin{array} {r} \hat{\alpha} \\ \hat{\beta} \end{array}\right]$

model$coef %*% c(1, 1500) ## [,1] ## [1,] 314547.5 # or tmp <- data.frame( area = 1500 ) predict( model, tmp, interval = 'confidence' ) ## fit lwr upr ## 1 314547.5 296484.3 332610.6 The above function, %*%, multiplies the two vectors. See the difference between regular multiplication and vector multiplication. x <- c(1,2,3) y <- c(2,3,4) x*y ## [1] 2 6 12 x %*% y ## [,1] ## [1,] 20 #### Graphing the regression line plot( price ~ area, col = ifelse( model$fitted > price, 'green','red'),
data = subset( nanaimo, area > 0) )
abline( model\$coef )

#### Fitted Model

Which houses would you buy? The red ones or the green ones?

### Outliers

We can use base R plot() function to track outliers. * ** Residuals vs Fitted ** No obvious pattern should be present. That is the errors should appear random. * ** Normal Q-Q ** Are the residuals normally distributed? * ** Scale-Location ** Checking for heteroskedasticity. Is the variance of error terms equal along the range of predictors? * ** Residuals vs Leverage ** Just check for outliers.

plot( model )