Learn how to use R.

```
nanaimo <- read.csv(
'http://latul.be/mba_565/data/nanaimo.csv')
trinity <- c('price', 'area', 'landarea')
pairs( nanaimo[, trinity] )
```

Only keep houses located on lots smaller than an acre.

```
tmp <- subset(nanaimo, area > 0 & landarea < 1)
pairs( tmp[, trinity] )
```

At this point, you should have raw data for your project. If not, do it this week-end.

- Are there any variables you should inspect?
- How do your variables interact?
- Which of your variables are discrete? Continuous?
- What would you like to measure?

Try to include some analytical part to your project. This will greatly help in your ABP.

- Could you model your data in order to conduct inference?
- Any theory that would explain why one of your variable is dependent on another?
- Could you use a linear model?

- Load data:
`read.csv()`

,`attach()`

- Have a look:
`summary()`

,`pairs()`

,`plot()`

,`density()`

,`ggplot()`

,`geom_smooth()`

- Get rid of potential outliers:
`subset()`

- model:
`lm( y ~ x + y + z )`

- results:
`summary()`

- Information:
`model\$coef`

,`model\$fitted`

,`plot( model )`

In the following function, `plot()`

will create a boxplot giving us a glimpse of the distribution of price accross various housing types. Notice the use of tilda ‘~’, which is interpreted as price explained by type.

`plot( price ~ type, las = 2, data = nanaimo )`

The boxplot displays minimimum value, bottom quartile, median, upper quartile, max, and outliers.

A model has a deterministic component and a random component.

The validity of a model only arises from its ability to dominate other models.

We are assuming that the dependent variable \(y\) depends on the independent variable \(x\).

\[ y_{i}=\alpha+\beta_1 x_{i}+error_{i} \] The model consists of a straight line with intercept \(\alpha\) and slope \(\beta\).

The model of interest is \[ price=\alpha+\beta_1 area+error \] which can be fitted in R unsing the linear model function, lm().

`model <- lm( price ~ area, data = subset(nanaimo, area>0) )`

Notice the use of tilda “~”, which is interpreted as “explained by”. Thus, the above would read as price explained by area.

**Prediction Line**\(\hat{y}_{i}=\hat{\alpha}+\hat{\beta}x_{i}\), e.g. the predicted value for a house with 1500 sq.ft. is given by \[ \$321,590=\hat{\alpha}+\hat{\beta}\times1500. \]

\[\mathbf{X} \mathbf{\beta}= \left[\begin{array} {rr} 1 & 1500 \end{array}\right] \left[\begin{array} {r} \hat{\alpha} \\ \hat{\beta} \end{array}\right] \]

`model$coef %*% c(1, 1500)`

```
## [,1]
## [1,] 314547.5
```

```
# or
tmp <- data.frame( area = 1500 )
predict( model, tmp, interval = 'confidence' )
```

```
## fit lwr upr
## 1 314547.5 296484.3 332610.6
```

The above function, %*%, multiplies the two vectors. See the difference between regular multiplication and vector multiplication.

```
x <- c(1,2,3)
y <- c(2,3,4)
x*y
```

`## [1] 2 6 12`

`x %*% y`

```
## [,1]
## [1,] 20
```

```
plot( price ~ area, col = ifelse( model$fitted > price, 'green','red'),
data = subset( nanaimo, area > 0) )
abline( model$coef )
```

Which houses would you buy? The red ones or the green ones?

We can use base R `plot()`

function to track outliers. * ** Residuals vs Fitted ** No obvious pattern should be present. That is the errors should appear random. * ** Normal Q-Q ** Are the residuals normally distributed? * ** Scale-Location ** Checking for heteroskedasticity. Is the variance of error terms equal along the range of predictors? * ** Residuals vs Leverage ** Just check for outliers.

`plot( model ) `