title: “Week 6 Lab, MAST20005/MAST90058”
author: “School of Mathematics and Statistics, University of Melbourne”
date: “Semester 2, 2022”
institute: “University of Melbourne”
Copyright By PowCoder代写 加微信 powcoder
fontsize: 11pt
classoption: a4paper
html_document:
toc_float:
collapsed: false
**Goals:** (i) Fit and interpret the simple regression model in R; (ii)
Check model assumptions; (iii) Explore more general estimation methods beyond
the least squares method.
**Data:** In this lab, we will use the `crime` data from Agresti and Finlay
(1997), describing crime rates in the United States. The variables are state
ID (`sid`), state name (`state`), violent crimes per 100,000 people (`crime`),
murders per 1,000,000 people (`murder`), the percentage of the population
living in metropolitan areas (`pctmetro`), the percentage of the population
that are white (`pctwhite`), the percentage of the population with a high
school education or above (`pcths`), the percentage of the population living
under the poverty line (`poverty`), and the percentage of the population that
are single parents (`single`). Data can be obtained from the shared folder in
the computer labs, from the LMS, or downloaded from
Stata binary format (`.dta`) so you will need to install the `foreign` package
to import it into R.
“`{r echo=FALSE}
set.seed(1938)
# Explore pairwise associations
R can import and export data in different formats. For example, the
`crime.dta` data file is produced by Stata, a commercial statistical software
package. The R package `foreign` defines useful functions to manipulate data
formats different from `.csv` or `.txt`. If you working on your own computer
(rather than the lab machines), you might need to install it first:
“`{r eval=FALSE}
install.packages(“foreign”)
Now we can load the data and compute some basic summary statistics:
“`{r results=”hide”}
library(foreign)
cdata <- read.dta("crime.dta")
summary(cdata)
Let's explore pairwise associations between variables by constructing a scatter
plot matrix. We are interested in finding variables related to crime.
```{r fig.height=6.5}
plot(cdata[, -c(1, 2)])
We excluded the first two columns since they do not contain numerical
variables. Often transforming the variables by the `log()` function
helps us see linear association between variables.
```{r eval=FALSE}
plot(log(cdata[, -c(1, 2)]))
A linear association between variables is estimated by the sample correlation
coefficient:
\[ \hat\rho = \frac{\sum_i (x_i - \bar{x}) (y_i - \bar{y})}
{\sqrt{\sum_i (x_i - \bar{x})^2
\sum_i (y_i - \bar{y})^2}}. \]
Recall that when $\hat\rho$ is close to $\pm 1$ we have perfect linear
association, while $\hat\rho \approx 0$ indicates a lack of linear association.
A matrix containing pairwise correlation coefficients can be obtained using the
following command:
```{r eval=FALSE}
cor(cdata[, -c(1, 2)])
This is a bit hard to read. Let's multiply by 100 and round to the nearest
integer (e.g.\ 0.54 becomes 54, etc.):
round(100 * cor(cdata[, -c(1, 2)]))
Are there any variables showing linear association with crime?
What about other types of association?
# Fit a linear model
Let's fit a simple linear regression model and carry out some diagnostics on
the fitted model.
The following command fits the model:
\text{crime}_i = \alpha + \beta \times \text{single}_i + \varepsilon_i, \quad
\varepsilon_i \sim \mathrm{N}(0, \sigma^2).
m1 <- lm(crime ~ single, data = cdata) # fit model (compute estimates)
summary(m1) # show results
Next, we want to report estimates for $\alpha$ and $\beta$, and give confidence
intervals. Then interpret the value of $\hat\beta$ in the context of the
From the previous output, the estimates are $\hat\alpha = -1362.53$ and
$\hat\beta = 174.42$. Confidence intervals can be computed using:
confint(m1)
States with a larger proportion of single parents tend to have high violent
crime rates. In particular, we expect about 174 additional violent crimes per
100,000 people for each 1 percentage point increase in the proportion of single
parents. Since the 95\% confidence interval for $\beta$ is clearly above 0, we
have good evidence for this positive association.
Next, we show examples of how to obtain interval estimates for the response
Suppose we want a 95\% confidence interval for the mean crime rate when the
percent of population that are single parents is 11.5. In other words, compute
a confidence interval for $\mu (11.5) = \alpha + \beta \times 11.5$.
newdata <- data.frame(single = 11.5)
predict(m1, newdata, interval = "confidence")
Then obtain a 95\% prediction interval for $Y$ for the same scenario.
predict(m1, newdata, interval = "prediction")
(Why is the prediction interval wider than the confidence interval?)
Next, we want to estimate the error standard deviation, $\sigma$.
From the previous output we obtain, $\hat\sigma = 242.5$.
Note that R computes $\hat\sigma = \sqrt{d^2 / (n - 2)}$, where $d^2 = \sum_i
r^2_i$ is the sum of squared residuals, with $r_i = y_i - (\hat\alpha
+ \hat\beta x_i)$. This estimate is different from the MLE of $\sigma$, which
is computed as $\hat\sigma_\text{MLE} = \sqrt{d^2 / n}$. If we needed the
latter (which we rarely do in practice), we could calculate it in R as follows:
n <- nrow(cdata) # sample size
RSS <- sum(m1$residuals^2) # extract, square and sum the residuals
sqrt(RSS / n) # MLE
Let us have a closer look at the residuals, $r_i = \text{crime}_i
- (\hat{\alpha} + \hat{\beta} \, \text{single}_i)$. These represent the
left-over, unexplained variation under the model. Model diagnostic
plots can be obtained as follows:
```{r fig.width=8, fig.height=4}
par(mfrow = c(1, 2)) # set up a 1*2 array for drawing plots
plot(m1, 1:2) # this will draw 2 plots, filling up the 1*2 array
The first plot shows the residuals against fitted values. The red line is a
smooth approximation of the data helping us detect anomalous trends in the
residuals. Ideally, the residuals should be scattered randomly around the zero
line. The second plot is a QQ plot to check how close the residuals follow a
normal distribution. While the central part of the distribution looks normal,
a few of the extreme observations deviate a little from the normality
assumption.
Overall the model looks reasonably appropriate for the data except for a few
points that might be incompatible with the model assumptions. The most extreme
data points are numbered in the plots, for convenience. In this case, it is
worth looking closely at observations 9, 25 and 51, to see if there is any
reason to suspect they might be not representative of the other states.
cdata[c(9, 25, 51), ]
# Comparisons with the null model
Consider the model
\text{crime}_i = \alpha + \varepsilon_i, \quad
\varepsilon_i \sim \mathrm{N}(0, \sigma^2)
and give an estimate of $\sigma$. This model is sometimes called the
*null model* since it assumes that there is no linear relationship between
the predictor(s) and `crime`, by assuming that the mean of
`crime` is constant. An estimate of $\sigma$ in this case is just the
sample standard deviation $s_y = \sqrt{\sum_i (y_i - \bar{y})^2 / (n - 1)}$.
y <- cdata$crime
The estimated variance for this model is much larger than that for the simpler
linear regression model we fitted earlier, suggesting that the earlier model is
superior to the null model in terms of explaining the variability in `crime`.
By the way, you can also fit the null model in R using `lm()` as
m0 <- lm(crime ~ 1, data = cdata)
summary(m0)
Can you spot the estimates of $\alpha$ and $\sigma$ in the output?
The improvement in terms of explained variability of the simple linear
regression model, compared to the null model, is typically measured by the
*coefficient of determination*, also known as ``R-squared'':
\[ R^2 = 1 - \frac{\text{residuals sum of squares}}
{\text{total sum of squares}}
= 1- \frac{\sum_i (y_i - \hat\alpha- \hat\beta x_i)^2}
{\sum_i (y_i - \bar{y})^2}. \]
TSS <- sum((y - mean(y))^2)
1 - sum(RSS / TSS)
This number is also given in the summary output of `lm()`, labelled as
```Multiple R-squared`'' (can you spot it in the earlier output for the
model `m1`?). If $R^2 = 1$, the linear relationship between
the predictor and the response explains all the variability in the response
(i.e.\ the points are exactly on a straight line). In this case $R^2 =
0.7037$, meaning that about 70\% of the variability in `crime` is
explained by the linear relationship between `crime` and
The coefficient of determination can be used to compare models with the same
number of predictors. For example, let us fit another model:
```{r results="hide"}
m2 <- lm(crime ~ poverty, data = cdata)
summary(m2)
Clearly the linear model including `single` as a predictor does a better
job in explaining the variability in `crime` compared to the model that
with `poverty`.
# Robust regression
There are a number of estimation methods that may be used to fit the regression
model and the choice of the method ultimately depends on the goal of the
analysis. In this section we explore use of the weighted least squares (WLS)
method to compute estimates for $\alpha$ and $\beta$. WLS finds $\alpha$ and
$\beta$ by solving the weighted least squares problem,
\[ \min_{\alpha, \beta} \sum_i w_i (y_i - \alpha - \beta x_i)^2, \]
where $w_i$ are user-defined weights. When the goal is to reduce the influence
of potential outliers we can use the Huber weights, which have the form:
\begin{cases}
1 & \text{if } \lvert e_i \rvert \leqslant k \\
k / \lvert e_i \rvert & \text{if } \lvert e_i \rvert > k
\end{cases} \]
where $k > 0$ is a tuning constant^[A *tuning constant* or *tuning parameter*
is parameter in a model that usually needs to be set to a value before we can
estimate the other parameters. If so, then it is not a parameter that we
estimate. Hence, we can think of it as one of the (many) assumptions we need
to make in order to do inference.] and $e_i = y_i – \alpha – \beta x_i$. In
Huber weighting, observations with small residuals get a weight of 1 and the
larger the residual, the smaller the weight.
In the following example, we use the `rlm()` function from the `MASS` package
to carry out WLS estimation with Huber weights (there are several other
weighting functions that can be used). The default value of the tuning
constant is $k = 1.345 \hat\sigma$. Theory suggests that this value leads to
a considerable increase in robustness but with only a small increase in the
variance of the estimator.
library(MASS)
m3 <- rlm(crime ~ single, data = cdata)
summary(m3)
The following gives a table with Huber weights (in increasing order) and the
corresponding residuals:
hweights <- data.frame(state = cdata$state,
resid = m3$resid,
weight = m3$w)
hweights2 <- hweights[order(m3$w), ]
hweights2[1:15, ]
We can see that as the absolute residual tends to go down, the weight goes up.
In other words, cases with a large residuals tend to be down-weighted. This
output shows us that the observation for Mississippi will be down-weighted the
most. Florida will also be substantially down-weighted. All observations not
shown above have a weight of 1. In ordinary least squares regression, all
cases have a weight of 1.
# Exercises
1. The following questions refer to the simple linear regression model.
a) Draw a plot of the data and an overlayed model fit.
# Insert answer here...
b) Is there a data point that's very different to the others? Which one is
# Insert answer here...
c) Do you think this point has a strong impact on the model fit, in the
sense of having a larger influence on the parameter estimates than the
other observations?
# Insert answer here...
d) Check whether this is a concern by excluding this data point and
re-fitting the model. How do the estimates compare to the original ones?
# Insert answer here...
2. The President of the United States asks you to find out how the murder rate
relates to the proportion of single parents.
a) Fit a simple linear regression model that relates these two variables.
# Insert answer here...
b) Draw a standard set of diagnostic plots. What do you see?
# Insert answer here...
c) Remove one of the data points that seems to be problematic and re-fit
the model. (Note: this now means the model fit is representative of all of
the remaining states but not the one that has been omitted.) How does your
model fit compare to the original?
# Insert answer here...
d) The President would like you predict the murder rate a state would have
if 12\% of its population were single parents. Calculate an appropriate
90\% prediction interval.
*Hint:* You might need to run `help(predict.lm)` to find out something
# Insert answer here...
e) Repeat this for a state with only 8\% single parents. What do you
# Insert answer here...
f) Since the measurements are all positive, a standard way to fix the
problem you have just observed is to first take the logarithm of all of the
data values and work with those instead. Do this and, re-fit the model and
re-compute the prediction interval. (Remember to transform your interval
back to the original scale at the end.)
# Insert answer here...
g) What model is this procedure equivalent to? Can you write its model
equation? How do you interpret the 'slope' parameter in this model?
# Insert answer here...
h) Draw a plot of the data and a 90\% confidence band using this latest
# Insert answer here...
3. Do question 7 from the tutorial problems.
# Insert answer here...
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com