Lecture 2: Regression analysis – A quick review of classical linear regression.
Can Yang
Department of Mathematics
The Hong Kong University of Science and Technology
Most of the materials here are from Chapter 3 of Introduction to Statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
Other related materials are listed in Reference.
Spring, 2020
Outline
Simple Linear Regression
Multiple Linear Regression
Comparsion of Linear Regression with K-Nearest Neighbors
Outline
Simple Linear Regression
Multiple Linear Regression
Comparsion of Linear Regression with K-Nearest Neighbors
A simple example of linear regression: Advertising data
The Advertising data set consists of the sales of that product in 200 different markets, along with advertising budgets for the product in each of those markets for three different media: TV, radio, and newspaper.
Suppose that in our role as statistical consultants we are asked to suggest, on the basis of this data, a marketing plan for next year that will result in high product sales. What information would be useful in order to provide such a recommendation?
Advertising data
0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100
TV Radio Newspaper
Figure 1: The Advertising data set. The plot displays sales, in thousands of units, as a function of TV, radio, and newspaper budgets, in thousands of dollars, for 200 different markets. In each plot we show the simple least squares fit of sales to that variable. In other words, each blue line represents a simple model that can be used to predict sales using TV, radio, and newspaper, respectively.
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Sales
5 10 15 20 25
Here are a few important questions that we might seek to address:
1. Is there a relationship between advertising budget and sales?
2. How strong is the relationship between advertising budget and sales?
3. Which media contribute to sales?
4. How accurately can we estimate the effect of each medium on
sales?
5. How accurately can we predict future sales?
6. Is the relationship linear?
7. Is there synergy among the advertising media?
It turns out that linear regression can be used to answer each of these questions. We will first discuss all of these questions in a general context, and then return to them later.
Simple linear regression
It is a very straightforward simple approach for predicting a quantitative response Y on the basis of a single predictor variable X. It assumes that there is approximately a linear relationship between X and Y . Mathematically, we can write this linear relationship as
Y ≈β0 +β1X, (1)
e.g., sales ≈ β0 + β1 × TV.
Here β0 and β1 are two unknown constants that represent the intercept and slope terms in the linear model. Together, β0 and β1 are intercept slope known as the model coefficients or parameters.
Once we have used our coefficient parameter training data to produce estimates βˆ0 and βˆ1 for the model coefficients, we can predict future sales on the basis of a particular value of TV advertising by computing
yˆ≈βˆ0 +βˆ1x, (2)
where yˆ indicates a prediction of Y on the basis of X = x. Here we use a hat symbol,ˆ, to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.
Estimating the Coefficients
In practice, β0 and β1 are unknown. Let (x1,y1),(x2,y2),…,(xn,yn)
represent n observation pairs, each of which consists of a measurement of
X and a measurement of Y
The most common approach involves minimizing the least squares
criterion. Alternative approaches will be discussed later.
Let yˆi = βˆ0 + βˆ1xi be the prediction for Y based on the i-th value of X.
Then ei = yi − yˆi represents the i -th residual– this is the difference between residual the i-th observed response value and the i-th response value that is predicted by our linear model. We define the residual sum of squares (RSS) as
RSS = e12 + e2 + · · · + en2 n
= ( y i − βˆ 0 − βˆ 1 x i ) 2 . i=1
The least squares coefficient estimates are given as
ˆ ni (xi − x ̄)(yi − y ̄) ˆ ˆ
(3)
β1 = ni (xi − x ̄)2 , β0 = y ̄ − β1x ̄, (4)
wherey ̄≡ 1 nyi andx ̄≡nxi. nii
Sales
5 10 15 20 25
0 50 100 150 200 250 300 TV
Figure 2: For the Advertising data, the least squares fit for the regression of sales onto TV is shown. The fit (βˆ0 = 7.03, βˆ1 = 0.0475) is found by minimizing the sum of squared errors. Each grey line segment represents an error, and the fit makes a compromise by averaging their squares. In this case a linear fit captures the essence of the relationship, although it is somewhat deficient in the left of the plot.
3
3
2.5
2.15
RSS
2.2 2.3
56789
β0
β1
β1
0.03 0.04 0.05 0.06
3
3
Figure 3: Contour and three-dimensional plots of the RSS on the Advertising data, using sales as the response and TV as the predictor. The red dots correspond to the least squares estimates βˆ0 and βˆ1.
β0
Assessing the Accuracy of the Coefficient Estimates
Assume the true relationship between X and Y :
Y =f(X)+ε,f(X)=β0 +β1X. (5)
−2 −1 0 1 2 −2 −1 0 1 2
XX
Figure 4: A simulated data set. Left: The red line represents the true relationship, f (X ) = 2 + 3X , which is known as the population regression line. The blue line is the least squares line; it is the least squares estimate for
f (X ) based on the observed data, shown in black. Right: The population regression line is again shown in red, and the least squares line in dark blue. In light blue, ten least squares lines are shown, each computed on the basis of a separate random set of observations. Each least squares line is different, but on average, the least squares lines are quite close to the population regression line.
−10 −5 0 5 10
−10 −5 0 5 10
Y
Y
At first glance, the difference between the population regression line and the least squares line may seem subtle and confusing. We only have one data set, and so what does it mean that two different lines describe the relationship between the predictor and the response?
Fundamentally, the concept of these two lines is a natural extension of the standard statistical approach of using information from a sample to estimate characteristics of a large population.
For example, suppose that we are interested in knowing the population mean μ of some random variable Y . Unfortunately, μ is unknown, but we do have access to n observations from Y ,which we can write as y1,…,yn, and which we can use to estimate μ.
A reasonable estimate is μˆ = y ̄, where y ̄ = 1 yi is the sample ni
mean. The sample mean and the population mean are different, but in general the sample mean will provide a good estimate of the population mean. In the same way, the unknown coefficients β0 and β1 in linear regression define the population regression line. We seek to estimate these unknown coefficients using least squares.
We continue the analogy with the estimation of the population mean μ of a random variable Y . A naturel question is as follows: how accurate is the sample mean μˆ as an estimate of μ?
In general, we answer this question by computing the standard error of μˆ:
2 σ2
Var(μˆ)=SE(μˆ) = n, (6)
where σ is the standard deviation of each of the realizations yi of Y .
Similarly, we can compute the standard errors associated with βˆ0
and βˆ1:
ˆ2 21 x ̄2 ˆ2 σ2
SE(β0) =σ n+ni(xi−x ̄)2 ,SE(β1) =ni(xi−x ̄)2, (7) where σ2 can be estimated by σˆ2 = RSS/(n − 2).
Accordingly, we have 95% confidence interval for β1 as
βˆ1 −2·SE(βˆ1),βˆ1 +2·SE(βˆ1). (8)
Similarly for β0.
Hypothesis testing
The most common hypothesis test involves testing the null hypothesis of H0 : β1 = 0, (9)
i.e., there is no relationship between X and Y .
We compute a t -statistics
t = βˆ1 − 0 , (10) SE (βˆ1 )
which will have a t-distribution with n − 2 degrees of freedom.
Consequently, it is a simple matter to compute the probability of
observing any value equal to |t| or larger, assuming β1 = 0. We call this probability the p-value. We reject the null hypothesis if the p-value is small enough, e.g., <0.05 or <0.01.
Coefficient Std.error t-statistics p-value Intercept 7.0325 0.4578 15.36 <0.0001 TV 0.0475 0.0027 17.67 <0.0001
Table 1: For the Advertising data, coefficients of the least squares model for the regression of number of units sold on TV advertising budget. An increase of $1,000 in the TV advertising budget is associated with an increase in sales by around 50 units.
Assessing the accuracy of the model
The residual standard error (RSE): n
RSE = 1 RSS = 1 (y −yˆ)2. (11)
ii
n−2 n−2
It provides an absolute measure of lack of fit of the linear model.
R2 statsitic provides an alternative measure of fit. It takes the form of a proportion-the proportion of variance explained - and so it always takes on a value between 0 and 1, and is independent of the scale of Y .
R2 = TSS − RSS = 1 − RSS , (12) TSS TSS
where TSS = (yi − y ̄)2 is the total sum of squares. Correlation between X and Y is define as
i(xi −x ̄)(yi −y ̄)
r =Cor(X,Y)= i(xi −x ̄)2i(yi −y ̄)2. (13)
In fact, it can be shown that R2 = r2 in the simple linear regression setting.
Outline
Simple Linear Regression
Multiple Linear Regression
Comparsion of Linear Regression with K-Nearest Neighbors
Model
Instead of fitting a separate simple linear regression model for each predictor, a better approach is to extend the simple linear regression model so that it can directly accommodate multiple predictors. In general, suppose that we have p distinct predictors. Then the multiple linear regression model takes the form
Y =β0 +β1X1 +β2X2 +···+βpXp +ε, (14)
where Xj represents the jth predictor and βj quantifies the association between that variable and the response. We interpret βj as the average effect on Y of a one unit increase in Xj , holding all other predictors fixed.
In the advertising example, (19) becomes
sales=β0 +β1 ×TV+β2 ×radio+β3 ×newspaper+ε. (15)
Estimating the regression coefficients
Let βˆ = [βˆ0,βˆ1,...,βˆp] be the least squares n
βˆ =arg min (yi −β0 −β1xi1 −···−βpxip)2 (16) β=[β0 ,β1 ,...,βp ] i
The least square solution is
βˆ = (XT X)−1XT y. (17)
where
y1 1 x11 ... x1p
y= y2 , X= 1 x21 ... x2p (18)
... ... ... ... ... yn 1 xn1 xn2 xnp
Figure 5: Geometric interpretation of Least Squares (Figure 3.2 from ELS by Hastie et al.).
Y
X2
Figure 6: In a three-dimensional setting, with two predictors and one re- sponse, the least squares regression line becomes a plane. The plane is chosen to minimize the sum of the squared vertical distances between each observation (shown in red) and the plane.
X1
A simple R demo code for linear regression
# A linear regression function
# input: n by p design matrix X; n - vector y
# output: #
#
#
#
#
#
linReg <-
coefficient estimates beta;
residual variance estimate sig2;
standard errors of beta;
t-statistics of beta;
p-values of beta
Rsquare: 1- RSS/TSS
RSS: residual sum of squares; TSS: Total sum of squares function(X,y){
n <- nrow(X) p <- ncol(X)
X <- cbind(1,X)
invK <- solve(t(X)%*%X) beta <- invK%*%(t(X)%*%y) residual <- y-X%*%beta
Rsq <- 1-sum(residual^2)/sum( (y-mean(y))^2 ) sig2 <- sum(residual^2) / (n-p-1)
Sig_beta <- sig2*invK
se <- sqrt(diag(Sig_beta))
t <- beta/se
pval <- pt(abs(t),n-p-1,lower.tail = F)*2
return(list(beta=beta,sig2=sig2,se=se,t=t,pval=pval,Rsq = Rsq)) }
Elements of likelihood inference
Definition: Assuming a statistical model parameterized by a fixed and unknown parameter θ, the likelihood is the probability of the observed data x considered as a function of θ: l(θ) = p(x|θ).
Example: Suppose x is a sample from N(θ,1). Then the likelihood ofθisl(θ)=√1 exp(−1(x−θ)2).
2π 2
Example: Let x1, x2, . . . , xn be an independent and identically
distributed (i.i.d) sample from N(θ,σ2) with known σ2. The total likelihood function is
nn
1 (xi−θ)2
L(θ)= li(θ)= √2πσ2 exp − 2σ2 , i i=1
and the total log-likelihood function is
n n 1n
logli(θ)=−2log(2πσ2)−2σ2 (xi −θ)2. i=1
logL(θ)=
The maximum likelihood estimate (MLE) is
i=1
θˆ = arg max log L(θ). θ
Variance and Fisher information
Lets(θ)= ∂ logl(θ)= 1 ∂p(x|θ). ∂θ p(x|θ) ∂θ
The score function has expectation 0,
∂p(x|θ) s(θ)p(x|θ)dx = ∂θ
∂ dx = ∂θ
∂ p(x|θ)dx = ∂θ 1 = 0.
E[s(θ)] =
The Fisher information Iθ is defined to be the variance of the score
function s(θ):
Iθ = V(s(θ)) = E(s2(θ)) − E2(s(θ)) =
Therefore we have
=0
E[s(θ)] = 0, V[s(θ)] = Iθ.
The main result for maximum likelihood estimation is that, the MLE
θˆ has an approximately normal distribution with mean θ and
variance I−1: θ
s2(θ)p(x|θ)dx.
θˆ ∼ N ( θ , ( n I θ ) − 1 ) .
(19)
Proof: Variance and Fisher information (I)
The expectation of the derivative of the score function is ∂ ∂2
E ∂θs(θ) = E ∂θ2 logp(x|θ) = −Iθ. (20) This is because
∂ ∂2 ∂θs(θ)=∂θ2 logp(x|θ)=
2 ∂2p(x|θ) 1 ∂p(x|θ) 1
∂θ2 p(x|θ)− ∂θ p(x|θ)
Take expectation
∂ ∂2p(x|θ)1 2
E ∂θs(θ) = E ∂θ2 p(x|θ) −E[s (θ)] = −Iθ.
s(θ)
=0
Proof: Variance and Fisher information (II)
Now suppose that x = (x1, x2, . . . , xn) is an independent and identically distributed (i.i.d) sample from p(x|θ).
The total log-likelihood function is
logL(θ) = logli(θ) = logp(xi|θ).
ii
The total score function is
S(θ) = si(θ).
Similarly, we have
i
−∂S(θ) =−∂si(θ).
∂θ
i
∂θ
The MLE θˆ satisfies: S(θˆ) = 0. A first-order Taylor approximation gives
0 = S ( θˆ ) = S ( θ ) + ∂ S ( θ ) ( θˆ − θ ) . ∂θ
Proof: Variance and Fisher information (III)
The above equation gives
ˆ ∂S(θ)−1 1 ∂S(θ)−1 S(θ)
θ=θ+ − ∂θ S(θ)=θ+ −n ∂θ n . Eq. (19), i.e., E[s(θ)] = 0,V[s(θ)] = Iθ, and the central limit
theorem imply that
S(θ) i si(θ)
n = n ∼ N(0,Iθ/n) The law of large numbers and Eq. (20) implies
−1∂S(θ)=−1∂si(θ)→Iθ asn→∞. n∂θ n ∂θ
i
Put things together
θˆ ∼ N ( θ , ( n I θ ) − 1 ) .
In practice, nIθ is replaced by its unbiased estimate
I (θ) = − ∂2 log L(θ) evaluated at θˆ, i.e., observed fisher information
∂θ
I (θˆ).
MLE in linear regression
Assuming y|X ∼ N(Xβ, σe2I), we can write the log-likelihood βˆ = arg max log L(β) = log p(y|X; β)
=argmax−1logσe2− 1 (y−Xβ)T(y−Xβ) β 2 2σe2
Clearly, this is equivalent to the least squares. Residual variance can be estimated as
σˆe2 =∥y−Xβˆ∥2/(n−p−1) By he likelihood theory, we have
ˆ −1 ˆ ∂2 logL(β)−1
T −1 =σˆe(X X) .
Var(β)=I (β)= −
2
2
∂ β
β = βˆ
An alternative way to obtain the variance of βˆ Consider the least squares solution
βˆ = (XT X)−1XT y. We can directly calculate Var(βˆ) as follows:
Var(βˆ) = Var[(XT X)−1XT y]
= Var[(XT X)−1XT (Xβ + e)]
= Var[β + (XT X)−1XT e]
= Var[(XT X)−1XT e]
(∗) = E[(XT X)−1XT eeT X(XT X)−1]
= XT X)−1 XT E[eeT ] X(XT X)−1
σ e2 I = σ e2 ( X T X ) − 1 ,
where the fifth equality (∗) we used the fact
Var(z) = E[zzT ] − E[z]ET [z] for a vector z of random variables.
Some important Questions
1. Is at least one of the predictors X1, X2, . . . , Xp useful in predicting the response?
2. Do all the predictors help to explain Y , or is only a subset of the predictors useful?
3. How well does the model fit the data?
4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction?
One. Is There a Relationship Between the Response and Predictors?
In the multiple regression setting with p predictors, we need to ask whether all of the regression coefficients are zero, i.e. whether
β1 = β2 = · · · = βp = 0. We test the null hypothesis,
H0 :β1 =β2 =···=βp =0 (21) versus the alternative Ha: at least one βj is non-zero.
This hypothesis test is performed by computing the F-statistics,
F= (TSS−RSS)/p, (22)
RSS/(n − p − 1)
where TSS = i(yi −y ̄)2 and RSS = i(yi −yˆi)2.
Sometimes we want to test that a particular subset of q of the coefficients
are zero. This corresponds to a null hypothesis
H0 :βp−q+1 =βp−q+2 =···=βp =0, (23)
Then the appropriate F-statistic is
F = (RSS0 − RSS)/q , (24)
RSS/(n − p − 1)
where RSS0 is the sum of squares for the model that uses all the variables except those last q.
Two: Deciding on Importance Variables
If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder which are the guilty ones!
We could look at the individual p-values as in Table 2, but as discussed, if p is large we are likely to make some false discoveries.
Intercept TV radio newspaper
2.939 0.3119 9.42 0.046 0.0014 32.81 0.189 0.0086 21.89
-0.001 0.0059 -0.18
<0.0001 <0.0001 <0.0001 0.8599
Coefficient Std.error t-statistics p-value
Table 2: For the Advertising data, least squares coefficient estimates of the multiple linear regression of number of units sold on radio, TV, and newspaper advertising budgets.
It is possible that all of the predictors are associated with the response, but it is more often the case that the response is only related to a subset of the predictors.
The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection.
Ideally, we would like to perform variable selection by trying out a lot of different models, each containing a different subset of the predictors. For instance, if p = 2, then we can consider four models:
(1) a model containing no variables,
(2) a model containing X1 only,
(3) a model containing X2 only,
(4) a model containing both X1 and X2.
We can then select the best model out of all of the models that we have considered. How do we determine which model is best? Various statistics can be used to judge the quality of a model. These include Mallow’s Cp, Akaike information criterion (AIC), Bayesian information criterion (BIC) and adjusted R2.
Unfortunately, there are a total of 2p models that contain subsets of p variables. This means that even for moderate p, trying out every possible subset of the predictors is infeasible. For instance, we saw that if p = 2, then there are 22 = 4 models to consider. But if
p = 30, then we must consider 230 = 1, 073, 741, 824 models! This
is not practical.
There are three classical approaches for this task:
Forward selection
Backward selection
Mixed section
Many other methods appeared in recent years, such as Lasso.
Three: Model Fit
Recall that in simple regression, R2 is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals Cor(Y,Yˆ)2, the square of the correlation between the response and the fitted linear model.
In fact one property of the fitted linear model is that it maximizes this correlation among all possible linear models.
An R2 value close to 1 indicates that the model explains a large portion of the variance in the response variable.
Take Advertising data as an example. The model that uses all three advertising media to predict sales has an R2 of 0.8972. On the other hand, the model that uses only TV and radio to predict sales has an R2 value of 0.89719 (see Table 2).
Note that R2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. This is due to the fact that adding another variable to the least squares equations must allow us to fit the training data (though not necessarily the testing data) more accurately.
Visualization for Model Checking
Sales
Figure 7: For the Advertising data, a linear regression fit to sales using TV and radio as predictors. From the pattern of the residuals, we can see that there is a pronounced non-linear relationship in the data. The positive residuals (those visible above the surface), tend to lie along the 45-degree line, where TV and Radio budgets are split evenly. The negative residuals (most not visible), tend to lie away from this line, where budgets are more lopsided.
TV
Radio
Four: Prediction
Once we have fit the multiple regression model, it is straightforward to apply βˆ in order to predict the response Y on the basis of a set of values for the predictors X1,X2,...,Xp. However, there are three sorts of uncertainty associated with this prediction.
1. The least square βˆ is the estimate for β (the population true value). The inaccuracy in the coefficient estimates is related to the reducible error.
2. Of course, in practice assuming a linear model is almost always an approximation of reality, so there is an additional source of potentially reducible error which we call model bias.
3. Even if we knew f (X )-that is, even if we knew the true values for β0,β1,...,βp-the response value cannot be predicted perfectly because of the random error ε which is referred to as the irreducible error.
4. How much will Y vary from Yˆ? We use prediction intervals to answer this question. Prediction intervals are always wider than confidence intervals, because they incorporate both the error in the estimate for f (X ) (the reducible error) and the uncertainty as to how much an individual point will differ from the population regression plane (the irreducible error).
Extensions of the Linear Model
For linear models, two of the most important assumptions state that the relationship between the predictors and response are additive and linear.
The additive assumption means that the effect of changes in a predictor Xj on the response Y is independent of the values of the other predictors, i.e.,
Y =fj(Xj)+ε. j
The “additive” assumption is also known as the “separable” condition.
The linear assumption states that the change in the response Y due to a one-unit change in Xj is constant, regardless of the value of Xj .
For example, Y = 2 + 3X1 − 4sin(X1) + X2 is additive but non-linear in X1.
Removing the Additive Assumption
Consider the standard linear regression model with two variables, Y = β0 + β1X1 + β2X2 + ε.
One way of extending this model to allow for interaction effects is to include a third predictor, which is constructed by computing the product of X1 and X2. This results in the model
Y =β0 +β1X1 +β2X2 +β3X1X2 +ε. (25)
When β3 ̸= 0, the effect of changes in a predictor X1 on the
response Y depends on X2, and vice versa.
This is known as a synergy effect or an interaction effect.
The parameters β1, β2, β3 can be estimated using least squares.
Non-linear Relationships
Consider the model with a quadratic term,
Y = β0 + β1X + β2X2 + ε.
Although Y is non-linear in X, the model f(X;β)=β0+β1X+β2X2 islinearinparameters[β0,β1,β2].
Linear Degree 2 Degree 5
50 100 150 200 Horsepower
Figure 8: The Auto data set. For a number of cars, mpg and horsepower are shown. The linear regression fit is shown in orange. The linear regression fit for a model that includes horsepower2 is shown as a blue curve. The linear regression fit for a model that includes all polynomials of horsepower up to fifth-degree is shown in green.
Miles per gallon
10 20 30 40 50
Potential problems
When we fit a linear regression model to a particular data set, many problems may occur. Most common among these are the following:
1. Non-linearity of the response-predictor relationships. 2. Correlation of error terms.
3. Non-constant variance of error terms.
4. Outliers.
5. High-leverage points.
6. Collinearity.
In practice, identifying and overcoming these problems is as much an art as a science. Here we provide only a brief summary of some key points.
1. Non-linearity of the Data
Residual plots are a useful graphical tool for identifying non-linearity. We may plot the residuals versus the predicted (or fitted) values yˆi . We may add nonlinear terms such as log(X), √X and X2 if necessary.
Residual Plot for Linear Fit
323 330
334
Residual Plot for Quadratic Fit
334
323
5 10 15 20 25 30
Fitted values
15 20 25 30 35
Fitted values
Figure 9: Plots of residuals versus predicted (or fitted) values for the Auto data set. In each plot, the red line is a smooth fit to the residuals, intended to make it easier to identify a trend. Left: A linear regression of mpg on horsepower. A strong pattern in the residuals indicates non-linearity in the data. Right: A linear regression of mpg on horsepower and horsepower2. There is little pattern in the residuals.
155
Residuals
−15 −10 −5 0 5 10 15 20
Residuals
−15 −10 −5 0 5 10 15
2. Correlation of Error Terms
An important assumption of the linear regression model is that the error terms, ε1,...,εn, are uncorrelated. The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms.
If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter.
In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant.
In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.
As an extreme example, suppose we accidentally doubled our data, leading to observations and error terms identical in pairs.
ρ=0.0
0 20 40 60 80 100
ρ=0.5
0 20 40 60 80 100
ρ=0.9
0 20 40 60 80 100
Observation
Figure 10: Plots of residuals from simulated time series data sets generated with differing levels of correlation ρ between error terms for adjacent time points.
Residual Residual Residual
−1.5 −0.5 0.5 1.5 −4 −2 0 1 2 −3 −1 0 1 2 3
3. Non-constant Variance of Error Terms
An important assumption of the linear regression model is Var(εi ) = σ2. Non-constant variances in the errors, known as heteroscedasticity, can be seen from the presence of a funnel shape in the residual plot.
10 15
20 25
Fitted values
30
605
671 437
2.4 2.6 2.8 3.0 3.2 3.4
Fitted values
Response Y
Response log(Y)
998 975
845
Residuals
−10 −5 0 5 10 15
Residuals
−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4
Figure 11: Residual plots. In each plot, the red line is a smooth fit to the residuals, intended to make it easier to identify a trend. The blue lines track the outer quantiles of the residuals, and emphasize patterns. Left: The funnel shape indicates heteroscedasticity. Right: The predictor has been log-transformed, and there is now no evidence of heteroscedasticity.
When faced with this problem, one possible solution is to transform
Weighted least squares: Sometimes we have a good idea of the variance of each response. For example, the ith response could be an average of ni raw observations. If each of these raw observations is uncorrelated with variance σ2, then their average has variance
σi2 = σ2/ni . In this case a simple remedy is to fit our variances-i.e. wi = ni in this case.
√
Y . Such a transformation results in a greater amount of shrinkage of the
the response Y using a concave function such as log Y or larger responses, leading to a reduction in heteroscedasticity.
4. Outliers
An outlier is a point for which yi is far from the value predicted by the outlier model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection.
20
20
20
−2 −1 0 1 2 −2 0 2 4 6 −2 0 2 4 6
X Fitted Values Fitted Values
Figure 12: Left: The least squares regression line is shown in red, and the regression line after removing the outlier is shown in blue. Center: The residual plot clearly identifies the outlier. Right: The outlier has a studentized residual of 6; typically we expect values between -3 and 3. The studentized residual can be computed by dividing each residual ei by its estimated standard studentized residual error. Many other methods in robust linear regression are available to handle outliers.
−4 −2 0 2 4 6
−1 0 1 2 3 4
Y
Residuals
Studentized Residuals
0246
5. High leverage Points
We just saw that outliers are observations for which the response yi is unusual given the predictor xi . In contrast, observations with high leverage have an unusual value for xi .
20
41
20
41
−2 −1 0 1 2 3 4 −2
X
−1
0 1 2 0.00 0.05 0.10 0.15 0.20 0.25
X1 Leverage
Figure 13: Left: Observation 41 is a high leverage point, while 20 is not. The red line is the fit to all the data, and the blue line is the fit with observation 41 removed. Center: The red observation is not unusual in terms of its X1 value or its X2 value, but still falls outside the bulk of the data, and hence has high leverage. Right: Observation 41 has a high leverage and a high residual.
Y
0 5 10
X2
−2 −1 0 1 2
Studentized Residuals
−1 0 1 2 3 4 5
Levarage statistics
Suppose we have a linear regression model y = Xβ + e. The βˆ=(XTX)−1XTyandH=X(XTX)−1XT isthehatmatrix.Ithas this name because it is used to compute
yˆ = Xβˆ = Hy.
The diagonal values of H are denoted by h1, . . . , hn.
hi is defined as the levarange statistic of the i-th data, measuring the sensitivity
∂ yˆ hi= i.
∂yi
6. Collinearity: Credit Data
Balance
0 500
1500
2
4 6
8
50 100 150
Rating
200 600 1000
20
40
60
Age
80 100
5
10 15 20
2000 8000
14000
Cards
Education
200 600 1000
50 100 150
2 4 6 8
0 500 1500
2000 8000 14000
5 10 15 20
20 40 60 80 100
Figure 14: The Credit data set contains information about balance, age, cards, education, income, limit,and rating for a number of potential customers.
Income
Limit
6. Collinearity
Collinearity refers to the situation in which two or more predictor variables collinearity are closely related to one another.
0.16 0.17 0.18 0.19 −0.1 0.0 0.1 0.2
βLimit βLimit
Figure 15: Contour plots for the RSS values as a function of the parameters β for various regressions involving the Credit data set. In each plot, the black dots represent the coefficient values corresponding to the minimum RSS. Left: A contour plot of RSS for the regression of balance onto age and limit.The minimum value is well defined. Right: A contour plot of RSS for the regression of balance onto rating and limit. Because of the collinearity, there are many pairs (βLimit , βRating ) with a similar value for RSS.
βAge
−5 −4 −3 −2 −1 0
βRating 012345
21.8
21.8
21.5
21.25
21.5
Since collinearity reduces the accuracy of the estimates of the regression coefficients, it causes the standard error for βˆj to grow. Recall that the t-statistic for each predictor is calculated by dividing by its standard error. Consequently, collinearity results in a decline in the t-statistic.
As a result, in the presence of collinearity, we may fail to reject H0:
βj = 0. This means that the power of the hypothesis test–the probability of correctly power detecting a non-zero coefficient–is reduced by collinearity.
Coefficient Std.error t-statistic p-value
Intercept Model 1 age
limit Intercept
Model 2 rating limit
-173.411 -2.292 0.173 -377.537 2.202 0.025
43.828 0.672 0.005 45.254 0.952 0.064
-3.957 -3.407 34.496 -8.343 2.312 0.384
<0.0001 0.0007 <0.0001 <0.0001 0.0213 0.7012
Table 3: The results for two multiple regression models involving the Credit data set are shown. Model 1 is a regression of balance on age and limit, and Model 2 a regression of balance on rating and limit. The standard error of βlimit increases 12-fold in the second regression, due to collinearity.
A simple way to detect collinearity is to look at the correlation matrix of the predictors. An element of this matrix that is large in absolute value indicates a pair of highly correlated variables, and therefore a collinearity problem in the data.
Unfortunately, not all collinearity problems can be detected by inspection of the correlation matrix: it is possible for collinearity to exist between three or more variables even if no pair of variables has a particularly high correlation. We call this situation multicollinearity.
Instead of inspecting the correlation matrix, a better way to assess multicollinearity is to compute the variance inflation factor (VIF).
Xj |X−j
so the VIF will be large.
VIF (βˆj ) = 1 , (26) 1−R2
Xj |X−j
where R2 is the R2 from a regression of Xj onto all the other
Xj |X−j
predictors. If R2 is close to one, then collinearity is present, and
Outline
Simple Linear Regression
Multiple Linear Regression
Comparsion of Linear Regression with K-Nearest Neighbors
Linear Regression v.s. K-Nearest Neighbors
Linear regression is an example of a parametric approach because it
assumes a linear functional form for f (X ).
In contrast, non-parametric methods do not explicitly assume a parametric form for f(X), and thereby provide an alternative and more flexible approach for performing regression.
Here we consider one of the simplest and best-known non-parametric methods, K-nearest neighbors regression (KNN regression).
Given a value for K and a prediction point x0, KNN regression first identifies the K training observations that are closest to x0, represented by N0.It then estimates f (x0) using the average of all the training responses in N0. In other words,
fˆ(x0)=1 yi. (27) K xi∈N0
y
y
x2
x2
x1
x1
Figure 16: Plots of fˆ(X) using KNN regression on a two-dimensional data set with 64 observations (orange dots). Left: K = 1 results in a rough step function fit. Right: K = 9 produces a much smoother fit.
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
xx
Figure 17: Plots of fˆ(X) using KNN regression on a one-dimensional data set with 100 observations. The true relationship is given by the black solid line. Left: The blue curve corresponds to K = 1 and interpolates (i.e. passes directly through) the training data. Right: The blue curve corresponds to
K = 9,and represents a smoother fit.
y
1234
y
1234
−1.0 −0.5 0.0 0.5 1.0 0.2 0.5 1.0
x 1/K
Figure 18: The same data set shown in Figure 17 is investigated further. Left: The blue dashed line is the least squares fit to the data. Since f (X ) is in fact linear (displayed as the black line), the least squares regression line provides a very good estimate of f (X ). Right: The dashed horizontal line represents the least squares test set MSE, while the green solid line corresponds to the MSE for KNN as a function of 1/K (on the log scale). Linear regression achieves a lower test MSE than does KNN regression, since f (X ) is in fact linear. For KNN regression, the best results occur with a very large value of K, corresponding to a small value of 1/K.
y
1234
Mean Squared Error
0.00 0.05 0.10 0.15
−1.0 −0.5 0.0
x
0.5
1.0
0.2
0.5 1.0
1/K
−1.0 −0.5 0.0
x
0.5
1.0
0.2
0.5 1.0
Figure 19: Top Left: In a setting with a slightly non-linear relationship between X and Y (solid black line), the KNN fits with K = 1 (blue) and K = 9 (red) are displayed. Top Right: For the slightly non-linear data, the test set MSE for least squares regression (horizontal black) and KNN with various values of 1/K (green) are displayed. Bottom Left and Bottom Right: As in the top panel, but with a strongly non-linear relationship between X and Y .
1/K
1.0 1.5 2.0 2.5
3.0
3.5 0.5
1.0
1.5 2.0 2.5 3.0 3.5
yy
Mean Squared Error
Mean Squared Error
0.00 0.05 0.10
0.15 0.00
0.02 0.04 0.06
0.08
p=1 p=2 p=3 p=4 p=10 p=20
0.2 0.5 1.0 0.2 0.5 1.0 0.2 0.5 1.0 0.2 0.5 1.0 0.2 0.5 1.0 0.2 0.5 1.0
1/K
Figure 20: Test MSE for linear regression (black dashed lines) and KNN (green curves) as the number of variables p increases. The true function is non-linear in the first variable, as in the lower panel in Figure 19, and does not depend on the additional variables. The performance of linear regression deteriorates slowly in the presence of these additional noise variables, whereas KNN’s performance degrades much more quickly as p increases.
0.0 0.2
0.4 0.6
0.8 1.0
0.0 0.2
0.4 0.6
0.8 1.0
0.0 0.2
0.4 0.6
0.8 1.0
0.0 0.2
0.4 0.6
0.8 1.0
0.0 0.2
0.4 0.6
0.8 1.0
0.0 0.2
0.4 0.6
0.8 1.0
Mean Squared Error
References
Friedman J.
An overview of predictive learning and function approximation.
Springer, 1994.
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani
An Introduction to Statistical Learning with Applications in R,
Springer, 2013.
Hastie, T., Tibshirani R., Friedman J.
The elements of statistical learning, 2nd
Springer, 2009.
C. Bishop
Pattern recoginition and Machine learning
Springer, 2006.
Box, G. E. and Jenkins, G. M. (2008).
Time series analysis: forecasting and control (Fourth Edition). John Wiley & Sons.