Regression analysis: Basics
Outline
➢ Regressions and linear models.
➢ Estimation of linear regressions.
➢ Sampling distributions of regressions.
➢ Determining the explanatory power of a regression. ➢ Troubleshooting.
Reading
SDA chapter 12/13
2
The concept of dependence
➢ Data: one response/dependent variable 𝑌 and p predictor variables
𝑋1,…,𝑋𝑝 all measured on each of T observations. The predictors are also
named independent variables, regressor variables, or explanatory variables.
Dependent
Predictors
…
…
…
➢ In regression analysis, we estimate the relationship between a random variable 𝑌 and one or more variables 𝑋𝑖 .
➢ The goals of regression modeling include the investigation of how Y is related to 𝑋1, …, 𝑋𝑝, and prediction of future Y values when the corresponding values
of 𝑋1, …, 𝑋𝑝 are already available.
3
Regression model
➢ The regression equation can be written as 𝑌 = 𝐸 [𝑌 𝑋1, …, 𝑋𝑝 ] + 𝜖.
The deterministic function y = 𝑓(𝑧) where
y=𝑓(𝑧)= 𝐸[𝑌 𝑋1 = 𝑧1, …, 𝑋𝑝 = 𝑧𝑝 ]iscalledtheregression
function.
➢ The following properties of regression equations hold.
(1) The conditional mean of the error is zero: 𝐸 [𝜖 | 𝑋1, …, 𝑋𝑝 ] = 0. (2) The unconditional mean of the error is zero: 𝐸 [𝜖] = 0.
(3) The errors are uncorrelated with the variables 𝑋1, …, 𝑋𝑝: 𝐸[𝜖𝑋] = 0.
4
Example: Weekly interest rate vs. AAA yield
Weekly interest rates from February 16, 1977, to December 31, 1993, from the Federal Reserve Bank of Chicago. Below is a plot of changes in the 10-year Treasury constant maturity rate and changes in the Moody’s seasoned corporate AAA bond yield.
❑The dependence looks linear. ❑There is only one predictor. ❑Data:WeekInt.txt
❑R code:WeeklyInterestRate.R
5
Simple linear regression
Simple linear regression is a linear model with a constant and one predictor:
𝑌𝑡 = 𝛽0 + 𝛽1𝑋𝑡1 + 𝜖𝑡, t = 1,…, T .
where parameter 𝛽0 is the unknown intercept and 𝛽1 = 𝜕𝐸 [𝑌𝑡 𝑋𝑡1] . Therefore, 𝛽1 is 𝜕𝑋𝑡1
the unknown slope that tells change in the expected value of 𝑌𝑡 when 𝑋t1 changes one unit.
We observe the dependent variable 𝑌𝑡 and the independent variable 𝑋𝑡, but not the error 𝜖𝑡.
We can formulate the regression problem in a matrix form that is standard in regression
analysis as follows:
where 𝜷 is the vector of regression coefficients 𝜷 = (𝛽0, 𝛽1)′ and 𝝐 are the noises.
Q:What is 𝑿?
1 𝑋11
𝒀 = 𝑿𝜷 + 𝝐
1⋮ 𝑋21 ⋮
A: 𝑿=
1 𝑋𝑇1
6
Estimation of linear regressions
➢ We discuss how to estimate the linear regression parameters. We consider two main estimation techniques: least squares and maximum likelihood methods.
➢ The assumptions of linear regression model: (1) linearity of the conditional expectation.
(2) independent and identically noise: 𝜖1, …, 𝜖𝑇 and constant variance 𝑉𝑎𝑟(𝜖𝑡) = 𝜎𝜖2 for all t.
are independent with mean zero
7
Least-Squares estimation
The LS estimates are the values of 𝑏0 and 𝑏1 (also denoted as 𝛽^0 and 𝛽^1) that minimize the loss function: 𝑇 ^2 𝑇 2
^ ^ ∑𝑡=1𝜖𝑡 =∑𝑡=1(𝑌𝑡−𝑏0−𝑏1𝑋𝑡)
where 𝜖𝑡 = Yt − 𝑌𝑡 = 𝑌𝑡 − 𝑏0 − 𝑏1𝑋𝑡 is the difference between the Y-values and
the fitted values. It is called residuals.
The objective is to pick values of 𝑏0 and 𝑏1 in order to make the model
fit the data as closely as possible – where close is taken to be a small variance of the unexplained residual
^𝜖𝑡.
8
Example of OLS
9
First order conditions 𝜕𝑓(𝑏)
First order conditions for minimizing a differentiable function 𝑓(𝑏): 𝜕𝑏 = 0.
The first order conditions are that the derivatives of the loss function with respect to 𝑏0 and 𝑏1 should be zero.
𝜕 ∑(𝑌𝑡−𝑏0−𝑏1𝑋𝑡)2 =−2∑(𝑌𝑡−𝑏0−𝑏1𝑋𝑡)1=0 𝜕𝑏0
𝜕 ∑(𝑌𝑡−𝑏0−𝑏1𝑋𝑡)2 =−2∑(𝑌𝑡−𝑏0−𝑏1𝑋𝑡)𝑋𝑡 =0 𝜕𝑏1
It leads to the OLS estimates:
𝑐𝑜𝑣(𝑋,𝑌) 𝑣𝑎𝑟(𝑋)
∑𝑇𝑡=1𝑋𝑡𝑌𝑡 −𝑋 ̄𝑌 ̄
𝑏 = 𝛽^ = 𝑇 =
𝑏 = 𝛽^ = 𝑌 ̄ − 𝛽^ 𝑋 ̄ 0 0 1
1 1 ∑𝑇𝑡=1𝑋𝑡2 −𝑋 ̄2 𝑇
The slope estimator is the sample covariance of 𝑌 and X, divided by the sample variance of the regressor X. In a simpler case, when the means of Y and X are
𝑏 1 = 𝛽^ 1 = ∑ 𝑇𝑡 = 1 𝑋 𝑡 𝑌 𝑡 ∑ 𝑇𝑡 = 1 𝑋 𝑡 2
zero, then we can disregard the constant. It gives
10
Example: Weekly interest rate vs. AAA yield
From the output we see that the least-squares estimates of the intercept and slope are -0.000109 and 0.616. The Residual standard error is
0.066; this is what we call 𝜎^2𝜖 or s, the estimate of 𝜎𝜖2. 11
Properties of estimators
Unbiasedness and Consistency ( )
The bias of an estimator 𝜃^ is defined as: 𝐵𝑖𝑎𝑠 𝜃^ = 𝐸 𝜃^ − 𝜃.
𝜃^ is unbiased if E(𝜃^) = 𝜃.
An estimator 𝜃^𝑛 is said to be consistent if:
( )
limPr( 𝜃^𝑛−𝜃 <𝜖)=1 ∀𝜖>0. 𝑛→∞
Asymptotic Normality
Provide a method to perform inference.
The primary tool in econometrics for inference is the central limit theorem.
Efficiency ^ ~ ^
Relative efficiency: If 𝜃𝑛 and 𝜃𝑛 are consistent estimators of 𝜃, then 𝜃𝑛 is said to
be relatively efficient to ~𝜃𝑛 if 𝐴𝑉𝑎𝑟(𝜃^𝑛) < 𝐴𝑉𝑎𝑟(~𝜃𝑛)
12
Properties of 𝛽^1 or 𝑏1
Consider the simpler case of the coefficient estimator in the previous regression
model: 𝑛
𝛽^ = ∑ 𝑡 = 1 𝑋 𝑡 𝑌 𝑡
1 ∑ 𝑛𝑖 = 1 𝑋 𝑖 2
Use true parameter to substitute 𝑌𝑡
𝛽^1 = ∑𝑇𝑡=1 𝑋𝑡(𝛽1𝑋𝑡 + 𝜖𝑡) ∑ 𝑇𝑡 = 1 𝑋 𝑡 2
= 𝛽1 + (𝑋1𝜖1 + 𝑋2𝜖2 + ... + 𝑋𝑇 𝜖𝑇 ) ∑ 𝑇𝑡 = 1 𝑋 𝑡 2
Assume fixed regressors, given the assumptions:
𝐸(𝜖𝑡) = 0, and 𝐶𝑜𝑣(𝑋𝑡, 𝜖𝑡) = 0, we prove unbiasedness of slope:
𝐸(𝛽^1) = 𝛽1 .
13
Importance of error variance and variation of 𝑥
14
Variance of 𝛽^1
It is useful to have a formula for the variance of an estimator to show how the estimator's precision depends on various aspects of the data such as the sample size and the values of the predictor variables.
If we assume IID noise, 𝐸(𝜖𝑡2) = 𝜎𝜖2, and 𝐶𝑜𝑣(𝜖𝑖, 𝜖𝑗) = 0 if 𝑖 ≠ 𝑗 𝑉𝑎𝑟(𝛽^1) = 1 𝑉𝑎𝑟(𝑋1𝜖 1 + 𝑋2𝜖2 + ... + 𝑋𝑇𝜖𝑇)
( ∑ 𝑇𝑡 = 1 𝑋 𝑡 2 ) 2
1 ( 2 𝜖2 2 𝜖2 2 𝜖2)
= (∑𝑇𝑡=1𝑋𝑡2)2 𝑋1𝜎 +𝑋2𝜎 +...+𝑋𝑇𝜎
= ∑ 𝑇𝑡 = 1 𝑋 𝑡 2 𝜎 𝜖 2 = 𝜎 𝜖 2
( ∑ 𝑇𝑡 = 1 𝑋 𝑡 2 ) 2 ∑ 𝑇𝑡 = 1 𝑋 𝑡 2
15
Variance of 𝛽^ 1^
The variance of estimated coefficient 𝛽1 depends on the variance of error 𝜖, sample size 𝑇, and the sample variance of 𝑋𝑡 .
𝑉 𝑎 𝑟 ( 𝛽^ ) = 𝜎 𝜖2 / 𝑇 = 𝜎 𝜖2 1 ∑𝑇 (𝑋)2/𝑇 ^
𝑡=1 𝑡 𝑇 𝑉𝑎𝑟(𝑥𝑡)
❑ The variance is increasing in 𝜎𝜖2. More variability in the noise means more
variable estimators. ^ ^
❑ The variance is decreasing in both 𝑇 and 𝑉𝑎𝑟(𝑋𝑡). Increasing 𝑉𝑎𝑟(𝑋𝑡)
means that the X are spread farther apart, which makes the slope of the line
easier to estimate.
❑ A significant practical question is whether one should use daily or weekly data,
or perhaps even monthly or quarterly data. Does it matter which sampling frequency we use? Yes, the highest possible sampling frequency gives the most precise estimate of the slope, if data are stationary.
The error variance 𝜎2𝜖 is not observable, usually estimated by residual variance ^ ∑ 𝑇𝑡 = 1 ^𝜖 2𝑡
𝜎𝜖2= 𝑇−2. 16
Maximum likelihood estimation
Maximum-likelihood estimation (MLE) is to estimate the unknown parameters of a statistical model so that the particular parametric values make the observed data the most probable given the model.
➢ The assumptions for the linear regression model with normally distributed noises are:
(1) The noises are zero-mean, normally distributed independent variables
𝝐 ∼ 𝑁 (0, 𝜎𝜖2𝑰 ), where 𝜎𝜖2 is the common variance of the noises and 𝑰 is the identity matrix.
(2) 𝑋 is distributed independently of the errors 𝜖.
➢ Theregressionequationcanthenbewritten:𝐸(𝒀|𝑿)=𝑿𝜷.Astheerrors are independent draws from the same normal distribution, we can compute the log-likelihood function as follows:
2𝜎𝜖2
𝑇 𝑇 2 𝑇 (𝑌𝑡 − 𝛽0 − 𝛽1𝑋𝑡)2
𝑙𝑜𝑔𝐿 = − 2 log(2𝜋) − 2 log(𝜎𝜖 ) − ∑𝑡=1
17
MLE
➢ The Maximum Likelihood (ML) principle requires maximization of the log- likelihood function. Maximizing the log-likelihood function entails solving the equations:
𝜕𝑙𝑜𝑔𝐿 = 0,𝜕𝑙𝑜𝑔𝐿 = 0,𝜕𝑙𝑜𝑔𝐿 = 0 𝜕𝛽0 𝜕𝛽1 𝜕𝜎𝜖2
➢ These equations can be explicitly written as follows: ∑𝑇 (𝑌𝑡−𝛽0−𝛽1𝑋𝑡)=0
𝑡=1 𝑇
∑ 𝑋𝑡(𝑌𝑡 − 𝛽0 − 𝛽1𝑋𝑡) = 0
𝑡=1𝑇 [
T𝜎 − 𝑌 − 𝛽 − 𝛽 𝑋 = 0
𝜖2∑(𝑡 0 1𝑡)2
𝑡=1
18
]
MLE
A little algebra shows that solving the first two equations yields:
∑𝑇𝑡=1𝑋𝑡𝑌𝑡 −𝑋 ̄𝑌 ̄
𝛽^= 𝑇 ,𝛽^=𝑌 ̄−𝛽^𝑋 ̄
1 ∑𝑇𝑡=1𝑋𝑡2 −𝑋 ̄2 0 1 𝑇
where
𝑋 ̄=1∑𝑇 𝑋𝑡, 𝑋 ̄𝑌=1∑𝑇 𝑋𝑡𝑌𝑡 𝑇 𝑡=1 𝑇 𝑡=1
Substituting these expressions in the third equation:
𝜕𝑙𝑜𝑔𝐿 = 0 𝜕𝜎𝜖2
yields the variance of the residuals:
𝜎^2𝜖 = 1 ∑𝑇 [(𝑌𝑡 − 𝛽^0 − 𝛽^1𝑋𝑡)2] T 𝑡=1
19
Example:Multiple linear regression with interest rates
We continue the analysis of the weekly interest-rate data but now with changes in 30-year Treasury rate (cm30_dif) and changes in the Federal funds rate (ff_dif) as additional predictors. Thus p = 3. Here is a scatterplot matrix of the four time series. There is a strong linear relationship between all pairs of aaa_dif, cm10_dif, and cm30 dif, but ff_dif is not strongly related to the other series.
❑ Data: WeekInt.txt
❑ 2_MultipleLinearRegression.R
20
Multiple linear regression models
➢ If there are more than one predictor, we can write the following multiple linear regression equation:
𝑌𝑡 =𝛽0+𝛽1𝑋𝑡1+...+𝛽𝑝𝑋𝑡𝑝+𝜖𝑡, t=1,..., T. which, with condition (1), implies that :
𝐸[𝑌𝑡 𝑋𝑡1,...,𝑋𝑡𝑝]=𝛽0+𝛽1𝑋𝑡1+...+𝛽𝑝𝑋𝑡𝑝
The parameter 𝛽0 is the intercept. The regression coefficients 𝛽1, ..., 𝛽𝑝 are the
slopes 𝛽𝑗 = 𝜕𝐸 [𝑌𝑡 𝑋𝑡1, ..., 𝑋𝑡𝑝 ] . Therefore, 𝛽𝑗 is the change in the 𝜕𝑋𝑡𝑗
expected value of 𝑌t when 𝑋t𝑗 changes one unit, keeping other predictors constant.
➢ The assumptions of linear regression model: (1) linearity of the conditional expectation.
(2) independent and identically noise: 𝜖1, ..., 𝜖𝑇 are independent with mean
zero and constant variance 𝑉𝑎𝑟(𝜖𝑡) = 𝜎𝜖2 for all t. 21
Generalization to multiple independent variables: Ordinary least squares method
Multipleregression:𝑌𝑡 =𝛽0+𝛽1𝑋𝑡1+...+𝛽𝑝𝑋𝑡𝑝+𝜖𝑡, t=1,..., T. Matrixform:𝐸(𝒀 𝑿)=𝑿𝜷, 𝜷=(𝛽0,𝛽1,...,𝛽𝑝)′
➢ In the general case of a multivariate regression, the OLS method requires minimization of the sum of the squared errors. Consider the vector of errors:
𝜖1 𝝐= 𝜖⋮
𝑇
The sum of the squared residuals (SSR) ( = 𝜖12 + ... + 𝜖𝑇2) can be written as
𝑆𝑆𝑅=𝝐′𝝐. As𝝐=𝒀−𝑿𝜷,wecanalsowrite𝑆𝑆𝑅 = (𝒀−𝑿𝜷)′(𝒀−𝑿𝜷)
➢ The OLS method requires that we minimize the SSR. To do so, we equate to zero
the first derivatives of the SSR:
𝜕(𝒀−𝑿𝜷)′(𝒀−𝑿𝜷) =(𝒀−𝑿𝜷)′𝑿=𝟎 𝜕𝜷
➢ This is a system of N equations. Solving this system, we obtain the estimators:
𝜷^ = (𝑿′𝑿 )−1𝑿′𝒀
➢
22
Generalization to multiple independent variables: Maximum likelihood estimation
Multipleregression:𝑌𝑡 =𝛽0+𝛽1𝑋𝑡1+...+𝛽𝑝𝑋𝑡𝑝+𝜖𝑡, t=1,..., T. Matrixform:𝐸(𝒀 𝑿)=𝑿𝜷, 𝜷=(𝛽0,𝛽1,...,𝛽𝑝)′
➢ We make the same set of assumptions as we made in the case of a single regressor. Using the above notation, the log-likelihood function will have the form:
𝑙𝑜𝑔𝐿=−2log(2𝜋)−2log 𝜎 −2𝜎𝜖2 𝒀−𝑿𝜷 ′(𝒀−𝑿𝜷) 𝑇 𝑇(𝜖2)1()
➢ The maximum likelihood conditions are written as 𝜕𝑙𝑜𝑔𝐿 =0,𝜕𝑙𝑜𝑔𝐿 =0
𝜕𝜷 𝜕𝜎𝜖2
➢ Solving the above system of equations gives the same form for the estimators
as in the univariate case:
𝜷^ = ( 𝑿 ′ 𝑿 ) − 𝟏 𝑿 ′ 𝒀 𝜎^ 2 = 1 ( 𝒀 − 𝑿 𝜷^ ) ′ ( 𝒀 − 𝑿 𝜷^ ) 𝜖𝑇
23
Generalization to multiple independent variables
➢ Under the assumption that the errors are normally distributed, it can be demonstrated that the regression coefficients are jointly normally distributed as follows:
𝜷^ ∼ 𝑁[𝜷, 𝜎𝜖2(𝑿′𝑿)−1]
This expression is important because they allow to compute confidence
intervals for the regression parameters.
➢ The variance estimator is not unbiased. It can be demonstrated that to obtain an unbiased estimator we have to apply a correction that takes into account the number of variables by replacing T:
𝜎^2 = 1
𝜖 𝑇−1−𝑝
(𝒀−𝑿𝜷^)′(𝒀−𝑿𝜷^)
➢ Remarks: The MLE method requires that we know the functional form of the distribution. If the distribution is known but not normal, we can still apply the MLE method but the estimators will be different.
24
Example:Multiple linear regression with interest rates
Call:
lm(formula = aaa_dif ~ cm10_dif + cm30_dif + ff_dif)
25
Standard errors, t-values and p-values
Each of the above coefficients has three other statistics associated with it.
➢ Standard error (SE), which is the estimated standard deviation of the least- squares estimator and tells us the precision of the estimator.
➢ t-value, which is the t-statistic for testing that the coefficient is 0. The t-value is the ratio of the estimate to its standard error. For example, for cm10_dif, the t- value is 7.86 = 0.355/0.0451.
➢ p-value (Pr > |t| in the lm output) for testing the null hypothesis that the coefficient is 0 versus the alternative that it is not 0. If a p-value for a slope
parameter is small, as it is here for 𝛽1 , then this is evidence that the corresponding coefficient is not 0, which means that the predictor has a linear
relationship with the response.
The p-value is large (0.44) for 𝛽3 so we would not reject the null hypothesis that 𝛽3
is zero. This result however should not be interpreted as stating that aaa_dif and ff_dif are unrelated, but only that ff_dif is not useful for predicting aaa_dif when cm10_dif and cm30 _dif are included in the regression model. Since the Federal Funds rate is a short-term (overnight) rate, it is not surprising that ff_dif is less useful than changes in the 10- and 30-year Treasury rates for predicting aaa_dif.
26
OLS & MLE
➢ We now establish the relationship between the MLE principle and the ordinary least squares (OLS) method.
➢ OLS is a general method to approximate a relationship between two or more variables. If we use the OLS method, the assumptions of linear regressions can be weakened. In particular, we need not assume that the errors are normally distributed but only assume that they are uncorrelated and have finite variance. The errors can therefore be regarded as a white noise sequence.
➢ The OLS estimators are the same estimators obtained with the MLE method; they have an optimality property. In fact, the Gauss-Markov theorem states that the above OLS estimators are the best linear unbiased estimators (BLUE).
➢ “Best” means that no other linear unbiased estimator has a lower variance. It should be noted explicitly that OLS and MLE are conceptually different methodologies: MLE seeks the optimal parameters of the distribution of the error terms, while OLS seeks to minimize the variance of error terms. The fact that the two estimators coincide was an important discovery.
27
Determining the Explanatory Power of a Regression
➢ The above computations to estimate regression parameters were carried out under the assumption that the data were generated by a linear regression function with uncorrelated and normally distributed noise. In general, we do not know if this is indeed the case.
➢ Though we can always estimate a linear regression model on any data sample by applying the estimators discussed above, we must now ask the question: When is a linear regression applicable and how can one establish the goodness (i.e., explanatory power) of a linear regression?
➢ Quite obviously, a linear regression model is applicable if the relationship between the variables is approximately linear. But
How can we check if this is indeed the case?
What happens if we fit a linear model to variables that have non-linear relationships, or if distributions are not normal?
➢ A number of tools have been devised to help answer these questions.
➢ ➢
28
Coefficient of determination
➢ Intuitively, a measure of the quality of approximation offered by a linear regression is given by the variance of the residuals. Squared residuals are used because a property of the estimated relationship is that the sum of the residuals is zero. If residuals are large, the regression model has little explanatory power. However, the size of the average residual in itself is meaningless as it has to be compared with the range of the variables.
A widely used measure of the quality and usefulness of a regression model is given by the coefficient of determination denoted by 𝑅2 or R-squared.
➢ Total sum of squares (TSS):𝑆𝑌2 = ∑𝑇𝑡=1 (𝑌𝑡 − 𝑌 ̄ )2
➢ Explained sum of squares (ESS):𝑆𝑅2 = ∑𝑇𝑡=1 (𝑌^𝑡 − 𝑌 ̄ )2
➢ Residual sum of squares (RSS):𝑆𝜖2 = ∑𝑇𝑡=1 (𝑌𝑡 − 𝑌^𝑡)2 = ∑𝑇𝑡=1 ^𝜖2𝑡
TSS=ESS+ RSS
29
Coefficient of determination
➢ We can therefore define the coefficient of determination 𝑅2 as: 𝑅 2 = 𝑆 𝑅2
and
𝑆𝑌2 1 − 𝑅 2 = 𝑆 𝜖2
𝑆𝑌2
as the portion of the total fluctuation of the dependent variable, 𝑌, explained by
the regression relation.
➢ 𝑅2 is a number between 0 and 1:
𝑅2 = 0 means that the regression has no explanatory power.
𝑅2 = 1 means that the regression has perfect explanatory power.
➢ It can be demonstrated that the coefficient of determination 𝑅2 is distributed as the well-known Student t distribution. This fact allows one to determine intervals of confidence around a measure of the significance of a regression.
➢ ➢
30
Adjusted 𝑅2
➢ The quantity 𝑅2 as a measure of the usefulness of a regression model suffers from the problem that a regression might fit data very well in sample but have no explanatory power out-of-sample. This occurs if the number of regressors is
too high (overparametrization). Therefore an adjusted 𝑅2 is sometimes used. ➢ The adjusted 𝑅2 is defined as 𝑅2 corrected by a penalty function that takes
into account the number 𝑝 of regressors in the model:
𝑅𝐴2𝑑𝑗𝑢𝑠𝑡𝑒𝑑 =1−(1−𝑅2) 𝑇−1 =1− 𝑆2𝜖/(𝑇−𝑝−1)
2 𝑇 − 𝑝 − 1 𝑆 𝑌2 / ( 𝑇 − 1 ) Theadjusted𝑅 adjustsforthenumberofregressorsinthemodel.Itincreases
only if the new term improves the model more than would be explained by chance.
It is always smaller than 𝑅2, it can even be negative.
31
Model selection: Mean Sums of Squares (MS) and F-Tests
➢ The ratio of the sum of squares to the degrees of freedom is the mean sum of squares: 𝑚𝑒𝑎𝑛 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 = 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠
➢ Suppose we have two models, I and II, and the predictor variables in model I are a subset of those in model II, so that model I is a submodel of II. A common null hypothesis is that the data are generated by model I.
➢ To test this hypothesis, we use the excess regression sum of squares of model II relative to model I.
SS(II | I) = regression SS for model II – regression SS for model I 𝑀𝑆(𝐼𝐼 | 𝐼) = 𝑆𝑆(𝐼𝐼 | 𝐼)/𝑑𝑓𝐼𝐼|𝐼 , 𝑤h𝑒𝑟𝑒 𝑑𝑓𝐼𝐼|𝐼 =𝑝𝐼𝐼 −𝑝𝐼
The F-statistic for testing the null hypothesis is
𝐹 = 𝑀𝑆(𝐼𝐼 𝐼)
^2 𝜎^2𝜖
where 𝜎𝜖 is the mean residual sum of squares for model II. Under the null hypothesis,
the F-statistic has an F-distribution with 𝑑𝑓𝐼𝐼|𝐼 and 𝑇 − 𝑝𝐼𝐼 − 1 degrees of freedom. The
null hypothesis is rejected if the F-statistic exceeds the 𝛼-upper quantile of this F- distribution.
𝑑𝑒𝑔𝑟𝑒𝑒 𝑜𝑓 𝑓𝑟𝑒𝑒𝑑𝑜𝑚
32
Example: Weekly interest rates-Testing the one-predictor versus three-predictor model
The null hypothesis is that, in the three predictor model, the slopes of cm30_diff and ff_dif are zero. The small p value leads us to reject the null. The smaller model with only cm10_diff doesn’t beat the alternative.
2_MultipleLinearRegression.R
33
Example: Weekly interest rates-Testing the two-predictor versus three-predictor model
The null hypothesis is that, in the three predictor model, the slope ff_dif is zero. We don’t reject the null and concludes that the smaller model (with 2 predictors) is better.
2_MultipleLinearRegression.R
34
Model selection criteria
When there are many potential predictor variables, often we wish to find a subset of them that provides a parsimonious regression model.
❑ For linear regression models, AIC is
𝐴 𝐼 𝐶 = 𝑇 l o g ( 𝜎^ 2𝜖 ) + 2 ( 1 + 𝑝 )
where 1 + 𝑝 is the number of parameters in a model with 𝑝 predictor variables. (𝑙𝑜𝑔𝐿=−2log(2𝜋)−2log 𝜎 −2𝜎𝜖2 𝒀−𝑿𝜷 ′(𝒀−𝑿𝜷))
𝑇 𝑇(𝜖2)1()
BIC replaces 2(1 + 𝑝) in AIC by log(𝑇 )(1 + 𝑝). The first term, 𝑇 log(𝜎^2𝜖); is
❑
❑ Adjusted R square.
❑ Supposethereare𝑀predictorvariables.Let𝜎^2𝜖,𝑀betheestimateof𝜎𝜖2usingall of them, and let 𝑆𝑆𝐸(𝑝) be the sum of squares for residual error for a model with some subset of only 𝑝 ≤ 𝑀 of the predictors.
𝐶𝑝 = 𝑆𝑆𝐸(𝑝) −𝑇+2(𝑝+1)
35 𝜎^2𝜖,𝑀
−2 times the log-likelihood evaluated at the MLE, assuming that the noise is Gaussian.
Example: Weekly interest rates-Model
selection by AIC and BIC
“Best” means smallest for BIC and 𝐶𝑝 and largest for adjusted 𝑅2. There are three plots, one for each of BIC, 𝐶𝑝, and adjusted 𝑅2. All three criteria are optimized by
two predictor variables.
2_WeeklyInterestRatesModelSelection.R
36
Pitfalls of Regressions
It is important to understand when regressions are correctly applicable and when they are not. There are several situations where it would be inappropriate to use regressions. In particular, we analyze the following cases which represent possible pitfalls of regressions:
Multicollinearlty Nonlinearity
Nonnormality
High leverage and outliers Autocorrelations Heteroscedasticity
➢ ➢ ➢ ➢ ➢ ➢
37
Collinearity
Predictor variables exhibit collinearity when one of the predictors can be predicted well from the others.
Consequences of collinearity:
Coefficients in a multiple regression model can be surprising, taking on an unanticipated sign or being unexpectedly large or small.
The stronger the correlation between coefficients, the more the variance of their coefficients increases when both are included in the model. This can lead to a smaller t-statistic and correspondingly large P-value.
38
Collinearity
Housing Prices based on Living Area and Bedrooms.
Simple Regression Models Multiple Regression Model
An increase of $113.12 in price for each additional square foot of space.
An increase of $48,218 in price for each additional bedroom.
39
The coefficient on Bedrooms seems counterintuitive.
39
Collinearity and variance inflation
➢
➢
where 𝑅𝑗2 measures how well 𝑋𝑗 can be predicted from the other 𝑋s, i.e. regressing 𝑋𝑗 on the 𝑝 − 1 other predictors.
Define variance inflation factor (VIF) of 𝑋𝑗 as:
𝑉𝐼𝐹𝑗 = 1 = 𝑇𝑆𝑆(𝑋𝑗)
Note that 𝑣𝑎𝑟(𝛽^ ) is 𝜎2 multiplied by the corresponding (j+1,j+1)-th element of the main 𝑗 −1 𝜖
diagonal of (𝑿′𝑿 ) . After operating, the variance of the j-th predictor can be expressed as 𝑣 𝑎 𝑟 ( 𝛽^ ) = 𝜎 𝜖2 1
𝑗 ^( 𝑗)1−𝑅2 𝑇 𝑉𝑎𝑟 𝑋 𝑗
1−𝑅𝑗2 𝐸𝑆𝑆(𝑋_𝑗|𝑿−𝑗)
The VIF of a variable tells us how much the variance of 𝛽^ of the j-th predictor variable 𝑋 is
➢
For example, if a variable has a VIF of 4, then the variance of its 𝛽^ is four times larger than it would be if the other predictors were either deleted or were not correlated with it. The standard error is increased by a factor of 2.
It is important to keep in mind that 𝑉𝐼𝐹𝑗 tells us nothing about the relationship between the
response and 𝑗th predictor. Rather, it tells us only how correlated the 𝑗th predictor is with the
other predictors.
𝑗𝑗 increased by having the other predictor variables 𝑋1, …𝑋𝑗−1, 𝑋(𝑗+1), …𝑋𝑃 in the model.
40
Partial residual plot
Partial residual plot is a graphical technique that attempts to show the relationship between a given independent variable 𝑋𝑗 and the response variable Y given that
other independent variables are also in the model. The partial residual for the 𝑗th predictor variable is
𝑌𝑡− 𝛽^0+∑𝑋𝑗′𝑡𝛽^𝑗′ =𝑌^𝑡+^𝜖𝑡− 𝛽^0+∑𝑋𝑗′𝑡𝛽^𝑗′ =𝑋𝑗𝑡𝛽^𝑗+^𝜖𝑡 𝑗′≠𝑗 𝑗′≠𝑗
(The scatter plot of 𝑋𝑗 and Y does not take into account the effect of the other independent variables).
41
Example: Partial residual plots for the weekly interest-rate example
❑ Partial residual plots for the weekly interest-rate example are shown in Figures 12.7(a) and (b). For comparison, scatterplots of cm10_dif and cm30_dif versus aaa_dif with the corresponding one- variable fitted lines are shown in panels (c) and (d).
❑ The main conclusion from examining the plots is that the slopes in (a) and (b) are nonzero, though they are shallower than the slopes in (c) and (d).
Due to collinearity, the effect of cm10_dif on aaa_dif when cm30_dif is in the model [panel (a)] is less than when cm30_dif is not in the model [panel (c)] 2_WeeklyInterestRatesPartialResidual.R
42
Detecting nonlinearity
➢ Data were simulated to illustrate some of the techniques for diagnosing problems. In the example there are two predictor variables, 𝑋1 and 𝑋2. The
assumed model is multiple linear regression, 𝑌𝑖 = 𝛽0 + 𝛽1𝑋𝑖,1 + 𝛽2𝑋𝑖,2 + 𝜖𝑖.
➢ The scatterplot suggests non-linearity. One possibility is to take appropriate transformation of data, e.g. log transformation. Alternatively, one can adopt a non-linear model to reflect the empirical features of data.
❑ R: 2_DetectingNonlinearity.R
43
Detecting non-normality
R: 2_DetectingNonlinearity.R
44
Example: Residual plots for weekly interest changes
The normal plot in panel (a) shows heavy tails. A t-distribution was fit to the residuals, and the estimated degrees of freedom was 2.99, again indicating heavy tails. Panel (b) shows a QQ plot of the residuals and the quantiles of the fitted t- distribution with a 45o reference line. There is an excellent agreement between the data and the t-distribution.
❑ Data: 2_WeekInt.txt
❑ R: 2_ResidualPlotsForWeeklyInterestChanges.R
45
High-leverage points and residual outliers- Simulated data example
A high-leverage point is not necessarily a problem, only a potential problem.
In panel (a), Y is linearly related to X and the extreme X-value is, in fact, helpful as it increases the precision of the estimated slope.
In panel (b), the value of Y for the high-leverage point has been misrecorded as 5.254 rather than 50.254.This data point is called a residual outlier and has an extreme influence on the estimated slope.
In panel (c), X has been misrecorded for the high-leverage point as 5.5 instead of 50. Thus, this point is no longer high-leverage, but now it is a residual outlier. Its effect now is to bias the estimated intercept.
46
High-leverage points
Three important tools are often used for diagnosing problems of high-leverage points:
❑ leverages;
❑ externally studentized residuals; and
❑ Cook’s D, which quantifies the overall influence of each observation on the fitted values.
Cook’s D measures influence, and any case with a large Cook’s D is called a high- influence case. Leverage and rstudent alone do not measure influence. Let
𝑌^ ( − 𝑖) be the j-th fitted value using estimates of the 𝛽^ s obtained with the i-th 𝑗
observation deleted. Then Cook’s D for the i-th observation is
∑𝑛𝑗=1{𝑌^ −𝑌^(−𝑖)}2 𝑗𝑗
( 𝑝 + 1 ) 𝜎^ 2𝜖
One way to use Cook’s D is to plot the values of Cook’s D against case number and look for unusually large values.
47
Example: Diagnostics
Residual plots and other diagnostics are shown in Figure 13.13 for a regression of Y on X. Describe any problems that you see and possible remedies.
48
Example: Diagnostics
There is a high-leverage point with a residual outlier. In panel (f) we see that this point is #58. One should investigate whether this observation is correct. Regardless of whether this observation is correct, it is so far detached from the other data that it is sensible to remove it and fit a model to the other data.
49
Outliers
50
Robust regression estimation method
The OLS estimator is sensitive to outliers.
The least absolute derivation (LAD) estimator is less sensitive to outliers.
𝛽^ 𝐿 𝐴 𝐷 = 𝑎 𝑟 𝑔 𝑚 𝑖 𝑛 ( ∑ 𝑇𝑡 = 1 | 𝑦 𝑡 − 𝑏 0 − 𝑏 1 𝑥 𝑡 | )
51
Autocorrelation of Residuals
Autocorrelation: 𝐶𝑜𝑣(𝜖𝑡, 𝜖𝑡−𝑠) ≠ 0
➢ Suppose that residuals are correlated. This means that in general
𝐸[𝜖𝑖𝜖𝑗] = 𝜎𝑖𝑗 ≠ 0. Thus the variance-covariance matrix of the residuals 𝜎𝑖𝑗 will
not be a diagonal matrix as in the case of uncorrelated residuals, but will exhibit nonzero off-diagonal terms. We assume that we can write 𝜎𝑖𝑗 = 𝜎2Ω, where Ω is a positive definite symmetric matrix and is a parameter to be estimated.
How do we detect the autocorrelation of residuals?
We perform a linear regression between the variables and estimate regression parameters using the OLS method. After estimating the regression parameters, we can compute the sequence of residuals. At this point, we can apply tests such as the Durbin-Watson test or the Ljung-Box test to gauge the autocorrelation of residuals.
Null hypothesis: There is no autocorrelation.
To implement a test, estimate the autocorrelations of the residuals as 𝜌^𝑠 = 𝐶𝑜𝑟𝑟(^𝜖𝑡, ^𝜖𝑡−𝑠), for 𝑠 = 1,⋯, 𝐿 .
Theteststatistics: 𝑇𝜌^𝑠~𝑁(0,1)or𝑄𝐿=𝑇∑𝑇𝜌^2𝑠𝑑𝒳𝐿2 52 𝑡=1
Generalized Least Squares
➢ If residuals are correlated, the regression parameters can still be estimated without biases, but this estimate will not be optimal in the sense that there are other estimators with lower variance of the sampling distribution. An optimal linear unbiased estimator has been derived. It is called the Aitkens generalized least squares (GLS) estimator and is given by
𝜷^ = (𝑿′Ω−𝟏𝑿)−1𝑿′Ω−𝟏𝒀 where Ω is the residual correlation matrix.
➢ The GLS estimators vary with the sampling distribution. It can also be demonstrated that the variance of the GLS estimator is also given by the following “sandwich” formula:
^ ^^𝜖2(−𝟏)−1 𝑉(𝜷)=𝐸[(𝜷−𝜷)(𝜷−𝜷)′]=𝜎 𝑿′Ω 𝑿
where Ω is the residual correlation matrix.
➢ Unfortunately,𝑉(𝜷^)cannotbeestimatedwithoutfirstknowingtheregression coefficients. For this reason, in the presence of correlation of residuals, it is
common practice to use models that explicitly capture autocorrelations and produce uncorrelated residuals.
53
Example: Residual plots for weekly interest rates without differencing
❑ Data: 2_WeekInt.txt
❑ R: 2_ResidualPlotsForWeeklyInterestRatesWithoutDifferencing.R
54
Heteroskedasticity
Heteroskedasticity: 𝑉𝑎𝑟 𝑢 𝑋 = 𝜎 (𝑡 ) 𝑡2
If the residuals are heteroskedastic, least squares estimator is still consistent, reasonably efficient, however the standard expression for the standard errors is not correct.
White’s test of heteroskedasticity. The null hypothesis is homoskedasticity, and the alternative is the kind of heteroskedasticity which can be explained by the levels, squares, and cross products of the regressors.
To implement White’s test, let 𝑤𝑡 be the levels, squares, and cross products of the regressors, the test is then to run a regression of squared residuals on 𝑤𝑡
𝑢^2𝑡 =𝑤𝑡′𝛾+𝑣𝑡
The null is all the slope coefficients (not the intercept) in 𝛾 are zero. The test
statisticis𝑇𝑅~𝒳,𝑝=dim𝑤 −1. 2𝑝2 (𝑡)
55
Heteroskedasticity and estimated Std
56
Example: Diagnostics
Residual plots and other diagnostics are shown in Figure 13.12 for a regression of Y on X. Describe any problems that you see and possible remedies.
57
Example: Diagnostics
Panel (a) shows clearly that the effect of X on Y is nonlinear. Because of the strong bias caused by the nonlinearity, the residuals are biased estimates of the noise and examination of the remaining plots is not useful.The remedy to this problem is to fit a nonlinear effect.A model that is quadratic in X seems like a good choice.After this model has been fit, the other diagnostic plots should be examined.
58
R lab
This section uses the data set USMacroG in R’s AER package. This data set contains quarterly times series on 12 U.S. macroeconomic variables for the period 1950-2000. We will use the variables:
consumption = real consumption expenditures, dpi = real disposable personal income,
cpi = consumer price index (inflation rate) government = real government expenditures, and unemp = unemployment rate.
Our goal is to predict changes in consumption from changes in the other variables.
❑Data: Rlab2_1_USMacroG.txt ❑R: Rlab2_1.R
59
R lab
Run the following R code to load the data, difference the data (since we wish to work with changes in these variables), and create a scatterplot matrix.
library(AER)
data(“USMacroG”)
MacroDiff= apply(USMacroG,2,diff) pairs(cbind(consumption,dpi,cpi,government,unemp))
Problem 1. Describe any interesting features, such as, outliers, seen in the scatterplot matrix. Keep in mind that the goal is to predict changes in consumption. Which variables seem best suited for that purpose? Do you think there will be collinearity problems?
60
R lab
Next, run the code below to fit a multiple linear regression model to consumption using the other four variables as predictors.
fitLm1 = lm(consumption~dpi+cpi+government+unemp) summary(fitLm1)
confint(fitLm1)
Problem 2 From the summary, which variables seem useful for predicting changes in consumption?
61
R lab
Next, print an AOV table.
anova(fitLm1)
Problem 3 For the purpose of variable selection, does the AOV table provide any useful information not already in the summary?
Upon examination of the p-values, we might be tempted to drop several variables from the regression model, but we will not do that since variables should be removed from a model one at a time. The reason is that, due to correlation between the predictors, when one is removed then the significance of the others changes. To remove variables sequentially, we will use the function stepAIC in the MASS package.
library(MASS)
fitLm2 = stepAIC(fitLm1) summary(fitLm2)
62
R lab
Problem 4 Which variables are removed from the model, and in what order?
Now compare the initial and final models by AIC.
AIC(fitLm1)
AIC(fitLm2) AIC(fitLm1)-AIC(fitLm2)
Problem 5 How much of an improvement in AIC was achieved by removing variables? Was the improvement huge? Is so, can you suggest why? If not, why not?
The function vif in the car package will compute variance inflation factors. A similar function with the same name is in the faraway package. Run
library(car) vif(fitLm1) vif(fitLm2)
Problem 6 Was there much collinearity in the original four-variable model? Was the collinearity reduced much by dropping two variables?
63
R lab
Partial residual plots, which are also called component plus residual or cr plots, can be constructed using the function cr.plot in the car package. Run
par(mfrow=c(2,2))
sp = 0.8
cr.plot(fitLm1,dpi,span=sp,col=”black”) cr.plot(fitLm1,cpi,span=sp,col=”black”) cr.plot(fitLm1,government,span=sp,col=”black”) cr.plot(fitLm1,unemp,span=sp,col=”black”)
Besides dashed least-squares lines, the partial residual plots have solid lowess smooths through them unless this feature is turned off by specifying smooth=F. Lowess is an earlier version of loess. The smoothness of the lowess curves is determined by the parameter span, with larger values of span giving smoother plots. The default is span = 0.5. In the code above, span is 0.8 but can be changed for all four plots by changing the variable sp. A substantial deviation of the lowess curve from the least-squares line is an indication that the effect of the predictor is nonlinear. The default color of the cr.plot figure is red, but this can be changed as in the code above.
Problem 7 What conclusions can you draw from the partial residual plots? 64
R lab – Results & Discussions
Problem 1. Describe any interesting features, such as, outliers, seen in the scatterplot matrix. Keep in mind that the goal is to predict changes in consumption. Which variables seem best suited for that purpose? Do you think there will be collinearity problems?
No outliers are seen in the scatterplots. Changes in consumption show a positive relationship with changes in dpi and a negative relationship with changes in unemp, so these two variables should be most useful for predicting changes in consumption. The correlations between the predictors (changes in the variables other than consumption) are weak and collinearity will not be a serious problem.
65
R lab – Results & Discussions
Problem 2 From the summary, which variables seem useful for predicting changes in consumption?
Changes in dpi and unemp are highly significant and so are useful for pre-
diction. Changes in cpi and government have large p-values and do not seem
useful.
66
R lab – Results & Discussions
Problem 3 For the purpose of variable selection, does the AOV table provide any useful information not already in the summary?
The AOV table contains sums of squares and mean squares that are not in the summary.These are however not needed for variable selection.
67
R lab – Results & Discussions
Problem 4 Which variables are removed from the model, and in what order?
First changes in government is removed and then changes in cpi.
68
R lab – Results & Discussions
Problem 5 How much of an improvement in AIC was achieved by removing variables? Was the improvement huge? Is so, can you suggest why? If not,
why not?
AIC decreased by 2.83 which is not a huge improvement. Dropping variables decreases the log-likelihood (which increases AIC) and decreases the number of variables (which decreases AIC). The decrease due to dropping variables is limited; it is twice the number of deleted variables. In this case, the maximum possible decrease in AIC from dropping variables is 4 and is achieved only if dropping the variables does not change the log-likelihood, so we should not have expected a huge decrease. Of course, when there are many variables then a huge decrease in AIC is possible if a very large number of variables can be dropped.
> AIC(fitLm1)
[1] 1807.064
> AIC(fitLm2)
[1] 1804.237
> AIC(fitLm1)-AIC(fitLm2) [1] 2.827648
69
R lab – Results & Discussions
Problem 6 Was there much collinearity in the original four-variable model? Was the collinearity reduced much by dropping two variables?
There was little collinearity in the original model, since all four VIFs are
near their lower bound of 1. Since there was little collinearity to begin with, it could not be much reduced.
70
R lab – Results & Discussions
Problem 7 What conclusions can you draw from the partial residual plots?
The least-squares lines for government and cpi are nearly horizontal, which agrees with the
earlier result that these variables can be dropped. The lowess curves are close to the
least-squares lines, at least relative to the random variation in the partial residuals, and
this indicates that the effects of dpi and unemp on consumption are linear.
71