Multiple Regression
In practice there often exists more than one variable that influences the dependent variable. This chapter discusses the regression model with mul- tiple explanatory variables. We use matrices to describe and analyse this model. We present the method of least squares, its statistical properties, and the idea of partial regression. The F-test is the central tool for testing linear hypotheses, with a test for predictive accuracy as a special case. Particular attention is paid to the question whether additional variables should be included in the model or not.
118 3 Multiple Regression
Copyright By PowCoder代写 加微信 powcoder
3.1 Least squares in matrix form E Uses Appendix A.2–A.4, A.6, A.7.
3.1.1 Introduction
More than one explanatory variable
In the foregoing chapter we considered the simple regression model where the dependent variable is related to one explanatory variable. In practice the situation is often more involved in the sense that there exists more than one variable that influences the dependent variable.
As an illustration we consider again the salaries of 474 employees at a US bank (see Example 2.2 (p. 77) on bank wages). In Chapter 2 the vari- ations in salaries (measured in logarithms) were explained by variations in education of the employees. As can be observed from the scatter diagram in Exhibit 2.5(a) (p. 85) and the regression results in Exhibit 2.6 (p. 86), around half of the variability (as measured by the variance) can be explained in this way. Of course, the salary of an employee is not only determined by the number of years of education because many other variables also play a role. Apart from salary and education, the following data are available for each employee: begin or starting salary (the salary that the individual earned at his or her first position at this bank), gender (with value zero for females and one for males), ethnic minority (with value zero for non-minorities and value one for minorities), and job category (category 1 consists of administrative jobs, category 2 of custodial jobs, and category 3 of management jobs). The begin salary can be seen as an indication of the qualities of the employee that, apart from education, are determined by previous experience, personal char- acteristics, and so on. The other variables may also affect the earned salary.
Simple regression may be misleading
Of course, the effect of each variable could be estimated by a simple regres- sion of salaries on each explanatory variable separately. For the explanatory variables education, begin salary, and gender, the scatter diagrams with regression lines are shown in Exhibit 3.1 (a–c). However, these results may be misleading, as the explanatory variables are mutually related. For
11.5 11.0 10.5 10.0
11.5 11.0 10.5 10.0
11.5 11.0 10.5 10.0
EDUC vs. GENDER
LOGSAL vs. EDUC
LOGSAL vs. LOGSALBEGIN
10.0 10.5 11.0 11.5 LOGSALBEGIN
LOGSAL vs. GENDER
11.0 10.5 10.0
11.0 10.5 10.0
−0.5 0.0 0.5 1.0 1.5
5 10 15 20 25 EDUC
LOGSALBEGIN vs. GENDER
−0.5 0.0 0.5 1.0 1.5
Exhibit 3.1 Scatter diagrams of Bank Wage data
0.5 1.0 1.5 GENDER
Scatter diagrams with regression lines for several bivariate relations between the variables LOGSAL (logarithm of yearly salary in dollars), EDUC (finished years of education), LOGSALBEGIN (logarithm of yearly salary when employee entered the firm) and GENDER (0 for females, 1 for males), for 474 employees of a US bank.
LOGSALBEGIN vs. EDUC
3.1 Least squares in matrix form 119
EDUC LOGSAL LOGSAL
LOGSALBEGIN LOGSALBEGIN LOGSAL
120 3 Multiple Regression
example, the gender effect on salaries (c) is partly caused by the gender effect on education (e). Similar relations between the explanatory variables are shown in (d) and (f). This mutual dependence is taken into account by formulating a multiple regression model that contains more than one ex- planatory variable.
3.1.2 Least squares E Uses Appendix A.7.
Regression model in matrix form
The linear model with several explanatory variables is given by the equation
yi 1⁄4 b1 þb2x2i þb3x3i þþbkxki þei (i 1⁄4 1,, n): (3:1)
From now on we follow the convention that the constant term is denoted by b1 rather than a. The first explanatory variable x1 is defined by x1i 1⁄4 1 for every i 1⁄4 1, , n, and for simplicity of notation we write b1 instead of b1 x1i . For purposes of analysis it is convenient to express the model (3.1) in matrix form. Let
0y1 01x x1 0b1 0e1 1 21 k1 1 1
B.CB…CB.CB.C . . (3:2)
yn 1 x2n xkn bk en
NotethatinthenkmatrixX1⁄4(xji)thefirstindexj (j1⁄41,,k)refersto the variable number (in columns) and the second index i (i 1⁄4 1, , n) refers to the observation number (in rows). The notation in (3.2) is common in econometrics (whereas in books on linear algebra the indices i and j are often reversed). In our notation, we can rewrite (3.1) as
y 1⁄4 Xb þ e: (3:3) Here b is a k1 vector of unknown parameters and e is an n1 vector of
unobserved disturbances.
Residuals and the least squares criterion
If b is a k 1 vector of estimates of b, then the estimated model may be written as
3.1 Least squares in matrix form 121 y 1⁄4 Xbþe: (3:4)
Here e denotes the n 1 vector of residuals, which can be computed from the data and the vector of estimates b by means of
e 1⁄4 yXb: (3:5)
We denote transposition of matrices by primes (0) — for instance, the trans- pose of the residual vector e is the 1n matrix e0 1⁄4(e1,, en). To deter- mine the least squares estimator, we write the sum of squares of the residuals (a function of b) as
S(b)1⁄4Xe2i 1⁄4e0e1⁄4(yXb)0(yXb) 1⁄4 y0y y0Xb b0X0y þ b0X0Xb:
Derivation of least squares estimator
The minimum of S(b) is obtained by setting the derivatives of S(b) equal to zero. Note that the function S(b) has scalar values, whereas b is a column vector with k components. So we have k first order derivatives and we will follow the convention to arrange them in a column vector. The second and third terms of the last expres- sion in (3.6) are equal (a 1 1 matrix is always symmetric) and may be replaced by 2b0X0y. This is a linear expression in the elements of b and so the vector of derivatives equals 2X0y. The last term of (3.6) is a quadratic form in the elements of b. The vector of first order derivatives of this term b0X0Xb can be written as 2X0Xb. The proof of this result is left as an exercise (see Exercise 3.1). To get the idea we consider the case k 1⁄4 2 and we denote the elements of X0X by cij, i, j 1⁄4 1, 2, with c12 1⁄4 c21. Then b0X0Xb 1⁄4 c11b21 þ c22b2 þ 2c12b1b2. The derivative with re- spect to b1 is 2c11 b1 þ 2c12 b2 and the derivative with respect to b2 is 2c12b1 þ 2c22b2. When we arrange these two partial derivatives in a 2 1 vector, this can be written as 2X0Xb. See Appendix A (especially Examples A.10 and A.11 in Section A.7) for further computational details and illustrations.
The least squares estimator
Combining the above results, we obtain
@S 1⁄4 2X0y þ 2X0Xb: (3:7)
The least squares estimator is obtained by minimizing S(b). Therefore we set
these derivatives equal to zero, which gives the normal equations
X0Xb 1⁄4 X0y: (3:8)
122 3 Multiple Regression
Solving this for b, we obtain
b 1⁄4 (X0X)1X0y (3:9)
provided that the inverse of X0X exists, which means that the matrix X should have rank k. As X is an n k matrix, this requires in particular that n k — that is, the number of parameters is smaller than or equal to the number of observations. In practice we will almost always require that k is considerably smaller than n.
Proof of minimum
From now on, if we write b, we always mean the expression in (3.9). This is the classical formula for the least squares estimator in matrix notation. If the matrix X has rank k, it follows that the Hessian matrix
@2S 1⁄4 2X0X (3:10)
is a positive definite matrix (see Exercise 3.2). This implies that (3.9) is indeed the minimum of (3.6). In (3.10) we take the derivatives of a vector with
0 @b respect to another vector (b ) and we follow the convention to arrange these
derivatives in a matrix (see Exercise 3.2). An alternative proof that b minimizes the sum of squares (3.6) that makes no use of first and second order derivatives is given in Exercise 3.3.
Summary of computations
The least squares estimates can be computed as follows.
Least squares estimation
Step 1: Choice of variables. Choose the variable to be explained (y) and the explanatory variables (x1 , , xk , where x1 is often the constant that always takes the value 1).
Step 2: Collect data. Collect n observations of y and of the related values of x1,, xk and store the data of y in an n1 vector and the data on the explanatory variables in the n k matrix X.
Step 3: Compute the estimates. Compute the least squares estimates by the OLS formula (3.9) by using a regression package.
E Exercises: T: 3.1, 3.2.
also has the property MX 1⁄4 0, it follows from (3.11) that X0e 1⁄4 0:
We may write the explained component ^y of y as ^y 1⁄4 Xb 1⁄4 Hy
H 1⁄4 X(X0X)1X0
is called the ‘hat matrix’, since it transforms y into ^y (pronounced: ‘y-hat’).
Clearly,thereholdsH0 1⁄4H,H2 1⁄4H,HþM1⁄4IandHM1⁄40.So y 1⁄4 Hy þ My 1⁄4 ^y þ e
where, because of (3.11) and (3.13), ^y0e 1⁄4 0, so that the vectors ^y and e are orthogonal to each other. Therefore, the least squares method can be given the following interpretation. The sum of squares e0e is the square of the length of the residual vector e 1⁄4 y Xb. The length of this vector is minimized by choosing Xb as the orthogonal projection of y onto the space spanned by the columns of X. This is illustrated in Exhibit 3.2. The projec- tion is characterized by the property that e1⁄4yXb is orthogonal to all columns of X, so that 0 1⁄4 X0e 1⁄4 X0(y Xb). This gives the normal equations (3.8).
3.1 Least squares in matrix form 123
3.1.3 Geometric interpretation
E Uses Sections 1.2.2, 1.2.3; Appendix A.6.
Least squares seen as projection
The least squares method can be given a geometric interpretation, which we discuss now. Using the expression (3.9) for b, the residuals may be written as
e 1⁄4 y Xb 1⁄4 y X(X0X)1X0y 1⁄4 My (3:11)
M 1⁄4 I X(X0X)1X0: (3:12) The matrix M is symmetric (M0 1⁄4 M) and idempotent (M2 1⁄4 M). Since it
124 3 Multiple Regression
Geometry of least squares
Let S(X) be the space spanned by the columns of X (that is, the set of all n 1 vectors that can be written as Xa for some k 1 vector a) and let S?(X) be the space orthogonal to S(X) (that is, the set of all n 1 vectors z with the property that X0z 1⁄4 0). The matrix H projects onto S(X) and the matrix M projects onto S?(X). In y 1⁄4 ^y þ e, the vector y is decomposed into two orthogonal components, with ^y 2 S(X) according to (3.14) and e 2 S?(X) according to (3.13). The essence of this decomposition is given in Exhibit 3.3, which can be seen as a two-dimensional version of the three-dimensional picture in Exhibit 3.2.
Exhibit 3.2 Least squares
Xb= -dimensional geometric impression of least squares, the vector of observations on the dependent variable y is projected onto the plane of the independent variables X to obtain the linear combination Xb of the independent variables that is as close as possible to y.
Geometric interpretation as a tool in analysis
This geometric interpretation can be helpful to understand some of the algebraic properties of least squares. As an example we consider the effect of applying linear transformations on the set of explanatory variables. Suppose that the n k matrix X is replaced by X 1⁄4 XA where A is a k k invertible matrix. Then the least squares fit (^y), the residuals (e), and the projection matrices (H and M) remain unaffected by this transformation. This is immediately evident from the geometric pictures in Exhibits 3.2 and 3.3, as S(X) 1⁄4 S(X).
0 Xb = 3.3 Least squares
Two-dimensional geometric impression of least squares where the k-dimensional plane S(X) is represented by the horizontal line, the vector of observations on the dependent variable y is projected onto the space of the independent variables S(X) to obtain the linear combination Xb of the independent variables that is as close as possible to y.
The properties can also be checked algebraically, by working out the expres- sions for ^y, e, H, and M in terms of X. The least squares estimates change after the transformation, as b 1⁄4 (X0X)1X0y 1⁄4 A1b. For example, suppose that the variable xk is measured in dollars and xk is the same variable measured in thousands of dollars. Then xki 1⁄4 xki=1000 for i 1⁄4 1,, n, and X 1⁄4 XA where A is the diagonal matrix diag(1, , 1, 0:001). The least squares estimates of bj for j 61⁄4 k remain unaffected — that is, bj 1⁄4 bj for j 61⁄4 k, and bk 1⁄4 1000bk. This also makes perfect sense, as one unit increase in xk corresponds to an increase of a thousand units in xk.
E Exercises: T: 3.3.
3.1.4 Statistical properties E Uses Sections 1.2.2, 1.3.2.
Seven assumptions on the multiple regression model
To analyse the statistical properties of least squares estimation, it is conveni- ent to use as conceptual background again the simulation experiment de- scribed in Section 2.2.1 (p. 87–8). We first restate the seven assumptions of Section 2.2.3 (p. 92) for the multiple regression model (3.3) and use the matrix notation introduced in Section 3.1.2.
. Assumption 1: fixed regressors. All elements of the n k matrix X con- taining the observations on the explanatory variables are non-stochastic. It is assumed that n k and that the matrix X has rank k.
. Assumption 2: random disturbances, zero mean. The n 1 vector e con- sists of random disturbances with zero mean so that E[e] 1⁄4 0, that is, E[ei] 1⁄4 0 (i 1⁄4 1,,n).
. Assumption 3: homoskedasticity. The covariance matrix of the disturb- ances E[ee0] exists and all its diagonal elements are equal to s2, that is, E[e2i ] 1⁄4 s2 (i 1⁄4 1,,n).
. Assumption4:nocorrelation.Theoff-diagonalelementsofthecovariance matrix of the disturbances E[ee0] are all equal to zero, that is, E[eiej] 1⁄4 0 for all i 61⁄4 j.
. Assumption 5: constant parameters. The elements of the k 1 vector b and the scalar s are fixed unknown numbers with s > 0.
. Assumption 6: linear model. The data on the explained variable y have been generated by the data generating process (DGP)
y 1⁄4 Xb þ e: (3:16)
3.1 Least squares in matrix form 125
126 3 Multiple Regression
. Assumption 7: normality. The disturbances are jointly normally distrib-
Assumptions 3 and 4 can be summarized in matrix notation as
E[ee0] 1⁄4 s2I, (3:17) where I denotes the n n identity matrix. If in addition Assumption 7 is
satisfied, then e follows the multivariate normal distribution e N(0, s2I):
Assumptions 4 and 7 imply that the disturbances ei , i 1⁄4 1, , n are mutually independent.
Least squares is unbiased
The expected value of b is obtained by using Assumptions 1, 2, 5, and 6. Assumption 6 implies that the least squares estimator b 1⁄4 (X0X)1X0y can be written as
b 1⁄4 (X0X)1X0(Xb þ e) 1⁄4 b þ (X0X)1X0e:
Taking expectations is a linear operation—that is, if z1 and z2 are two random variables and A1 and A2 are two non-random matrices of appropriate dimensions so that z 1⁄4 A1z1 þ A2z2 is well defined, then E[z] 1⁄4 A1E[z1] þ A2E[z2]. From Assumptions 1, 2, and 5 we obtain
E[b] 1⁄4 E[b þ (X0X)1X0e] 1⁄4 b þ (X0X)1X0E[e] 1⁄4 b: (3:18) So b is unbiased.
The covariance matrix of b
Using the result (3.18), we obtain that under Assumptions 1–6 the covariance
matrix of b is given by
var(b) 1⁄4 E[(b b)(b b)0] 1⁄4 E[(X0X)1X0ee0X(X0X)1]
1⁄4 (X0X)1X0E[ee0]X(X0X)1 1⁄4 (X0X)1X0(s2I)X(X0X)1
1⁄4 s2(X0X)1: (3:19)
The diagonal elements of this matrix are the variances of the estimators of the individual parameters, and the off-diagonal elements are the covariances between these estimators.
Least squares is best linear unbiased
The Gauss–Markov theorem, proved in Section 2.2.5 (p. 97–8) for the simple
regression model, also holds for the more general model (3.16). It states that,
among all linear unbiased estimators, b has minimal variance — that is, b is
the best linear unbiased estimator (BLUE) in the sense that, if b^ 1⁄4 Ay with A ^^
a k n non-stochastic matrix and E[b] 1⁄4 b, then var(b) var(b) is a positive semidefinite matrix. This means that for every k 1 vector c of constants
0^ 00^ there holds c(var(b)var(b))c0, or, equivalently, var(cb)var(cb).
Choosing for c the jth unit vector, this means in particular that for the jth component var(bj) var(b^j) so that the least squares estimators are efficient. This result holds true under Assumptions 1–6, the assumption of normality is not needed.
Proof of Gauss–Markov theorem
To prove the result, first note that the condition that E[b] 1⁄4 E[Ay] 1⁄4 AE[y] 1⁄4 AXb 1⁄4 b for all b implies that AX 1⁄4 I, the k k identity matrix. Now define D 1⁄4 A (X0X)1X0, then DX 1⁄4 AX (X0X)1X0X 1⁄4 I I 1⁄4 0 so that
E Exercises: T: 3.4.
3.1.5 Estimating the disturbance variance Derivation of unbiased estimator
3.1 Least squares in matrix form 127
var(b^) 1⁄4 var(Ay) 1⁄4 var(Ae) 1⁄4 s2AA0 1⁄4 s2DD0 þ s2(X0X)1,
where the last equality follows by writing A 1⁄4 D þ (X0X)1X0 and working out
AA . This shows that var(b) var(b) 1⁄4 s DD , which is positive semidefinite, and
zero if and only if D 1⁄4 0—that is, A 1⁄4 (X0X)1X0. So b^ 1⁄4 b gives the minimal variance.
Next we consider the estimation of the unknown variance s2. As in the previous
chapter we make use of the sum of squared residuals e0e. Intuition could suggest to
estimate s2 1⁄4 E[e2 ] by the sample average 1 P e2 1⁄4 1 e0 e, but this estimator is not i nin
unbiased. It follows from (3.11) and (3.16) and the fact that MX 1⁄4 0 that e 1⁄4 My 1⁄4 M(Xb þ e) 1⁄4 Me. So
E[e] 1⁄4 0, (3:20) var(e) 1⁄4 E[ee0] 1⁄4 E[Mee0M] 1⁄4 ME[ee0]M 1⁄4 s2M2 1⁄4 s2M: (3:21)
128 3 Multiple Regression
To evaluate E[e0e] it is convenient to use the trace of a square matrix, which is defined as the sum of the diagonal elements of this matrix. Because the trace and the expectation operator can be interchanged, we find, using the property that tr(AB) 1⁄4 tr(BA), that
E[e0e] 1⁄4 E[tr(ee0) ] 1⁄4 tr(E[ee0] ) 1⁄4 s2tr(M):
Using the property that tr(A þ B) 1⁄4 tr(A) þ tr(B) we can simplify this as
tr(M) 1⁄4 tr(In X(X0X)1X0) 1⁄4 n tr(X(X0X)1X0) 1⁄4ntr(X0X(X0X)1) 1⁄4ntr(Ik)1⁄4nk,
where the subscripts denote the order of the identity matrices.
The least squares estimator s2 and standard errors This shows that E[e0e] 1⁄4 (n k)s2 so that
s2 1⁄4 e0e nk
(3:22) is an unbiased estimator of s2. The square root s of (3.22) is called the
standard error of the regression. If in the expression (3.19) we replace s2 2 0 1 pffiffiffiffiffi
by s and if we denote the jth diagonal element of (X X) by ajj, then s ajj is
called the standard error of the estimated coefficient bj. This is an estimate of pffiffiffiffiffi
the standard deviation s ajj of bj. Intuition for the factor 1/(n k)
The result in (3.22) can also be given a more intuitive interpretation. Suppose we would try to explain y by a matrix X with k 1⁄4 n columns and rank k. Then we would obtain e 1⁄4 0, a perfect fit, but we would not have obtained any information on s2. Of course this is an extreme case. In practice we confine ourselves to the case k < n. The very fact that we choose b in such a way that the sum of squared residuals is minimized is the cause of the fact that the squared residuals are smaller (on average) than the squared disturb- ances. Let us consider a diagonal element of (3.21),
var(ei) 1⁄4 s2(1 hi), (3:23)
where hi is the ith diagonal element of the matrix H 1⁄4 I M in (3.15). As H is positive se
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com