ECON61001: Econometric Methods1 Instructor: Alastair R. Hall2
Lecture Notes
January 18, 2021
Copyright 2020 Alastair R. Hall
Not to be reproduced without the permission of the author
1Postgraduate course offered at the University of Manchester
2Professor of Econometrics, Economics, School of Social Sciences, University of Manchester, Manchester M13 9PL, UK
Contents
1 Introduction 1
1.1 Thelinearregressionmodel ………………… 1 1.2 Whyuselinearalgebra? ………………….. 4 1.3 Overviewofcourse…………………….. 7
2 The linear regression model 9
2.1 ClassicalAssumptions …………………… 9
2.2 OrdinaryLeastSquaresEstimation…………….. 12
2.3 OLS, partial correlation and the Frisch-Waugh-Lovell Theorem . 15
2.4 The sampling distribution of the OLS estimator . . . . . . . . . . 21
2.4.1 Background…………………….. 21
2.4.2 SamplingdistributionofβˆT …………….. 22
2.5 OLSestimatorofσ02 ……………………. 25
2.6 Confidenceintervalsforβ0,i ………………… 27
2.7 Predictionintervals…………………….. 29
2.8 Hypothesistesting …………………….. 30
2.8.1 Classical hypothesis testing framework . . . . . . . . . . . 31
2.8.2 Testing hypotheses about individual coefficients . . . . . . 39
2.8.3 Testing whether β0 satisfies a set of linear restrictions . . 43
2.9 RestrictedLeastSquares………………….. 48
2.10Variableselection:R2andadjustedR2…………… 51 2.11Stochasticregressors ……………………. 54 2.12OLSaslinearprojection………………….. 57 2.13 Appendix: Background for distributional results . . . . . . . . . 61
3 Large Sample Statistical Theory
64
3.1 Background………………………… 64
3.2 Large sample analysis of the OLS inference framework: cross-
sectiondata………………………… 75
3.3 Large sample analysis of the OLS inference framework: time se-
riesdata………………………….. 82 3.3.1 Statisticalbackground……………….. 85 3.3.2 PropertiesofOLS …………………. 89
3.4 Appendix: more on the long run variance and weak dependence . 95 i
3.5 Appendix: The large sample behaviour of OLS estimators with trendingdata……………………….. 98
4 Inference in the Linear Model with Non-Spherical Errors 102
4.1 Introduction…………………………102
4.2 OLSandGLS:genericanalysis……………….107 4.2.1 OLS…………………………107 4.2.2 GeneralizedLeastSquares………………109
4.3 Heteroscedasticity in cross-sectional data . . . . . . . . . . . . . . 112 4.3.1 OLS-basedinference…………………113 4.3.2 GeneralizedLeastSquares………………116 4.3.3 Testingforheteroscedasticity …………….122 4.3.4 Empiricalillustration ………………..124
4.4 Timeseriesandnon-sphericalerrors …………….126
4.4.1 Conditional heteroscedasticty in time series models . . . . 126
4.4.2 OLS…………………………128
4.4.3 GLS…………………………132
4.4.4 Testingforserialcorrelation……………..133
4.4.5 Empiricalillustration ………………..135
5 Instrumental Variables estimation 137
5.1 Endogenousregressormodels ………………..138
5.2 OLSasaMethodofMomentsestimator . . . . . . . . . . . . . . 140
5.3 InstrumentalVariablesestimation………………142
5.4 Largesamplepropertiesandinference . . . . . . . . . . . . . . . 145
5.4.1 Cross-sectionaldata …………………145
5.4.2 Timeseriesdata …………………..147
5.5 IVand2SLS ………………………..149
5.6 Examplesofinstruments…………………..151
5.7 Weakinstruments ……………………..153
5.8 Empiricalexample ……………………..154
6 Maximum Likelihood 156
6.1 Intuitionanddefinition …………………..156
6.2 Binaryresponsemodels …………………..163 6.2.1 Thelinearprobabilitymodel …………….165 6.2.2 Probitmodel …………………….166 6.2.3 Logitmodel……………………..167 6.2.4 Empiricalexample………………….168
6.3 Thelinearregressionmodel…………………172 6.3.1 MLEundernormality………………..173 6.3.2 MLEunderStudent’stdistribution . . . . . . . . . . . . 174
6.4 Appendix: The large sample distribution of the MLE and two importantidentities …………………….176
1
Chapter 1 Introduction
1.1 The linear regression model
Econometrics involves the use of statistical methods to analyze economic issues. Examples include: how do individuals’ choose the amount of their income to spend on consumer goods and the amount to save? How do firms choose the level of output to produce and the number of workers to employ? How is the interest rate set by the Central Bank?
The answers to these questions start with the development of an economic theory that postulates an explanation for the phenomenon of interest. This theory is most often expressed via an economic model, which is a set of mathe- matical equations involving economic variables and certain constants, known as parameters, that reflect aspects of the economic environment such as taste pref- erences of consumers or available technology for firms. While the interpretation of these parameters is known, their specific value is not. Therefore, in order to assess whether the postulated model provides useful insights, it is necessary to find appropriate values for these parameters based on observed economic data using statistical methods.
Here are some examples of econometric models; in each case we use “error” to denote unobservable factors that are assumed not to have a systematic effect on the variable in question.
• The Capital Asset pricing Model implies that the return on an asset R is determined by the model
R−Rf =β0(Rm −Rf)+error
where Rf is the return on the risk-free asset and Rm is the market return.
• A simplified version of the model used to assess the returns to education is:
ln(w) = β0,1 + β0,2 ∗ed + β0,3 ∗exp + β0,4 ∗exp2 + error 1
where w denotes weekly wage, ed is the number of years of education and exp denotes the number of years of experience. An individual’s wage, education and experience are all observed, but the regression coefficients β0,i are unknown.
• The Cobb-Douglas production function implies:
ln(Q) = β0,1 + β0,2 ∗ ln(L) + β0,3 ∗ ln(K) + error
where Q is output, K is the level of capital stock and L is the amount of labour employed.
• A simple time series model for forecasting the change in inflation is: ∆inf = β0,1 + β0,2 ∗ ∆inf(−1) + error
where ∆inf denotes the change in inflation this time period and ∆inf(−1) denotes the change in inflation last period.
In all these models, the beta terms denote parameters. As noted above, the parameters tell us something about the economic phenomena in question. For example, in the CAPM example β0 is a measure of the volatility, or systematic risk, of an asset in comparison to the market as a whole. If β0 equals one then the asset exhibits the same volatility as the market, if it is less than one then the asset is less volatile than the market and if it is greater than one then its more volatile than the market. For most assets β0 is positive. While the interpretation is known, the specific value is unknown for any specific asset.
All these examples have a common structure: one variable is explained as a weighted linear combination of a set of variables plus an error term which captures unobservable factors that affect the left-hand side variable in question. Crucially, the weights are treated as unknown parameters. While not all econo- metric models take this form, many do and this specification is going to be the main focus of our discussion.
It is useful to have a generic notation, and so we write our equation of interest as:
yt = β0,1xt,1 + xt,2β0,2 + . . . + xt,kβ0,k + ut (1.1)
in which yt denotes an observable scalar economic variable referred to here as the dependent variable and {xt,j; j = 1, 2, . . ., k} a set of scalar observable economic variables, referred to as the explanatory variables, {β0,j ; j = 1, 2, . . . , k} are unknown parameters and ut captures an unobservable error. Here t indexes the observation and it is assumed the sample consists of observations t = 1, 2, . . . , T . In many cases, the model includes an intercept term and by convention this is included in the specification (1.1) by setting xt,1 = 1 so that the value of the intercept term is β0,1.
The specification in (1.1) is referred to as a linear regression model because the right-hand side is linear in the regression parameters. Notice that both the dependent and explanatory variables may involve nonlinear transformations
2
of underlying economic variables. For example, in the returns to education example above, the dependent variable is the logarithm of the wage and one of the regressors is the square of experience. However, if the dependent variable or explanatory variables do involve nonlinear transformations then this affects the interpretation of the regression coefficients. As the following examples illustrate; in each case x is a continuous variable and we omit the error terms for ease of presentation.
Example 1.1. Suppose y = β0,1 + β0,2x then β0,2 = ∂y
∂x
and so β0,2 is the marginal response in y to a change in x.
Example 1.2. Suppose ln(y) = β0,1 + β0,2x then β0,2 = 1 ∂y
y ∂x
and so β0,2 is the marginal response in y to a change in x expressed as a fraction of y. In other words, 100β0,2 is the percentage change in y resulting from a change in x. (In this case, 100β2 is known as the semi-elasticity of y with respect to x.)
Example 1.3. Suppose ln(y) = β0,1 + β0,2ln(x) then β0,2 = x ∂y
y ∂x
and so β0,2 is the marginal response in y to a change in x expressed as a fraction of the ratio of y to x. In other words, 100β0,2 is the percentage change in y resulting from a percent change in x. (In this case, β2 is known as the elasticity of y with respect to x.)
Example 1.4. Suppose y = β0,1 + β0,2x + β0,3×2 then ∂y = β0,2 + 2β0,3x
∂x
and so β0,2 is the marginal response in y to a small change in x at x = 0, and
β0,3 is half the partial change in ∂y/∂x in response to a change in x.
We now present an equivalent version of the model in (1.1) using vector and matrix notation. To this end, define xt to be the k × 1 vector with jthelement xt,j and β0 to be the k × 1 vector with jth element β0,j that is,
xt = .
xt,1 xt,2
β0,1 β0,2
, xt,k
β0 = .
3
. β0,k
Thus, we can equivalently write (1.1) as
yt =x′tβ0+ut,t=1,2,…,T.
Equation (1.2) implies that the following T equations hold:
y1 =x′1β0+u1 y2 =x′2β0+u2 y3 =x′3β0+u3
. . .
yT =x′Tβ0+uT
(1.2)
It turns out to be analytically convenient to write this system of T individual equations as a single equation using matrix algebra. To see how this can be done, we must first introduce some additional notation. Let y denote the T × 1 vector with tth element yt, X denote the T × k matrix with tth row x′t and u denote the T × 1 vector with tth element ut that is,
x′1 x′2
X = x′3 , .
y = Xβ0 +u.
This is a much more compact version of the model and it is the one with which
we work in this course.
1.2 Why use linear algebra?
It may reasonably be asked at this stage why we should adopt the linear algebra version of the regression model instead of the original specification in (1.1). The answer is that by working with the vector/matrix version of the model we will be able to take advantage of linear algebra to uncover important mathematical patterns from which we can deduce convenient formulae for statistics of interest that can be used not only to calculate these statistics but also to deduce their statistical properties.
To illustrate this point, we consider the so-called normal equations associ- ated with Ordinary Least Squares estimation. Later we will derive the normal equations and the OLS estimator formally, but for now it suffices to know that the normal equations implicitly characterize the values of the OLS estimator of
4
y1
u2 u = u3 .
y2 y = y3 ,
. uT
. yT
x′T Then (1.3) can be written equivalently as:
u1
(1.3)
the regression parameters given a sample of size T for the dependent and ex- planatory variables. The values of the OLS estimators can be found by solving the normal equations.
First consider the case in which k = 1, so that there is only one explanatory variable as in the CAPM example.1 In this case, the normal equation is:
TT
xt,1yt = x2t,1βˆ1. (1.4) t=1 t=1
where βˆ1 denotes the OLS estimator of β0,1. This equation can be solved straightforwardly for βˆ1: dividing both sides of (1.4) by Tt=1 x2t,1, we obtain
βˆ = Tt = 1 x t , 1 y t . ( 1 . 5 ) 1 Tt = 1 x 2t , 1
Now consider the case of the simple linear regression model in which k = 2 and xt,1 = 1 for all t. In this case the normal equations are:
TT
yt =βˆ1T+xt,2βˆ2, t=1 t=1
T T T
xt,2yt = βˆ1 xt,2 + x2t,2βˆ2.
t=1 t=1 t=1
(1.6)
(1.7)
Solving for the OLS estimators takes a little more work than before, but is still fairly straightforward. Dividing both sides of (1.6) by T , we obtain y ̄ = βˆ1 +x ̄2 βˆ2 which can be rearranged to yield:
βˆ1 =y ̄−x ̄2βˆ2. Substituting (1.8) into (1.7), we obtain
(1.8)
T TT
xt,2yt = (y ̄ − x ̄2βˆ2) xt,2 + x2t,2βˆ2.
(1.9) Rearranging (1.9) so that those terms involving βˆ2 are on one side of this equa-
and so
TTTT xt,2yt − y ̄xt,2 = (x2t,2 − x ̄2 xt,2)βˆ2, t=1 t=1 t=1 t=1
βˆ = Tt=1xt,2yt −y ̄Tt=1xt,2. 2 Tt=1 x2t,2 − x ̄2 Tt=1 xt,2
(1.10)
(1.11)
t=1 t=1 t=1
tion and those not involving βˆ2 on the other side, we obtain
1Recall there is no intercept in this model. 5
Since for any at, bt we have2
we can equivalently express βˆ2 as:
T
T T t=1 t=1
(at−a ̄)(bt− ̄b)= t=1
atbt −T ̄a ̄b, βˆ = Tt=1(xt,2−x ̄2)(yt−y ̄).
(1.12)
(1.13) Now consider the case where k = 4 and xt,1 = 1. The normal equations are:
2 Tt = 1 ( x t , 2 − x ̄ 2 ) 2
T T T T
yt = βˆ1T + xt,2βˆ2 + xt,3βˆ3 + xt,4βˆ4,
t=1 t=1 t=1 t=1
T T T T T
xt,2yt = βˆ1 xt,2 + x2t,2βˆ2 + xt,2xt,3βˆ3 + xt,2xt,4βˆ4,
t=1 t=1 t=1 t=1 t=1 T T T T T
xt,3yt = βˆ1 xt,3 + xt,2xt,3βˆ2 + x2t,3βˆ3 + xt,3xt,4βˆ4, t=1 t=1 t=1 t=1 t=1
TTTTT xt,4yt = βˆ1xt,4 +xt,2xt,4βˆ2+xt,3xt,4βˆ3 +x2t,4βˆ4.
t=1 t=1 t=1 t=1 t=1
In principle, we can solve these equations using a similar approach as above to obtain four separate equations for the OLS coefficient estimators as functions of {yt, x2,t, x3,t, x4,t}Tt=1 but it would take some time (try it!), and the situation would only become worse as the number of regressors are increased.
However, linear algebra comes to the rescue. We can write the normal equa-
tions as:
T t=1
T t=1
x t x ′t βˆ , X′y = (X′X)βˆ.
Under conditions to be discussed in the next section, we can use linear algebraic arguments to deduce that
βˆ = (X′X)−1X′y, (1.14)
2To see this, multiply out the term inside the sum to give: PTt=1(at − a ̄)(bt − ̄b) = PTt=1(atbt −a ̄bt −at ̄b+a ̄ ̄b). Using the fact that the sum is a linear operator – that is, the sum of a linear combination is the linear combination of the sums – PTt=1(atbt − a ̄bt − at ̄b + ̄a ̄b) =
PTt=1 at bt − ̄a PTt=1 bt − PTt=1 at ̄b + T ̄a ̄b. Finally note that PTt=1 at = T ̄a with a similar result for b.
6
x t y t
where βˆ = [βˆ1, βˆ2, βˆ3, βˆ4] or equivalently as,
=
where (X′X)−1 is the inverse of X′X. Thus by using matrix notation we have an expression for the OLS estimators that has the analytical simplicity of the single regressor case. In fact, as we shall see, the generic formulae in (1.14) is valid for any choice of k. (It contains the formulae for the k = 1 and k = 2 as a special case.)
1.3 Overview of course
The main aim of ECON61001 is provide a firm grounding in popular methods of estimation and inference in econometric models. The estimation methods covered include Ordinary Least Squares, Generalized Least Squares, Weighted least Squares, Instrumental Variables, Maximum Likelihood and Generalized Method of Moments. The majority of the course focus on the linear regression model, but we also explore a class of nonlinear models known as binary choice models. As part of the discussion of these topics, the course provides you with both an understanding of important statistical concepts and also experience in using linear algebra in econometric analysis. Orme (2009) provides a good source for those wishing to read more about the linear algebra concepts and results referred to in these lecture notes and is available via the course BB page. The course outline is as follows.
• The Classical Linear Regression model
– Specification and Assumptions in matrix notation
– Definition of Ordinary Least Squares estimation
– Statistical properties of OLS: mean, variance and the Gauss-Markov Theorem
– Inference based on OLS: (i) about regression parameters; (ii) predic- tion.
– OLS as linear projection and and its associated decomposition of the error sum of squares (including discussion of projection matrices)
– Restricted Least Squares (RLS) • Large sample analysis of OLS
– Stochastic regressors and why large sample analysis is used in econo- metrics
– Framework for large sample analysis: convergence in probability, con- vergence in distribution, Weak Law of Large Numbers, Central Limit Theorem, Slutsky’s Theorem.
– Inference framework
• Inference in regression models estimated from cross-section data
– Assumptions
7
– Heteroscedasticity: consequences for OLS and testing for
– Robust inference based on OLS in the presence of heteroscedasticity
• Inference in regression models estimated from time series data
– Assumptions
– Serial correlation: consequences for OLS and testing for
– Robust inference based on OLS in the presence of serial correlation
• Generalized Least Squares (GLS)
– GLS versus OLS
– GLS in models with Heteroskedasticity; Weighted Least squares (WLS) – GLS in time series regression models with serial correlation
• Instrumental Variables (IV) estimation
– Problem of endogeneity in econometric analysis using OLS
– Definition of Instrumental Variables (IV) and Two Stage Least Squares – Assumptions IV estimation
– Large sample properties of IV estimators
• Maximum Likelihood and Binary Choice Models
– Maximum Likelihood estimation: definition, properties and inference
– Binary choice models: linear probability, Probit and Logit models • Generalized Method of Moments
– Definition and properties of estimators
– As a unifying framework for estimation and inference
8
Chapter 2
The linear regression model
In this chapter, we introduce the linear regression model, derive the OLS estima- tor of the regression parameters, present methods both for performing inference about these parameters and also predicting values of the endgoenous variable given a value for the regressors.
2.1 Classical Assumptions
Before we formally derive the OLS estimator, we first complete the model by listing a set of assumptions about how the data are generated. These assump- tions are known as the Classical Assumptions and when imposed yield what is referred to as the Classical linear regression model. To present these assump- tions, we need the following definitions.
Let v be a n × 1 random vector that is, a n × 1 vector of random variables, with ith element vi.
Definition 2.1. E[v] denotes the n × 1 vector with ith element E[vi] and is referred to as the population mean of v.
Definition 2.2. V ar[v] = E [ v − E[v]] [ v − E[v]]′ and so is the n × n matrix with (i, j)th element Cov[vi, vj]. V ar[v] is referred to as the variance-covariance matrix of v.
Notice that the main diagonal consists of the variances that is, the (i, i)th element of V ar[v] is V ar[vi]; also that V ar[v] is symmetric by construction as Cov[vi,vj] = Cov[vj,vi].
Definition 2.3. If v has a multivariate normal distribution then its probability density function is given by:1
1 ′−1 f(v;μ,Ω) = (2π)n/2det(Ω)exp −(v−μ)Ω (v−μ)/2
1exp{a} = ea.
9
whereE[v]=μandVar[v]=Ω. Thisiswrittenasv∼N(μ,Ω).
With this background, we can now present the Classical Assumptions, de-
noted by CA1-CA6 below.
Assumption CA1. The true model for y is: y = Xβ0 + u.
Assumption CA1 states that y is generated by the same regression model as we estimate. Thus we are estimating a correctly specified model. The import of this assumptions comes from the assumptions below regarding u (and X).
Assumption CA2. X is fixed in repeated samples.
To understand the implication of this assumption, recall that within the Classical statistical paradigm the probability of an event is defined as the relative frequency of that event in an infinite number of repetitions of the underlying experiment statistical experiment. In our case here, the statistical experiment involves taking the sample of y,X. Assumption CA2 implies in each of these repeated experiments the data matrix X is the same. The error, u is a random variable and so its value is different in general in each sample meaning – from Assumption CA1 – that y is random and so its value is also different in general in each sample. It should be noted that this is a totally implausible assumption for economic data but is adopted here because it simplifies our linear algebra analysis.2 Later we will relax this assumption and allow X to be random, and then assess what impact it has on the analysis developed under Assumption CA2.
Assumption CA3. X is rank k.
Since X is T × k, Assumption CA3 states that X is of full column rank and so, in other words, that the columns of X form a linearly independent set. As a result, no one column of X can be written as a linear combination of the other columns. To understand why this is crucial, consider the case where k = 3. Let X•,j denote the jth column of X, and suppose that the third column of X can be written as a linear combination of the other two that is, there exist constants c1 and c2 such that
X•,3 = c1X•,1 + c2X•,2.
Substituting for X•,3, it can be recognized Xβ0 = Xβ∗ where β∗ = (β0,1 + c1β0,3, β0,2 + c2β0,3, 0)′. Thus for any given choice of u, we obtain the same value for y using either β0 or β∗ as the vector of regression coefficients. As a result, there is not a unique true value of the regression parameters, and so it will be impossible to determine how each explanatory variable affects y. This
2For example, consider the returns to education model in Section 1. To estimate this model, we need to sample T individuals from the target population and record their (log) wage, y, and the number of years of education and experience to give X. If we take repeated random samples then we would not expect to obtain the same X in each sample. Assumption CA2 has its origins in controlled experiments in say, agriculture, where y might be crop yield and X would contain information such as amounts of light, water and fertilizer whose values can be controlled in an experimental setting.
10
type of exact linear relationship between the explanatory variables is known as exact multicollinearity, and is in truth unlikely. More empirically relevant is the case of near multicollinearity in which (at least) one variable is “almost” a linear combination of other explanatory variables, but we do not explore this issue in the course; see Greene (2012)[p.91, 129-131] for further discussion.
Assumption CA4. E[u] = 0.
Using Definition 2.1, it can be seen that this assumption implies the mean
of the error is zero in each equation.
Assumption CA5. V ar[u] = σ02IT where σ02 is an unknown positive constant.
Using Definition 2.2, it can be seen that Assumption CA5 places two restric- tions on the second moments of the errors. First, it implies V ar[ut] = σ02 and so the variance of the error is the same for each t; this is known as homoscedas- ticity. Second, it implies Cov[ut, us] = 0 for t ̸= s and so the errors are pairwise uncorrelated, meaning the errors from different observations are uncorrelated. The constant σ02 is treated as a parameter of the model.
Assumption CA6. u has a normal distribution.
Thisassumptionstatesthatuhasamultivariatenormaldistribution.3 Note that Assumptions CA5 and CA6 imply ut and us are independent (for t ̸= s). We note that sometimes Assumption CA6 is not included in the list of “Classical Assumptions” because it is not needed to derive the first two moments of the OLS estimator; in such cases normality is only introduced at the point at which the sampling distribution of OLS is derived en route to developing inference procedures for β0. Here we include this assumption from the outset and we highlight below the role of each assumption in the derivation of the statistical properties of OLS.
To complete this section, we explore the implications of Assumptions CA1- CA6 for y. To this end, we note the following important property of normal random vectors.
Lemma 2.1. Let v be a n × 1 random vector and v be distributed as N (μ, Ω). Further w = c+Cv where c is a m×1 vector of constants and C is a m×n matrix of constants such that CΩC′ is nonsingular. Then: w ∼ N(c + Cμ, CΩC′).
Lemma 2.1 says in essence that linear combinations of normal random vec- tors are themselves normally distributed. It then follows from Assumptions CA1-CA6thaty∼N(Xβ0,σ02IT).ThusXβ0 representsthemeanofy,andso the regression parameters tell us how changes in X affect the mean of y.
3The normal distribution is also known as the Gaussian distribution after Carl Frederich Gauss (1777-1855), the German mathematician, who first described the distribution in statis- tics.
11
2.2 Ordinary Least Squares Estimation
In scalar notation, OLS estimators of the regression coefficients are the values of β that minimize QT (β) = Tt=1(yt − x′tβ)2. As noted in Section 1.2, it is far easier to characterize what these values of β are if we express the problem in vector/matrix notation. To this end, we introduce the T × 1 vector of functions u(β) = y − Xβ; here β is the argument of u( · ) and can take any value in the parameter space B (the set of possible values for the regression parameters).4 So, for a given sample, the value of u(β) changes as β changes. Note that it follows from CA1 that u(β0) = u.
We can then write the OLS minimand equivalently as5 QT(β) = u(β)′u(β),
and the OLS estimator of β0 as6
βˆT = argminβ∈B QT (β).
∂β ∂βi ∂β∂β′ to be the k × k matrix with (i, j)th element ∂2QT (β) , the first order conditions
(2.2) Defining ∂QT (β) to be the k × 1 vector with ith element ∂QT (β) and ∂2QT (β)
(2.1)
∂βi∂βj
(FOC) of this minimization are:
∂QT(β)β=βˆT = 0;
∂β
(2.3)
and the second order conditions are that:7
∂2QT (β) ˆ is a positive definite (p.d.) matrix.
(2.4) To obtain the derivatives, we appeal to the following linear algebra results.8
∂β∂β′ β=βT
Lemma 2.2. Let θ be a p×1 vector, and define h(θ) = b′θ and g(θ) = θ′Bθ where b is a p × 1 vector of constants (that is, they do not depend on θ) and B is a p × p matrix of constants. Then we have: (i) ∂h(θ)/∂θ = b; (ii) ∂g(θ)/∂θ = (B + B′)θ; (iii) ∂2g(θ)/∂θ∂θ′ = B + B′.
Using Lemma 2.2 (i)-(ii) and noting that X′X is symmetric by construction,
it follows that
∂QT (β) = −2X′y + 2X′Xβ. ∂β
4More formally, u( · ) : S × B → RT where S denotes the sample space for y, X. 5Note:QT :S×B→[0,∞).
6arg min stands for the value of the argument that minimizes the function.
7Recall that a n × n symmetric matrix M is positive definite if for any n × 1 non-null
vector z, z′ M z > 0. For further discussion see Greene (2012)[p.1045] or Orme (2009)[Sections 7.1-7.2].
8See Tutorial 1 for derivations.
12
(2.5)
So the first order conditions can be written as:
X′XβˆT − X′y = 0. (2.6)
From Assumption CA3 it follows that X′X is nonsingular (why?), and so if we multiply both sides of (2.6) by (X′X)−1 then we obtain
βˆ T − ( X ′ X ) − 1 X ′ y = 0 ,
which yields the well-known formula for the OLS estimator of β0,
βˆT = (X′X)−1X′y. (2.7) The first order conditions of OLS estimation, (2.6), are sometimes referred to
as the normal equations.9 Using Lemma 2.2 (iii), the Hessian matrix is
∂2QT (β) = 2X′X. (2.8)
∂β∂β′
This matrix does not depend on β, and we now show it is positive definite under
our assumptions. Recall that X′X is positive definite if z′X′Xz > 0 for any
z̸=0. Definingc=Xz,wehavez′X′Xz=c′c=Tt=1c2t wherect isthetth ′′
element of c. Therefore, z X Xz ≥ 0 by construction and can only be zero if c = 0. So we now consider c. By definition, c = kj=1 X•,jzj, where zj is the jth element of z, and so c is a linear combination of the columns of X. So for c to be zero, it must be the case that at least one column of X can be written as a linear combination of the others; but this is ruled out by Assumption CA3. Thus c must be non-zero, and so X′X is positive definite.
OLS estimation produces two T × 1 vectors that are important for our sub- sequent analysis: yˆ = XβˆT , the predicted value of y; and e = y−yˆ the prediction error or, more commonly, the vector of OLS residuals. These two vectors are linearly unrelated because we can re-write the first order conditions in (2.6) as X′e = 0 and so
Since
′ ˆ′′
yˆe = βTXe = 0. (2.9)
y = yˆ + e , ( 2 . 1 0 )
it follows that OLS estimation affects a decomposition of y into two linearly unrelated components. This decomposition mimics our model which splits y into two parts: a part that can be explained by X, E[y] = Xβ0; and the part that cannot be explained by X, the error u = y − Xβ0. The property in (2.9) also implies that OLS affects a similar decomposition of the variation of y about its mean in regression models that include an intercept term. While we postpone a presentation of the linear algebraic arguments behind this variance decomposition to Section 2.12, it is pedagogically useful to briefly describe its
9Recall that we used the normal equations in Section 1.2 to motivate our use of matrix algebra.
13
nature here. The variation in y about its mean is captured here by a quantity known (in this context) as the total sum of squares (TSS) and is defined as
T
TSS = (yt − y ̄)2. (2.11)
t=1
In Section 2.12, it is established that the variation in y can be decomposed as
follows:
TSS = ESS + RSS, (2.12)
where ESS = Tt=1(yˆt − y ̄)2 and RSS = Tt=1 e2t . ESS stands for the Ex- plained sum of squares because it equals the variation of the predicted values of y about the mean of y. RSS stands for the Residual Sum of Squares because it equals the variation of the residuals. Therefore, it follows from (2.12) that OLS estimation decomposes the total variation of y about its mean into two parts: ESS, the variation in y that can be explained by regression on X, and RSS, the remainder.
Since the aim of our regression model is to explain y with X, it is clearly desirable to have some measure of how well the model actually does. One such measure is the multiple correlation coefficient, R2, defined as:
R2 = ESS, (2.13) TSS
which, from our discussion above, is the proportion of the variation in y ex- plained by linear regression on X. Notice that by construction R2 lies between zero and one, with zero implying the regression cannot explain y at all (as ESS = 0) and one implying X perfectly explains y (as RSS = 0). We return to the properties of R2 in Section 2.10.
We conclude this section by showing that (2.7) contains the formula for the OLS estimators in the simple linear regression model (see equations (1.8) and (1.13) above) as a special case.
Example 2.1. The simple linear regression model is obtained by setting k = 2 andxt,1 =1in(1.3). Definex1 =ιT,aT×1vectorofones,andx2 to be the T × 1 vector of observations on the other explanatory variable; and so X = [ιT , x2]. It follows that
y1 ′ ι′T 1,1,…,1y2
X y = x ′ y = x , x , . . . , x . . 2 2,1 2,2 2,T .
Tt=1yt = T x y ,
t=1 2,t t
14
yT
and
X′X=ι′T[ι,x]=ι′TιT ι′Tx2 x′2 T 2 x′2ιT x′2×2
T Tx ̄2 = Tx ̄ T x2 ,
2 t=1 2,t
where x ̄2 = T−1 Tt=1 x2,t. Noting that the determinant of X′X can be written
using (1.12) as
T det(X′X) = T
t=1
T
x2,t − (Tx ̄2)2 = T the matrix inversion formula for a 2 × 2 matrix yields
(X′X)−1 = T Therefore, it follows from (2.7) that
t=1 2,t 2 . −x ̄2 1
1
t=1(x2,t − x ̄2)2
(x2,t − x ̄2)2, Tx2 −x ̄
t=1
βˆT,1 1 Tt=1×2,t−x ̄2Tt=1yt βˆ =T 2 −x ̄2 1 Txy.
T,2 t=1(x2,t−x ̄2) t=1 2,tt Multiplying out and using (1.12), it can be seen that
βˆ = Tt=1 x2,tyt − Tx ̄2y ̄ = Tt=1(x2,t − x ̄2)(yt − y ̄), T,2 Tt=1(x2,t − x ̄2)2 Tt=1(x2,t − x ̄2)2
which is the formula in (1.13). For the estimated constant, we have
βˆ = Tt=1×2,ty ̄−x ̄2Tt=1×2,tyt. T,1 Tt=1(x2,t − x ̄2)2
By adding and subtracting Tx ̄2y ̄ to the numerator in the previous equation and using (1.12), we obtain
βˆ = y ̄Tt=1(x2,t − x ̄2)2 − x ̄2 Tt=1(x2,t − x ̄2)(yt − y ̄) = y ̄ − x ̄ βˆ , T , 1 Tt = 1 ( x 2 , t − x ̄ 2 ) 2 2 T , 2
which matches (1.8).
2.3 OLS, partial correlation and the Frisch-Waugh- Lovell Theorem
In this section, we provide an alternative method for calculating the OLS esti- mator that provides insights into exactly what effect is being captured by the
15
estimated coefficients. The result rests on the so-called Frisch-Waugh-Lovell (FWL) Theorem.
To present the FWL Theorem, it is necessary to partition the regressor matrix and coefficient vector conformably as follows,
X=(X1,X2),andβ0 =(β1′,β2′)′,
where X1, X2 are respectively T × k1 and T × k2 matrices, and β1, β2 are respectively k1 × 1 and k2 × 1 vectors, with (for coherence) k = k1 + k2. Thus we can write our regression model as
y = Xβ0 + u
y = X1β1 +X2β2 +u (2.14)
Similarly we partition the vector of OLS estimators of β0 as: βˆT = (βˆT′ ,1, βˆT′ ,2),
w h e r e βˆ T , j i s k j × 1 f o r j = 1 , 2 .
Now consider the alternative strategy for estimation of β2, in which x2,l de-
notes the lth column of X2.
Step 1: Regress y on X1 via OLS and denote the associated vector of OLS
residuals by w.
Step 2: For each l = 1,2,…,k2, regress x2,l on X1 via OLS and denote the
associated vector of OLS residuals by dl.
Step 3: Regress w on D, where D = (d1,d2,…,dk2), via OLS and denote the resulting vector of coefficient estimators by ˆb2 that is, ˆb = (D′D)−1D′w.
We return to the interpretation of the regressions in Steps 1 – 3 below fol- lowing the statement and proof of the FWL theorem.
Theorem 2.1 (FWL). Let βˆT,2 be the OLS estimator of β2 based on (2.14) and ˆb be the vector defined in Step 3 above then we have βˆ2 = ˆb.
Frisch and Waugh (1933) establish this result for the case where X1 is a time trend; Lovell (1963) shows the result holds for any choice of X1. It should be noted that the analogous result holds for βˆ1 if the roles of X1 and X2 are reversed in Steps 1-3 above. To prove Theorem 2.1, we need to introduce a very useful result regarding the inversion of partitioned matrices and also delve more deeply into the mathematical structure of the formula for OLS residuals. First, we present the partitioned matrix inversion result.
Lemma 2.3. Let A be the partitioned matrix A=A1,1 A1,2 ,
A2,1 A2,2
16
where Ai,j is a mi × mj matrix. If A ,A1,1 and A2,2 are nonsingular matrices then it follows that:
where
A−1=B=B1,1 B1,2, B2,1 B2,2
B1,1 = (A1,1 − A1,2A−1A2,1)−1, 2,2
B1,2 = −A−1A1,2(A2,2 − A2,1A−1A1,2)−1 1,1 1,1
= −(A1,1 − A1,2A−1A2,1)−1A1,2A−1, 2,2 2,2
B2,1 = −A−1A2,1(A1,1 − A1,2A−1A2,1)−1 2,2 2,2
= −(A2,2 − A2,1A−1A1,2)−1A2,1A−1, 1,1 1,1
B2,2 = (A2,2 − A2,1A−1A1,2)−1. 1,1
The proof of this result is left as an exercise on Tutorial 1. The second result we need in order to prove the FWL involves the vector of OLS residuals. Using the definition of e and (2.7), we have
e = y − XβˆT = y − X(X′X)−1X′y = (IT −P)y, (2.15)
where P = X(X′X)−1X′. At this point, it is useful to highlight three important properties of the matrix IT −P: (i) it is symmetric that is, IT −P = (IT −P)′; (ii) it equals its square that is, IT −P = (IT −P)2; (iii) (IT −P)X = 0. (Check these for yourselves.10)
Proof of Theorem 2.1: Using the partitions of X and βˆ, equation (2.7) as: βˆT,1 =X1′X1 X1′X2 −1X1′y
βˆ T , 2 X 2′ X 1 X 2′ X 2 X 2′ y
Using the partitioned matrix inversion formula in Lemma 2.3, we have:
βˆT,2 = −(X2′ M1X2)−1X2′ X1(X1′ X1)−1X1′ y + (X2′ M1X2)−1X2′ y, (2.16)
where to condense notation we set M1 = IT − X1(X1′ X1)−1X1′ . Noticing that both terms on the right-hand side of (2.16) begin with (X2′ M1X2)−1X2′ and end with y, we can collect terms to obtain:
βˆT,2 = (X2′ M1X2)−1X2′ (IT − X1(X1′ X1)−1X1′ )y,
= (X2′ M1X2)−1X2′ M1y. (2.17)
Now consider M1y. Notice that M1 can be obtained from IT −P upon replacing X by X1, and so M1y is the OLS residual vector from the regression of y on
10For further discussion of P and IT − P and their properties see Section 2.12. 17
X1 that is, M1y = w. Now consider M1X2: this is a T × k1 matrix whose lth column is M1x2,l. By a similar argument as for M1y, M1x2,l is the OLS residual vector from the regression of x2,l on X1 that is, M1x2,l = dl. Therefore, we have M1X2 = D. A further implication of the common structure of M1 and IT − P is that the former shares the latter’s properties highlighted in (i)-(iii) under (2.15). As a result we can re-write (2.17) as
βˆ T , 2 = ( X 2′ M 1′ M 1 X 2 ) − 1 X 2′ M 1′ M 1 y
= ( {M1X2}′{M1X2} )−1 {M1X2}′{M1y}
= (D′D)−1D′w = ˆb. ⋄
The FWL theorem can be used to provide an interpretation of the OLS estimator of any individual coefficient. For ease of presentation, we consider the case where X1 consists of the the first k − 1 columns of X that is, X1 = [x1, x2, . . . , xk−1], and X2 is the vector of observations on the kth explanatory variablethatis,X2 =xk. Inthiscasek2 =1andsoitisD=d1 =M1xk. It follows that w and D represent the parts of y of xk that cannot be linearly explained by X1. Thus the model in Step 3 captures the relationship between y and xk once they have both been purged of any linear dependence they have on X1. If ˆb = 0 then any relationship between y and xk can be accounted for by their joint dependence on X1. However, if ˆb is non-zero then this implies y and xk are linearly related to an extent beyond that which can be explained by their joint dependence on X1. Thus, βˆk reflects the partial effect that is, the unique contribution – relative to the other regressors in the model – of xk to the explanation of y. The operation in Steps 1 and 2 are often referred to as “partialling out” the effect of X1. The regression in Step 3 is said to capture the relationship between y and xk controlling for X1.
In simple linear regression, the OLS slope coefficient is proportional to the correlation between the dependent variable and regressor. In the more general k- regressor model, it follows from the FWL that (in general) the individual OLS coefficients reflect the partial correlation between the variables in question rather than the correlation per se.
Definition 2.4. Suppose that there is a sample t = 1,2,…,T on scalars at and b and vector c where c includes the intercept. Let {e(a|c)} and {e(b|c)}
ttttt denote the OLS residuals from regression of respectively at on ct and bt on ct. The sample partial correlation between a and b given c, ρˆa,b|c, is the sample
correlation between e(a|c) and e(b|c). tt
Continuing with our example from above, it follows from the FWL that βˆk is proportional to the sample partial correlation between y and xk given X1.
The FWL theorem also has implications for the specification of regression models in cases where it is the relationship of y to only a subset of the ex- planatory variables that is of most interest. To elaborate, we focus on the case the situation in which it is the relationship between y and a single explanatory
18
variable, h say, that is of most interest. If we estimate the simple regression model
y = α+γh+error
then a concern is that important variables have been omitted and so γ captures not only the effects of h on y but also (at least partially) the effects of the omitted variables on y. As a result, inference is based on a model
y = α+γh+Zδ+error,
where Z contains other variables thought to help explain y. In this context, Z are often referred to as control variables because their inclusion is solely to ensure a more accurate estimate of γ, the coefficient on h, rather than because δ is of interest. The following example illustrates the importance of control variables in evaluations of the impact of policy changes.
Example 2.2. In mid-1980’s there were two changes to federal highway regula- tions in US: in January, 1986, a law was passed that made it mandatory to wear a seat belt and from May 1987, states were allowed to raise highway speed limit from 55mph to 65mph. McCarthy (1994) investigates whether these changes had an impact on highway safety using monthly observations for California spanning January 1981 through December 1989.11 The dependent variable, yt, is defined as the percentage of highway accidents that resulted in one or more fatality in period (year.month) t. Summary statistics for yt reveal that this percentage is low: the sample mean is 0.886,the sample standard deviation is 0.010 and the minimum and maximum values of yt in the sample are respectively 0.702 and 1.217. The law changes are modeled via two dummy variables: beltt takes the value one for t ≥ 1986.1; mpht takes the value one for t ≥ 1987.5.
If we regress yt on just an intercept and the two law-change dummies, the results are:
yˆt = 0.914 − 0.102 ∗ beltt + 0.057 ∗ mpht . (2.18)
Both coefficients on the dummies have the anticipated sign. To interpret the magnitudes, note that mean number of total road accidents per month is ap- proximately 45,000 during the sample period. So these estimated coefficients suggest non-trivial impacts of the law changes. However, this regression model implies that the expected value of y only takes three different values throughout the sample, which seems implausible. In particular, there may be variations across months, reflecting weather variations amongst other factors. There may also there may be systematic changes over time due for example, to increases in the number of cars on the road or growing awareness of road safety. These features can be captured by introducing eleven dummy variables indicating the month and also a linear time trend. The resulting estimated model is (where for ease of presentation we include only the estimates of the law-change dummies):
yˆt = controls − 0.014 ∗ beltt + 0.078 ∗ mpht. (2.19) 11Here we consider a simplified version of is model.
19
Interestingly this reduces the effect of the seat belt law change but increases the effect of the speed limit increase. Finally, we introduce two further control variables: the state unemployment rate which reflects economic activity; the number of weekend days (Friday-Sunday) in the month as driving patterns may be different at weekends as opposed to weekdays. With these last two controls also included, the estimated regression model is:
yˆt = controls − 0.030 ∗ beltt + 0.067 ∗ mpht. (2.20)
The multiple correlation coefficient from this model is R2 = 0.72 and so the model explains approximately 70% of the variation in yt. These results suggest, ceteris paribus, the introduction of the seat belt law led to a reduction in the percentage of accidents involving a fatality by 0.03 and the increase in the speed limit increased the percentage of accidents involving a fatality by approximately 0.07. Taken at face value, the results imply these law changes had non-trivial impacts. In subsequent, we introduce methods to assess whether these estimated impacts are statistically significant.
However, in some cases, the inclusion of control variables can be counter productive. For example, consider the case in which it is desired to investigate the impact of climate variables on the economy using a cross-sectional data set on countries for a particular year. For simplicity assume climate is captured by a single variable and consider inference based on the simple linear regression model
yi = α + γCi + error, (2.21)
where yi is some measure of economic activity in country i and Ci is the climate variable for country i. Examples of variables used in the literature are: for yi, national income or agricultural output; for Ci, precipitation or temperature; see Dell, Jones, and Olken (2014). Analogous reasoning to the previous example might lead to the introduction of control variables, zi say, so that inference is based on
yi = α+γCi +zi′δ+error. (2.22)
However, common choices for controls – such as institutional measures or popu- lation – may themselves be dependent on climate. In this case, the “partialling out” eliminates any linear association between y and Ci that can be attributed to their both being correlated with the controls. As a result, the OLS estimate of γ in (2.22) may not be a more accurate measure of the effect of Ci on yi than the OLS estimator of γ based on (2.21). This scenario is often referred to as the “over-controlling” problem.
20
2.4 The sampling distribution of the OLS esti- mator
2.4.1 Background
Within Classical statistics paradigm, the desired finite sample properties for an estimator, θˆT , of an unknown p × 1 parameter vector, θ0, are: (i) it should be unbiased that is, E[θˆT ] = θ0; (ii) it should be efficient that is, θˆT should have minimum variance within the class of unbiased estimators of θ0.
If p = 1 – and so θ0 is a scalar – the requirement that θˆT has minimum variance is straightforward to interpret because it means V ar[θˆT ] ≤ V ar[θ ̃T ], where θ ̃T is any other unbiased estimator of θ0. However, it is less obvious how to interpret this requirement when p > 1. As we now discuss, the interpretation rests on considering estimation of linear combinations of θ0.
So we now suppose that we are interested in estimation of the linear combi- nation of the parameters, c′θ0 where c is a p × 1 vector of specified constants. LetθˆT andθ ̃T betwounbiasedestimatorsofθ0,andconsiderthetwoassociated estimatorsofc′θ0,c′θˆT andc′θ ̃T. Sincecisaconstantvector,itfollowsthat c′ θˆT and c′ θ ̃T are unbiased for c′ θ0 .12
Since c′θ0 is a scalar, c′θˆT is a more efficient estimator of c′θ0 than c′θ ̃T if Var[c′θˆT] < Var[c′θ ̃T]. (2.23)
Our ranking of the relative efficiency of θˆT and θ ̃T comes from comparing V ar[c′θˆT ] and V ar[c′θ ̃T ] for all possible values of c. θˆT is said to be more efficient than θ ̃T if (2.23) holds for at least one choice of c and V ar[c′θ ̃T ] is never less than V ar[c′θˆT ] no matter the choice of c.
This idea is expressed formally in the following definition. Definition2.5.LetθˆT andθ ̃T betwounbiasedestimatorsofthep×1parameter
vector, θ0. Then θˆT is efficient relative to θ ̃T if
V ar[c′θˆT ] ≤ V ar[c′θ ̃T ] for all c s.t. c′c < ∞, (2.24)
and the inequality is strict for at least one choice of c.
Definition 2.6. θˆT is the minimum variance unbiased estimator if (2.24) holds
for any other unbiased estimator, θ ̃T .
We can derive a more primitive condition for this to be true.13
Proposition 2.1. Let θˆT and θ ̃T be unbiased estimators of the p × 1 parameter vector, θ0 . Then θˆT is efficient relative to θ ̃T if V ar[θ ̃T ] − V ar[θˆT ] is a positive semi-definite (p.s.d.) matrix.
12We have E[c′θˆ ] = c′E[θˆ ] = c′θ because θˆ is an unbiased estimator for θ , with a TT0 0
similar argument for c′θ ̃T .
13Recall that a matrix n × n symmetric matrix M is positive semi-definite if for any n × 1
non-null vector z, z′Mz ≥ 0.
21
Proof of Proposition 2.1: Recall that V ar[c′θˆT ] = E (c′θˆT − E[c′θˆT ])2. Not-
ing that
c′θˆT −E[c′θˆT]=c′θˆT −c′E[θˆT]=c′(θˆT −E[θˆT])
(c′θˆT −E[c′θˆT])2 ={c′(θˆT −E[θˆT])}2 =c′(θˆT −E[θˆT])(θˆT −E[θˆT])′c,
and so
we have
Var[c′θˆT] = Ec′(θˆT − E[θˆT])(θˆT − E[θˆT])′c
= c′E (θˆT − E[θˆT ])(θˆT − E[θˆT ])′ c
= c′Var[θˆT]c.
Similar logic yields V ar[c′θ ̃T ] = c′V ar[θ ̃T ]c. Therefore the condition in (2.24)
is equivalent to
For this condition to hold for any c, V ar[θ ̃T ] − V ar[θˆT ] must be positive semi-
c ′ { V a r [ θ ̃ T ] − V a r [ θˆ T ] } c ≥ 0 . 2.4.2 Sampling distribution of βˆT
definite. ⋄
We derive the sampling distribution in stages to highlight the roles of different assumptions. First, we derive the mean and variance of βˆT and show βˆT satisfies the unbiasedness and efficiency properties described above. We then derive the sampling distribution of βˆT .
The key to all these derivations is an alternative representation for the OLS estimator that involves X and u. Using Assumption CA1 in (2.7), we obtain:
βˆT = (X′X)−1X′(Xβ0 + u), = β0 + (X′X)−1X′u.
First consider E[βˆT ]. It follows from (2.25) that
E [ βˆ T ] = E β 0 + ( X ′ X ) − 1 X ′ u .
From Assumption CA2, this expectation can be written as:14 E[βˆT ] = β0 + (X′X)−1X′E[u],
and so, using Assumption CA4, we have: E[βˆT] = β0.
(2.25)
(2.26)
14Recall that if v is a n×1 random vector, c is a m×1 vector of constants and C is a m×n matrix of constants then E[c + Cv] = c + CE[v].
22
Therefore, the OLS estimator is unbiased and satisfies the first of our desirable requirements in Section 2.4.1. Notice that our derivation of unbiasedness only involves Assumption CA4 and not Assumptions CA5 and CA6. Therefore, the OLS estimator is unbiased provided only that E[u] = 0 irrespective of other aspects of the error distribution.
We now consider V ar[βˆT ]. Recall from Definition 2.2 that Var[βˆT] = EβˆT − E[βˆT]βˆT − E[βˆT]′ .
Using E[βˆT ] = β0 and (2.25), it follows that:
Var[βˆT] = E(X′X)−1X′uu′X(X′X)−1.
Using Assumption CA2, it follows from (2.27) that:15 Var[βˆT] = (X′X)−1X′E[uu′]X(X′X)−1.
(2.27)
(2.28) Using Assumption CA4, it follows from Definition 2.2 that E[uu′] = V ar[u]. It
then follows from Assumption CA5 and (2.28) that:16
V ar[βˆT ] = (X′X)−1X′σ02IT X(X′X)−1, = σ02(X′X)−1X′X(X′X)−1,
= σ 02 ( X ′ X ) − 1 . ( 2 . 2 9 )
Notice that our derivation of the variance depends only on Assumptions CA4 and CA5 but not Assumption CA6. So the formula in (2.29) holds for any error distribution with mean zero and variance-covariance matrix σ02IT .
We now establish that the OLS is minimum variance within a certain class of estimators. This result is known as the Gauss-Markov Theorem.
Theorem 2.2. Gauss-Markov Theorem Under assumptions CA1 - CA5, OLS is the Best Linear (in y) Unbiased Estimator (BLUE) of β0 in the sense thatVar[β ̃T]−Var[βˆT]ispositivesemi-definiteforanyotherestimatorβ ̃T that is both linear (in y) and unbiased for β0.
We discuss the implication of the results before providing a proof. The Gauss-Markov Theorem states that OLS is the minimum variance estimator within the class of linear unbiased estimators. Linear in y means that the es- timator can be written as Dy for some k × T matrix of constants D (for OLS D = (X′X)−1X′). Notice that the conditions of the theorem do not include Assumption CA6 and so this property holds for any error distribution with a
15Recall that if V is a n×n random vector and C is a m×n matrix of constants then E[CV C′] = CE[V ]C′.
16Recall: (i) if a matrix is multiplied by a scalar then every element of the matrix is multiplied by the scalar, and so if A and B are confromable matrices and c is a scalar then cAB = AcB = ABc; (ii) IT X = X by the definition of the identity matrix.
23
mean of zero and variance-covariance matrix equal to σ02IT .
Proof of Theorem 2.2: To establish this result, we need to show that V ar[β ̃T ] − V ar[βˆT ] is a positive semi-definite matrix for any β ̃T satisfying the conditions of the theorem. To this end, we introduce a representation for β ̃T that will turn out to provide a very convenient form for V ar[β ̃T ].
Since β ̃T is linear in y, we have β ̃T = Dy for some choice of D. Whatever the choice of D we can always find a k × T matrix of constants C such that D = (X′X)−1X′ + C. This representation for β ̃T is convenient because the unbiasedness of the estimator puts a constraint on C as we shall see below. Using Assumption CA1, we have
β ̃T = D(Xβ0 + u) = (X′X)−1X′Xβ0 +CXβ0 + Du,
= β0 + CXβ0 + Du. (2.30)
Taking the expectation of both sides of the previous equation and as D, X, C and β0 are constants, we obtain
E [ β ̃ T ] = β 0 + C X β 0 + D E [ u ] , which from Assumption CA4 yields
E [ β ̃ T ] = β 0 + C X β 0 .
Since we know by assumption that E[β ̃T ] = β0, it must follow that CX = 0 (as unbiasedness must hold for any value of β0).
We now turn to V ar[β ̃T ]. From (2.30), E[β ̃T ] = β0 and CX = 0, it follows thatβ ̃T −E[β ̃T]=Du,andso
Var[β ̃T] = E[Duu′D′].
Since D is a constant, we can use similar arguments as in our derivation of
V ar[βˆT ] above to deduce that V ar[β ̃T ] = σ02DD′. Multiplying out, we have:
DD′ = (X′X)−1X′ + C(X′X)−1X′ + C′ ,
= (X′X)−1X′ + CX(X′X)−1 + C′,
= (X′X)−1X′X(X′X)−1 + (X′X)−1X′C′ + CX(X′X)−1 + CC′.
Since (X′X)−1X′X = Ik and CX = 0 (which implies X′C′ = (CX)′ = 0′), we have
( 2 . 3 1 )
and so
Using (2.29) and (2.31), it follows that
DD = (X′X)−1 + CC′,
V a r [ β ̃ T ] = σ 02 { ( X ′ X ) − 1 + C C ′ } .
Var[β ̃T]−Var[βˆT] = σ02CC′. 24
Since σ02 > 0 by definition, we conclude the proof by establishing that CC′ is p.s.d. To this end, let z be any non-null k × 1 vector, and consider the quadratic form z′CC′z. If we set f = C′z and let ft be the tth element of f then z′CC′z = f′f = Tt=1 ft2 ≥ 0 by construction. So CC′ is p.s.d. which establishes the desired result. ⋄
So far, we have derived the first two moments of βˆT . We now complete the characterization of its sampling distribution. Using Assumptions CA2 and CA6, itfollowsfrom(2.25)thatβˆT isalinearfunctionofanormalrandomvector and so using Lemma 2.1, we have the following result.
Theorem 2.3. If Assumptions CA1-CA6 hold then: βˆT ∼ N β0, σ02(X′X)−1.
Theorem 2.3 implies that the marginal sampling distribution of an individual coefficient, βˆT,i say, is N β0,i, σ02mi,i, where mi,i is the ith main diagonal element of (X′X)−1, and so
τ =βˆT,i−β0,i ∼N(0,1). (2.32) T,i σ0√
In principle, we could use this result to create a confidence interval for β0,i but for the resulting interval to be feasible, σ02 would need to be known. In general, this will not be the case and σ02 is treated as a parameter of the model to be estimated from the sample data – σ02 is often referred to as a “nuisance parameter” because it is not of interest in its own right but has to be estimated in order for us to able to make inference about the parameters we are interested in (β0 here). In the next section, we consider the OLS estimator of σ02 and its properties.
2.5 OLS estimator of σ02 The OLS estimator of σ02 is:
σˆT2 = e′e , (2.33) T−k
where, as a reminder, e is the vector of OLS residuals introduced in (2.10) in Section 2.2. Since the residuals are, in effect, the sample analogs to the errors, it is unsurprising that the estimator of the error variance is based on the residual sum of squares. However, the choice of denominator in (2.33) is less obvious: it is justified by the fact that E[σˆT2 ] = σ02 as we now demonstrate.
In order to do so, we need to introduce the following scalar quantity associ- ated with any square matrix.
Definition 2.7. Let N be a l × l matrix withtypical element ni,j . The trace of N, denoted tr(N), is defined to be: tr(N) = li=1 ni,i.17
17see Orme (2009)[p.23], Greene (2012)[p.1039-40]. 25
mi,i
Thus the trace of the matrix is the sum of the main diagonal elements. It has three properties relevant to our discussion here. The first two are obvious by inspection: if l = 1 (and so N is a scalar) then tr(N) = N; tr(N) is a linear function of the elements of N. The third property is stated in the following lemma.18
Lemma 2.4. Let A and B be respectively p × q and q × p matrices. Then tr(AB) = tr(BA).
To demonstrate σˆT2 is an unbiased estimator, we consider E[e′e]. Using the definition of e and (2.7), we have
e = y−XβˆT = y−X(X′X)−1X′y = (IT −P)y,
where P = X(X′X)−1X′. Recalling the properties of IT −P highlighted in the
previous section,19 we have
e = (IT −P)y = (IT −P)(Xβ0 +u) = (IT −P)u.
This, in turn, along with the first two properties of IT − P implies that:
e′e = u′(IT − P)′(IT − P)u = u′(IT − P)u. (2.34)
Since e′e is a scalar, we can rewrite the previous equation as e′e = tr{u′(IT −P)u}.
Atthisstage,wecanapplyLemma2.4-withA=u′ andB=(IT −P)u-to deduce that
e′e = tr{(IT −P)uu′}. Taking expectations of both sides, we have
E[e′e] = E[tr{(IT −P)uu′}].
Since the trace is a linear function of the elements of the matrix in question, it
follows that E [ tr( · ) ] = tr ( E[ · ] ) and so,
E[e′e] = tr{E[(IT −P)uu′]} = tr{(IT −P)E[uu′]},
where the last equality follows from Assumption CA2. Using Assumption CA5,
we obtain
E[e′e] = σ02tr(IT −P). (2.35)
Since the trace is a linear function, we have tr(IT − P ) = tr(IT ) − tr(P ). By definition of the identity matrix, it follows that tr(IT ) = T ; using Lemma 2.4,
we have
tr(P) = tr(X′X)−1X′X = tr(Ik) = k. 18See Tutorial 2 for the proof.
19These are: (i) IT −P = (IT −P)′; (ii) IT −P = (IT −P)2; (iii) (IT −P)X = 0. 26
So using these results in (2.35), we obtain E[e′e] = σ02(T − k) from which it follows that E[σˆT2 ] = σ02.
While the above demonstrates that T − k is the appropriate divisor of the residual sum of squares, it does not deliver an intuition for why this is the case. Since e contains T elements, it might be wondered why division by T does not yield an unbiased estimator. The answer is that the first order conditions in (2.6) can be written as X′e = 0 and so place k restrictions on e. Therefore, if we are given X and T −k values for e then there is no uncertainty about the remaining k values of the residuals because their values are automatically determined by the restrictions in the first order conditions. Thus only T − k elements of the residuals are “free”, and it is for this reason that T − k is appropriate divisor of RSS to produce an unbiased estimator of σ02. For this reason also T − k is referred to as the degrees of freedom in this context.
2.6 Confidence intervals for β0,i
Let us now return to the problem of constructing a confidence interval for β0,i based on βˆT ,i . Rather than work with τT ,i in (2.32), we derive the feasible confidence interval from the statistic
where σˆT = σˆT2 . The quantity σˆT √
βˆT,i, and is the estimated standard deviation of the estimator.
Notice that τˆT,i can be formed from τT,i by replacing the unknown σ0 by σˆT . The key question is whether this substitution has affected the distribution,
and the answer is yes. To see why, note that τˆT,i = τT,i × (σ0/σˆT ). Since σˆT2 is random, σ0/σˆT is a random variable: so τˆT,i is the product of a standard normal random variable (τT,i) and a random variable σ0/σˆT. Such a product doesnothaveastandardnormaldistribution.20 Intheappendixtothischapter (Section 2.13), we provide a derivation of the distribution of σˆT2 . Here we focus on the statistic of interest τˆT,i, the distribution of which is given in the following theorem.
Theorem 2.4. If Assumptions CA1-CA6 hold then τˆT,i has a Student’s t – distribution with T − k degrees of freedom.
This result can be used to construct a confidence interval for β0,i. From Theorem 2.4 it follows that
P ( | τˆ T , i | < τ T − k ( 1 − α / 2 ) ) = 1 − α , ( 2 . 3 7 )
20The key difference between the situation here and the result in Lemma 2.1 is that in this lemma c and C are constants, whereas here the normal random variable is multiplied by a random quantity.
27
βˆ T , i − β 0 , i
m , (2.36)
τˆT,i = σˆ √
T i,i
mi,i is known as the standard error of
where τT−k(1 − α/2) is 100(1 − α/2)th percentile of Student’s t distribution with T − k df. The form of the confidence interval comes from an alternative representation of the event | τˆT ,i | < τT −k (1 − α/2). From the definition of τˆT ,i, this event can be written
βˆ T , i − β 0 , i
< τT−k(1−α/2), s . e . ( βˆ T , i )
where(tocondenseournotation)wesets.e.(βˆT,i)=σˆT√mi,i. Multiplyingboth sides of the previous inequality by the standard error, this event is equivalent
to βˆT,i −β0,i<τT−k(1−α/2)s.e.(βˆT,i), which is the same as
−τT−k(1−α/2)s.e.(βˆT,i) < βˆT,i − β0,i < τT−k(1−α/2)s.e.(βˆT,i).
Subtracting βˆT,i from each part of these inequalities and multiplying the result by minus one, we obtain the following event as being equivalent to | τˆT ,i | < τT−k(1−α/2),
βˆT,i − τT−k(1 − α/2)s.e.(βˆT,i) < β0,i < βˆT,i + τT−k(1 − α/2)s.e.(βˆT,i). Therefore from (2.37), we have
PβˆT,i −τT−k(1−α/2)s.e.(βˆT,i)<β0,i <βˆT,i +τT−k(1−α/2)s.e.(βˆT,i) = 1−α, (2.38)
and so
βˆT,i −τT−k(1−α/2)s.e.(βˆT,i), βˆT,i +τT−k(1−α/2)s.e.(βˆT,i) (2.39)
represents a 100(1−α)% confidence interval for β0,i. Notice that the boundaries of this interval are random. The “confidence level” represents 100 times the probability (i.e. the relative frequency in an infinite number of repeated samples of y, X) that an interval calculated in this way contains the true parameter value. To illustrate this confidence interval, we return to the traffic example considered in Example 2.2.
Example 2.3. Let βˆbelt and βˆmph denote the OLS estimators of the coefficients on beltt and mpht in (2.20), and let βbelt,0 and βmph,0 denote the coefficients on these two variables in the analogous population regression model. To cal- culate a confidence interval in (2.39) for these population parameters, we need the standard errors and appropriate percentile, τT −k (1 − α/2). The standard errors are: se(βˆbelt ) = 0.023 and se(βˆmph ) = 0.021. The degrees of freedom are T − k = 108 − 17 = 91 and so for the 95% confidence interval, the appropri- ate percentile is τ91(0.975) = 1.986. This yields a 95% confidence interval for βbelt,0 of (−0.076, 0.017) which contains not only positive values (like the point
28
estimate) but also zero and negative values. This suggests that the data are consistent with the introduction of the seat belt law having either a positive or a negative effect or no effect on fatalities. In contrast, a 95% confidence interval for βmph,0 is (0.026, 0.110) which contains only positive values, and so the data is consistent with the view that relaxing the maximum speed limit led to more fatalities in traffic accidents.
2.7 Prediction intervals
Consider the following scenario. We have a sample of size T on y, X, and we are now told xT+1. Using this information, we wish to predict yT+1. To begin we assume the same model continues to hold for the (T + 1)th observation so that
yT+1 = x′T+1β0 + uT+1, (2.40) where uT+1 ∼ N(0,σ02) and uT+1 is independent of u. From (2.40), E[yT+1] =
x′T+1β0 and so the natural predictor of yT+1 given our information is: yp =x′ βˆT.
T+1 T+1 The associated prediction error is:
ep = yT+1 − yp = x′T+1β0 + uT+1 − x′T+1βˆT T+1 T+1
= uT+1 −x′T+1(βˆT −β0). Under our Assumptions here, it can be shown that21
epT+1 ∼ N 0, σ02(1 + x′T+1(X′X)−1xT+1).
This leads to the following 100(1 − α)% prediction interval for yT +1 ,
yp ± τT−k(1−α/2)σˆT (1 + x′ (X′X)−1xT+1)). T+1 T+1
We now illustrate this interval using the traffic example from Example 2.2.
Example 2.4. Recall that the sample runs from 1980.1 through 1989.12. Sup- pose it is desired to use the model in (2.20) to predict the percentage of highway accidents that resulted in 1990.1 under the assumption that the state unemploy- ment rate that month is 5%. The predicted value is calculated as
yp = βˆ1 + βˆtrend ∗(T +1) + βˆwkend ∗wkend + βˆunem ∗unem T+1
+βˆbelt ∗beltt + βˆmph ∗mpht,
= 0.914 − 0.002 ∗ (109) + 0.001 ∗ (12) − 0.015 ∗ (5) − 0.102 + 0.057, = 0.754,
and the 95% prediction interval is: (0.629, 0.879). 21See Tutorial 2.
29
2.8 Hypothesis testing
Sometimes specific values for β0 are associated with certain types of behaviour in the underlying economic model. It is therefore of interest to assess whether the sample provides evidence that these restrictions hold in the population. To illustrate, we return to the three examples in Section 1.1 used to motivate our interest in the linear regression model.
Example 2.5. The CAPM implies that the return on an asset in period t, Rt, can be modeled as
Rt − Rf,t = β0(Rm,t − Rf,t) + ut
where Rf,t is the return on the risk-free asset and Rm,t is return on the market index. Understanding how various individual assets move relative to the market index is important for portfolio management. For example, if β0 < 0 ⇒ asset is inversely related to market index; if β0 = 1 then the asset moves in line with market index; if β0 > 1 then the asset is more risky than the market index.
Example 2.6. Suppose it is desired to test whether the returns to education are the same for men and women. To this end, we might estimate the model,22
ln(wi) = β0,1 + β0,2 ∗edi + β0,3 ∗Di + β0,4 ∗(Di ∗edi) + ui
where wi and edi denote respectively the hourly wage and number of years of education of the ith individual in the sample, and Di is a dummy variable that takes the value one if the ith individual in the sample is female and is zero otherwise. The wage equation is the same for men and women if β0,3 = 0, β0,4 = 0. The returns to education for men and women are the same if β0,4 = 0; the returns to education for men would be greater (less) for men than for women if β0,4 < 0 (β0,4 > 0).
Example 2.7. Suppose that aggregate production is determined by a Cobb- Douglas production function, and so the following regression model is estimated,
ln(Qt) = β0,1 + β0,2 ∗ ln(Lt) + β0,3 ∗ ln(Kt) + ut
where Qt is aggregate production in period t, Lt is the aggregate labour force, and Kt is the aggregate capital stock. It can be recalled that β0,2 (β0,3) is the elasticity of production with respect to labour (capital), and that β0,2 + β0,3 captures the “returns to scale”. Specifically we have
< diminishing
β0,2 + β0,3 = 1 ⇔ constant returns to scale
> increasing
We can assess whether our data are consistent with these types of restrictions using hypothesis tests. In the next subsection, we provide a review of the classical hypothesis testing framework. In the section after that, these ideas are applied to develop hypotheses tests in our regression model context.
22The other regressors have been omitted purely for ease of presentation. 30
2.8.1 Classical hypothesis testing framework
The classical hypothesis testing framework has its origins in the work of Neyman and Pearson in the 1930’s. They laid out its foundations and established many key results, and in view of the seminal nature of their contribution, this body of statistical theory is often referred to as the “Neyman-Pearson” framework for hypothesis testing.
For simplicity, we assume our hypothesis involves some aspect of the distri- bution of a univariate random variable V . Let θ be a p × 1 parameter vector that indexes this distribution, and Θ denote the parameter space with Θ ⊂ Rp. It is assumed that if the value of θ is known then it is also known whether or not the hypothesis is true. Thus we can divide the parameter space into two mutually exclusive and exhaustive parts:
Θ0 = {θ : such that the hypothesis is true},
Θ1 = {θ : such that the hypothesis is false}.
Using this partition, we can state the object as being to test the null hypothesis, H0 : θ ∈ Θ0
against the alternative hypothesis,
H1 : θ ∈ Θ1.
To facilitate the choice between H0 and H1, a sample of T observations on V is collected. Our inference is actually based on some function of this sample, known as a test statistic; we denote this statistic by ST . In a test procedure, the sample space of ST is divided into two mutually exclusive and exhaustive regions, R0 and R1, such that23
ST∈R0 ⇒ H0isaccepted,
ST ∈ R1 ⇒ H0 is rejected in favour of H1.
The set R0 is known as the acceptance region of the test and R1 is known as the rejection region or the critical region of the test.
The choice of the acceptance (and rejection) regions is determined by consid- eration of the possible outcomes of this decision making process. While the decision may be correct, it may also be incorrect. There are two types of error that can be made:
• a Type I error, in which H0 is rejected when it is true;
23Strictly, this type of procedure is known as a nonrandomized test. Since the majority of tests in practice and all the ones we discuss are of this form, we do not pursue the distinction between them and randomized tests.
31
• a Type II error, in which H0 is not rejected when its false.
For what follows, it is useful to have a notation for the probability of making these errors. Let Pθ ( · ) denote the probability of the event in parentheses if the parameter vector takes the value θ.
We then define α(θ) = Pθ(R1) that is, α(θ) describes the probability of a type I error for values of θ that satisfy H0. The quantity supθ∈Θ0 α(θ) is known as the size of the test, and represents the maximal probability of a type I error.
Similarly define β(θ) = Pθ(R0) = 1−Pθ(R1) that is, β(θ) describes the probabil- ity of a type II error for values of θ that satisfy H1. The quantity π( · ) = 1−β( · ) is known as the power function of the test, and π(θ∗), for θ∗ ∈ Θ1, as the power of the test against the alternative θ = θ∗. The power of the test against the al- ternative θ = θ∗, π(θ∗), is the probability of correctly rejecting H0 when θ = θ∗.
Ideally, the probabilities of making either type of error should be zero. Un- fortunately, for a fixed sample size, this is not possible. Instead, the approach taken within classical hypothesis testing is to choose the critical region in order to control the size of the test. To this end, a quantity α is specified such that α(θ) ≤ α for all θ ∈ Θ0; thus α is an upper bound on the probability of a type one error and is known as the significance level of the test. (In general, the size and significance level coincide but this need not be the case.)24 Typically, α is chosen to be a number such as 0.10, 0.05 or 0.01. Notice that this approach to choosing the critical region reflects an asymmetry in the treatment of the hy- potheses: H0 is only rejected if the observed value of ST has a small probability of occuring under H0. In other words, we only give up on H0 if there is strong evidence against it. Such an approach was justified by Neyman and Pearson by arguing that, in most cases, a researcher has some theory to justify H0 and wants to assess if the data are consistent with this theory. This approach is now widespread, but it should be noted that it remains controversial. Concerns about approach are reflected in the modern terminology surrounding hypotheses tests: while older texts talk in terms of “accepting H0”, such a decision is now expressed as “failing to reject H0”.
We now illustrate these concepts via a couple of simple examples.
Example 2.8. Suppose that {vt}Tt=1 is an independently and identically dis- tributed sequence of random variables with a N(θ,σ2) distribution,25 and we wishtotestH0 : θ=0versusH1 : θ̸=0. Lettingv ̄T =T−1Tt=1vt,itis
24As a result of their common equivalence, the terms are often used interchangeably in the literature although strictly they have different definitions. A test for which α(θ) = α for all θ ∈ Θ0 is said to be a similar test.
25This specification can also be written as: vt ∼ IN(θ,σ2) for t = 1,2,…,T where the “IN” stands for “independently distributed”.
32
straightforward to show that:
v ̄T ∼ N(θ,σ2/T). (2.41)
For simplicity, we assume σ2 is known. Since θ is the population mean, it is natural to base inference on the sample mean. Our test statistic is the ratio of the sample mean to its standard deviation,
v ̄T
τT = σ2/T.
Given the structure of the null and alternative hypotheses, we reject the null if τT is sufficiently far away from zero for it to be “implausible” that the true population mean is zero. So the decision rule is going to be of the form: reject H0 if |τT | > c for some constant c. This c is chosen to control the probability ofaTypeIerror. HereΘ0 ={0}andsosupθ∈Θ0α(θ)=P(|τT| > c θ=0) and so we pick c to achieve the desired level for this probability. Given (2.41), if c is set equal to z1−α/2, the 100(1 − α/2)% percentile of the standard normal distribution, then the probability of a Type I error is α.26 Therefore, supposing we adopt a 5% significance level, the decision rule is:
• reject H0 : θ = 0 in favour of H1 : θ ̸= 0 at the 5% significance level if |τT | > 1.96.
The acceptance and rejection regions are represented graphically in Figure 2.1. The bell-shape curve is the probability density function (pdf) of τT under H0 that is, the pdf of the standard normal distribution. The boundaries between the acceptance and rejection regions – R0 and R1 respectively – are indicated by the vertical lines at ±1.96. A Type I error occurs when H0 is falsely rejected. Thus the probability of a Type I error is the probability that τT ∈ R1 given that τT ∼ N(0,1), its distribution under H0. This probability is the sum of the two shaded areas on the plot – which is 0.05 in this case.
The power of the test is given by
π(θ) = P(|τT| > 1.96|θ,θ∈Θ1).
To evaluate this probability, we need the distribution of τT if θ ̸= 0. From (2.41),
it follows that
v ̄T−θ θ
τT = σ2/T + σ2/T ∼ N(μ,1)
where μ = θ/(σ2/T). Notice that the power depends on θ through its effect on μ. Figure 2.2 shows the power calculation for the case where θ = σ/√T
26 Note that a Type I occurs if either −τT < −z1−α/2 or τT > z1−α/2 , and so the probability of a type I error is the sum of the probabilities of these two (mutually exclusive) events. Under H0, we have: P(τT > z1−α/2) = α/2 by definition; P(−τT < −z1−α/2) = α/2 because from the symmetry of the normal distribution −z1−α/2 = zα/2 and again by definition P(τT < zα/2) = α/2.
33
Figure 2.1: Acceptance and rejection regions for two-sided test in Example 2.8
and so μ = 1. The bell-shape curve is the pdf of N(1,1) that is, the pdf of the
distribution of τT when θ = σ/√T. Notice that the curve is symmetric about
one and so it is located on the x-axis to the right of the curve in Figure 2.1. The
vertical lines at ±−1.96 delineate the boundaries between R and R . The power
√
01
of the test for θ = σ/ T is the probability that τT ∈ R1 given that τT ∼ N(1,1) which is the sum of the two shaded areas on the plot (and equals 0.17 in this
case). √
Using the case of μ = 1 (θ = σ/ T) as reference point, we can use Figure
2.2 to understand how the power changes with θ. Notice that changes in the value of θ only change μ, the mean of the distribution; other aspects of the distribution stay the same. Thus the value of θ does not affect the shape of the curve only its location on the x-axis. Thus as the value of θ increases, the location of the pdf moves to the right and as a result the shaded area (the power) increases. Similarly as the value of θ decreases towards zero, the location of the pdf moves to the left and the shaded area (the power) decreases. This behaviour is evident in the plot of the power curve in Figure 2.3.27 It can be seen that π(θ) is an increasing function of |θ|, and it is symmetric about θ = 0 (the value of θ under H0).28 Notice the power is low for values of θ close to zero, and the power is essentially one for values of θ sufficiently far away from zero. This implies the test is good at revealing that H0 is incorrect when the true value of θ
27This figure is plotted for σ2 = 4 and T = 25. So for the purposes of comparing Figures 2.2 and 2.3, θ = 0.4 implies μ = 1. Note, as is common practice, we include the probability of rejection at the parameter value(s) under H0, here θ = 0.
28Note that π(0) = 0.05 because the probability of rejection at θ = 0 is just the size of the test.
34
Figure 2.2: Power of two-sided test in Example 2.8 when θ = σ/√T
is sufficiently far away from the value specified under H0 but not so good if θ is close to the value specified under H0. The symmetry of the power curve means the test is equally powerful when θ = θ ̄ or θ = −θ ̄, as would be expected given the form of the decision rule.29
.
29This symmetry follows from the symmetry of both the boundaries between R0 and R1 about zero and also the symmetry of the N (μ, 1) distribution about μ.
35
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
θ
Figure 2.3: Power curve for the test in Example 2.8
Example 2.9. Consider the same statistical model as in the previous example butsupposenowthatwewishtotestH0 : θ ≤ 0versusH1 : θ > 0. We again base inference on τT but now reject only if τT is sufficiently large that it is “implausible” that the population mean is equal to or less than zero. Recall that we construct the critical region to control the probability of a Type I error. Here Θ0 = (−∞, 0], and we need supθ∈Θ0 α(θ) = α. This is achieved here by constructing the critical region based on the distribution of τT for the value of θ in θ0 that is closest to Θ1 that is θ = 0. Supposing again that we wish to set α = 0.05, the decision rule is:
• reject H0 : θ ≤ 0 in favour of H1 θ > 0 at the 5% significance level if τT >1.64.
The acceptance and rejection regions are illustrated graphically in Figure 2.4. The bell-shape curve is the probability density function (pdf) of τT under H0 with θ = 0 that is, the pdf of the standard normal distribution. The boundary between the acceptance and rejection regions – R0 and R1 respectively – is indicated by the vertical line at 1.64. A Type I error occurs when H0 is falsely rejected. When θ = 0, the probability of a Type I error is the probability that τT ∈ R1 given that τT ∼ N (0, 1). This probability is the shaded area on the plot.
36
π(θ)
Figure 2.4: Acceptance and rejection regions for one-sided test in Example 2.9
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
0
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
Figure 2.5: Power curve for the test in Example 2.9
37
Verify for yourselves that the size of the test associated with this decision rule is 0.05. The power of the test is given by
π(θ) = P(τT > 1.64|θ,θ∈Θ1).
Figure 2.5 contains a plot of the power curve. It can be seen that π(θ) is an increasing function of θ. As in Example 2.8, the power function reveals that the test is a good at revealing that H0 is incorrect when the true value of θ is sufficiently far away from the value specified under H0 but not so good if θ is close to the value specified under H0.
For obvious reasons, the test in Example 2.8 is said to be two-sided and the test in Example 2.9 is said to be one-sided. The class of one-sided tests would alsoincludeH0 : θ ≥ 0versusH1 : θ < 0(-whatwouldbethedecisionrule in this case?) Note that in one-sided tests the equality always appears in the null hypothesis: this is so we can control the probability of the Type I error in a similar fashion to the approach in Example 2.9.
The outcome of the tests above is either “fail to reject at the 5% significance level” or “reject at the 5% significance level”. While this informs about the deci- sion at the stated significance level it does not necessarily tell another researcher what decision would be made at a different significance level. For example, H0 is rejected at the 5% level then we know automatically that it would also be rejected at any significance level larger than 5% (check this for yourselves using the examples above). However, it is impossible to know what decision would have been made at the 1% significance level. This problem is circumvented by reporting the p-value of the test.30
Definition 2.8. The p-value of a test is δ where 100δ% is the significance level for which the observed test statistic lies on the boundary of the acceptance and rejection regions.
So for the test in Example 2.8, the p-value is δ defined by |τT | = z1−δ/2; and for Example 2.9, the p-value is δ defined by τT = z1−δ.
In most cases, there are a number of different test statistics upon which to base our inference. Given the chosen significance level controls the probability of a type I error, it is natural to choose the statistic that yields the smallest probability of a type II error - or equivalently, given the relationship between β(θ) and the power function, the test with the highest power. While appealing, this may or may not decide the issue because the power depends on θ. To illustrate, suppose we we wish to compare two statistics S1 and S2 with power functions π1(θ) and π2(θ) respectively. It may turn out that π1(θ) ≤ π2(θ) for all θ ∈ Θ1, in which case S2 is said to be uniformly more powerful than S1. However, it may also turn out that neither dominates because π1(θ∗) < π2(θ∗) but π1(θ†) > π2(θ†) for θ∗ ̸= θ†, θ∗,θ† ∈ Θ1. In the latter case, there is no obvious way to choose between the tests unless we have reason to prefer the test to have power against either of these specific values. Ideally, inference is based
30The p-value is sometimes referred to as the “observed significance level” of the test. 38
on the uniformly most powerful (UMP) that is, the statistic whose power is at least as large as that of any other test (of the same significance level) at any point in the parameter space. However, aside from certain exceptional cases, an UMP does not exist.
Another consideration in choosing a test is to require it to have the property that π(θ) > α for all θ ∈ Θ1 that is, the rejection region has a higher probability of occurring under H1 than it does under H0. A test with this property is said to be unbiased.31 We note parenthetically that while it may not be possible to find an UMP test, it may be possible to find an UMP test within the class of unbiased tests; however, this is not an issue we pursue in this course.
2.8.2 Testing hypotheses about individual coefficients
In this section, we develop methods for testing hypotheses about an individual coefficient, β0,i say.
First consider testing H0 : β0,i = β∗,i versus H1 : β0,i ̸= β∗,i. The natural
test statistic is
T,i ∗,i σˆT√mi,i
|τˆ (β )|=βˆT,i−β∗,i
because it follows from Theorem 2.4 that under H0, τˆT,i(β∗,i) has a Student’s t distribution with T − k degrees of freedom. Given the two-sided nature of the hypotheses, this leads to the following decision rule:
Decision rule: reject H0 : β0,i = β∗,i in favour of H1 : β0,i ̸= β∗,i at 100α% significance level if
|τˆT,i(β∗,i)| > τT−k(1−α/2).
For completeness, we verify that the significance level is 100×P (Type I error). A Type I error would occur if H0 is true and either τˆT ,i(β∗,i ) < −τT −k (1−α/2) or τˆT,i(β∗,i)>τT−k(1−α/2).SinceunderH0τˆT,i(β∗,i)∼ Student’stwithT−kdf, the probability of both of these events is α/2 and so the probability that either one or the other occurs is α. It can also be verified that we fail to reject H0 at the 100α% significance level for all values of β∗,i that are contained in the 100(1 − α)% confidence interval in (2.39). The p-value of the test is δ where |τˆT,i(β∗,i)| = τT−k(1−δ/2).
We can graphically illustrate both the decision rule and also the p-value for this test. Suppose that T − k is sufficiently large that the Student t distribution with T = k df can be approximated by the standard normal distribution, and so we can set τT −k(1 − α/2) equal to z1−α/2. In Figure 2.6 we present the acceptance and rejection regions – R0 and R1 respectively – associated with a 5% significance level test; the vertical lines at ±1.96 delineate the boundaries between R0 and R1, and the shaded errors are each equal to 0.025. Suppose that for our sample, the observed value of τˆT,i = 1.6; this value is indicated by “X” in Figure 2.6. As can be seen, in this case we fail to reject H0 at the
31From Figures 2.3 and 2.5 it can be seen that the tests in Examples 2.8 and 2.9 are unbiased. 39
Figure 2.6: Acceptance and rejection regions for two-sided test with 5% signifi- cance level; “X” indicates observed value of τˆT,i = 1.6
5% significance level. The p-value of the test is the probability of a Type I error associated with a decision rule in which the observed value of the test statistic actually lies on the boundary of the acceptance and rejection regions. For our example here, this means a decision rule that involves rejecting H0 if |τˆT,i(β∗,i)| > 1.6. This decision rule is illustrated in Figure 2.7, with the vertical lines at ±1.6 indicating the boundaries between R0 and R1 for this test. The p-value is the probability of Type I error associated with this decision rule and so is the shaded area in Figure 2.7; it equals 0.1096.32
We now consider how the test statistic behaves if H0 is false. We can write τˆ ( β ) = βˆ T , i − β ∗ , i = βˆ T , i − β 0 , i + β 0 , i − β ∗ , i
T ,i ∗,i σˆT √mi,i σˆT √mi,i σˆT √mi,i
and so τˆT,i(β∗,i) is the sum of τˆT,i, a Student’s t random variable with T − k degrees of freedom, and a non-zero random variable. It can be shown that this structure means τˆT,i(β∗,i) has a non-central Student’s t-distribution w√ith T −k degrees of freedom and non-centrality parameter ν = (β0,i − β∗,i)/(σ0 mi,i).33 (If ν = 0 then the non-central Student’s t-distribution is equal to the Stu- dent’s t-distribution, and so because of this, the Student’s t-distribution is sometimes referred to as the central Student’s t-distribution, with the adjec- tive “central” reflecting that the non-centrality parameter is zero.) It can be shown that P (reject H0 | H1 is true) is a strictly increasing function of |ν |
32Check this for yourselves in MATLAB, using the command: 2*normcdf(-1.6,0,1). 33See the Appendix in Section 2.13 for a formal definition of this distribution.
40
Figure 2.7: P-value of the two-sided test when observed value of test statistic is 1.6
and P (reject H0 | H1 is true) > α meaning the test is unbiased. Notice that the power of the test does not depend on |β0,i − β∗,i| per se but on |β0,i − β∗,i|/(σ0√mi,i). So the smaller the variance of βˆT,i then the more power the test has for a given value of |β0,i − β∗,i)| – a further example of the benefits of basing inferences on efficient estimators.
In any regression model, there is one specific choice of β∗,i that is always of interest, namely β∗,i = 0 because if true this implies yt does not depend on xt,i. The test statistic for the two-sided test of H0 : β0,i = 0 is just the ratio of βˆT,i to its standard error, and is often referred to as the t−stat associated with βˆT,i. Most regression packages report the standard errors and t-stat’s along with the estimates. To illustrate this test, we return to the traffic example considered in Example 2.2.
Example 2.10. Suppose that we wish to test whether or not the dummy variable beltt helps to explain the percentage of accidents resulting in fatalities. Using the notation introduced in Example 2.4, this amounts to testing H0 : βbelt,0 = 0 versus H1 : βbelt,0 ̸= 0. The decision rule is to reject H0 at the 100α% significance level if:
τˆ = βˆbelt > τ (1−α/2). b e l t s e ( βˆ b e l t ) T − k
From our previous discussion of this example we have βˆbelt = −0.030 and
se(βˆ ) = 0.023, and so τˆ = 1.304. For this example, T − k = 91, and so belt belt
41
for a 5% test, the critical value is τ91(0.975) = 1.986. Therefore, we fail to reject H0 at the 5% significance level. Notice that this finding tallies with the finding reported in Example 2.3 that the 95% confidence interval for βbelt,0 contains zero. What about the outcome at other significance levels? The p-value for this test is δ such that τ91(1 − δ/2) = 1.304 which gives δ = 0.195. This means that H0 is rejected in any test with a significance level greater than 19.5%, implying we would fail to reject H0 at all conventional significance levels. It is left as an exercise for the reader to test whether or not the dummy variable for the speed law change helps to explain the percentage of accidents resulting in fatalities.
One-sided tests can also be relevant in our regression context. In our traffic example, the more policy relevant question is arguably whether or not there is evidence that the seat belt law led to a reduction in traffic fatalities. This can be assessed using a one-sided test with H0 : βbelt,0 ≥ 0 versus H1 : βbelt,0 < 0. Notice that under the null hypothesis the seat belt law had either no or a positive effect - recall the the restriction with the equality always goes in the null hypothesis - and under the alternative the seat belt law led to a reduction in fatalities).
To present the appropriate decision rule, we abstract to the general case in which we wish to test H0 : β0,i ≥ β∗,i vs HA : β0,i < β∗,i. Our decision rule is as follows.
Decision rule: reject H0 : β0,i ≥ β∗,i in favour of HA : β0,i < β∗,i at the 100α% significance level if
τˆT,i(β∗,i) < τT−k(α).
Verify for yourselves that the probability of a Type I error with this decision rule is α. The p-value of this test is δ where τˆT,i(β∗,i) = τT−k(δ). We return to the traffic example to illustrate this type of test.
belt
Example 2.11. Suppose that we test H0 : βbelt,0 ≥ 0 versus H1 : βbelt,0 < 0 at the 5% significance level. The decision rule is to reject H if τˆ < τ (.05) =
0 belt 91 −1.662. From Example 2.10, we have τˆ = −1.304. Therefore, we fail to
reject H0. For this test, the p-value is δ such that −1.304 = τ91(δ) which gives δ = 0.098. Therefore, while we fail to reject H0 at the 5% significance level, we would reject at the 10% significance level and so the estimation results provide some, albeit marginal, evidence to support the view that making seat belts mandatory led to a reduction in the percentage of traffic accidents that involve at least one fatality.
It is left as an exercise for the reader to provide a graphical illustration of the p-value for the test in Example 2.11 along the lines of Figure 2.7 above. Suppose now that we wish to test H0 : β0,i ≤ β∗,i vs HA : β0,i > β∗,i what would the decision rule be?34
34See Tutorial 3.
42
2.8.3 Testing whether β0 satisfies a set of linear restric- tions
In this section, we present methods for testing whether β0 satisfies a set of nr linear restrictions. Such restrictions can represented as: Rβ0 = r where R, r are nr × k and nr × 1 are specified constants. Both our examples above fall into this class. First consider the model in Example 2.6. The wage equation is the sameformenandwomenifβ0,3 =0andβ0,4 =0. Herenr =2andk=4,and these restrictions can be written as Rβ0 = r where
R=0 0 1 0, r=0. 0001 0
For the model in Example 2.7, the production function exhibits constant returns to scale if β0,2 +β0,3 = 1. Here nr = 1 and k = 3, and this restriction can be written as Rβ0 = r where
R = [0,1,1], r = 1.
In this section, we consider tests of H0 : Rβ0 = r versus H1 : Rβ0 ̸= r. For what follows it is important that Rβ0 = r represents a set of nr unique restrictions. This is guaranteed under the following assumption.
Assumption 2.1. rank(R) = nr .
Notice this assumption inevitably implies the number of restrictions can not exceed the number of unknown regression parameters. (Why?)
It is natural to base inference upon the corresponding linear combination of the OLS estimators, RβˆT − r. To motivate the form of the statistic, we note that from Theorem 2.3 and Lemma 2.1, it follows that
R(βˆT − β0) ∼ N(0, σ02R(X′X)−1R′).
Since R(βˆT − β0) = RβˆT − Rβ0 and under H0, Rβ0 = r, it follows that under
H0
RβˆT − r ∼ N(0, σ02R(X′X)−1R′). (2.42)
Ingeneral,RβˆT −risavectorandsoweneedtoconstructfromitsome scalar measure of how far RβˆT − r is from zero. To this end, we start by considering the quadratic form
G = (RβˆT − r)′[R(X′X)−1R′]−1(RβˆT − r).
NoticethatAssumptionsCA3and2.1implyG>0.35 TouseGasthebasisfor a hypothesis test, we need to know its sampling distribution under H0. From (2.42), it follows that G is a quadratic form in a normal random vector and a matrix that is proportional to the inverse of the variance-covariance matrix of the random vector. The distribution of such statistics can be deduced from the following result.
35Note that if a matrix is positive definite then it must be nonsingular and its inverse is also positive definite.
43
Lemma 2.5. If v is n×1 random vector and v ∼ N(μ, Ω) where Ω is a positive definite matrix then (v − μ)′Ω−1(v − μ) ∼ χ2n.
Therefore, it follows from (2.42) and Lemma 2.5 that under H0:
G ∼ χ2 . (2.43)
σ02 nr
However, the statistic G/σ02 is infeasible because it depends on the nuisance parameter σ02. As before, we substitute σˆT2 for σ02 to yield the following statistic where, for historical reasons, we also divide the statistic nr, the number of restrictions:
F = (RβˆT − r)′[R(X′X)−1R′]−1(RβˆT − r). (2.44) n r σˆ T2
The substitution of σˆT2 for σ02 affects the distribution of the test statistic: Theorem 2.5. If Assumptions CA1- CA6 and 2.1 hold then under H0 : Rβ0 =
r, F ∼ Fnr,T−k, the F distribution with (nr,T −k) df. This leads to the following decision rule:
Decision rule: reject H0 : Rβ0 = r in favour of H1 : Rβ0 ̸= r at the 100α% significance level if:
F >Fnr,T−k(1−α)
where Fnr ,T −k (1 − α) is the 100(1 − α)th percentile of the Fnr ,T −k distribution.
Figure 2.8 represents this decision rule graphically for the case in which H0 : Rβ0 = r is tested using a 5% significance level, Rβ0 = r involves nr = 3 restrictions and T − k = 150. In this case the F has a F3,150 distribution under H0, and the pdf of this F- distribution is plotted in Figure 2.8; the vertical line at F3,150(0.95) = 2.66 represents the boundary between the acceptance and rejection regions – R0 and R1 respectively. The probability of a Type I error is given by shaded error and equals 0.05 by construction.
The p-value of this test is δ where F = Fnr,T −k (1 − δ). To illustrate the p- value graphically, suppose that for our sample the observed value of F equals 2 – note that this means we fail to reject H0 at the 5% significance level. The p-value of the test is the probability of a Type I error associated with a decision rule in which the observed value of the test statistic actually lies on the boundary of the acceptance and rejection regions. For our example here, this means a decision rule that involves rejecting H0 if F > 2. This decision rule is illustrated in Figure 2.9, with the vertical line at 2 indicating the boundary between R0 and R1 for this test. The p-value is the probability of Type I error associated with this decision rule and so is the shaded area in Figure 2.9; it equals 0.1164.36
36You can verify this in MATLAB using the command: 1-fcdf(2,3,150).
44
Figure 2.8: Acceptance and rejection regions for the F-test with a 5% signifi- cance level
Figure 2.9: P-value of F test when observed value of F statistic is 2.
45
Figure 2.10: The pdf of the F3,150 distribution (blue line), and the non-central F3,150 distributions with νF = 1 (green), 2 (red), 3 (magenta).
Under H1, it can be shown that F has a non-central F-distribution with degrees of freedom (nr , T − k) and non-centrality parameter37
νF = (Rβ0 − r)′[R(X′X)−1R′]−1(Rβ0 − r). σ02
Figure 2.10 compares the the pdf of the (central) F3,150 distribution with that of the non-central F3,150 for νF = 1, 2, 3. It can be seen that increases in νF lead to transfer of mass the left of the distribution to the (right-hand) tail. As a result, the power of the test is increasing in νF , and the test is unbiased.38
To illustrate the F-test, we return to our running example.
Example 2.12. Suppose that we wish to test whether the two traffic laws cancelled each other out that is, βbelt,0 + βmph,0 = 0. To present this hy- pothesis in our matrix/ vector notation, recall that k = 17 and define β0 = (β0′ ,c, β0,belt, β0,mph)′ where β0,c is the 15 × 1 vector of coefficients on the con- trols. The null hypotheses is H0 : Rβ = r where R = [01×15,1,1] and r = 0; there is only one restriction and so nr = 1.39 In this case, the decision rule
37See the Appendix in Section 2.13 for a formal definition of this distribution.
38For our example here, the power of the test is 0.1138 for νF = 1, 0.1881 for νF = 2 and 0.2682forνF =3.TheMATLABcommandforcalculatingprobabilitythatarandomvariable with non-central Fdf1,df2 distribution takes a value less than b is ncfcdf(b,df1,df2,ncp) where ncp is the non-centrality parameter.
39 In other words, R is the 1 × 17 vector whose first 15 elements are all zero and remaining two are both 1, and r = 0.
46
is to reject H0 at the 100α% significance level if F > F1,91(1 − α). For a 5% significance level test, F1,91(1 − α) = F1,91(0.95) = 3.946. The F-statistic is 3.126, and so we fail to reject H0 at the 5% significance level. However, the p-value is 0.080 and so the null would be rejected at the 10% significance level. Thus there is marginal evidence to support the belief that the effects of the two traffic laws cancelled each other out.
The F – statistic can be equivalently calculated as follows:
F = RSSR −RSSUT−k, (2.45)
RSSU nr
where RSSU is the unrestricted RSS from regression of y on X, and RSSR is restricted RSS obtained form the regression of y on X subject to the re- strictions Rβ = r. The latter estimation is performed via so-called Restricted Least Squares (RLS) estimation which is the topic of the next section. To con- clude this section, we discuss one particular type of F-test that is relevant in all regression models.
In Section 2.8.2, we consider tests for whether individual regressors actually helped to explain y. In any regression, we may also want to assess if the set of regressors (apart from the intercept) can collectively help to explain y. Here the null hypothesis is that the regressors do not collectively help to explain y that is, β0,2 = 0, β0,3 = 0,..,β0,k = 0.40 The alternative is that at least one of the regressors help to explain y that is, β0,j ̸= 0 for at least one j, j = 2,3,…,k. This null and alternative can be expressed within our framework here by setting R = [0k−1, Ik−1] and r = 0k−1, where 0k−1 denotes the null vector of dimension (k − 1) × 1. This version of the F – test is routinely reported as part of the ANOVA table in the regression output from many computer programs. Noting that under this H0, RSSR = TSS (check this for yourselves), it follows from (2.45) that the F – statistic is
F= R2 T−k. (2.46) 1 − R2 k − 1
This formulation gives an appealing intuition to the test. Notice that if y is linearly unrelated to X in the population, then the population R2 between y and X is zero.41 From (2.46), it can be recognized that F is an increasing function of the sample R2. So we reject the null hypothesis that the regressors do not collectively explain y if the sample R2 is sufficiently large to make it implausible that the population R2 is zero. We conclude this section by illustrating this version of the F-test for our running example.
Example 2.13. There are seventeen regressors in equation (2.20): the intercept plus the fourteen controls (eleven month dummies, the time trend, unemploy- ment and the number of weekends), beltt and mpht. So in this case, the null
40Recall the intercept in included by setting xt,1 = 1 for all t.
41The population R2 is the multiple correlation coefficient from the regression of y on X using the entire population (or equivalently an infinitely large sample).
47
hypothesis that these explanatory variables collectively do not help to explain yt involves sixteen restrictions. Recall from Example 2.2 that the estimated model has R2 = 0.72. So using the formula in (2.46), the F-statistic is 14.625. Un- der the null hypothesis this statistic has a F16,91 distribution and so the p-value of the test is the value of δ satisfying 14.625 = F16,91(1 − δ). The solution to this equation yields a value of δ that is effectively equal to zero, meaning H0 is rejected at all conventional significance levels. Therefore, the estimation results provide evidence to support the view that these explanatory variables collectively contribute to the explanation of the percentage of accidents involving fatalities in California during this period.
2.9 Restricted Least Squares
In Section 2.2, the OLS estimator is defined as the value of β that minimizes QT (β) = u(β)′ u(β). Suppose now that our economic model implies β0 satisfies a set of linear restrictions of the form Rβ0 = r where R and r are respectively a nr × k matrix and a nr × 1 vector of specified constants. In this section, we explore how this information can be exploited in least squares estimation and also the potential gains from doing so.
Recall that B denotes the parameter space for the regression coefficients. Now define the set: BR = {β;Rβ = r, β ∈ B}. It can be seen that BR consists of all possible values for regression coefficient vector that satisfy the restrictions. Notice that it must follow that BR ⊆ B. The Restricted Least Squares (RLS) of β0 is defined to be
βˆR,T = argminβ∈BR QT (β).
Thus both OLS and RLS minimize the error sum of squares u(β)′u(β) but they differ in the sets over which the minimization is taken. OLS minimizes over all possible value for β; RLS minimizes over all possible values of β that also satisfy the restrictions. As a result, in general, the two estimators are different. Notice also that the RLS estimator satisfies the restrictions by construction, RβˆR,T =r.
To obtain formula for the RLS estimator, it must convenient to use the La- grange’s method for constrained optimization. To this end, let the Lagrangean function be denoted by L(β, λ) where λ is a nr × 1 vector of Lagrange mul- tipliers. It turns out to be most convenient to define the Lagrangean for our problem as follows:
L(β, λ) = QT (β) + 2λ′(Rβ − r). (2.47) Recall that the RLS estimator of β0 and the estimated Lagrange multiplier, λˆT ,
satisfy the first-order conditions:
∂L(β,λ) = 0, ∂β β=βˆR,T ,λ=λˆT
∂L(β,λ) = 0. ∂λ β=βˆR,T ,λ=λˆT
48
From (2.47) and Lemma 2.2(i) it follows that:
∂L(β, λ) = ∂QT (β) + 2R′λ, ∂L(β, λ) = 2(Rβ − r).
∂β ∂β ∂λ
Using (2.5), the first-order conditions imply that the estimators are the solutions to the following equations:
−X′y + X′XβˆR,T + R′λˆT = 0, (2.48) RβˆR,T −r = 0. (2.49)
Notice that the second of these equations ensures βˆR,T satisfies the restrictions. To obtain the formula for βˆR,T , it is necessary to manipulate (2.48)-(2.49). As- suming X is full column rank, X′X is nonsingular, and (2.48) can be multiplied by R(X′X)−1 to give:
−R(X′X)−1X′y + RβˆR,T + R(X′X)−1R′λˆT = 0. (2.50) RecallingtheformulafortheOLSestimatorandthatRβˆR,T =r,equation
(2.50) can be re-written as:
−RβˆT + r + R(X′X)−1R′λˆT = 0. (2.51)
Assuming that rank(R) = nr, the nr × nr matrix R(X′X)−1R′ is nonsingular, and we can solve (2.51) for λT to obtain:
λˆT = {R(X′X)−1R′}−1(RβˆT − r). (2.52) Substituting for λˆT in (2.48) and pre-multiplying the resulting equation by
(X′X)−1, we obtain after re-arrangement:
βˆR,T = βˆT − (X′X)−1R′{R(X′X)−1R′}−1(RβˆT − r). (2.53)
This formula gives an appealing interpretation to the RLS estimator: it equals the OLS estimator modified by a factor that reflects how far the OLS estimator is from satisfying the restrictions.
The following theorem presents the sampling distribution of the RLS esti- mator.
Theorem 2.6. If Assumptions CA1- CA6 and 2.1 hold, and Rβ0 = r then the sampling distribution of the RLS estimator in (2.53) is N β0 , σ02D where
D = (X′X)−1 − (X′X)−1R′{R(X′X)−1R′}−1R(X′X)−1.
The proof of this result is left to Tutorial 3. Here we focus on comparing the properties of the OLS and RLS estimators. First, consider the case where the restrictions are valid. In this case, it follows from Theorems 2.3 and 2.6 that
49
both estimators are unbiased but the RLS estimator is at least as efficient as the OLS estimator. The latter property holds because
Var[βˆT] − Var[βˆR,T] = σ02(X′X)−1R′{R(X′X)−1R′}−1R(X′X)−1
which is a positive semi-definite matrix by construction. Now consider the case where the restrictions are invalid: in this case OLS is still unbiased but it can be shown that the RLS estimator is in general biased.42 In summary there is an intuitively reasonable trade-off: if the linear restrictions are correct then their imposition leads to more efficient estimators, but the imposition of invalid restrictions leads to biased estimators.
As with OLS, the sampling distribution cannot be used for inference about β0 because it depends on σ02 and so to develop feasible inference an estimator of this nuisance parameter is needed. The RLS estimator of σ02 is
σˆR2 = (y−XβˆR,T)′(y−XβˆR,T), T − k + nr
where the value in the denominator is chosen so that the estimator is unbiased for σ02. Notice that the denominator is larger than in its OLS counterpart and this is because the degrees of freedom increase due to the imposition of the restrictions.43
Confidence intervals and hypothesis tests can be constructed based on the RLS estimators along the same lines to those for OLS.44 However, the sampling distributions of the RLS-based test statistics reflect the change in the degrees of freedom. To illustrate, consider a two-sided test of H0 : β0,l = β∗,l. Given our discussion of OLS-based inference for this type of test, the natural RLS-based test statistic is:
βˆ R , l − β ∗ , l τˆR,l = σˆ D ,
R l,l
where Dl,l is the (l,l) element of D and for ease of notation here we have
dropped the T in the subscript of the RLS estimator.
Theorem 2.7. If Assumptions CA1-CA6 and 2.1 hold, and Rβ0 = r then τˆR,l
has a Student’s t distribution with T − k + nr degrees of freedom.
RLS-based tests for whether the parameters satisfy linear restrictions – other than those imposed in the estimation – can also be constructed but we do not pursue these further here.
42See Tutorial 3.
43Essentially the values of nr elements of βˆR,T are implicitly fixed by the restrictions and the remaining k − nr are determined by the first-order conditions of the minimization so that T − (k − nr ) of the observations are “free”; see the disvussion at the end of Section 2.5.
44See Sections 2.6 and 2.8.2.
50
2.10 Variable selection: R2 and adjusted R2
In many situations in practice, we are unsure exactly which explanatory vari- ables to include in the regression model. It would be tempting to select the set of explanatory variables which leads to the highest value of R2 but this is a flawed variable selection strategy. The reason is that R2 can never go down when an additional variable is included irrespective of whether or not that variable be- longs in the model. To see why, consider the following two possible models for y: Model 1 in which the explanatory variables are the intercept and x2,t, and Model 2 in which the explanatory variables are the intercept, x2,t and x3,t. Let RSS1 be the RSS from OLS estimation of Model 1, and RSS2 be the RSS from OLS estimation of Model 2. Notice that Model 1 is a special case of Model 2 because if we set the coefficient on xt,3, β3 say, equal to zero in Model 2 then this yields Model 1. Therefore, we can think of OLS estimation of Model 1 as OLS estimation of Model 2 subject to the constraint that β3 = 0. Thus, we have
T
RSS1 = minβ∈Br (yt − β1 − β2xt,2 − β3xt,3)2 (2.54)
t=1
where Br = {β = (β1, β2, 0)′; β ∈ B} where B is the parameter space for Model 2;45 that is, Br the set of all vectors in the parameter space for Model 2 whose third element is zero. Whereas, from (2.2), we have
T
RSS2 = minβ∈B (yt − β1 − β2xt,2 − β3xt,3)2. (2.55)
t=1
Since Br ⊂ B, it follows that the minimization in (2.54) is over a smaller set of possible values for β than the minimization in (2.55) and so the right-hand side of (2.54) must be at least as large as the right-hand side of (2.55), meaning RSS1 ≥ RSS2. As a result, R2 alone can not be used as guide to the selection of explanatory variables between models with different numbers of regressors.46
An alternative is to use the so-called adjusted R2 which can either increase, decrease or stay the same when an additional explanatory variable is included in the model. The adjusted R2 is commonly denoted R ̄2 and is defined as:
R ̄2 =1−T−1(1−R2). (2.56) T−k
Like R2, the adjusted R2 is routinely reported as part of the regression output in many computer packages. To motivate the formula for R ̄2, it is useful to
write R2 as
R2 = 1 − RSS TSS
45Therefore a typical element of B is β = (β1,β2,β3)′. Here we are assuming the set of possible values for β1 and β2 is the same in both models.
46However, if we have to choose between two linear regression models for yt that have the same number of regressors then it is reasonable to select the model that has the highest R2 as that is the model of the two which explains the most variation in the dependent variable.
51
where RSS = Tt=1 e2t and TSS = Tt=1(yt −y ̄)2. Dividing the numerator and denominator of the ratio by T leaves the ratio unaffected, and so R2 can be
written equivalently as
R2 =1−T−1RSS, (2.57) T−1TSS
Given the sum of squares format of the terms in the ratio, R2 can be thought of as one minus the ratio of an estimator of the variance of the errors (T−1RSS) to an estimator of the variance of the dependent variable (T−1TSS). However, both these variance estimators are biased because: (as discussed in Section 2.5), σˆT2 = RSS/(T −k) is an unbiased estimator of σ02; and from standard arguments σˆy2 = TSS/(T − 1) is an unbiased estimator of the variance of y. If T−1RSS and T−1TSS are replaced on the right-hand side of (2.57) by respectively σˆT2 and σˆy2 then the result is the formula for R ̄2 that is,
R ̄ 2 = 1 − T − 1 R S S . T −k TSS
This modification is sufficient to ensure that the criterion can either go up, stay the same or go down when an additional regressor is included in the model. To see this is the case, assume the original model has k regressors, and let R ̄2k and RSSk denote the R ̄2 and RSS from this model. The model with an additional regressor thus has k + 1 regressors and its R ̄2 and RSS are denoted by R ̄2k+1 and RSSk+1. Notice that since the dependent variable is the same in both cases, the TSS is the same for each model. Therefore, we have
R ̄ 2k = 1 − T − 1 R S S k T −k TSS
and
From these equations it follows that
R ̄ 2k + 1 = 1 − T − 1 R S S k + 1 . T − k − 1 T SS
R ̄ 2k >= R ̄ 2k + 1 ⇔ R S S k − R S S k + 1 <= R S S k . < >T−k
Hence, the direction of change in R ̄2 depends on the size of the drop in RSS when the extra regressor is included.
Given this property, a possible model selection strategy is to choose regres- sors to maximize R ̄2. While this strategy has a certain intuitive appeal, we now develop an alternative perspective on this approach from which it does not seem so attractive. If we choose regressors to maximize R ̄2 then this amounts to including any regressor that increases R ̄2. To explore what this implies, we return to the thought experiment in the previous paragraph, only this time adding in slightly more structure. Let R ̄2k denote R ̄2 for the regression of y on X = [x1,…,xk], and R ̄2k+1 denote R ̄2 for the regression of y on [X,xk+1].
52
The condition R ̄2k+1 > R ̄2k can be written as:
1 − RSSk+1/(T −k−1) > 1 − RSSk/(T −k).
TSS/(T − 1) TSS/(T − 1)
Since TSS is the same in both models, this condition is equivalent to:
RSSk+1 < RSSk, T−k−1 T−k
which, in turn, implies
(T − k)RSSk+1 < (T − k − 1)RSSk.
(2.58)
Adding and subtracting RSSk+1 to the left hand side of the previous equation and rearranging, we obtain that (2.58) is equivalent to:
or:
RSSk+1 < (T − k − 1)(RSSk − RSSk+1), 1 < RSSk −RSSk+1 T−k−1.
RSSk+1 1
From (2.45), it can be recognized that the right-hand side of this inequality is the F-statistic for testing H0 : β0,k+1 = 0. The latter statistic is equal to the square of the t-statistic for testing the same null hypothesis.47 Therefore, R ̄2k+1 > R ̄2k if and only if the t-statistic for testing H0 : β0,k+1 = 0 is greater than one in absolute value – or in other words, the inclusion of xk+1 in the model increases R ̄2 if and only if the we reject H0 : β0,k+1 = 0 at approximately a 30% significance level (assuming T − k > 30 as an illustration). The latter is, of course, a very high significance level relative to common testing practice. So a strategy of selecting variables to maximize R ̄2 potentially leads to the inclusion of regressors that would be judged to have a statistically insignificant effect on the dependent variable using conventional significance levels.
To illustrate the behaviour of R ̄2 we consider the three models for traffic fatalities discussed in Example 2.2.
Example 2.14. In Example 2.2, the initial model, equation (2.18), involves the results from regressing yt (the percentage of traffic accidents involving at least one fatality) on dummy variables for the two law changes. We then consider the impact on the coefficients on these dummies of including control variables: equation (2.19) includes month dummies and a linear time trend as controls; equation (2.20) includes the state unemployment rate and the number of week- ends in the month as additional controls. The R2s from these three models are: 0.135, 0.693 and 0.717. This shows the inclusion of the controls greatly improves the explanatory fit. The R ̄2s for the three models are respectively: 0.118, 0.687 and 0.668. So the inclusion of the unemployment rate and number of weekends
47See Tutorial 3 Question 5.
53
in the month leads to a sufficiently modest increase in R2 that R ̄2 falls slightly. Therefore, if we base our model selection on R ̄2 then the second model, equation (2.19), is the preferred specification of the three. However, in our analysis of the impact of the law changes on traffic fatalities, we have based inference on the more general model, equation (2.20), to protect against potential bias in the estimation of βbelt,0 and βmph,0 caused by the omission of the controls.
2.11 Stochastic regressors
Our analysis to date has rested on the assumption that the regressors are fixed in repeated samples. As we noted at the outset, this assumption is unreasonable for econometric modeling but was made for pedagogical convenience. In this section, we relax it and consider the analogous regression model with stochastic regressors. However, as emerges, our previous analysis will actually provide a stepping stone to the derivation of the analogous results in this more general context.
This new framework necessitates a new set of assumptions. The first of these is the same as Assumption CA1 but nevertheless we repeat it here for completeness.
Assumption SR1. The true model is: y = Xβ0 + u.
Assumption SR1 states the y is generated by the same regression model as
we estimate.
Assumption SR2. X is stochastic.
This replaces Assumption CA2. The import of this assumption is that X is resampled in each of the repeated samples that underlie our interpretation of probability.
Assumption SR3. X is rank k with probability 1.
Assumption SR3 plays the same role as Assumption CA3. The difference is that as the regressors are stochastic here then we must attach a probability to any statement about properties of X. This assumption implies that (X′X)−1 exists with probability one.
Assumption SR4. E[u|X] = 0.
Assumption SR4 states that the conditional expectation of u given X is zero. It is important to understand what this assumption actually implies. To this end, it is useful to introduce the condition E[ut|xt] = 0. This states that ut has a zero mean conditional on xt and in turn implies ut is uncorrelated with xt that is, the error in the tth equation is uncorrelated with all the explanatory variables in the tth equation.48 In contrast, Assumption SR4 implies that E[ut|X] = 0 and so ut is uncorrelated with {xs; s = 1,2,…T} that is, ut is uncorrelated
48This can be shown formally using the Law of Iterated Expectations discussed below. 54
with all the explanatory variables in all equations. With cross-section data, it is often assumed that individuals/firms are independent and so this condition combined with E[ut|xt] = 0 would yield Assumption SR4. With time series data, t defines a temporal ordering and the distinction is more important. If E[ut|xt] = 0 then xt is said to be contemporaneously exogenous; if E[ut|X] = 0 holds then xt is said to be strictly exogenous. In this context, strict exogeneity implies that ut is not only uncorrelated with the current value of the explanatory variables (as would be the case under contemporaneous exogeneity) but also all futureandpastvaluesoftheexplanatoryvariablesaswell.49 Onecircumstance, in which Assumption SR4 holds is when u and X are independent.
Assumption SR5. V ar[u|X] = σ02IT .
Assumption SR5 implies that conditional on X the errors are homoscedastic
and ut is uncorrelated with us for any t ̸= s.
Assumption SR6. The conditional distribution of u given X is normal that
is, u|X ∼ Normal.
Conditional on X, y is a linear combination of normal random variables and so it follows from Assumption SR1 that the conditional distribution of y given Xis normal that is, y|X ∼ N(Xβ0, σ02IT ).
WenowexplorethesamplingdistributionofβˆT underAssumptionsSR1- SR6, beginning with its first two moments. Our analysis relies on the Law of Iterated Expectations. The following lemma adapts this law to our setting here.
Lemma 2.6. The Law of Iterated Expectations (LIE): Let w = f(u,X) be a (scalar/vector/matrix) function of u and X, such that E[w] exists then E[w] = EX Eu|X[w] where Eu|X[·] denotes the expectation operator relative to the conditional distribution of u given X and EX [ · ] denotes the expectations oper- ator with respect to the marginal distribution of X.
Using LIE, we have that if E(X′X)−1X′u exists then E[βˆT] = β0 + EX Eu|X (X′X)−1X′u
= β0 + EX (X′X)−1X′Eu|X [u], and so from Assumption SR4
E[βˆT] = β0.
(2.59)
So OLS is unbiased. Notice that the Eu|X [ · ] part of our analysis is essentially the same as our analysis for the fixed regressors case because conditioning on X amounts to treating it as a constant.
The variance can be similarly calculated, following analogous steps to the corresponding analysis in the fixed regressor case. Assuming that
49See Section 3.3 for further discussion.
55
E (X′X)−1X′uu′X(X′X)−1 exists, we have
Var[βˆT] = E(X′X)−1X′uu′X(X′X)−1 ,
= EX Eu|X (X′X)−1X′uu′X(X′X)−1 , = EX (X′X)−1X′Eu|X [uu′ ]X(X′X)−1 .
From Assumption SR5, it follows that
Var[βˆT] = σ02E(X′X)−1.
(2.60)
However, there is no such easy extension of the sampling distribution for βˆT . Using the conditioning argument, we can appeal to the arguments behind Theorem 2.3 to establish that conditional on X the distribution of the OLS estimator is:
βˆ T | X ∼ N β 0 , σ 02 ( X ′ X ) − 1 ,
but the unconditional distribution of βˆT in general depends on the distribution of X. This dependency would appear to suggest that our previous inference procedures based on τˆT,i and F are invalid but this is not in fact the case. The crucial difference is that these two statistics involve normalized versions of either βˆT ,i or RβˆT − r. Using analogous arguments to the fixed regressor case, we can conclude that conditional on X, we have that:
• under H0 : β0,i = β∗,i, τˆT,i(β∗,i) ∼ Student’s t with T − k df;
• under H0 : Rβ0 = r, F ∼ Fnr,T−k
Since both these distributions are independent of X, it follows that these con- ditional distributions are also the unconditional distributions of these statistics under the stated null hypotheses.50 Therefore, the confidence intervals for β0,i in Section 2.6, the decision rules in Sections 2.8.2 and 2.8.3 remain valid un- der Assumptions SR1-SR6.51 A similar argument justifies the validity of the prediction interval in Section 2.7 under the assumptions in this section.
In this section, we have seen that the inference framework developed in the Classical Linear Regression model remains valid if the regressors are stochastic regressors provided (u,X) satisfies certain conditions. For the distributional results above to hold, it is crucial that the conditional distribution of u given X is normal. If the error distribution is non-normal then in general none of the inference procedures described above are valid. Instead, we base our inference framework upon large sample statistical results. Large sample or asymptotic analysis is the topic of the next chapter. Before moving on, we explore an alternative interpretation of OLS estimation based on the concept of linear projection.
50A statistic with this property is known as a pivot.
51Here we implicitly assume that Eˆ(X′X)−1X′u ̃ and Eˆ(X′X)−1X′uu′X(X′X)−1 ̃ exist.
56
2.12 OLS as linear projection
The preceding sections of this chapter provide a compelling argument for OLS- based inference in linear regression models: the βˆT is the best linear unbiased estimator, these estimators can be used to form easily calculated confidence intervals for the unknown regression coefficients or prediction intervals for y, and the t− and F− statistics provide convenient methods for testing hypothe- ses about the unknown regression parameters. In this section, we explore an additional justification for the use of OLS based from the theory of forecasting.
Suppose we wish to forecast a variable yt given a value for xt. Let y ̃t = c(xt) denote a forecast of yt based on xt where c(·) is some potentially nonlinear function. Since there are many possible choices of c( · ) that can be used,52 it is necessary to introduce some loss function via which to rank the different choices. A natural choice of loss function is the quadratic loss or mean square error,
M S E ( y ̃ t ) = E [ y t − y ̃ t ] 2 .
It can be shown that the choice of c(·) that minimizes the MSE is co(xt) = E[yt|xt]. To implement this choice, it is necessary to know the form of this conditional expectation, and we assume here that the functional form of co(·) is not known.
In the absence of information about co(xt), one solution is to restrict at- tention to the class of linear predictors, ylp, that is ylp = α′xt for some α, a
vector of constants. Once we restrict attention to this class of linear forecasts, the question then becomes what is the optimal choice of α. Given the quadratic loss, the optimal choice of α is the one associated with the linear projection of yt onto xt. The linear projection has the special property that that its prediction error, (yt − α′xt), is uncorrelated with xt that is,
E[(yt − α′xt)x′t] = 0. (2.61)
To see why the linear projection yields the smallest MSE in the class of all linear forecasts, we consider the MSE of any other linear forecast g′xt, say. Simple manipulation shows that the mean square error of the g′xt can be written as
MSE(g′xt) = E[yt − g′xt]2,
= E[yt − α′xt + α′xt − g′xt]2,
= E[yt − α′xt]2 + E[ (yt − α′xt)(α′xt − g′xt) ] +E[α′xt − g′xt]2. (2.62)
Since α′xt and g′xt are scalars and α and g are constants, the middle term on the right hand-side of (2.62) can be written as
E[(yt − α′xt)(α′xt − g′xt)] =
= E[(yt − α′xt)x′t(α − g)],
tt
E[ (yt − α′xt)(x′tα − x′tg) ], = E[(yt − α′xt)x′t](α − g),
52For example, a linear function or a quadratic or more general polynomial function. 57
and so from (2.61) it follows that
E[(yt − α′xt)(α′xt − g′xt)] = 0.
Substituting this result into (2.62), we obtain
MSE(g′xt) = E[yt − α′xt]2 + E[α′xt − g′xt]2,
= MSE(α′xt) + E[α′xt − g′xt]2, MSE(g′xt) ≥ MSE(α′xt),
because E[α′xt − g′xt]2 ≥ 0 by construction.
Having established that the linear projection yields the smallest quadratic
loss in this class, the next issue is to uncover the choice of weights α associated with it. These can be deduced from the defining property of a linear projection. Rewriting (2.61), we have
E[ytx′t] − α′E[xtx′t] = 0, and so assuming E[xtx′t] is nonsingular, it follows that
α = {E[xtx′t]}−1 E[xtyt]. The MSE associated with the linear projection is
MSE(α′xt) = E[yt2]−E[ytx′t]{E[xtx′t]}−1E[xtyt]
The vector α and the MSE(α′xt) are the population analogs to βˆT and RSS/T . This association becomes more apparent if we write the OLS estimator and RSS/T as functions of standardized sums as follows:
T −1 T
βˆ T = T − 1 x t x ′t T − 1 x t y t ,
t=1 t=1
R S S 1 T 1 T 1 T − 1 1 T
T =T yt2−T ytx′t T xtx′t T xtyt.
t=1 t=1 t=1 t=1
Furthermore, it can be recognized that the first order conditions of OLS can be written as T−1 Tt=1 xtet = 0 which is the the sample analog to (2.61).
Therefore, the use of OLS can be justified as the estimators that yield the sample analogs to the weights in the linear projection. Notice that the argu- ments used to justify this interpretation of OLS are very general in the sense that we have not appealed to any assumptions about how X and y are generated beyond (2.61). However, the price of this generality is that this interpretation tells us nothing about how xt affects yt which is most often of interest. In order to use our OLS results to draw inferences about the effect of xt on yt, it is
58
from which it follows that
necessary to build a statistical model that puts more structure on the relation- ship between yt and xt. For example, the assumptions CA1-6 or SR1-6 imply that E[yt|X] = x′tβ0 and so β indexes the marginal response in the conditional expectation of yt with respect to changes in the explanatory variables.53
To conclude this section, we return to the decomposition of y and the vari- ation of y about its mean affected by OLS estimation. Recall from Section 2.2 that we have
y = yˆ + e , ( 2 . 6 3 ) TSS = ESS + RSS, (2.64)
where, as a reminder, TSS = Tt=1(yt −y ̄)2 is the total sum of squares, ESS = Tt=1(yˆt − y ̄)2 is the explained sum of squares and RSS = Tt=1 e2t is the
residual sum of squares. Both these decompositions have their roots in the interpretation of OLS as a linear projection. To explore this issue in more detail, we write the components in terms of the matrix P = X(X′X)−1X′: yˆ = P y and e = (IT − P )y.54 Using this notation, we can write (2.63) as
y = Py + (IT − P)y.
Both P and IT − P are examples of orthogonal projection matrices and so have
the following properties.
Definition 2.9. A m × m matrix M is an orthogonal projection matrix if it satisfies the following two properties: (i) symmetry that is, M = M′; (ii) idempotency that is, MM = M.
If a matrix has the property in (ii) then it is said to be idempotent.55 The matrices P and I − P have one further property by construction:56 they are orthogonal because
P(IT −P)=P−PP=P−P =0. (2.65) It this structure that leads to the decomposition of the variation in (2.64).
To demonstrate this is the case, we introduce the matrix A = IT −ιT(ι′TιT)−1ι′T,
where (as in Example 2.1) ιT is a T × 1 vector of ones. The matrix A is an orthogonal projection matrix (check this for yourselves) and also has the special
53See the discussion in Section 1.1.
54 The matrix IT − P is sometimes referred to as the “residual maker”.
55For further discussion of idempotency see Greene (2012)[p.1042], and of orthogonal projec-
tion matrices see Orme (2009)[Section 7.4]. Note that Orme drops the adjective “orthogonal” as is sometimes done in discussion of P and I − P in econometrics texts.
56In fact, these two matrices have a fundamental connection: P is known as the projection matrix onto the column space of X and I − P is the projection matrix onto the orthogonal complement of the column space of X; see Orme (2009)[Section 7.4.1] for further discussion, but this material is beyond the scope of this course.
59
′
w h e r e yˆ e = 0 , a n d
property that if we pre-multiply a T × 1 vector by A then this produces the same vector in mean deviation form that is,
A y = y − y ̄ ι T , wherey ̄=T−1Tt=1yt.Toseethis,notethat:ι′Ty=Tt=1yt,andι′TιT =T,
and so
Ay = (IT −ιT(ι′TιT)−1ι′T)y = y − ιT(T)−1 yt = y − y ̄ιT.
Using A, we can write
TSS = y′A′Ay = y′Ay,
where the last equality uses the fact that A is an orthogonal projection matrix. We now show that TSS can be decomposed into two parts. As noted in Section 2.2, this decomposition only holds in models that include an intercept term, and so we now set: X = [ιT , X2 ]. An immediate consequence of the inclusion of the intercept is that the first order conditions imply the residuals
sum to zero because (2.6) can be re-written as
X ′ ( y − X βˆ T ) = X ′ e = 0 ,
and then substituting for X yields
ι ′T e = 0 ,
X 2′ e
the first entry of which is Tt=1 et = 0. So trivially the mean of the residuals,
e ̄, is also zero, which we note, parenthetically, implies the mean of y equals the ̄ 57
mean of the predicted values for y, yˆ say, because e = y − yˆ. Using (2.63) and e ̄ = 0, we have
Ay = A(yˆ+e) = Ayˆ+e,
and so substituting for y from (2.10) and for Ay from (2.66), we have
(2.66)
(2.67)
The latter equation can be simplified because e′A = e′, meaning that e′Ayˆ = e′yˆ = yˆ′e = y′P(IT − P)y, which equals zero from (2.65). Thus both cross- product terms in (2.67) are zero, and so we have
̄
Using the properties of A and y ̄ = yˆ, it follows that ESS = yˆ Ayˆ and so (2.68)
is equivalent to (2.64).
TSS = y′Ay=(yˆ+e)′(Ayˆ+e) ′′′′
= yˆAyˆ + ee + eAyˆ + yˆe.
′′
T t=1
T S S = yˆ A yˆ + e e . ( 2 . 6 8 )
′
57This is the linear algebraic version of the statement that the OLS regression line passes through the sample means.
60
2.13 Appendix: Background for distributional results
In this appendix, we provide justification for the various distributional results that have been stated throughout this chapter. As in Sections 2.1-2.10, we impose Assumptions CA1-CA6. Recall that these assumptions imply X is a matrix of constants and u ∼ N (0, σ02IT ). As a result all our statistics are functions of linear combinations of N (0, σ02IT ) random variables.
We begin by presenting a sequence of definitions and lemmas to which we appeal during the analysis.
Definition 2.10. If zi for i = 1, 2, . . . , nz is a sequence of independent and identically distributed normal random variables then nz z2 has a chi-squared
i=1 i distribution with nz degrees of freedom, denoted by χ2nz .
Definition 2.11. If z ∼ N(0,1), B ∼ χ2b and z and B are independent then S = z/ (B/b) has a Student’s t-distribution with b degrees of freedom.
Definition 2.12. If A ∼ χ2a, B ∼ χ2b and A and B are independent then S = (A/a)/(B/b) has a F-distribution with (a, b) degrees of freedom, denoted by S ∼ Fa,b.
Lemma 2.7. If z ∼ N(0, In) and M is a (non-null) n×n orthogonal projection matrix58 with tr(M) = ν then z′Mz ∼ χ2ν.
Lemma 2.8. Let z ∼ N(0, In) and A, B be respectively ma ×n and mb ×n matricesofconstantsanddefinev=Az,w=Bz. IfAB′ =0thenvandw are independent.
Distribution of σˆT2 :
Consider the statistic, (T − k)σˆT2 /σ02. From (2.33) and (2.34), it follows that
( T − k ) σˆ T2 u ′ ( I T − P ) u ′
σ 02 = σ 02 = z ( I T − P ) z ( 2 . 6 9 )
where z = (1/σ0)u. Assumption CA6 and Lemma 2.1 imply z ∼ N(0, IT ). Recall from the previous section that IT −P is an orthogonal projection matrix and from Section 2.5 that tr(IT − P ) = T − k. Therefore, we can use (2.69) and Lemma 2.7 to deduce the following result.
Theorem 2.8. If Assumptions CA1 – CA6 hold then:
( T − k ) σˆ T2 2
σ 02
∼ χ T − k .
58See Definition 2.9 in Section 2.12.
61
Distribution of τˆT,i:
To characterize the distribution of τˆT,i, we note that under H0 β0,i = β∗,i we can write
βˆ T , i − β 0 , i τ T , i × σ 0 τ T , i τˆT,i(β∗,i)=σˆ√m = σˆ =′ 2,
T i,i T e e/{(T − k)σ0}
where τT,i is defined in (2.32), and so from Theorem 2.8 and (2.32), it can be recognized that τˆT,i is the ratio of a standard normal random variable to the square root of a chi-squared random variable divided by its degrees of freedom. We now show that numerator and denominator are independent. To this end, note that using (2.25), the numerator can be written as
τT,i = √1 a′i(X′X)−1X′( 1 u) = Az, mi,i σ0
where z = (1/σ0)u ∼ N(0,IT), A = √
vector with ith element equal to one. Now consider the denominator. Since
1 a′(X′X)−1X′ and ai is the unit mi,i i
IT − P is an orthogonal projection matrix, it follows from (2.69) that e′e = z′B′Bz
σ02
where B = IT − P. Therefore τT,i is independent of e′e/σ02 if Az and Bz are independent. From Lemma 2.8, we know that Az and Bz are independent if
AB′ = 0. Since A = A1X′ for A1 = √
mi,i i
1 a′(X′X)−1 and IT −P is symmetric,
it follows that AB′ = A1X′(IT −P) = 0, and so the condition for independence is satisfied. Combining these results, we have shown that under H0 : β0,i = β∗,i that τˆT,i is the ratio of a standard normal random variable to a chi-squared random variable divided by its degrees of freedom, and that the numerator and denominator of this ratio are independent. It then follows from Definition 2.11 thatτˆT,i hasaStudent’stdistributionwithT−kdegreesoffreedomasstated in Theorem 2.4.
Distribution of F-statistic:
Recall from (2.44) that the F − statistic for testing H0 : Rβ0 = r is given by
F = (RβˆT − r)′[R(X′X)−1R′]−1(RβˆT − r). n r σˆ T2
Under H0, we have Rβ0 = r and so
R(βˆT − r) = R(βˆT − β0) = R(X′X)−1X′u.
Using this formula and substituting for σˆT2 We can therefore write F = u′Mu/nr = z′Mz/nr ,
u′(IT −P)u/(T −k) z′(IT −P)z/(T −k) 62
where M = X(X′X)−1R′[R(X′X)−1R′]−1R(X′X)−1X′. It can be verified that (check this for yourselves): (i) M is an orthogonal projection matrix with tr(M)=nr andso,viaLemma2.7,z′Mz∼χ2nr;(ii)M(IT −P)=0sothat z′Mz and z′(IT −P)z are independent via Lemma 2.8 and a similar argument to the one used in the derivation of the distribution of τˆT,i. As a result, it follows from Definition 2.12 that F ∼ Fnr,T−k as stated in Theorem 2.5.
Non-central t- and F- distributions:
In Sections 2.8.2 and 2.8.3, it is stated that both the t- test of H0 : β0,i = β∗,i and the F-test of H0 : Rβ0 = r are unbiased. This property holds because under their alternative hypotheses τˆT ,i(β∗,i ) and F have respectively non-central t- and non-central F- distributions. These two distributions are defined below.
Definition 2.13. If z ∼ N(0,1), B ∼ χ2b, z and B are independent, and δ is some constant then S = (z + δ)/ (B/b) has a non-central t-distribution with b degrees of freedom and non-centrality parameter δ.
In order to define the non-central F distribution, it is necessary to first define the non-central chi-squared distribution.
Definition 2.14. If zi for i = 1, 2, . . . , nz is a sequence of independent and
i d e n t i c a l l y d i s t r i b u t e d n o r m a l r a n d o m v a r i a b l e s a n d { δ i } n z a r e c o n s t a n t s t h e nz2 i=1
i=1(zi + δi) has a chi-squared distribution with nz degrees of freedom and
non-centrality parameter ν = nz δ2, denoted by χ2 (ν). i=1 i nz
Definition 2.15. If A ∼ χa(δ)2, B ∼ χ2b and A and B are independent then S = (A/a)/(B/b) has a non-central F-distribution with (a, b) degrees of freedom and non-centrality parameter δ, denoted by S ∼ Fa,b(δ).
If Definition 2.15 is modified so that B is a random variable with a non- central chi-squared distribution then S has a doubly non-central F-distribution. As a result, the distribution in Definition 2.15 is sometimes referred to as a singly non-central F distribution but the adjective “singly” is often omitted because this is the most common scenario encountered in practice.
63
Chapter 3
Large Sample Statistical Theory
3.1 Background
Large sample or asymptotic theory involves analyzing the behaviour of statistics as the sample size goes to infinity. To begin our discussion it is useful to recall how the OLS estimator depends on T . From (2.25), we have that
T −1 T
βˆT = β0 + xtx′t xtut. (3.1)
t=1 t=1
Therefore, βˆT is a function of two summations that run from one to T, and so, as T increases, more terms are included in these sums. Since xt and ut are random, so too is βˆT; thus {βˆT;T = k,k+1,…} is a stochastic sequence indexed by T.
We establish below that our statistics of interest (βˆT , t-stats etc) “converge” as T → ∞ to either constants or random variables with known, and familiar, distributions. It will emerge that these large sample results hold under quite general conditions on the explanatory variables and errors; importantly these conditions do not involve the specification of the distribution of the errors. This is clearly desirable because – unlike our finite sample analysis – it means the large sample theory is not sensitive to a particular assumption about the distribution of the errors. The price we pay for this robustness is that the large sample theory is only strictly valid in the limit as the sample size goes to infinity, and so is only ever an approximation to the finite sample behaviour of our statistics.
We first need to define what is meant by “convergence” in this context. There are, in fact, many ways to define convergence of a stochastic sequence. Our large sample analysis rests on what are known as “convergence in probability” and “convergence in distribution”. To define these concepts, we abstract from our
64
OLS setting here and consider the stochastic sequence {VT ; T = 1, 2, . . .} and random variable (rv) V .
Definition 3.1. VT is said to converge in probability to V if P (|VT − V | < ε)→1foranyε>0asT→∞,andisdenotedbyVT →p V.
This definition holds for rv V . If V is a degenerate rv (that is, a constant) then we have two important items of terminology.
Definition 3.2. If VT →p c where c is a constant then c is referred to as the probability limit of VT and this is written as plimVT = c.
Definition 3.3. If θˆT is an estimator of the unknown parameter θ0 and θˆT →p θ0 then θˆT is said to be a consistent estimator of θ0.
In large sample analysis, consistency is going to play the role that unbi- asedness plays in our finite sample analysis in the sense that it is the minimal property that we want our estimators to exhibit. In fact, it is possible to re- late the two concepts as follows. It can be shown that sufficient conditions for θˆT →p θ0 are that E[θˆT ]−θ0 → 0 and V ar[θˆT ] → 0. If these two conditions hold then both the bias and the variance of the estimator tend to zero as T → ∞, and thus the sampling distribution of θˆT is collapsing to distribution with a point mass of one at θ0.1 Notice that not all unbiased estimators are consistent and an estimator that is biased in finite samples may be consistent if the bias tends to zero as T → ∞.
We have stated convergence in probability in terms of scalars, but we will need to apply the concept to random vectors and matrices. It is possible to extend the definition to these more general cases, but instead the following lemma provides a necessary and sufficient condition for convergence based on the behaviour of a typical element of the vector/matrix in question.
Lemma 3.1. Let {MT ; T = 1, 2, . . .} be a sequence of random matrices and M be random matrix where MT and M are p × q with i − jth elements MT ,i,j and Mi,j. Then MT →p M if and only if MT,i,j →p Mi,j for i = 1,2,…p, j = 1,2,…q
Lemma 3.1 states that a necessary and sufficient condition for matrix MT to converge in probability to matrix M is that every element of MT converges in probability to the corresponding element of M.
Often our analysis does not involve VT per se but some function of VT . The following result allows us to deduce the large sample behaviour of a function of VT from that of VT .
Lemma 3.2. Slutsky’s theorem: Let {VT } be a sequence of random vectors (or matrices) which converge in probability to the random vector (or matrix) V and let f(.) be a real- valued vector of continuous functions then f(VT ) →p f(V ).
This is a very powerful result as will become apparent in our analysis below.2 1IfE[θˆ ]−θ →0andVar[θˆ ]→0thenθˆ issaidtoconvergeinquadraticmeantoθ ,
T0TT0 ˆ q.m.
written as θT → θ0.
2Note f(E[V ]) ̸= E[f(V )] in general.
65
The other form of convergence we need is as follows.
Definition 3.4. The sequence of random variables {VT } with corresponding distribution functions {FT ( · )} converges in distribution to the random variable V with distribution function F(·) if and only if there exists T(ε) for every ε such that |FT (c)−F(c)| < ε for T > T(ε) at all points of continuity {c} of F(.).
Convergence in distribution is denoted by VT →d V , with the distribution of V being known as the limiting distribution of VT .3
In essence, convergence in distribution states that P (VT < a) → P (V < a) as T → ∞. Notice that the definition of convergence in distribution only requires convergence at all the points at which F (.) is continuous. This obviously rules out any points at which the limiting distribution function is discontinuous. Such points of discontinuity occur in the distribution function if it contains steps, with the points of discontinuity being those at which the steps occur. Convergence is not required at points of discontinuity because to do so would rule out cases where a limiting distribution very clearly exists.
Note: If VT →p V then VT →d V but the reverse implication does not hold unless V = c, a constant. The following simple example illustrates the difference between the two forms of convergence.
Example 3.1. Let z1 and z2 be two independent standard normal random vari- ables. Now define the following three stochastic sequences
aT = z1 + 1, bT = z1 − 1, cT = z2 − 1. TTT
SinceT−1→0asT→∞,wehaveaT→p z1,bT→p z1andcT→p z2.Therefore,
aT −bT →p 0 and aT −cT →p z1 −z2. So aT and bT converge in probability to
thesamelimitbutaT andcT donot. WealsohaveaT →d z1,bT →d z1 and
cT →d z2. Since both z1 and z2 have N(0, 1) distributions, it follows that aT , bT and cT all have the same limiting distribution.
Convergence in probability and convergence in distribution provide the con- ceptual frameworks for describing what is meant by the idea that a statistic behaves like a particular random variable in the limit, but do not in themselves provide the tools for establishing what form this limiting behaviour takes. For this, we turn to two fundamental results in statistics: the Weak Law of Large Numbers (WLLN) and the Central Limit Theorem (CLT).
We state these limit theorems in terms of a sequence of random vectors {vt, t = 1,2,...T} with E[vt] = μt.
3It should be noted that the terminology “convergence in law” is also sometimes used for this concept.
66
Lemma 3.3. Weak Law of Large Numbers: Subject to certain conditions, it follows that
− 1 T p
T (vt − μt) → 0.
t=1
Notice that the WLLN only implies that the average of vt − μt converges in
probability to zero. In other words that the difference between T −1 Tt=1 vt and
T
−1
T t=1 μt converges in probability to zero. This does not necessarily imply that lim T −1 T μ = μ ̄, some finite constant, and so similarly does not
Lemma 3.4. Central Limit Theorem: Subject to certain conditions,
T
T − 1 / 2 ( v t − μ t ) →d N ( 0 , Ω ) ,
t=1
where Ω = limT→∞ΩT is a finite positive definite matrix of constants, and
T ΩT =Var T−1/2 (vt−μt) .
t=1
Before proceeding, it is important to emphasize the form of this variance. To this end, we introduce the following vector extension of the scalar covariance function.
Definition 3.5. Let v be a n ×1 random vector with E[v ] = μ(v) and w be tv ttt
T→∞T t=1 t
imply T−1 v converges to a constant. However, in the cases we consider
t=1 t T
below μt = μ and in this case the WLLN implies T −1 t=1 vt converges in probability to μ that is, the sample mean converges to the population mean.
a n × 1 random vector with E[w ] = μ(w). The covariance between v and w wtttt
is the nv × nw matrix, Cov[vt, wt] = E (vt − μ(v))(wt − μ(w))′ . tt
So the i − jth element of Cov[vt, wt] is Cov[vt,i, wt,j] where vt,i and wt,j are respectively the ith element of vt and the jth element of wt. Notice that if vt = wt then Cov[vt, wt] = V ar[vt], but in general Cov[vt, wt] is not symmetric (- it may not be square); however, Cov[vt, wt] = {Cov[wt, vt]}′.
Returning to the form of Ω, it can be shown that
− 1 T T t=1 s=1
If vtis uncorrelated with vs (and so Cov[vt,vs] = 0) for all s ̸= t then ΩT = T −1 Tt=1 V ar[vt]. Such a condition might apply in microeconometric settings
if we have, say, a sample of individuals who can be assumed independent of one another. However, with time series data for which t indicates temporal ordering, it may not be the case that Cov[vt,vs] = 0 for t ̸= s.
67
ΩT = T
Cov[vt,vs]. (3.2)
Comparing the WLLN and CLT, it can be seen that the results differ in how thestatisticisscaled.TheWLLNstatesthatdT =T−1Tt=1(vt−μt)converges in probability to zero; the CLT states that T1/2dT converges in distribution to a standard normal random variable. We illustrate these behaviours in the following example.
Example 3.2. We examine the behaviour of the sample mean based on a ran- dom sample of size T from three different distributions: (i) the standard normal distribution; (ii) the normalized uniform distribution on (0, 1); (iii) the normal- ized Bernoulli distribution with probability 0.5. Here normalization means that the appropriate transformation is applied to deliver a distribution with mean zero and variance one to match the standard normal.4 For all three, the WLLN im- plies that the sample mean converges to zero, the population mean. Figures 3.1, 3.2 and 3.3 plot the sampling distributions of the sample means from the distri- butions in (i), (ii) and (ii) respectively for T = 10, 20, 30, 50, 100, 250, 500, 1000. It can be see that in line with the prediction of the WLLN, the sampling distribu- tion of the sample mean is collapsing onto zero in each case.5 If we increase the sample size further this collapse would continue and in the limit the distribution would be a spike at zero.
Figures 3.4, 3.5 and 3.6 plot the sampling distributions of the sample means scaled by T 1/2 from the distributions in (i), (ii) and (ii) respectively for the same sample sizes. The CLT predicts that this statistic converges in distribution to a standard normal distribution.6 As can be seen, in each case the scaling stops the distribution collapsing as in Figures 3.1-3.3, and the sampling distribution is now converging to a standard normal distribution in each case.
In addition to the limit theorems and Slutsky’s Theorem, there are two further results which will be very useful in our large sample analysis of OLS (and other estimators). These are presented in the following two lemmas.
Lemma3.5. IfbT =MTmT whereMT isaq×rrandommatrixandmT isa r × 1 random vector and: (i) MT →p M , a matrix of finite constants; (ii) mT →d
N(0,Ω), where Ω is a finite p.d. matrix of constants; (iii) rank(MΩM′) = q. T h e n w e h a v e : b T →d N ( 0 , M Ω M ′ ) .
Lemma 3.6. If aT = m′ Ωˆ−1mT where mT is a r × 1 random vector and: (i) TT
mT →d N ( 0, Ω), where Ω is a finite p.d. matrix of constants; (ii) Ωˆ T →p Ω. T h e n w e h a v e a T →d χ 2r .
Notice that Lemmas 3.5 and 3.6 can be viewed as the large sample analog of Lemmas 2.1 and 2.5 respectively.
4If u ∼ uniform(0, 1) then E[u] = 0.5, V ar[u] = 1/12. If b ∼ Bernoulli(0.5) then E[b] = 0.5 and V ar[b] = 0.25.
5Note that in Figure 3.2 for low values of T the distribution of the sample mean is discrete, and the wiggles in the plot come from MATLAB plotting a continuous density to a discrete distribution.
6Of course, in case (i), the CLT is an exact result for any T; see Tutorial 4. 68
14
12
10
8
6
4
2
0
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 3.1: Sampling distribution of sample mean based on sample size T from standard normal distribution
T=10 T=20 T=30 T=50 T=100 T=250 T=500 T=1000
69
14
12
10
8
6
4
2
0
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 3.2: Sampling distribution of sample mean based on sample size T from the normalized uniform distribution
T=10 T=20 T=30 T=50 T=100 T=250 T=500 T=1000
70
14
12
10
8
6
4
2
0
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 3.3: Sampling distribution of sample mean based on sample size T from the normalized Bernoulli distribution
T=10 T=20 T=30 T=50 T=100 T=250 T=500 T=1000
71
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
Figure 3.4: Sampling distribution of T1/2 times the sample mean based on sample size T from standard normal distribution
T=10 T=20 T=30 T=50 T=100 T=250 T=500 T=1000
-6 -4 -2 0 2 4 6
72
0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0
Figure 3.5: Sampling distribution of T1/2 times the sample mean based on sample size T from the normalized uniform distribution
T=10 T=20 T=30 T=50 T=100 T=250 T=500 T=1000
-6 -4 -2 0 2 4 6
73
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Figure 3.6: Sampling distribution of T1/2 times the sample mean based on sample size T from the normalized Bernoulli distribution
T=10 T=20 T=30 T=50 T=100 T=250 T=500 T=1000
74
3.2 Large sample analysis of the OLS inference framework: cross-section data
We now apply the results from the previous sub-section to derive the large sample analysis of the OLS estimator and its associated statistics in the case of certain types of cross-section data.
Cross-section data can involve a variety of sampling units. Often in microe- conometrics, the sampling unit is an individual, a household or a firm. In other cases, the sampling unit may be defined geographically such as towns, counties (in the UK), states in the US or countries. To reflect common practice in the microeconometric literature, we adopt here the notation that uses i to index the observation and N as the sample size, and so write our model as
yi =x′iβ0+ui, i=1,2,...,N, (3.3)
where xi = (1, x′2,i)′ and the elements of x2,i are assumed to be stochastic ran- dom variables. The nature of the sampling unit and the way in which the sample is drawn determine the nature of the relationship between sample members. In our analysis below, we assume that the sample is obtained by random sampling from a homogeneous population: this means that the random variables mod- eling the outcomes in the sample are independently and identically distributed (i.i.d). As the name suggests, within this framework, vi = (ui,x′i)′ (have the same distribution for each i, and vi and vj are independent of i ̸= j. This is the simplest framework, and while appropriate in many circumstances, it is not always valid. If the sample is obtained by random sampling from heterogeneous population then it is more appropriate to assume vi as independently but not identically distributed (i.n.i.d) where “independence” follows from random sam- pling but the heterogeneity of the population means we can no longer treat the observations as identically distributed.
In other cases, the “independence” assumption may not be valid. For exam- ple, suppose it is desired to learn about the returns to education and a sample of workers is obtained by randomly sampling firms and then including all workers from the selected firms. In this case, observations for workers from the same firm are unlikely to be independent as the wage-education relationship is likely to be subject to unobserved firm effects. This type of scenario is known as clus- ter sampling. The independence may also not hold if the sampling units involve geographical location. For example, if crop yields are recorded for a collection of contiguous states or counties then the outcomes are likely correlated due to common weather effects or soil composition. This type of dependence is known as spatial correlation.
The results developed below under the i.i.d assumption also hold if the data are i.n.i.d - although some details of the argument would be different- but do not hold in the cases of either cluster sampling or spatial correlation. We only cover the i.i.d. case but for those interested the other three scenarios are discussed to some extent in Greene (2012), see Sections D.2.6 (for i.n.i.d. data), 14.8.4 (cluster sampling) and 11.7 (spatial correlation).
75
We now introduce the assumptions upon which our analysis of OLS with cross-section data rests.
Assumption CS1. For all i, the data generation process for yi is: yi = x′iβ0 + ui.
So we maintain our assumption that we estimate a correctly specified model. The remaining assumptions involve properties of the error and regressors. For notational convenience, we present these in terms of xi but as x′i = (1, x′2,i), these assumption could be equivalently stated in terms of x2,i as the inclusion of the constant adds no content to the restrictions in question.
Assumption CS2. {(ui, x′i), i = 1, 2, . . .N} forms an independent and iden- tically distributed sequence.
This assumption is discussed above. Notice that the “identically distributed” part of this assumption implies all unconditional moments of (ui, x′i) are inde- pendent of i.
Assumption CS3. E[xix′i] = Q, where Q is a finite, positive definite matrix. This assumption implies that no single element of xi can be written as a
linear combination of the others for every i with probability one.7 Assumption CS4. E[ui|xi] = 0.
This assumption implies ui is uncorrelated with the explanatory variables and also that E[ui] = 0. Notice that taken together Assumptions CS2 and CS4 imply Assumption SR4.
Assumption CS5. V ar[ui|xi] = σ02, positive, finite constant.
This assumption implies the errors are conditionally and unconditionally
homoscedastic. Assumptions CS2 and CS4 imply that E[uiuj|xi, xj] = E[ui|xi]E[uj|xj] = 0,
and so ui and uj are uncorrelated conditional on xi, xj (and also uncondition- ally). So taken together Assumptions CS2, CS4 and CS5 imply Assumption SR5.
We start by considering the large sample properties of the OLS estimator of β0. To begin, we re-express the formula in (3.1) in terms of the notation we
7To see this, suppose that xi,l can be written as a linear combination of the remaining elements of xi for every i and so there exists a non-null constant vector c such that P(c′xi = 0) = 1. We now show this is incompatible with Assumption CS3. To this end, note that xix′i is p.s.d. by construction (why?) and so Q must be either p.s.d. or p.d. If P(c′xi = 0) = 1 then it must follow that P(c′xix′ic = 0) = 1 and so E[c′xix′ic] = c′E[xix′i]c = c′Qc = 0 (where the second identity holds because c is a constant). However, as c is non-null, c′Qc = 0 can only occur if Q is p.s.d. Therefore Q p.d. implies P(c′xi = 0) < 1.
76
have adopted for our discussion of OLS in cross-sectional data. Accordingly, we now denote the OLS estimator by βˆN where
−1 N
x i y i .
i=1
Multiplying and dividing the right-hand side of (3.4) by N, we obtain:
N
βˆ N = x i x ′i
( 3 . 4 )
N −1 N
βˆN −β0 = N−1xix′i N−1xiui.
(3.5) I n s p e c t i o n o f ( 3 . 5 ) r e v e a l s t h a t β ˆ N − β 0 i s a c o n t i n u o u s f u n c t i o n o f N − 1 Ni = 1 x i x ′ i
i=1
i=1 i=1
and N−1 Ni=1 xiui.8 So we proceed by first deducing the limiting behaviour of the two standardized sums and then applying Slutsky’s Theorem in order to deduce the limiting behaviour of the OLS estimator.
Using the WLLN and Assumptions CS2 and CS3, we have
N p
N−1 xix′i → E[xix′i] = Q, (3.6)
i=1 N
N−1 xiui →p E[xiui]. i=1
To evaluate E[xiui], we can use the Law of Iterated Expectations (LIE) and Assumption CS4 to deduce that E[xiui] = E[xiE[ui|xi]] = 0 and so
N
N−1xiui→p 0. (3.7)
i=1
In order to apply Slutsky’s Theorem, it is necessary not only that the func- tion in question is continuous but also that the function of the limit exists. This could potentially be an issue here as βˆN − β0 depends on the inverse of N−1 Ni=1 xix′i, and if the latter matrix converged in probability to a singular matrix then the inverse of the limit would not be defined. However, Assumption CS3 implies that Q is nonsingular and hence Q−1 exists. Therefore, we can use Slutsky’s Theorem in conjunction with (3.6)-(3.7) to deduce
βˆ N − β 0 →p Q − 1 × 0 = 0 , which gives the following result.
Theorem 3.1. If Assumptions CS1 - CS4 hold then βˆN is a consistent estima- tor for β0.
8The elements of A−1 are continuous functions of the elements of A. 77
While Theorem 3.1 tells us that the OLS estimator converges in probability to the true parameter value, it does not provide us with any distributional result that could be used as a basis for inference. Such a result is obtained by considering the large sample behaviour of N1/2(βˆN − β0). This choice of scaling is determined by the CLT because from (3.1) we have
N −1 N
N1/2(βˆN −β0) = N−1xix′i N−1/2xiui. (3.8)
i=1 i=1
T h e r e f o r e N 1 / 2 ( β ˆ N − β 0 ) i s t h e p r o d u c t o f a r a n d o m m a t r i x , N − 1 Ni = 1 x i x ′ i − 1 ,
and a random vector, N−1/2 Ni=1 xiui. We have already shown that the ran- dom matrix converges to a constant and we show below that the CLT implies the limiting distribution of the random vector is normal. This structure means that we can use Lemma 3.5 to deduce that N1/2(βˆN − β0) also has a limiting distribution that is normal.
Since we have shown above that E[xiui] = 0 for all i, it follows from the
CLT that
N
N−1/2xiui →d N(0,Ω)
i=1
where Ω = limN→∞ΩN and ΩN = VarN−1/2Ni=1xiui. Using (3.2), it
follows that to evaluate ΩN we must find Cov[xiui,xjuj]. Since E[xiui] = 0, it follows that Cov[xiui, xjuj] = E[uiujxix′j]. From Assumption CS2, {xiui} is an i.i.d. sequence and so for i ̸= j, E[uiujxix′j] = E[uixi]E[ujx′j] which equals zero because E[uixi] = 0. So Cov[xiui, xjuj] = 0 for i ̸= j. Therefore, it follows that
− 1 N i=1
Since E[xiui] = 0, it follows that V ar[xiui] = E[u2i xix′i]. Using the LIE, we have E[u2i xix′i] = E E[u2i |xi]xix′i . By Assumptions CS4 and CS5, we have E[u2i |xi] = σ02 and combining this result with Assumption CS3, we can deduce that V ar[xiui] = σ02Q. Substituting back into (3.9), we obtain ΩN = σ02Q and h e n c e Ω = σ 02 Q .
Therefore, we have
N
N−1/2 xiui →d N 0, σ02Q. (3.10)
i=1
Combining (3.6) and (3.10), we can use Lemma 3.5 to deduce the following
result.9
9Note Q is symmetric by construction and for any square matrix M: (M−1)′ = (M′)−1.
78
ΩN = N
Var[xiui]. (3.9)
Theorem 3.2. If Assumptions CS1 - CS5 hold then: N1/2(βˆN−β0)→d N0,σ02Q−1.
This result implies
τ ̃N,l = σ2q(−1) → N(0,1). (3.11)
where q(−1) is the l − lth element of Q−1. To use this result as a basis for l,l
inference, it is necessary to have a consistent estimator of σ02Q−1. The natural estimator for σ02 is σˆN2 (= e′e/(N − k) in our notation here). We previously showed that σˆN2 is an unbiased estimator. The following result establishes its large sample properties.10
Theorem 3.3. If Assumptions CS1 - CS5 hold then σˆN2 is a consistent estima- tor for σ02.
The natural estimator of Q is Qˆ = N−1 Ni=1 xix′i = N−1X′X which from (3.6) is consistent for Q. Since σ02Q−1 is continuous function of σ02 and Q, it follows from Theorem 3.3, equation (3.6) and Slutsky’s Theorem that
N1/2(βˆN,l − β0,l) d 0 l,l
σˆ N2 Qˆ − 1 →p σ 02 Q − 1 . ( 3 . 1 2 ) Thus σˆ2 Nml,l is a consistent estimator for σ2q(−1).11 If we replace σ2q(−1)
N 0 l,l 0 l,l by σˆN2 N ml,l in the formula for τ ̃N,l then the resulting statistic is τˆN,l given by (2.36) with an appropriate conversion of notation. The large sample behaviour of τˆN,l can be deduced by noting that
σ2q(−1) 0 l,l
τˆN,l =σˆN2Nml,lτ ̃N,l,
and so τˆT,l is the product of a random scalar that is converging in probability to one from (3.12), and a random variable that is converging to a standard normal distribution from (3.11). Therefore, via Lemma 3.5, we obtain the following result.
Theorem 3.4. If Assumptions CS1 - CS5 hold then: τˆN,l →d N(0, 1).
Using similar steps as in Section 2.6, Theorem 3.4 can be used to justify the
following approximate 100(1 − α)% confidence interval for β0,l,
βˆ N , l ± z 1 − α / 2 σˆ N mˆ l , l , ( 3 . 1 3 )
10The proof is in Tutorial 4.
11Recall that ml,l is the (l, l) element of (X′X)−1. Since N(X′X)−1 = (N−1X′X)−1 and
N−1X′X →p Q, p.d., it follows via Slutsky’s Theorem that N(X′X)−1 = (N−1X′X)−1 →p
Q−1 and so Nml,l →p q(−1). l,l
79
where z1−α/2 is the 100(1−α/2)th percentile of the standard normal distribution. Here the adjective “approximate” refers to that fact that probability statement associated with the interval is based on the limiting distribution. Notice that the only difference between the intervals in (3.26) and (2.39) is in the distribution from which the percentile is taken, and for N −k > 120 this difference is minimal.
We now consider a large sample basis for the hypothesis tests considered earlier. Consider first testing H0 : β0,l = β∗,l versus H1 : β0,l ̸= β∗,l. Following directly from our discussion of confidence intervals, our limiting distribution theory justifies the following decision rule:
Decision rule: reject H0 : β0,l = β∗,l in favour of H1 : β0,l ̸= β∗,l at the approximate 100α% significance level if
|τˆN,l(β∗,l)| > z1−α/2.
Again the adjective “approximate” refers to that fact that probability statement associated with the decision rule (that is, the associated probability of a Type I error) is based on the limiting distribution. One-sided tests can be similarly constructed, but that is left as an exercise for the reader.
NowsupposewewishtotestH0 : Rβ0 =rvsH1 : Rβ0 ̸=r. InSection 2.8.3, inference is based on F but to begin here, we focus on WN = nrF that is,
WN = N(RβˆN − r)′ R(N−1X′X)−1R′ −1 (RβˆN − r)/σˆN2 . From Theorem 3.2 and Lemma 3.5, it follows that
(3.14)
(3.15)
σˆN2 R(N−1X′X)−1R′ −1 →p σ02RQ−1R′ −1
Using Lemma 3.6 in conjunction with (3.15) and (3.16), we obtain the following
result.
Theorem 3.5. If Assumptions CS1 – CS5 and 2.1 hold then under H0 : Rβ0 =
r,wehave:WN→d χ2nr.
The limiting distribution of F can be obtained as a corollary to this theo-
rem.12
12You can check this for yourself in MATLAB using the functions finv and chi2inv: for
example calculate finv(.95,4,Inf)*4-chi2inv(.95,4). 80
and so under H0,
N1/2R(βˆN−β0)→d N0,σ02RQ−1R′
N1/2(RβˆN−r)→d N0,σ02RQ−1R′. From (3.12) and Slutsky’s Theorem it follows that
(3.16)
Corollary 3.1. Under the conditions of Theorem 3.5 F →d Fnr,∞.
So we have two equivalent decision rules for testing H0 : Rβ0 = r versus
H1 : Rβ0 ̸=r.
Decision rule: reject H0 : Rβ0 = r in favour of HA : Rβ0 ̸= r at the approxi-
mate 100α% significance level if:
WN >cnr(1−α),
where cnr (1 − α) is the 100(1 − α)th percentile of the χ2nr distribution. or
Decision rule: reject H0 : Rβ0 = r in favour of H1 : Rβ0 ̸= r at the approximate 100α% significance level if:
F >Fnr,∞(1−α),
where Fnr,∞(1 − α) is the 100(1 − α)th percentile of the Fnr,∞ distribution.
Our large sample framework can also be used to develop tests of whether β0 satisfies a set of nonlinear restrictions. The key to this extension is the so-called Delta method which allows the linearization of the restrictions. To elaborate, suppose g( · ) is a ng × 1 vector of continuous differentiable functions, and let G(β ̄) = ∂g(β)/∂β′|β=β ̄. Suppose further that we wish to test H0 : g(β0) = 0 versus HA : g(β0) ̸= 0.
It is natural to base our inference about this null on g(βˆN ). Taking a first order Taylor Series expansion of g(βˆN ) around g(β0), we obtain
g(βˆN ) = g(β0) + G(β0)(βˆN − β0) + “remainder”, from which we can deduce
N1/2g(βˆN)−g(β0) = G(β0)N1/2(βˆN −β0)+ξN,
where ξN = N 1/2 ∗ “remainder”. The key to the Delta method is that, under our assumptions, ξN is asymptotically negligible in the sense that ξN →p 0. This means that for the purposes of our large sample analysis under H0, we can replace N1/2g(βˆN ) by G(β0)N1/2(βˆN − β0) which is just a linear combination ofN1/2(βˆN −β0). Inordertoutilizethissubstitutionbelow,weneedone further assumption, in which for completeness we repeat our definition of g( · ) above.
Assumption 3.1. (i) g( · ) is a ng × 1 vector of continuously differentiable functions; (ii) rank{G(β0)} = ng.
81
Assumption 3.1(ii) plays the same role as Assumption 2.1. The common structure of the two assumptions arises because the Delta Method has allowed us to replace the nonlinear restrictions by a set of linear restrictions in which G(β0) plays the role of R in our earlier discussion.
Theorem 3.6. If Assumptions CS1 – CS5 and 3.1 hold then under H0 : g(β0 ) =
N
0:
These arguments lead us to the following test statistic
N1/2g(βˆN)→d N0,σ02G(β0)Q−1{G(β0)}′.
W(g) = Ng(βˆN )′ G(βˆN )(N−1X′X)−1G(βˆN )′ −1 g(βˆN )/σˆN2 , (3.17)
which can be recognized to have the same structure as WN in (3.14) only with g(βˆN ) replacing RβˆN − r and G(βˆN ) replacing R. From Theorem 3.1, Slutsky’s Theorem and Assumption 3.1, it follows that G(βˆN ) →p G(β0). We can then use the same arguments as those used to establish Theorem 3.5 above to deduce the following result.
Theorem 3.7. If Assumptions CS1 – CS5 and 3.1 hold then under H0 : g(β0 ) = 0 , w e h a v e : W ( g ) →d χ 2 .
N ng
A decision rule for testing H0 : g(β0) = 0 versus H1 : g(β0) ̸= 0 based on
W(g) has the same form as the one for testing H0 : Rβ0 = r based on WN N
above.
3.3 Large sample analysis of the OLS inference framework: time series data
The key characteristic of all time series analysis is the variables are observed over time. There are two basic types of time series variable: a flow variable, which is measured over an interval of time, for example monthly consumption expenditures; and a stock variable, which is measured at a moment in time, such as price or quantity of shares owned. For time series, the index t denotes the moment in time at which the variable is observed. It is assumed here that the time series in question is observed at regularly spaced intervals. The frequency at which the time series is observed is known as the sampling frequency. In macroeconometrics, the sampling frequency is usually monthly, quarterly or annually. With financial data, the sampling frequency can also be higher with data available weekly, daily or even finer time intervals. 13
Although we work with the same generic linear regression model notation as before, parts of our analysis depend on the types of variables included in xt. It
13The assumption of data observed at regularly spaced intervals may not be strictly true. For example, suppose we observe the daily data on the price of a financial asset. The asset is not traded weekends and holidays, and so there is no price for such days. In practice, this handled by introducing dummy variables to account “days of the week effects”.
82
is, therefore, useful to distinguish four different scenarios.
Static regression models: here xt contains variables that are observed contem-
poraneously with yt. For example, consider
yt = β0,1 + β0,2ht + ut
where yt and ht are variables observed in the same period. For example, the simple Keynesian consumption function takes this form, with yt being aggregate consumption in period t, and ht being aggregate income in period t. ⋄
This type of model is referred to as a “static” model because yt depends depends only on variable(s) in the same period. While sometimes appropriate, it is often too restrictive for empirical analysis of economic time series. Intuition suggests many economic time series depend not only on contemporaneously ob- served variables but also variables observed in the past. In the consumption function example, consumption likely depends not only on current income but also on past values of consumption and income. The inclusion of lagged values as explanatory variables leads to the class of dynamic regression models. This class can be broken down into three sub-classes as follows.
Finite distributed lag (FDL) models: here xt contains variables observed at time t and before. For example, consider
yt = α0 +δ0,0ht +δ0,1ht−1 +ut = x′tβ0 +ut,
where x′t = (1, ht, ht−1) and β0 = (α0, δ0,0, δ0,1)′. Notice this implies a “dy- namic” relationship between “y” and “h” in the sense that yt depends on both ht and ht−1. Pursuing the consumption function example, this model implies aggregate consumption in period t depends on aggregate income both in period t and the period before. The specification naturally extends to models with more than one lag, for example,
l i=0
where x′t = (1,ht,ht−1,…,ht−l) and β0 = (α0,δ0,0,δ0,1,…,δ0,l)′, which is known as a FDL model of order l. Notice that within the the FDL model in (3.18), all the coefficients δ0,0, δ0,1, . . . , δ0,l tell us something about the relation- ship between y and h. The coefficient δ0,0 captures the impact of a unit change in zt (i.e. ∆ht = 1) on yt, in other words, the immediate (or contemporaneous) impact of a change in h on y; as a result, δ0,0 is usually referred to as the im- pact propensity or impact multiplier. For i > 0, the coefficient δ0,i captures the impact of a unit change in ht on yt+i. The total effect of a unit change in ht
83
yt = α0 +
δ0,iht−i + ut = x′tβ0 + ut, (3.18)
on the path of y is the sum of all these impacts namely, δ0,0 + δ0,1 . . . + δ0,l, which is known as the long-run propensity or long-run multiplier.
Note that a FDL can also involve more than one explanatory variable: yt = α0 + δ0,0ht + δ0,1ht−1 + η0,0wt + η0,1wt−1 + ut,
where wt is an observable variable. ⋄
Autoregressive (AR) time series models: here xt contains only lagged values of
yt. For example, the Autoregressive model of order one,
yt = β0,1 +β0,2yt−1 +ut = x′tβ0 +ut,
where x′t = (1, yt−1) and β0 = (β0,1, β0,2)′. Here yt is explained purely in terms of its own past values. In our consumption function example, such a specification would imply consumption in period t depends only on consumption in period t − 1 and not on income. More generally, an AR model of order p is:
p
yt = β0,1 + β0,i+1yt−i + ut.
i=1
AR models are part of the class of linear time series models known as Au- toregressive Moving Average (ARMA) models. Following the very influential text by Box and Jenkins (1976), these models are widely applied in time series analysis. ⋄
General dynamic regression models: here xt contains current and past values of other variables, and lagged values of yt. For example,
yt = β0,1 + β0,2yt−1 + β0,3ht + β0,4ht−1 + ut = x′tβ0 + ut,
where x′t = (1, yt−1, ht, ht−1) and β0 = (β0,1, β0,2, β0,3, β0,4)′. In our consump- tion function example, this specification implies consumption in period t de- pends on income in period t as well as both lagged consumption and income. Notice that this class of models is the most general as it contains the other three as special cases. It is, therefore, no surprise that this type of model is widely employed in empirical analysis of economic time series.14 ⋄.
Throughout our discussion of time series, we assume the right hand side variables consist of: an intercept; ht, a vector of (non-degenerate) random vari- ables; lagged values of ht; lagged values of yt.15 Thus, we assume that yt is
14In the time series literature, this type of model is referred to as ARMAX; however, this terminology is less common in applied econometrics and so we do not use it in the course.
15If a rv is “degenerate” then it is a constant and so a non-degenerate rv has a non-zero probability of taking more than one value and so is stochastic.
84
generated by the general dynamic regression model,
where
py ph
yt = γ0 + θi,0yt−i + h′t−jδj,0 + ut,
i=1 j=0 = x′tβ0+ut,
x′t = [1,yt−1,yt−2,…,yt−py,h′t,h′t−1,…,h′t−ph], β0′ = [γ0,θ1,0,θ2,0,…,θpy,0,δ0′,0,δ1′,0,…,δp′h,0],
(3.19)
and ut is, once again, an unobservable error term. Notice that the number of lags of y and h on the right hand side of (3.19) are py and ph respectively, and these two numbers need not be the same.
The presence of lagged variables on the right-hand side of the regression model means there is a difference between the number of observations on the variables and the number of observations used in the estimation. For example, if py = ph = 1 in (3.19) and we start with T⋆ observations on y then the estimation is only based on T⋆ − 1 observations because t = 2 is the first observation for which we have both left hand and right hand side variables. Thus the effective sample size is T⋆ − 1 with y1 and h1 being used as a conditioning observations. More generally, if p = max[py,ph] is the longest lag to appear on the right hand side then the effective sample size is T⋆ − p with the first p observations being used as needed for conditioning. For ease of presentation below, we use T to denote the effective sample size below and assume this sample runs from t = 1,2,…,T with y0,y−1,…,y−py+1 and h0,h−1,…,h−ph+1 available for conditioning. This way, our summations in the formula for the OLS estimator runfromt=1tot=T asbefore.
In Section 3.3.2, we present a set of conditions under which the large sample inference framework in Section 3.2 is valid with times series data. Before we can do so, it is necessary to discuss briefly some key concepts that underpin the statistical analysis of time series.
3.3.1 Statistical background
The sampling framework for time series is different from the one used to under- pin our analysis of cross-section data. Our analysis here rests on a branch of statistics known as stochastic process theory. We do not need to delve into this theory too deeply, but do need to highlight certain key features to help motivate the assumptions that are imposed below.
In order to introduce the necessary framework, it is pedagogically useful to recap the sampling framework behind our analysis of cross-section data. Recall that we assume the data is generated by randomly sampling from some target population. Crucially, each observation provides information about the under- lying population, and if the sample size goes to infinity then we learn all there is
85
to know about the underlying probability distribution of the random variables in question.
The sampling framework for time series is fundamentally different. To ex- plain why, we must first explain what is meant by a realization of a time series. To this end, let vt denote the random vector modeling the outcome of the variables of interest in period t. Although we only observe vt for the sample t = 1,2,…,T, the stochastic process framework views this time series as evolv- ing over time before we start observing it and continuing to evolve over time after we stop observing it, that is
…,v−3,v−2,v−1,v0 v1,v2,v3,…vT ,vT+1,vT+2,….
sample
When we want to refer to the entire process we write: {vt}∞t=−∞ – it is this doubly infinite sequence that is known as a realization of vt. The sample space of a stochastic process is the set of all possible realizations of the time series. Unlike in the random sampling framework used for cross-section data, in time series there is only one draw from the population leading to a particular realization of the series. Thus, as the sample size grows, we see more of one realization of the process – or, in other words, we learn more about a single draw from the underlying population.
With such a sampling framework, it is not immediately obvious that we learn more about the underlying probability distribution of the data even as the sample size increases. In fact, this is only guaranteed if we place two types of restrictions on the time series. First, the distribution of vt must not vary “too much” with t. Second, every realization must visit all possible values for vt. Here these two restrictions are imposed by assuming respectively stationarity and weak dependence. We now discuss each in turn.
It is useful to distinguish two forms of stationarity, distinguished by the prefix of the adjectives strong and weak.
Definition 3.6. The time series {vt}∞t=−∞ is said to be strongly stationary if the joint probability distribution function, F(.), of any subset of {vt} satisfies:
F(vt1,vt2,…,vtn) = F(vt1+c,vt2+c,…,vtn+c) for any integer n and integer constant c.
Strong stationarity implies the joint distribution of a collection of elements of the series depends only on the differences in time between them and not on their location in time. To illustrate, we consider a couple of simple cases. If we set n = 1 then the definition implies the marginal distribution of vt is the same as the marginal distribution of vs for any t ̸= s. If we set n = 2 then the definition implies the joint distribution of (vt,vs) is the same as that of (vt+c,vs+c) for any t ̸= s. It is important to note what this does and what it does not imply: it does imply that (v1, v2) and (v11, v12) have the same distribution as each other, and that (v1, v10) and (v11, v20) have the same distribution as each other; but it
86
does not imply (v1, v2) has the same distribution as (v1, v10). Notice that, since it restricts the probability distribution, it also implies any moments have the same property that is, for example E[f(v1, v2)] = E[f(v11, v12)] for any function f(·).
Weak stationarity is defined as follows.
Definition 3.7. The time series {vt}∞t=−∞ is said to be weakly stationary if for all t, s we have: (i) E[vt] = μ; (ii) V ar[vt] = Σ; (iii) Cov[vt, vs] = Σt−s.
The restrictions implied by parts (i) and (ii) are immediate from our earlier discussion: part (i) states that the mean of vt is independent of t; part (ii) states that the variance-covariance matrix of vt is independent of t. However, part (iii) requires further discussion. Part (iii) states that the covariance of vt and vs depends on the ordering of the two terms in the covariance and the distance between the two observations in time and not on their specific location in time.16 To explore the implications of this condition for the individual ele- ments of the vector time series, let vt,l be the lth element of vt. First consider the implications for the diagonal elements. The (l, l) element of Cov[vt, vs] is Cov[vt,l,vs,l] and so part (iii) implies that Cov[vt,l,vs,l] depends only on t − s, and so restricts the inter-temporal properties of individual elements of vt,l. The (l, j)th element of Cov[vt, vs] is Cov[vt,l, vs,j] and so part (iii) implies that Cov[vt,l, vs,j] depends on t − s, and so restricts the inter-temporal rela- tionship between different elements of vt. It is common to refer to Cov[vt,l, vs,l] as the autocovariance of vt,l at lag t − s, and Cov[vt, vs] as the autocovariance matrix of vt at lag t − s. Using this terminology, the implications of weak sta- tionarity can be summarized as being that the mean, variance-covariance and autocovariances are independent of time – or more succinctly that the first and second moments of vt are independent of time.
As anticipated by the nomenclature, strong stationarity implies weak sta- tionarity but the reverse is not true. It is reasonable to wonder why two concepts of stationarity exist. For linear models, as we have seen, the statistics of inter- est are functions of the first two moments of the data, and so it is sufficient to impose weak stationarity. In nonlinear models, it is more natural to restrict the underlying probability distribution of the data, and so impose strong stationar- ity. Finally, we note that the terminology adopted here is not universal: strong stationarity is sometimes referred to as strict stationarity, and weak stationarity as covariance stationarity.
Weak dependence places a restriction on the relationship between the cur- rent value of the time series and its distant past that is, it places a restriction the association between vt and vt−s as s tends to infinity. This feature of a time series is known as its memory. There are a number of ways to restrict the mem- ory. It is beyond the scope of this course to develop a detailed understanding of this issue, and so we present an intuitive explanation. If we consider a weakly
16For example, Cov[v4,v1] = Cov[v10,v7] = Σ3, and Cov[v1,v4] = Cov[v7,v10] = Σ−3. In general, if vt is a vector then Σ3 ̸= Σ−3 but it can be shown that Σt−s = Σ′s−t (so that for example, Σ3 = Σ′−3). However if vt is a scalar then Σt−s = Σs−t as the transpose of a scalar is equal to the scalar itself.
87
stationary process, vt, then weak dependence implies that Cov[vt, vt−s] → 0 as s → ∞ and at a sufficiently fast rate. Here “sufficiently fast” means in order for us to deduce the WLLN and CLT. For convenience, we reproduce these limit theorems here specialized to our time series setting. Note that these theorems hold under either strong or weak stationarity. In our discussion of linear mod- els in this chapter, we invoke these results for weakly stationary process, but in our discussion in subsequent chapters we appeal to the versions for strongly stationary time series.
Lemma 3.7. Weak Law of Large Numbers: Let vt be a (weakly or strongly) stationary and weakly dependent time series with E[vt] = μ then, subject to certain other conditions, it follows that
T
T − 1 v t →p μ .
t=1
Lemma 3.8. Central Limit Theorem: Let vt be a (weakly or strongly) stationary and weakly dependent time series with E[vt] = μ and Cov[vt,vt−j] = Γj then, subject to certain other conditions, it follows that
− 1 / 2 T d
T (vt −μ) →N(0,Ω),
t=1
where Ω = ∞−∞ Γi = Γ0 + ∞i=1 {Γi + Γ′i} is a finite, positive definite
matrix.
Note that in both limit theorems, we include “subject to certain other con- ditions”. These conditions involve the existence of certain moments and more precise definitions of weak dependence, but these need not concern us here. It is implicitly assumed below that any stationary, weakly dependent series satisfies these additional conditions.
The form of the variance in the CLT can be motivated as follows. By defi- nition we have Ω = limT →∞ ΩT where, using (3.2),
−1/2T −1TT
ΩT = Var T (vt − μ) = T Cov[vt,vs]. (3.20)
t=1 t=1 s=1 Sincevtisstationaryhere,Cov[vt,vs]=Γt−sandsoΩT =T−1Tt=1Ts=1Γt−s.
After some manipulation of the latter identity, it can be shown that
T− 1 T − l
ΩT = Γ0 + T (Γl +Γ−l). (3.21)
l=1
As T increases (T −l)/T → 1 for any fixed l and under the conditions of Lemma
3.8, it can be shown that ΩT converges to the matrix Ω given in the lemma.17
17In an appendix to this chapter, we demonstrate this result for the scalar case and use it as a vehicle to provide a set of sufficient conditions for weak dependence. The discussion in this appendix is not part of the course material but is included for the interest of readers who may wish to delve more into the statistical background of time series analysis.
88
The two formula for Ω are equivalent because Γ′l = Γ−l. Within the context of time series data, the quantity Ω is known as the long run variance of vt. Notice that Lemma 3.8 includes the statement that under the conditions of the lemma, Ω is finite.
The assumptions of stationarity and weak dependence are imposed because they provide a pedagogically convenient framework for developing or large sam- ple analysis of regression models with time series data. However, these condi- tions are sufficient and not necessary for the inference framework developed. The framework can be justified under more general conditions at the expense of more complicated mathematical arguments. Nevertheless, the inference framework is not valid for all economic time series data. For example, many macroeconomic series are trending over time and whether or not the OLS-based inference frame- work below is valid depends on our assumptions about the source of the trend: if the trend is deterministic then the inference framework is still valid, but if the trend is stochastic – a so-called unit root process – then the OLS based inference framework below is not valid. While it is beyond the scope of this course to elaborate in depth upon the previous statement, we illustrate its validity in an appendix to this chapter using two simple examples.18
3.3.2 Properties of OLS
Before presenting the large sample distribution of the OLS, we briefly return to our discussion of the finite sample properties of the OLS estimator in the stochastic regressor model. It can be recalled from Section 2.11 that our analysis revolved around assumptions involving the conditional moments of u given X. In our discussion of Assumption SR4 (E[u|X] = 0), we note that whether or not this assumption can hold with time series depends on whether or not xt is strictly exogenous. We now assess the nature of this restriction in more detail. To do so, it is necessary to introduce a more general version of the Law of Iterated Expectations.19
Lemma 3.9. The Law of Iterated Expectations part II (LIE-II): Let v, G, H be random scalars/vectors/matrices then E[v | G] = E [ E[v | G, H] | G].
This result tells us that the conditional expectation of v given G can be calculated in two steps. First we can take conditional expectations of v relative to G and any other random variable/vector/matrix H and then on the second step take the conditional expectation of E[v|G, H] with respect to G.
The LIE-II is useful here because it allows us to deduce the implications of E[u|X] = 0 – or equivalently E[ut|X] = 0 for all t – for E[ut|xs]. To this end, we partition {xt}Tt=1 into two sets: G = xs and
H = Hs = {x1,x2,…,xs−1,xs+1,…xT}.
18These types of models are discussed in ECON60522. 19See Lemma 2.6 in Section 2.11.
89
Note that G ∪ H = {xt}Tt=1. Taking advantage of this partition we have via the LIE-II that
E[ut|xs] = E[E[ut|Hs,xs]|xs] = E[E[ut|X]|xs] = 0,
because E[ut|X] = 0. Using E[ut|xs] = 0, we can deduce that Cov[xs, ut] = 0
because E[ut] = 0 (why?) and so
Cov[ut,xs] = E[utxs] = E[xs E[ut |xs]] = 0, for all t,s.
Since Cov[ut,xs] = 0 for any choice of t and s, it follows that if E[u|X] = 0 then ut is uncorrelated with {xt; t = 1,2,…,T}, and so xt is said to be strictly exogenous.
Whether or not the assumption can be maintained in practice depends on the model in question. One case in which it must fail is where xt contains lagged values of y. To illustrate, consider the AR(1) model,
yt = β0,1 + β0,2yt−1 + ut, (3.22)
in which case xt = (1, yt−1)′. To demonstrate that xt is not strictly exogenous in this model, it suffices to show that if E[xtut] = 0 – as it should be if xt is strictly exogenous – then E[xt+1ut] ̸= 0. To this end, we focus on the (2,1) element of E[xt+1ut] namely, E[ytut]. Using (3.22), it follows that
E[ytut] = E [ (β0,1 + β0,2yt−1 + ut)ut ]
= β0,1E[ut] + β0,2E[utyt−1] + E[u2t ] = E [ u 2t ] ̸ = 0 ,
because E[xtut] = 0 implies both E[ut] = 0 and E[utyt−1] = 0.
Although xt is not strictly exogenous in this model, it can be contemporane- ously exogenous under certain conditions on ut. To show this, it is convenient
to derive an all t, we can
yt = =
alternative representation for yt as follows. Since (3.22) holds for use repeated back substitution to show that:
= β0,1
1
j=0
βj 0,2
+ β02,2 (β0,1 + β0,2yt−3 + ut−2 ) + m−1
1
βj ut−j, 0,2
β0,1 + β0,2 (β0,1 + β0,2yt−2 + ut−1 ) + ut, β0,1 + β0,1β0,2 + β02,2yt−2 + β0,2ut−1 + ut,
j=0 ββj+βmy+βju. (3.23)
m−1
0,1 0,2 0,2 t−m 0,2 t−j
=
A necessary condition for an AR(1) process to be weakly stationary is |β0,2| < 1
j=0 j=0
and if this holds then by letting m → ∞ in (3.23) it can be shown that yt also 90
has the following representation20
y t = μ y + ∞ β j u t − j . ( 3 . 2 4 )
Notice (3.24) implies that yt is function of ut (the contemporaneous value of the error) and {ut−j; j = 1, 2, . . .} (lagged values of the error) but not {ut+j; j = 1, 2, . . .} (future values of the error). So if we assume that {ut} is an i.i.d. sequence with mean zero then it follows that ut is independent of yt−1 (as it is function of {ut−j; j = 1,2,...}) - and so E[ut|yt−1] = E[ut] = 0: thus E[ut|xt] = 0 and xt is contemporaneously exogenous.21
Even in models without lagged values of yt as explanatory variables, it is very difficult to argue that xt are strictly exogenous in many time series econometric models. As a result, we must turn to large sample analysis to justify our OLS- based inference procedures. To this end, we now present a set of regularity conditions under which the inference procedures described in Section 3.2 are valid when the data are time series.
Assumption TS1. yt is generated via the general dynamic linear regression model in (3.19).
This is the familiar assumption that we estimate a correctly specified model. Assumption TS2. (yt,h′t) is a weakly stationary, weakly dependent time se-
ries.
This assumption is imposed to facilitate the application of the WLLN and CLT, as discussed in Section 3.3.1.
Assumption TS3. E[xtx′t] = Q, a finite, positive definite matrix.
This assumption plays a similar role to Assumption CS3 and ensures T−1X′X
has a finite nonsingular probability limit.
Assumption TS4. E[ut |xt] = 0 for all t = 1,2,...,T.
As discussed in Section 2.13 and above, if this assumption holds then xt is said to be contemporaneously exogenous.
Assumption TS5. V ar[ut|xt] = σ02 where σ02 is a finite positive constant.
20This representation for yt is known as a moving average (MA) model of order infinity; see Tutorial 5 for further discussion of the properties of AR and MA models. In this model, E[yt] = μy = β0,1/(1 − β0,2) - why?
21Parenthetically, we note that in time series analysis of ARMA models the error process is often assumed to be “white noise” that is, it has mean zero, homoscedastic and serially uncorrelated. This is a weaker assumption than the error process being i.i.d. with mean zero and constant variance. However, if the errors also have a normal distribution then the two sets of conditions are equivalent. See Tutorial 5 for some examples of white noise processes and further analysis of the properties of an AR(1) process.
91
0,2 j=0
This assumption plays the same role as Assumption CS5 and states that the errors are conditionally (and unconditionally) homoscedastic.
Assumption TS6. For all t ̸= s, E[utus | xt, xs] = 0.
This assumption combined with Assumption TS4 implies Cov[ut, us|xt, xs] = 0 that is, that ut and us are conditionally uncorrelated given xt and xs. This, in turn, implies Cov[ut, us] = 0 for any t ̸= s in which case the errors are said to be serially uncorrelated or equivalently to exhibit no autocorrelation. Assumption TS6 also implies that Cov[xtut, xsus] = 0 for all t ̸= s and it is this implication that is relevant to the form of the limiting distribution of the OLS estimator presented below.
In truth, it is more natural to think in terms of the unconditional autocovari- ances of the errors here and it is hard to motivate Assumption TS6 in the way it is written, nevertheless it delivers the required result for the OLS estimator below and so we stick with it for now. Below we return to this assumption and a more intuitive condition that implies the stated property of the errors.
Under these assumptions, we can use analogous arguments to those used to prove Theorem 3.1 to establish the consistency of the OLS estimator in our time series regression model.
Theorem 3.8. If Assumptions TS1 - TS4 hold then βˆT is a consistent estimator for β0.
Recall that the large sample distribution of the OLS estimator is derived
from
T −1 T
T1/2(βˆT − β0) = T−1 xtx′t T−1/2 xtut, (3.25)
t=1 t=1
using the WLLN and the CLT. In the context here, we must appeal to the versions of these limit theorems in Lemmas 3.7 and 3.8. For the CLT, we need to derive the variance-covariance matrix of xtut (Γ0 in the notation of Lemma 3.8) and the autocovariance matrices of xtut (Γi for i ̸= 0 in the notation of the lemma). Since Assumption TS4 implies E[xtut] = 0, we can use the LIE and Assumption TS5 to deduce that:
Γ0 = E[u2t xtx′t] = E E[u2t |xt]xtx′t = E[σ02xtx′t] = σ02Q.
A similar argument - only appealing to Assumption TS6 instead of Assumption
TS5 - yields:
Γi = E[utut−ixtx′t−i] = E E[utut−i|xt, xt−i]xtx′t−i = E[0 × xtx′t−i] = 0.
Therefore, Ω = Γ0 = σ02Q. It then follows by an analogous argument to the proof of Theorem 3.2 that we have the following result.
Theorem 3.9. If Assumptions TS1 - TS6 hold then: T1/2(βˆT−β0)→d N0,σ02Q−1.
92
To perform inference based on this result, we need a consistent estimator. In our time series notation, the OLS estimator of σ02 is
σˆT2 =e′e/(T−k).
The following result establishes the consistency of this estimator; the proof
follows analagous arguments to those used in the proof with cross-section data. Theorem 3.10. If Assumptions TS1 - TS6 hold then σˆT2 is a consistent esti-
mator for σ02.
Although the conditions are different, it can be recognized that Theorems 3.8, 3.9 and 3.10 deliver qualitatively the same results as those obtained in our earlier analysis for cross-section data in Section 3.2. In fact, all the inference procedures described in Section 3.2 continue to be valid in time series regression models under Assumptions TS1-TS6. For completeness and ease of reference in our subsequent discussion, these inference procedures and the results on which they rest are repeated here.
First consider inferences about β0,l. Define
τˆ = βˆ T , l − β 0 , l
T , l σˆ T √ m l , l
where ml,l is the (l, l) element of (X′X)−1 = (Tt=1 xtx′t)−1. The large sample
behaviour of τˆT,l is as follows.
Theorem 3.11. If Assumptions TS1 - TS6 hold then: τˆT,l →d N(0, 1).
This result justifies the following approximate 100(1−α)% confidence interval
for β0,l,
βˆ T , l ± z 1 − α / 2 σˆ T √ m l , l , ( 3 . 2 6 )
where z1−α/2 is the 100(1−α/2)th percentile of the standard normal distribution. Consider now testing H0 : β0,l = β∗,l in favour of H1 : β0,l ̸= β∗,l. Theo- rem 3.11 justifies the following decision rule for testing testing H0 : β0,l = β∗,l
versus H1 : β0,l ̸= β∗,l,
Decision rule: reject H0 : β0,l = β∗,l in favour of H1 : β0,l ̸= β∗,l at the
approximate 100α% significance level if |τˆT,l(β∗,l)| > z1−α/2,
where
ml,l
TotestH0 : Rβ0 =rvsH1 : Rβ0 ̸=r,wecanusetheteststatistic
WT = T(RβˆT − r)′ R(T−1X′X)−1R′ −1 (RβˆT − r)/σˆT2 . 93
τˆ (β )=βˆT,l−β∗,l. T , l ∗ , l σˆ T √
(3.27)
Its large sample behaviour is given by the following result.
Theorem 3.12. If Assumptions TS1 – TS6 and 2.1 hold then under H0 : Rβ0 =
r , w e h a v e : W T →d χ 2n r .
This leads to the following decision rule for testing H0 : Rβ0 = r versus
H1 : Rβ0 ̸=r.
Decision rule: reject H0 : Rβ0 = r in favour of H1 : Rβ0 ̸= r at the approximate
100α% significance level if:
where cnr (1 − α) is the 100(1 − α)th percentile of the χ2nr distribution.
Finally we consider testing whether β0 satisfies a set of nonlinear restrictions. As in Section 3.2, we denote these restrictions by g(β0) = 0. Inference can be based on the following test statistic,
W(g) = Tg(βˆT)′G(βˆT)(T−1X′X)−1G(βˆT)′−1g(βˆT)/σˆT2. (3.28) T
If the restrictions are true then this statistic has the following limiting distribu- tion
WT >cnr(1−α)
Theorem 3.13. If Assumptions TS1 – TS6 and 3.1 hold then under H0:g(β0)=0,wehave:W(g)→d χ2n.
Tg
A decision rule for testing H0 : g(β0) = 0 versus H1 : g(β0) ̸= 0 based on
W(g) has the same form as the one for testing H0 : Rβ0 = r based on WT T
above.
To conclude this section, we return to Assumption TS6 and introduce an alternative condition that would also imply the errors are serially uncorrelated. To motivate this condition, we must reconsider how we interpret our regression model. So far, we have treated the regression model as a statement about the conditional mean of yt given some information set. In cross-section data, this information set contains information about the ith sampling unit, be it an individual, a household or a firm. For time series, the relevant information set is not only ht but the history of both y and h that is,
It = {ht,yt−1,ht−1,yt−2,ht−2,…,y1,h1 }.
Therefore, if our general linear dynamic regression model is viewed as specifying
the conditional expectation of yt then we are really stating that we believe:
E[yt | It] = x′tβ0, (3.29)
that is, x′tβ0 is the expectation of yt conditional on information available at time t. If equation (3.29) holds then the model is said to be dynamically complete. It is convenient to introduce this property as a specific assumption.
94
Assumption TS7. E[yt | It] = x′tβ0.
Since xt ∈ It by construction, it follows that Assumption TS7 implies As- sumption TS4. Dynamic completeness also implies that the errors are serially uncorrelated and that Cov[xtut, xsus] = 0 for all t ̸= s. To see why, note that i f t > s t h e n u s = y s − x ′s β 0 ∈ I t a n d s o w e h a v e :
Cov[ut,us] = E[utus] = E[E[utus|It]] = E[ut|It]us] = 0 and
Cov[xtut, xsus] = E[utusxtx′s] = E [ E[utusxtx′s | It] ] = E[E[ut|It]usxtx′s] = 0.
Therefore, we can equivalently state Theorem 3.8 with Assumption TS7 replacing Assumption TS4, and Theorem 3.9 with Assumption TS7 replacing Assumptions TS4 and TS6.
While this may seem like mathematical semantics as both deliver the same result here, the difference between them becomes important in the discussion of the impact of serial correlation in the errors on the inference procedures discussed above.
3.4 Appendix: more on the long run variance and weak dependence
In this section, we provide a formal justification for the formula for the long run variance in Lemma 3.8. To this end, we place a set of conditions on {vt} that are sufficient for the desired result. The role of these conditions in the derivation provides insight into why a restriction on the memory is needed for the development of our large sample framework for time series.22
We restrict to the case where {vt}∞t=−∞ is a scalar (univariate) weakly sta- tionary time series; let E[vt] = μ and Cov[vt, vt−j] = γj . In this case, it follows from (3.20) that ΩT is:
TT
ΩT = T −1 γ|t−s|. (3.30)
t=1 s=1
Note that, since {γ|t−s|} can take any (finite) values in principle, the quantity on the right-hand side of (3.30) is not guaranteed to converge to a finite constant. The following theorem provides a set of sufficient conditions under which it does converge.
22Although the conditions here are sufficient not necessary, the analysis, nevertheless, gives an intuitive understanding of why the memory needs to be restricted.
95
Theorem 3.14. Let {vt, t = −∞,…,−1,0,1,…∞} be a weakly stationary process with mean E[vt] =μ and autocovariances Cov[vt,vt−j] = γj for j = 0,1,…, and E[vt2] < ∞. If ∞l=0 |γl| < ∞ then
l i m T → ∞ Ω T = Ω = γ 0 + 2 ∞ γ l < ∞ . l=1
Note that in the case where vt is a scalar time series then Γi = γi = Γ′i and so the formula for Ω in Lemma 3.8 reduces to the formula given in Theorem 3.14. Before proving Theorem 3.14, we comment on the nature of the restrictions imposed. The theorem states that the the long run variance exists and has the form given for all weakly stationary time series whose autocovariances satisfy the restriction: ∞l=0 |γl| < ∞. If this condition is satisfied then autocovari- ances are said to be absolutely summable. An implication of the condition is that |γl| → 0 as T → ∞. So, the assumption of absolutely summable autocovari- ances is restricting attention to time series for which the dependence between vt and vt−l (as measured by γl) dies out as l → ∞. Thus, the assumption of absolutely summable autocovariances is placing a restriction on the memory of the series.
Proof of Theorem 3.14: We split the proof into two parts. First, we show that limT→∞ΩT is finite, and second, we show it equals the expression in the theorem.
To establish that the variance is finite in the limit, our proof strategy is to establish that ΩT is bounded by a finite quantity under the conditions stated in the theorem. From (3.30), it follows that (verify this for yourselves)
T− 1 T − l ΩT =γ0+2 T γl.
l=1
Noting that 0 < (T − l)/T < 1 for l = 1, 2, . . . T − 1, it can be seen that
(3.31)
T−1T −l γ 0 + 2 l = 1 T
γ l
≤
≤
T−1T −l
| γ 0 + 2 l = 1 T γ l |
Combining (3.31)-(3.32), it follows that
|γl|. |γl|.
ΩT ≤ 2 96
≤ 2
T−1 l=0
T− 1 T − l |γ0| + 2 T
l=1
T−1 l=0
|γl|
(3.32)
(3.33)
Taking the limit on both sides, we obtain
T−1 ∞
lim ΩT ≤ 2 lim |γl| = 2|γl|<∞, (3.34)
T→∞ T→∞ l=0 l=0 because the autocovariances are absolutely summable.
We now turn to establishing the form of the limit. Since (T − l)/T = 1 − l/T , it follows from (3.31) that
T−1 T−1l ΩT=γ0+2γl −2Tγl
l=1 l=1
Now consider the second term on the right hand side of (3.35). We have
T −1 l T −1
T γl =T−1lγl. l=1 l=1
This function can be bounded as follows:
T−1 T−1
|T−1 lγl| ≤ T−1 l|γl|. l=1 l=1
Since |γl| ≥ 0, the absolute summability of the autocovariances implies limn→∞ nl=1 |γl| < c < ∞ and so23
(3.35)
(3.36)
(3.37) which implies limT →∞ T −1 T −1 lγl = 0 via (3.36). Therefore, by taking the
limit on both sides of (3.35) and using (3.37), we obtain
lim ΩT T→∞
T−1
lim T−1 l|γl| = 0.
T→∞ l=1 l=1
T−1 T−1l = lim γ0 + 2 γl − 2 lim γl
T→∞ l=1 T→∞ l=1 T
T−1 ∞
= lim γ0 +2γl = γ0 +2γl.
T→∞ l=1 l=1 23 ⋄ Pn
Here we take advantage of the following useful result: if limn→∞ j=1 aj = a < ∞ then limn→∞ n−1 Pnj=1 jaj = 0.
97
3.5 Appendix: The large sample behaviour of OLS estimators with trending data
In this section, we explore the behaviour of OLS estimators in simplified versions of two models that have been proposed to capture the nonstationary behaviour of macroeconomic series. These are: (i) the linear deterministic trend model; (ii) the stochastic trend or unit root model. We consider each in turn.
(i) the linear deterministic trend model:
Consider the case yt is generated via the model
yt = β0,1 +β0,2t+ut = x′tβ0 +ut (3.38)
where x′t = (1,t), β0 = (β0,1,β0,2)′. If this model is used to capture the time series properties of macroeconomic variables then {ut} is typically a weakly stationary, weakly dependent time series. In other cases - such as our running traffic example - {ut} might be assumed to be i.i.d. but other weakly stationary, weakly dependent time series variables are included as regressors. To simplify the development of the analytical arguments below, we restrict attention to the case where there are no other explanatory variables in the model and {ut} is a sequence of i.i.d. random variables with mean zero and variance σ02. However, we discuss the extension of our analysis to the model with additional regressors included below.
It is straightforward to show that this model implies yt is a non-stationary series: taking expectations, we have E[yt] = β0,1 + β0,2t and so, provided β0,2 ̸= 0, E[yt] depends on t.
We now consider the large sample distribution of the OLS estimator in this model. Using standard arguments and the definition of xt, βˆT , the OLS esti- mator of β0, satisfies
βˆT − β0 = (X′X)−1X′u, (3.39) T Tt=1t−1Tt=1ut
= T t T t2 T tu . (3.40) t=1 t=1 t=1t
To develop a large sample theory in this case, it is necessary to scale the elements of βˆT − β0 by different functions of T. To this end, we introduce the matrix CT =diag(T1/2,T3/2)andconsider
C T ( βˆ T − β 0 ) = T 1 / 2 ( βˆ T , 1 − β 0 , 1 ) . T3/2(βˆT,2 − β0,2)
Using (3.39), it follows that
CT(βˆT − β0) = CT(X′X)−1X′u,
= CT (X′X)−1CT C−1X′u, T
= (C−1X′XC−1)−1C−1X′u. TTT
98
(3.41)
Given that Tt=1 t = T(T + 1)/2 and Tt=1 t2 = T(T + 1)(2T + 1)/6, it follows
that 1 T(T+1) −1 ′ −1 2T2
CT X XCT = T(T+1) T(T+1)(2T+1) , 2T2 6T3
11 C−1X′XC−1 → 2
and so24
−1 ′ T−1/2Tt=1ut CTXu= T−3/2T tu .
(3.42)
(3.43)
= Q, say, as T → ∞. Note that Q is a nonsingular matrix. Now consider
TT11 23
It can be shown that
t=1 t C−1X′u →d N(0,σ2Q)
T0
A formal proof is beyond the scope of this course but we can provide some
intuition as to why such a result holds by considering the marginal distributions
of the two elements of C−1X′u. First consider T−1/2 T ut. Since {ut} ∼ T t=1
IN(0,σ02), we can apply the CLT for i.i.d. random variables to deduce that T−1/2 Tt=1 ut →d N(0, σ02). Now consider T−3/2 Tt=1 tut, and note that we
can write
where at = tut/T. Given that {ut} ∼ IN(0,σ02), it follows that E[at] = 0, Var[at] = t2σ02/T2 and {at} is an independent sequence: therefore, {at} is an independently but not identically distributed (i.n.i.d.) sequence of random variables with mean 0 and
− 1 / 2 T t − 1 / 2 T
T Tut=T at,
t=1 t=1
V a r [ T − 1 / 2
T t=1
a t ] = →
σ 02 σ02
t2 t = 1
T3
= σ 02
T(T+1)(2T+1) T3
3 , as T → ∞.
Under these conditions, we can invoke the CLT for i.n.i.d. sequences to deduce
that
T
T−1/2 tut/T →d N(0,σ02/3).
t=1
Finally, using (3.41)-(3.43) and Lemma 3.5, we have that CT (βˆT − β0 ) →d
N(0, σ02Q−1).25 A similar proof strategy can be used to show that the results 24Recall Tutorial 5 Question 1.
25Note that this result implies βˆT →p β0 and so βˆT is consistent for β0. 99
in Theorems 3.11 and 3.12 hold in linear regression models where the linear time trend and potentially seasonal dummies are included as regressors and the other regressors are stationary, weakly dependent variables. Thus, the inference framework described in Section 3.3.2 remains valid in this more general setting.26
(ii) a unit root model:
Consider the AR(1) model
yt = β0yt−1 + ut. (3.44)
where β0 = 1 and {ut} ∼ IN(0,σ02). Recall that if |β0| < 1 then yt is a weakly stationary process.27 Here we consider the case where β0 = 1 and so yt is a so-called unit root process.28 It is immediately apparent that β0 = 1 is outside the range of parameter values for which yt is weakly stationary. Below we demonstrate both that (3.44) with β0 = 1 yields a non-stationary process for yt and also that the limiting distribution of the OLS estimator is non-standard.
To facilitate the presentation, we set y0 = 0: this serves to simplify the anal- ysis without affecting the qualitative point being made. First, we demonstrate that yt is a non-stationary process. Back substitution using (3.44) (with β0 = 1 and y0 = 0) yields
t i=1
which, since ui ∼ IN(0, σ02), implies yt ∼ N(0, tσ02). Therefore, V ar[yt] depends on t and so yt is not a stationary process. We now turn to the large sample behaviour of the OLS estimator. It follows from the algebra of OLS that if β0 = 1 then
βˆ − 1 = Tt = 1 y t − 1 u t . ( 3 . 4 6 ) T Ty2
t=1 t−1
To deduce the large sample behaviour of βˆT , we consider the large sample be- haviour of the numerator and denominator on the right hand side of (3.46). Consider first Tt=1 yt−1ut. Since
it follows that
y2=(y +u)2=y2 +2y u+u2, t t−1 t t−1 t−1t t
y t − 1 u t = 0 . 5 { y t2 − y t2 − 1 − u 2t } ,
yt =
ui (3.45)
26This (potentially) includes the model employed in the analysis of traffic fatalities in our running example, but there may be concerns about the quality of the large sample approxi- mation in this case given T = 108.
27For example, see Tutorial 4 Question 4.
28This particular example of a unit root process is known as a random walk process. For the process to possess a stochastic trend, equation (3.44) must also include a non-zero intercept on the right-hand side but we omit this term to simplify the presentation.
100
and so (using y0 = 0)
t=1
Now,asnotedabove,yt ∼N(0,tσ02),andsoyT/T1/2 ∼N(0,σ02);therefore,we
have
t=1
yT2 d 2
Tσ02 →χ1. (3.48)
T
T
yt−1ut = 0.5yT2 − 0.5 u2t. (3.47)
Furthermore since T −1 Tt=1 u2t →p σ02 via the WLLN, it thus follows from (3.47) that under our assumptions,
T
T−1 yt−1wt →d 0.5σ02{χ21 − 1} = ΞN, say. (3.49)
t=1
The denominator on the right-hand side of (3.46) is more complicated to an-
alyze, and so for this term we jump to the bottom line: it can be shown that T−2 T y2 →d ΞD, a well defined random variable. Given these results, it
t=1 t−1
is necessary to scale the deviation of the OLS estimator from the true value of
one by T in order to obtain a limiting distribution because then we have:
T ( βˆ − 1 ) = T − 1 Tt = 1 y t − 1 u t , ( 3 . 5 0 )
and both numerator and denominator converge to well-defined random variables. As a result, it can be shown that the limiting distribution of T (βˆT − 1) is the distribution of a random variable Ξ = ΞN /ΞD . This distribution is often referred to as a Dickey-Fuller distribution as it was first characterized in Dickey and Fuller (1979). For our purposes here, we do not need to provide the precise form of this distribution. What matters is that Ξ does not have a normal distribution - in fact its distribution is skewed to the left. While the limiting distribution theory is non-standard, note that the OLS estimator is still consistent because if T (βˆT − 1) converges in distribution (to a well-defined distribution) then βˆT →p 1 = β0.
In practice, if an AR model is used to capture the time series behaviour of the level of a macroeconomic series then typically an intercept and a linear time trend are included and the errors are assumed to be a weakly stationary process. In such cases, the large sample analysis of the OLS under the assumption that β0 = 1 requires statistical concepts beyond the scope of this course. However, the limiting distribution theory is similarly non-standard for the most part.
101
T T−2 T y2 t=1 t−1
Chapter 4
Inference in the Linear Model with Non-Spherical Errors
4.1 Introduction
In the previous chapter, we develop a powerful framework for inference in the linear regression model based on the OLS estimator with either cross-section or time series data. While the sampling framework is different in these two set- tings, we maintained in both that (i) the errors are conditionally homoscedastic (Assumptions CS5 and TS5); and (ii) that the errors are conditionally pairwise uncorrelated (Assumptions CS2 and TS6). These can be strong assumptions in the types of econometric model used in practice. In this chapter, we explore inference methods in models where one or other of these assumptions is violated.
If the errors are not homoscedastic then the error variance is not the same across all observations, a phenomenon known as heteroscedasticity. If the errors are not pairwise uncorrelated then the errors from different observations are linearly related in some way. Although pairwise correlation can occur in cross- section data data,1 we address its consequences solely within the framework of time series data where the phenomenon is know as serial correlation in the errors.
Both heteroscedasticity and pairwise correlation are statements about the variance-covariance matrix of the errors. This association is reflected in the following terminology commonly employed in discussion of the linear model: if the errors are homoscedastic and pairwise uncorrelated then they are said to have a spherical distribution; and if the errors are heteroscedastic and/or pairwise correlated then the errors are said to have a non-spherical distribution. The origins of this terminology are illustrated in the following example.
1See the discussion at the beginning of Section 3.2. 102
Example 4.1. Consider the case where T = 2 and u ∼ N(02, Σ) where Σ is the 2 × 2 symmetric matrix with typical element σi,j . (So V ar[u1] = σ1,1, V ar[u2] = σ2,2 and Cov[u1, u2] = σ1,2 = σ2,1.) We now consider three choices for Σ.
Case A: Σ = I2 In this case the errors are homoscedastic and pairwise uncorre- lated (i.e. Cov[ui,uj] = 0). Figure 4.1 plots the joint probability density of u. It can be recognized that the pdf has a three dimensional bell shape. To reveal more about the joint distribution, Figure 4.2 provides probability contour plots: these are obtained by slicing through the bell horizontally at different heights - the size of the contour being inversely related to the height at which the slice is taken. It can be recognized that these contours are circular meaning that it is symmetric about horizontal and vertical axis, and if (u1, u2) = (a, b) lies on a particular contour then so does the point (b, a).
Case B: σ1,1 = 2, σ2,2 = 1, σ1,2 = 0 In this case the errors are heteroscedastic and pairwise uncorrelated. Figure 4.3 plots the joint probability density of u, and Figure 4.4 plots the probability contours. The pdf is still bell shaped but the bell has been elongated along the horizontal axis. The contour plot is now elliptical meaning it is symmetric about the horizontal and vertical axes but larger values of u1 relative to u2 appear on each contour.
Case C: σ1,1 = 1, σ2,2 = 1, σ1,2 = 0.5 In this case the errors are homoscedastic and pairwise correlated. Figure 4.5 plots the joint probability density of u, and Figure 4.6 plots the probability contours. The pdf is still bell shaped but the bell has been elongated along the 45-degree line. More of each contour lies in the North-East and South-West quadrants than in the North-West and South-East quadrants of the plot. This reflects that positive values of u1 are more likely to occur with positive values of u2 and negative values of u1 are more likely to occur with negative values of u2.
Of the three cases only the case with homoscedasticity and zero-pairwise cor- relation produces perfect circles. Since a circle is a 1-sphere, these errors are said to have a spherical distribution.2 In the other cases, the contours are el- lipses and so the distribution is said to be non-spherical.
Below we are able to perform part of our analysis at a generic level in which only mild restrictions are placed on the second moments of the errors. These results cover the cases of either heteroscedasticity or serial correlation as both result in non-spherical errors. However, this generic analysis can only take us so far and so we then specialize our discussion to the cross-section and time series data contexts mentioned above.
2A n-sphere is defined a set Sn of all points d ∈ Rn+1 such that ∥d∥ = r for some r. 103
Figure 4.1: Probability density function with spherical errors
Figure 4.2: Probability contours with spherical errors
104
Figure 4.3: Probability density function with non-spherical errors: heteroscedas- ticity
Figure 4.4: Probability contours with non-spherical errors: heteroscedasticity 105
Figure 4.5: Probability density function with non-spherical errors: pairwise correlation
Figure 4.6: Probability contours with non-spherical errors: pairwise correlation 106
We begin our discussion of inference in the linear model with non-spherical errors in the simplified setting in which the regressors are assumed fixed in re- peated samples. In Section 4.2, it is shown that within this framework the OLS estimator is still unbiased but no longer BLUE. Furthermore, the formula for the variance of the estimator (derived in Section 2.4.2) is no longer valid, and so neither are the inference procedures in Sections 2.6, 2.7, 2.8.2 and 2.8.3. While OLS is not BLUE, we can exploit the argument behind the Gauss-Markov The- orem to deduce the form of the best linear unbiased estimator when the errors have a non-spherical distribution. The resulting estimator is known as Gener- alized Least Squares (GLS). In Section 4.2, we derive the GLS estimator and establish its sampling distribution. Implementation of GLS requires knowledge of the variance-covariance matrix of the errors which in general is not available. This leads us to a discussion of the so-called Feasible GLS estimator.
In Section 4.3, we explore methods for inference in linear models with het- eroscedasticity estimated from cross-section data. Two approaches are consid- ered: one based on OLS using modified formula for the variance of the estimator that take account of the heteroscedasticity; the second is based on GLS estima- tion. We also consider methods for testing for heterscedasticity.
In Section 4.4, we explore methods for inference in linear time series re- gression models with heteroscedasticity and/or serial correlation in the errors. The properties of OLS depend on the assumptions maintained. In some cases, the OLS estimator is still consistent and both the consequences for OLS and the remedies are qualitatively the same as those for heteroscedasticity in cross- section data. In other cases, OLS is inconsistent because the serial correlation is indicative of model misspecification.
4.2 OLS and GLS: generic analysis 4.2.1 OLS
To consider the consequences of non-spherical errors, it is convenient to return to the fixed regressor model in Section 2.1. We therefore impose Assumptions CA1-CA4 and CA6 but now replace Assumption CA5 by the following condition.
Assumption CA5-NS. V ar[u] = Σ where Σ is a T ×T positive definite matrix with typical element σi,j.
Here the suffix “NS” stands for “non-spherical”. Note that setting Σ = σ02IT returns us to the case of spherical errors. Thus the only difference in the frameworks here and Section 2.4.2 is in the specification of the variance- covariance matrix of the errors. The two following examples illustrate how heteroscedasticity and serial correlation fit within the framework of Assumption CA5-NS.
Example 4.2. If the errors are heteroscedastic with V ar[ut] = σt2 but the errors are pairwise uncorrelated (Cov[ut,us] = 0 for t ̸= s) then Σ is the diagonal
107
matrix,
σ 12 0 0 . . . 0 0 σ2 0 ... 0
Σ = .
0 0 ... 0 σT2
. . ... ... .
Example 4.3. Now suppose that {yt; t = 1,2,...T} is a time series and the
errors follow the AR(1) model,3
ut = ρ0ut−1 + εt
where ρ0 is a parameter and εt is an unobservable random variable that has mean zero, is homoscedastic (with V ar[εt] = σε2) and is serially uncorrelated (Cov[εt, εs] = 0 for all t ̸= s).4 Then it can be shown that5
1 ρ0 ρ20 ... ρT−1 0
2ρ 1ρ...ρT−2 σε0 0 0
Σ=2..... . 1−ρ0. . . ....
ρT−1 ρT−2 ... ρ0 1 00
We now derive the sampling distribution of the OLS estimator βˆT under this set of assumptions. It is straightforward to see that the OLS estimator is unbiased: a review of the discussion in Section 2.4.2 reveals that the proof of E[βˆT ] = β0 relies only on Assumptions CA1-CA4, all of which are imposed here.6
Now consider V ar[βˆT ]. We can repeat the same steps as before to deduce (2.28), reproduced here for convenience:
Var[βˆT] = (X′X)−1X′E[uu′]X(X′X)−1. (4.1)
As before, we use Assumption CA4 to deduce that E[uu′] = Var[u]; how- ever from this point the analysis is different. Assumption CA5-NS states that V ar[u] = Σ and so (4.1) yields:
Var[βˆT] = (X′X)−1X′ΣX(X′X)−1. (4.2)
This formula only collapses to the expression in (2.29) if Σ takes the form σ02IT or in other words if the errors are homoscedastic and pairwise uncorrelated.
Since Assumption CA6 is imposed, the sampling distribution of the OLS estimator is still normal, and so we have the following result.
3See Section 3.3.
4In the time series literature, a process with the listed properties for εt is said to be white noise.
5This assumes |ρ0| < 1, which is the condition for an AR(1) process to be weakly stationary.
6Recall that Assumption CA3 is needed for the existence of (X′X)−1 on which this analysis is premised.
108
Theorem 4.1. If Assumptions CA1-CA4, CA5-NS and CA6 hold then: βˆT ∼ Nβ0,(X′X)−1X′ΣX(X′X)−1.
A comparison of the sampling distributions in Theorems 2.3 and 4.1 indi- cates that in both cases they are normal with mean β0: the only - and crucial - difference is in the form of the variance. This difference creates the very im- portant practical problem. Many basic computer programs for regression are coded up using the formulae for confidence intervals and test statistics that de- rive from the sampling distribution in Theorem 2.3 - in other words under the assumption of spherical errors. However, if the errors are in fact non-spherical then the code is based on formulae derived from an incorrect expression for the variance. This means that the resulting inferences are all invalid in the sense that they do not have the anticipated statistical properties. The same problem persists in models with stochastic regressors, although we postpone a demon- stration until later because the arguments depend on the specific assumptions about the nature of the data.
One solution to this problem is to use computer routines based on appro- priately modified formula for V ar[βˆT ] when the errors are non-spherical. We discuss this approach in Sections 4.3 and 4.4. Before that, we turn to another important issue. If the errors are spherical then OLS is the best linear unbiased estimator: does OLS still have this property if the errors are non-spherical? The answer is no and is established in the next sub-section by characterizing a more efficient estimator, known as the Generalized Least Squares (GLS) estimator. However, as part of our discussion of GLS, it emerges that the implementation of GLS requires some strong assumptions about the form of Σ that have made this approach less popular, leading researchers back to the use of OLS in models with non-spherical errors.
4.2.2 Generalized Least Squares
It is convenient to work within the fixed regressor framework in order to derive the GLS estimator. We therefore impose Assumptions CA1-CA4 and CA5-NS: meaning we wish to estimate β0 in the model,
y = Xβ0 +u (4.3)
where V ar[u] = Σ.
The intuition behind the GLS estimator is as follows. Suppose there exists a
nonsingular matrix S such that SΣS′ = IT and that we pre-multiply both sides of (4.3) by S to yield the transformed regression model
Sy = SXβ0 + Su. (4.4) It can be recognized that (4.4) is also a linear regression model, because we can
re-write the transformed model as
y ̈ = X ̈ β 0 + u ̈ , ( 4 . 5 )
109
where y ̈ = Sy, X ̈ = SX and u ̈ = Su. Notice that because S is nonsingular (4.3) and (4.5) are equivalent representations of the model and so no information is lost by applying this transformation. Further, it can be verified (check for yourselves) that if X and u satisfy Assumptions CA2 - CA5-NS then X ̈ and u ̈ satisfy Assumptions CA2-CA5. Therefore, from the Gauss-Markov Theorem, the best linear unbiased estimator of β0 is the OLS estimator applied to the transformed regression model in the regression model in (4.5).
Thus, the problem of how to estimate β0 if the errors are non-spherical is solved - provided such a matrix S exists. Since Σ is symmetric (by construction) and positive definite (by Assumption CA5-NS), the existence of S is guaranteed by the following lemma.
Lemma 4.1. If Σ is a positive definite symmetric matrix then there exists a nonsingular matrix D such that Σ = DD′ .
Using Lemma 4.1, it follows that
D−1Σ(D′)−1 = D−1DD′(D′)−1 = IT ,
and so, recalling that (D′)−1 = (D−1)′, if we set S = D−1 then SΣS′ = IT as required. As is evident below, for our purposes here, it only matters that such a matrix D exists and not what form it takes;7 however, for those interested, further details are given in Orme (2009)[Theorem 13, p.65].
The Generalized Least Squares estimator of β0, βˆGLS, is defined to be the OLS estimator of β0 based on (4.5). So, using (2.7),
βˆ G L S = ( X ̈ ′ X ̈ ) − 1 X ̈ ′ y ̈
= ({SX}′{SX})−1{SX}′{Sy} = (X′S′SX)−1X′S′Sy
Using Lemma 4.1, it follows that
Σ−1 = (DD′)−1 = (D′)−1D−1 = S′S,
and so (4.6) becomes
βˆGLS = (X′Σ−1X)−1X′Σ−1y.
Since the GLS estimator is an OLS estimator applied to a (transformed) regression model that satisfies the Classical assumptions, we can use our analysis from Section 2.4.2 to deduce that E[βˆGLS] = β0 and V ar[βˆGLS] = (X′Σ−1X)−1. Furthermore, we can invoke the Gauss-Markov Theorem to deduce that GLS is the best linear unbiased estimator of β0. Finally, if we impose Assumption CA6 then the sampling distribution of the GLS estimator is normal so that
βˆGLS ∼Nβ0,(X′Σ−1X)−1. (4.8) 7In fact, D is not unique that is, there is more than one choice of D that satisfies the
decomposition in Lemma 4.1 for any given Σ.
110
(4.6)
(4.7)
We now return to the question of the efficiency of the OLS estimator of β0, βˆT . Since βˆGLS is BLUE in this context and βˆGLS ̸= βˆT , it follows that OLS is no longer the most efficient linear unbiased estimator in the linear regression model with non-spherical errors. This comparison of properties of OLS and GLS indicates that GLS is to be preferred when the errors are non-spherical. However, there is a caveat: to construct the GLS estimator, it is necessary to know Σ. There are some cases where the structure of the model implies a specific Σ but these situations are the exception rather than the rule; in cases where Σ is unknown then GLS is known as an infeasible estimator because it can not be calculated.
This obviously raises the issue of how to proceed if Σ is unknown. An obvious solution is to estimate Σ from the data. However to do so, we face a number of problems. First, Σ is the variance-covariance matrix of an unobservable random vector, u; this can be addressed by using the OLS residuals as the basis for our estimation of Σ. Second, in the general case, there are T (T + 1)/2 distinct elements of Σ, and so it is only possible to estimate Σ in a statistically meaningful way if we place some structure on its elements that is,
σi,j = hi,j(zt,α) (4.9)
where hi,j is some specified function, zt is a vector of observable variables, and α is a p × 1 vector of parameters. The vector zt often contains some or all of the elements of xt, but may contain other variables not included on the right-hand side of the regression model.
Whatever model is chosen and assuming - as is most likely - that p > 0, we end up with a specification for Σ that is known up to the vector of parameters α, and so we can write Σ = Σ(α). The unknown parameter vector α can then be estimated from the sample {yt,x′t,zt′; t = 1,2,…,T}. Let αˆ denote the resulting estimator of α, and define Σˆ = Σ(αˆ) as the associated estimator of Σ. If we replace Σ by Σˆ in the formula for the GLS estimator then we obtain what is known as the Feasible Generalized Least Squares (FGLS) estimator,
βˆFGLS = (X′Σˆ−1X)−1X′Σˆ−1y. (4.10)
While this approach is intuitively reasonable, it must be noted that the FGLS is not guaranteed to inherit the finite sample properties of GLS. To illustrate, consider E[βˆFGLS]. Substituting for y from Assumption CA1, we obtain:
βˆFGLS = β0 + (X′Σˆ−1X)−1X′Σˆ−1u, and so ˆ ′ ˆ−1 −1 ′ ˆ−1
Unlike in the corresponding analysis of the GLS estimator, the expectation of the terms of the right hand side of (4.11) can not be deduced from Assumptions CA1-CA4 because Σˆ is a random matrix. In fact, since Σˆ depends on y and hence u, (X′Σˆ−1X)−1X′Σˆ−1u is a nonlinear function of u. So we can not
111
E[βFGLS] = β0 +E (XΣ X) XΣ u . (4.11)
establish at this level of generality that the FGLS estimator is unbiased. Thus, to make more definitive statements about the finite sample properties of the FGLS estimator, it is necessary to work on a case by case basis. With certain combinations of models for Σ and methods for estimating α, it can be established t h a t E [ βˆ F G L S ] = β 0 .
Although the FGLS may not inherit the finite sample properties of the GLS estimator, it does inherit the asymptotic properties of the GLS estimator within either our cross-section data or time series data frameworks subject to certain conditions. While it is beyond the scope of this course to provide a complete accounting of these conditions, one such condition does need to highlighted namely, the assumption that FGLS estimation is based on the correct model for Σ. If this assumption is false then FGLS-based inferences are also invalid. In contrast, it emerges below that it is possible to consistently estimate the variance of the OLS estimator under conditions that do not include knowledge of the specific form of Σ. It is for this reason that inferences are often based on OLS instead of (F)GLS estimators in models with non-spherical errors. We explore asymptotic inferences based on OLS and FGLS in the next two sections.
4.3 Heteroscedasticity in cross-sectional data
In this section, we discuss estimation and inference in linear regression models with heteroscedasticity estimated from cross-sectional data. In line with our earlier discussion, we revert to using i to index the observation and N to de- note the sample size. Section 4.3.1 covers OLS-based inference, Section 4.3.2 covers FGLS/GLS-based inference, Section 4.3.3 describes various tests for het- eroscedasticity and Section 4.3.4 provides a simple empirical illustration.
All the results are based on asymptotic analysis. To this end, we impose Assumptions CS1-CS4 and an assumption about the conditional variance of ut, the exact nature of which depends on the context and so is specified as needed. Recall that Assumption CS2 states that the distribution of (ui,x′i) is the same for all i and that (ui,x′i) is independent of (uj,x′j) for any i ̸= j. Note that whatever form it takes, our assumption about the V ar[ui|xi] states that the conditional variance of ui given xi depends on xi, and the “identically distributed” part of Assumption CS2 implies the unconditional variance of ui is constant over i. Thus, our analysis is restricted to conditional heteroscedasticity and does not cover unconditional heteroscedasticity (i.e. the case where V ar[ui] varies with i).
To motivate our discussion, consider the following simple example.
Example 4.4. Suppose a researcher is interested in studying the savings be- haviour of households and so uses a cross-section sample to estimate the follow- ing model,
yi = x′iβ0 +ui = β0,1 +β0,2mi +ui,
where yi is the level of savings of household i, and mi is household income.
Assuming this model to represent the true household savings function so that
112
E[ui|xi] = 0, the error ui represents unobserved factors that have no systematic effect on savings; or equivalently ui can be viewed as “shocks” to households savings plans. In this setting, homoscedasticity is questionable and most likely implausible. Households with higher income can accommodate larger shocks to their savings plans than households with lower incomes. So V ar[ui | xi] arguably depends on mi.
4.3.1 OLS-based inference
In his seminal article, White (1980) proposes a framework for inference based on OLS estimators using a heteroscedasticity-robust estimator of the variance of the OLS estimator. To describe this approach and the statistical results behind it, we impose Assumptions CS1-CS4 along with the following condition on the error variance.
Assumption CS5-H. Var[ui|xi] = h(xi), where h(·) > 0 with h(xi) ̸= h(xj) for at least one combination of i and j where i,j = 1,2,…,N.
Notice that this assumption leaves the specific form of h( · ) unspecified.
We now consider the large sample behaviour of the OLS estimator under our assumptions here. Referring back to our analysis in Section 3.2, it can be seen that the consistency of the OLS estimator is established in Theorem 3.1 under Assumptions CS1-CS4. Since these four assumptions are imposed here, we can appeal directly to Theorem 3.1 to establish that the OLS estimator is
also consistent within our framework here.
However, the large sample distribution is sensitive to our assumptions about
the error variance. To see why and how, recall from (3.8) that we can write
N −1 N
N1/2(βˆN − β0) = N−1 xix′i N−1/2 xiui. (4.12)
i=1 i=1
As in Section 3.2, we start by examining the large sample behaviour of the two terms on the right-hand side of (4.12) individually. Since we impose Assump- tions CS2 and CS3, we can appeal to the same argument as in Section 3.2 to deduce that
N −1
N − 1 x i x ′ i →p Q − 1 , ( 4 . 1 3 ) i=1
where, as before, Q = E[xix′i].
As in Section 3.2, we use CLT in Lemma 3.4 to deduce that the large sample
behaviour of N−1/2 Ni=1 xiui is given by − 1 / 2 N d
whereΩ=limN→∞ΩN andΩN =VarN−1/2Ni=1xiui. Tospecializethe formula for Ω to our model here, we can invoke many of the arguments used
113
N xiui →N(0,Ω), i=1
in Section 3.2. Specifically, Assumption CS2 implies {xiui; i = 1,2,…N} are i.i.d. random vectors and so Cov[xiui,xjuj] = 0 for any i ̸= j. Thus, we
haveΩN =VarN−1/2 Ni=1xiui=Var[xiui]. SinceE[xiui]=0,wehave
V ar[xiui] = E[u2i xix′i]. It is at this point that our assumptions about the form of the error variance become important. Using the LIE and Assumption CS5-H, we obtain:
Var[xiui] = E[u2ixix′i] = EE[u2i|xi]xix′i] = E[h(xi)xix′i] = Ωh, say. Therefore, under our assumptions, we have:
− 1 / 2 N d
N xiui → N(0,Ωh). (4.14)
i=1
Using (4.13)-(4.14) and Lemma 3.5, we obtain the limiting distribution of the
OLS estimator.
Theorem 4.2. If Assumptions CS1-CS4 and CS5-H hold then:8
N1/2(βˆN−β0)→d N(0,Vh). where Vh = Q−1ΩhQ−1.
Comparing the limiting distributions of N1/2(βˆN − β0) under homoscedas- ticity (Theorem 3.2) and heteroscedasticity (Theorem 4.2), it can be seen that both are normal and mean zero but they differ in the form of the variance. (Notice that if Ωh = σ02Q then Vh reduces to σ02Q−1.)
To use this result as a basis for inference about β0, we need to be able to construct a consistent estimator of Vh and so also of Ωh. At first glance, this might seem impossible without knowledge of h(xi), however White (1980) shows that this initial reaction is mistaken. To see how we can estimate Ωh consistently, note that Ωh = E[u2i xix′i], and so from the WLLN it follows that
N p
N−1 u2i xix′i → E[u2i xix′i]. (4.15)
i=1
While ui is unobserved, the residuals are their sample analogs and, using the
consistency of the OLS estimator, it can be shown
NN N−1e2ixix′i − N−1u2ixix′i →p 0.
i=1 i=1 Combining (4.15) and (4.16), it follows that
N p
N−1 e2i xix′i → E[u2i xix′i] = Ωh.
i=1
(4.16)
(4.17)
8Strictly, the variance is Q−1Ωh{Q−1}′ but this reduces to the formula in the text because Q is symmetric.
114
Therefore, if we set
Vˆh = Qˆ−1ΩˆhQˆ−1, (4.18) where Qˆ = N−1X′X and Ωˆh = N−1 Ni=1 e2i xix′i then it follows via (4.13),
(4.17) and Slutsky’s Theorem that
Vˆh →p Vh. (4.19)
The square root of the diagonal elements of the matrix ˆ N
Vh/N = (X′X)−1 e2i xix′i (X′X)−1 i=1
are known commonly as the “White standard errors” and can be obtained in many regression packages.9
We can then perform all the large sample inference discussed in Section 3.2 provided we modify the statistics appropriately. To illustrate, we consider two of types of inference: a confidence interval for β0,l and a test of H0 : Rβ0 = r vs H1 : Rβ0 ̸= r. An approximate 100(1 − α)% confidence interval for β0,l is
g i v e n b y , βˆ T , l ± z 1 − α / 2 Vˆ h , l , l / N , ( 4 . 2 0 )
where Vˆh,l,l is the (l, l)th element of Vˆh and z1−α/2 is the 100(1 − α/2)th per- centile of the standard normal distribution. For the hypothesis test, we can base inference on the the following modified version of the Wald statistic in (3.14), an appropriate test statistic is
W(h) = N(RβˆN − r)′ RVˆhR′ −1 (RβˆN − r). (4.21) N
The decision rule is then as follows:
Decision rule: reject H0 : Rβ0 = r in favour of HA : Rβ0 ̸= r at the approxi-
mate 100α% significance level if:
W(h) >cq(1−α)
N
where cq (1 − α) is the 100(1 − α)th percentile of the χ2q distribution.
Comparing with the analogous inference procedures in Section 3.2, notice that the only difference is in the estimator of the variance of βˆN – everything else is the same.
There are two chief advantages to basing inference on the OLS-based pro- cedures described in this sub-section. First, we do not need to know the form
9These quantities are sometimes called the “Eicker-White” or “Eicker-Huber-White” stan- dard errors because after the publication of White’s 1980 article, it was recognized that Eicker (1967) and Huber (1967) introduce this approach into the statistics literature although their work does not appear to have impacted on econometrics at that time.
115
of the heteroscedasticity in order to perform valid inferences (in large samples). Second, if the errors are actually homoscedastic then our inferences are still valid because in that case Vˆh converges to σ02Q−1 – check this for yourselves using (4.15)-(4.16). It is for this reason that the OLS-based inference proce- dures described in this section are often referred to as being “robust against heteroscedasticity”. Therefore, a case can be made for always reporting infer- ences in cross-section data based on heteroscedastcity-robust versions of the OLS statistics. However, two considerations argue against that practice: first, OLS is inefficient in the presence of heteroscedasticity; second, if the errors are really homoscedastic then there is evidence that in “small-” to “moderate-” sized samples, asymptotic theory may provide a less accurate approximation to the statistical properties of the OLS statistics when heteroscedastcity-robust variance estimators are used than if the standard variance estimator is used. As a result, it is quite common for researchers to report inferences based on both variance estimators so that the reader can assess for him/herself whether inferences are sensitive to this choice.
4.3.2 Generalized Least Squares
The GLS estimator is derived in Section 4.2.2 in the context of the fixed re- gressor model. This derivation readily extends to our framework here: the only difference is that the moments of u are expressed conditional on X. We first verify that under our assumptions here the conditional moments of u match our assumptions about their unconditional counterparts in the fixed regressor model. Assumption CS4 states that E[ui|xi] = 0 and Assumption CS2 states that ui is independent of xj for all i ̸= j, and these two conditions taken to- gether imply E[ui|xi] = E[ui|X]. Thus, we have E[u|X] = 0. Via a similar logic, Assumptions CS2 and CS5-H imply V ar[u|X] = Σ where
σ 12 0 0 . . . 0
0 σ2 0 … 0
Σ= . . . … … , (4.22)
0 0 . . . 0 σ N2
and σi2 = h(xi). The formula for the GLS estimator can then be obtained by substituting (4.22) into (4.7).
Recall S – the matrix used to transform the regression model as part of our derivation of GLS – is given by S = D−1 where D satisfies DD′ = Σ. Given the form of Σ in (4.22), it can be recognized that
σ1 0 0 … 0 0 σ2 0 … 0
D = .
… … , (4.23) 0 0 … 0 σN
. . 116
and so
100…0 σ1
S =
Therefore the transformed regression model is
010…0 σ2
… … . (4.24) 00…01
. . .
yi =1β0,1+xi,2β0,2+…+xi,kβ0,k+1ui. (4.25)
σN
σi σi σi σi σi
Thus the effect of the transformation is to divide yi, xi and ui by the standard
deviation of the error. For what follows, it is convenient to re-write (4.25) as y ̈i = x ̈′iβ0 + u ̈i,
where y ̈i = yi/σi, x ̈i = (1/σi)xi and u ̈i = ui/σi. We return to the nature of this transformation after presenting the statistical properties of the estimator and discussing certain issues that arise in implementation.
It can be verified (check this yourselves) that if xi, ui satisfy Assumptions CS2-CS4 and CS5-H then x ̈i, u ̈i satisfy Asssumptions CS2-CS5.10 Further, by the nature of the transformation, V ar[u ̈i|x ̈i] = 1. Therefore, since the GLS estimator is the OLS estimator in the transformed regression model, we can appeal directly to Theorems 3.1 and 3.2 to deduce the following large sample properties of the GLS estimator.
Theorem 4.3. If Assumptions CS1 – CS4 hold then βˆGLS is a consistent esti- mator for β0.
Theorem 4.4. If Assumptions CS1 – CS4 and CS5-H hold then: N1/2(βˆGLS−β0)→d N(0,VGLS),
where VGLS = {E[σ−2xix′ ]}−1. ii
Thus the GLS estimator is consistent and has a limiting normal distribution. The variance takes the form it does because in the OLS formula “σ02Q−1” (from Theorem 3.2) we replace “σ02” by 1 (as Var[u ̈i|x ̈i] = 1) and “Q” by E[x ̈ix ̈′i]. Sometimes, this variance is written using matrix notation as,
VGLS = plim(N−1X′Σ−1X)−1.
To implement GLS, we need to know σi2. Unfortunately, the economic and/or statistical model rarely – if ever – provides this information. As discussed in Section 4.2.2, the way forward is to assume that σi2 is a function of certain observable variables denoted by zi in equation (4.9). Our discussion of the
10By which we mean that Assumptions CS2-CS5 hold if we replace xi by x ̈i and ui by u ̈i in the originals statements of these assumptions.
117
implementation of GLS focuses primarily on the leading case where zi = xi; the extension to the case where zi contains variables not included in xi is discussed briefly at the end.
The best possible scenario is that σi2 is known up to some scale parameter
that is,
σ i2 = σ 02 ∗ v ( x i ) ( 4 . 2 6 )
where vi = v(xi) is some known function of xi. To motivate this scenario, consider again the model in Example 4.4 in which savings is modeled as function of income, mi. As noted above, it seems intuitively reasonable that the error variance is positively related to income. Since mi > 0, this behaviour might be captured by setting v(xi) = mi in (4.26).
To see how to proceed in this case, note that if we set σi2 = σ02v(xi) then the GLS estimator is invariant to σ02 (check this for yourselves). Thus we can obtain the GLS estimator by applying OLS to the model,
yi 1 xt,2 xt,k 1
v β0,2 +…+ √v β0,k + √v ui (4.27)
√v = √v β0,1+ √ iiiii
where we have set vi = v(xi) for ease of presentation. Using (4.26) in VGLS, it can be recognized that
VGLS = {E[σ−2xix′i]}−1 = σ02 { E [x ̈ix ̈′i] }−1 i
where – with a slight abuse of notation – we set x ̈i = xi/√vi. This matrix can
be consistently estimated by σˆG2 LS(N−1X ̈′X ̈)−1 where X ̈ is the N × k matrix
with ith row equal to x ̈′ (= v−1/2x′) and σˆ2 is the OLS estimator of the error i i i GLS
variance in the transformed regression model (4.27). In this case, GLS-based inferences can be performed by applying the analogous OLS-based procedure to the transformed regression model (4.27).
More generally, if the heteroscedsticity is believed to be driven by a more complicated function of xi then we proceed in the way described in Section 4.2.2. We assume
σi2 = h(xi,α) (4.28)
where h( · ) is a specified function and α is vector of unknown parameters. To illustrate, consider again the simple model for savings in Example 4.4: one possible way to capture dependence of the error variance of income, mi, is via the following equation,
σ i2 = α 1 + α 2 m i + α 3 m 2i . ( 4 . 2 9 )
Whatever the choice of h(xi,α), it is necessary to estimate α in order to implement FGLS. Using Assumption CS4, it follows that from (4.28), it follows that E[u2i |xi] = h(xi, α), and so we have
u2i =h(xi,α)+ai (4.30) 118
where E[ai|xi] = 0 by construction. Equation (4.30) is a (nonlinear) regression model and could form the basis for estimation of α if u2i is observable. As in our discussion of robust standard errors for OLS, we circumvent this problem by using the OLS residuals and estimate α from
e2i = h(xi, α) + “error”. (4.31)
In practice, h( · ) is likely to be a nonlinear function and so estimation would involve a technique known as Nonlinear Least Squares which is beyond the scope of our discussion. Therefore, we restrict attention to the choice of h( · ) in (4.29) because this model is linear in the unknown parameter vector α = (α1, α2, α3)′. Let αˆ be the estimator of α obtained by regressing e2i on an intercept, mi and m2i , then our estimator of σi2 is:11
σˆ i2 = αˆ 1 + αˆ 2 m i + αˆ 3 m 2i .
Setting Σˆ equal to the diagonal matrix with (i, i)th element σˆi2 , the FGLS esti- mator is given by (4.10).
Under certain conditions (which need not concern us here), it can be shown that the FGLS estimator has the same limiting distribution as the GLS estimator
and so:
N1/2(βˆFGLS−β0)→d N(0,VGLS).
Furthermore, VˆGLS = (N−1X′Σˆ−1X)−1 is a consistent estimator of VGLS. Therefore, we can develop a large sample framework for inference about β0 based on the FGLS estimator. To illustrate, we consider two of types of inference: a confidence interval for β0,l and a test of H0 : Rβ0 = r vs H1 : Rβ0 ̸= r where R is nr × k. An approximate 100(1 − α)% confidence interval for β0,l is given
by, βˆFGLS,l ±z1−α/2VˆGLS,l,l/N, (4.32)
where VˆGLS,l,l is the (l, l)th element of VˆGLS and z1−α/2 is the 100(1 − α/2)th percentile of the standard normal distribution. For the hypothesis test, we can base inference on the following Wald statistic,
W(GLS) = N(RβˆN − r)′ RVˆGLSR′ −1 (RβˆN − r). (4.33) N
The decision rule is then as follows:
Decision rule: reject H0 : Rβ0 = r in favour of HA : Rβ0 ̸= r at the approxi-
mate 100α% significance level if:
W(GLS) >cnr(1−α)
11Note that we are implicitly assuming here that the predicted value of σi2 from this regres- sion in positive which is not guaranteed to be the case. This motivates the choice of nonlinear h( . ) that automatically impose this constraint.
119
N
where cnr (1 − α) is the 100(1 − α)th percentile of the χ2nr distribution.
So far,we have concentrated on the case where the heteroscedasticity is driven by a function of xi. Suppose now that σi2 = h(zi, α) where zi is a random vector of observable variables that includes at least element that does not appear in xi. In this case, all the GLS procedures above go through provided that we modify the conditions and methods appropriately. Specifically, (x′i, zi′, ui) is assumed to be an i.i.d. sequence with E[ui|xi, zi] = 0 and V ar[ui|xi, zi] = h(zi, α), and the estimation of α is based on (4.31) with h(zi, α) substituted for h(xi, α).
In comparison to OLS-based inference with heteroscedasticity-robust statis- tics, FGLS has one major advantage and one major disadvantage. The advan- tage is that in large samples (for which FGLS has inherited the properties of GLS) inferences are based on a more efficient estimator. The disadvantage is that to implement FGLS it is necessary to specify the form of the heteroscedas- ticity. To understand more about the pros and cons of FGLS versus OLS, we conclude this sub-section by returning to the nature of the transformation involved in GLS estimation.
To understand the impact of the transformation, it is useful to introduce another variation of Least Squares known as Weighted Least Squares (WLS) estimation. To this end, we introduce a set of finite positive weights {wi; i = 1,2,…,N}. The WLS estimator of β0, βˆWLS, is defined to the OLS estimator based on the transformed regression model,
and so its formula is
wi ∗yi = (wi ∗xi)′β0 + wi ∗ui,
βˆWLS = (X′W2X)−1X′W2y. (4.34)
where W2 is the diagonal matrix with (i,i)th element wi2. It can be shown that under suitable restrictions on {wi} and our assumptions here that βˆWLS is consistent for β0 and12
N1/2(βˆWLS−β0)→d N(0,VWLS), (4.35) V = Q−1Ω Q−1,
Qw = plimN→∞ N−1X′W2X, and Ωw = plimN→∞ N−1X′W2ΣW2X. No- tice that the above results for any choice of weights satisfying the conditions stated above.
Using the WLS framework, it can be recognized that GLS is WLS with weights wi = (1/σi). Thus – assuming the heteroscedasticity is correctly speci- fied – the effect of the transformation, is to put more weight in the calculation of the estimator on those observations associated with smaller values of the error variance and less weight on those observations with higher values of the error
where
WLS www
12See Tutorial 6.
120
variance. This makes intuitive sense because yi is more closely related to xi in observations with smaller values of the error variance and so these observa- tions are more informative about β0. This contrasts with OLS which puts the sameweightoneachobservationinthecalculationoftheestimator.13 Itisthis re-weighting that is the source of the efficiency gains over OLS.
We can also use this framework to uncover the properties of GLS based on
an incorrect form for the heteroscedastcity. To this end, suppose it is assumed
that σ2 = m(xi) but the true form of the heteroscedasticity is given σ2 = h(xi). i2i
GLS based on σi = m(xi) amounts to WLS with weights wi = 1/ m(xi). Thus the estimator is still consistent and has a normal distribution in large samples. However, two things have changed: since we have assumed an incorrect form for the heteroscedasticity, our estimator is no longer efficient. Second, if we base our inferences on statistics derived from the distributional result in Theorem 4.4 then these inferences are invalid because the true variance of our estimator is given by VWLS (with wi = 1/ m(xi)) and not VGLS.14
For some researchers, the fact that GLS is not guaranteed to be more effi- cient than OLS because we can never be sure we have the correct form of the heteroscedasticity is enough reason not to use (F)GLS. But as pointed out by Wooldridge (2006)[p.289]
This is a valid theoretical criticism, but it misses an important prac- tical point. Namely, in cases of strong heteroscedasticity, it is of- ten better to use a wrong form of the heteroscedasticity and apply weighted least squares than to ignore the problem in estimation en- tirely.
In such cases, inference can be performed using WLS with a heteroscedasticity-
robust estimator of VW LS . Using similar arguments to Section 4.3.1, it can be
shown that Vˆ = Qˆ−1Ωˆ Qˆ−1 where Qˆ = N−1X′W X, Ωˆ = N−1X′Mˆ X WLSwww w 2w w
and Mˆ w is a N × N diagonal matrix with (i, i)th element uˆ2i wi4 where uˆi = yi −x′iβˆWLS. Wooldridge(2006)[p.289]continues:
Although it is difficult to characterize when such WLS procedures will be more efficient than OLS, one always has the option of doing OLS and WLS and computing robust standard errors in both cases. At least in some cases the robust WLS standard errors will be notably smaller on the key explanatory variables.
This seems a sensible prescription. The form of the heteroscedasticity is rarely of interest, rather it is a complication in our aim of performing inference about the parameters of interest, β0. In the absence of certain information on the true form of the error variance, it is best to be aware of how sensitive our conclusions are to the way we have accounted for the potential presence of heteroscedasticity.
13Note that OLS is WLS with wi = 1.
14In general, VWLS only equals VGLS if m(xi) = h(xi).
121
In weighing up the evidence from these different methods, it may be useful to know whether there is evidence of heteroscedasticity in the data. In the next sub-section, we discuss a popular test for heteroscedasticity.
4.3.3 Testing for heteroscedasticity
Breusch and Pagan (1979) propose a test of homoscedasticity against the alter- native of hetersoscedasticity of the following form
σi2 = h(δ + zi′α)
where h( · ) is (a twice continuously differentiable) function that is independent of i and satisfies h( · ) > 0, zi is a (p × 1) vector of observable variables, and (δ, α′) is a vector of p + 1 unknown parameters in which δ is a scalar and α is p × 1. An advantage of this specification is that if we set α = 0 then σi2 = h(δ) which is independent of i, whereas if any element of α is non-zero then σi2 varies with i. Therefore, we can express the null and alternative in terms of restrictions on the parameter vector α as follows:
H0 : α=0
HA : αi ̸= 0foratleastonei = 1,2,…,p
Popular examples of h(δ + zi′α) are: eδ + zi′α and (δ + zi′α)2. The test statistic is:
BPN = Nf′Z(Z′Z)−1Zf f′f
(4.36)
w h e r e f i s t h e N × 1 v e c t o r w i t h i t h e l e m e n t e 2i − σˆ N2 , { e i ; i = 1 , 2 , . . . , N } a r e the OLS residuals and σˆN2 is the OLS estimator of σ02. We present the limiting distribution of the test under the null before commenting on the form of the statistic. To present the distribution, it is necessary to first modify some of our conditions to reflect the introduction of the variables zi into the model.
Assumption BP. (i) {(x′i, zi′, ui); i = 1, 2, . . ., N} are a sequence of indepen- dently and identically distributed variables; (ii) E[ui|xi, zi] = 0;
( i i i ) V a r [ u i | x i , z i ] = σ 02 .
The limiting distribution of the BP-statistic under H0 is given by the following result.
Theorem 4.5. If Assumptions CS1, CS3 and BP hold then: BPN →d χ2p.
There are several important features of this test that need to be noted.
First, the form of the test statistic depends on zi but not on h(·). This structure arises because of the test principle used to derive the test by Breusch and Pagan (1979). While we are not in a position to go into this derivation in
122
detail here,15 an intuitive explanation is that in the construction of the test h( ·) is linearized around α = 0 using a first order Taylor series expansion to give16
σi2 = h(δ + zi′α) ≈ h(δ) + h′(δ)zi′α
where h′(δ) = ∂h(κ)/∂κκ=δ.17 Due to the linearization, the choice of h(·) ends up having no impact on the test statistic. This lack of dependence on h( · ) has its advantages and disadvantages: on the the plus side the test is valid for any choice of h( · ); on the minus side, if significant then the test provides no guidance on a suitable choice of h( · ).
Second, it can be shown that BPN = NR2 where R2 is the multiple corre- lation coefficient from the regression of e2i on an intercept and zi. This means that the test can be calculated from two OLS regressions: yi on xi, and then e2i on an intercept and zi. At the time Breusch and Pagan were writing this was considered a considerable advantage due to the state of available computer software for econometric modeling, but is arguably less so today as the test is now included in many software packages.
Third, the NR2 interpretation is useful for deriving the intuition behind the test. If the errors are homoscedastic then a regression of u2i on (1,zi′) would have a population R2 equal to zero. Therefore, we anticipate a regression of e2i on (1,zi′) to yield an R2 that is insignificantly different from zero. However, if the error variance depends (linearly) on zi then a regression of u2i on (1, zi′) would have a population R2 greater than zero, and so we expect a regression of e2i on (1, zi′) to yield an R2 that is significantly different from zero.
Fourth, the choice of zi is crucial in the construction of the test. In practice, this choice has to be guided by the context as in Example 4.4. One popular choice of zi is due to White (1980). To describe, this choice it is convenient to partition the regressor vector as xi = (1, x ̃′i). White (1980) suggests setting zi equal to a vector including {x ̃i,l; l = 1,2,…,k−1} and the terms {x ̃i,lx ̃i,j; l = 1,2,…,j, j = 1,2,…,k − 1}.18 So for example, if k = 2 – so that x ̃i is a scalar – then zi = (x ̃i, x ̃2i )′; if k = 3 then zi = (x ̃i,1, x ̃i,2, x ̃2i,1, x ̃2i,2, x ̃i,1x ̃i,2)′. The motivation for this choice is as follows. Recall we have discussed two estimators of the variance of the OLS estimator. The first is σˆN2 Qˆ−1 which is consistent if the errors are homoscedastic, and the second is Qˆ−1ΩˆhQˆ−1 which is consistent if the errors are either homoscedastic or heteroscedastic.19 Therefore, if the errors are homoscedastic then
Qˆ−1ΩˆhQˆ−1 − σˆN2 Qˆ−1 →p 0. (4.37)
15Breusch and Pagan (1979) derive the test by applying the Lagrange Multiplier (LM) principle within the framework of Maximum Likelihood estimation under the assumption that the errors have a normal distribution. These techniques are discussed in Chapter 6. As result, BP is sometimes referred to as the LM test for heteroscedasticity.
16See Section 3.2.
17Here κ represents the argument of h.
18 In this description, it is assumed that the listed choice includes no duplicates. For example,
if x ̃i,l is a dummy variable then x ̃i,l = x ̃2i,l for all i. In such cases, x ̃2i,l would be omitted from zi.
19See equations (3.12) and (4.18) respectively above. 123
It can be verified that (4.37) holds if and only if ˆˆN p
Ωh − σˆN2 Q = N−1 (e2i − σˆN2 )xix′i → 0. (4.38) i=1
Therefore, White (1980) suggests testing for heteroscedastcity by examining whether N−1 Ni=1(e2i − σˆN2 )xix′i is insignificantly different from zero. The
resulting procedure is referred to as White’s “direct test for heteroscedasticity”; however it has been recognized that this is just the Breusch-Pagan test with choice of zi described above.20 An obvious advantage of this choice of zi is that the test is based on the part of the OLS framework that is affected by heteroscedasticity that is, our estimator of the variance of the estimator – if we fail to reject then we can use the standard OLS inference framework based on the conventional variance estimator in equation (3.12); if we reject then we must use the heteroscedasticity-robust variance estimator – equation (4.18) – to construct the statistics on which our inferences are based. A disadvantage is that if k is relatively large then the dimension of p – which is k(k + 1)/2 − 1 – can become large. In this case, it may be difficult to detect heteroscedasticty that depends on only a few elements of zi. One solution is to choose zi so that it contains a relatively small number of functions of {x ̃i,l; l = 1, 2, . . . , k − 1} and {x ̃i,lx ̃i,j; l = 1,2,…,j, j = 1,2,…,k − 1}; however, we do not pursue these here, please see Wooldridge (2002)[p.126-8] for further discussion.
4.3.4 Empirical illustration
In this sub-section, we illustrate the methods described above using the model in Example 4.4 using a sample of 100 families.21 The OLS estimated regression
line is:
The number in parentheses is the associated standard error calculated using the standard OLS formula defined in Section 2.6 just below equation (2.36); the number in brackets is the heteroscedasticity-robust standard error calculated as N−1/2 times the square-root of the appropriate diagonal element of Vˆh in (4.18). The R2 for this equation is 0.0621 and so indicates income only explains approximately 6% of the variation in savings. √
yi = 124.84 (655.39) [522.91]
+
0.147 mi. (0.058)
[0.061]
We also estimate the model using WLS with wi = 1/ mi. The results are
as follows.
yi = −124.95 (480.86) [266.59]
+
0.172 mi. (0.057)
[0.050]
20Note that since both matrices in (4.38) are symmetric, the test is only based on the lower (or upper) triangular elements to avoid duplication.
21The data is contained in the file SAVING.RAW used in Wooldridge (2006). We replicate part of the analysis in his Example 8.6 (p.287).
124
The number in parentheses are the standard errors calculated under the as- sumption that V ar[ui|xi] = σ02mi so that this WLS estimation is actually GLS; the number is brackets in the standard error based based on a heteroscedasticity robust estimator.
Comparing the two estimators of β0,2, it can be seen that both are positive with the WLS being slightly larger: the OLS results suggest that for every $1 of additional income savings increase by 14.7 cents ceteris paribus; the WLS results suggest that that for every $1 of additional income savings increase by 17.2 cents. Notice that the standard errors are smaller for the WLS estimation than for the OLS estimation, suggesting the weighting has led to a more precise estimation. We now consider whether the results indicate a significant relationship between income and savings. To this end, we test H0 : β0,2 = 0 versus HA : β0,2 ̸= 0. First consider the OLS-based inferences: using the results reported above, we have:
• OLS results:
– Using the conventional OLS standard errors, the test statistic is:
τˆN,2(0) = 0.1466 = 2.5479 0.0575
and so the p-value of the test is 0.0124;
– Using the White standard errors, the test statistic is:
τˆN,2(0) = 0.1466 = 2.4145 0.0607
and so the p-value of the test is 0.0176. • WLS results:
– Using standard errors based on the assumption that V ar[ui|xi] = σ02x ̃i, the test statistic is:
τˆN,2(0) = 0.1718 = 3.0232 0.0568
and so the p-value of the test is 0.0032;
– Using the heteroscedasticity robust standard errors, the test statistic
is:
τˆN,2(0) = 0.1718 = 3.4340 0.0500
and so the p-value of the test is 0.0009.
Therefore the OLS results allow us to conclude that income significantly affects savings at the 5% significance level, but the WLS estimation allows us to con- clude this relationship is significant at the 1% significance level and so provides stronger evidence of a relationship. Although there is a slight difference, the re- sults are qualitatively similar suggesting heteroscedasticity may not be a factor here and this is confirmed by the Breusch-Pagan test using zi = (1,mi): the test statistic is 0.9235 with a p-value of 0.3366.
125
4.4 Time series and non-spherical errors
The errors in models for economic time series can have a non-spherical distribu- tion due to to heteroscedasticity and/or serial correlation. If the non-sphericality is due only to conditional heteroscedasticity then the issues surrounding OLS- or GLS- based inferences are essentially the same as those described in the previous section for models estimated from cross-section data. If the source is serial correlation then the treatment depends on the framework adopted. In Section 3.3.2, we provide two sets of regularity conditions under which the stan- dard large sample OLS-based inference framework is valid with time series data. Both these sets of regularity conditions imply the errors are serially uncorrelated but they differ in how this property is imposed. In the first set of regularity conditions – Assumptions TS1-TS6 – it is imposed directly via Assumption TS6. In the second set of regularity conditions – Assumptions TS1-TS3, TS5 and TS7 – the lack of serial correlation in the errors is a consequence of the assumption that the model is dynamically complete (Assumption TS7). While both sets of conditions yield the same large sample distribution for the OLS estimator, the framework adopted impacts on how we view the consequences of serial correla- tion in the errors.
Within the context of the first set of conditions, serial correlation represents a violation of Assumption TS6 alone, and so in this case the OLS estimator is still consistent and the impact of serial correlation is that it induces a non- spherical error distribution. In such cases, there is scope for OLS-based inference provided the variance-covariance estimator is appropriately modified and also potentially for GLS estimation.
In contrast, within the context of the second set of regularity conditions, serial correlation is a consequence of a violation of the assumption that the model is dynamically complete. In this case, the model is misspecified and OLS estimators are inconsistent. In this case, the only solution is to revisit the model specification.
Below we expand on this summary and our discussion is organized as follows. Section 4.4.1 briefly describes the class of ARCH/GARCH processes that are commonly used to capture conditional variation in economic time series. Section 4.4.2 discusses the impact of serial correlation on the OLS-based inferences. Section 4.4.3 considers GLS estimation. Section 4.4.4 discusses testing for serial correlation, and Section 4.4.5 presents an empirical illustration of the techniques described in the this section.
4.4.1 Conditional heteroscedasticty in time series models
Following the seminal paper by Engle (1982), it is customary to allow for the possibility that economic time series exhibit conditional heteroscedasticity. En- gle (1982) introduces the class of Autoregressive Conditional Heteroscedastity
126
(ARCH) processes. The simplest example of such as process is: yt = σtεt,
σ2 = α +αy2 , t 0 1t−1
εt ∼ IN(0,1).
Since there are no regressors in this model, we define the information set at time t to be the history of the series that is, It = {yt−1, yt−2, . . .}. Given this specification, the first two conditional moments of yt are
E[yt|It] = 0, Var[yt|It]=σt2 =α0+α1yt2−1.
and
Thus the conditional mean is zero and the conditional variance has an autore-
gressive structure in the sense that σ2 depends on y2 . Therefore, there is a t t−1
sense in which ARCH process can be viewed as an AR model for the conditional
variance. The specification above is known as an ARCH process of order one
– denoted ARCH(1) – because the conditional variance depends on y2 . This t−1
framework easily extends to higher order processes: if the conditional variance equation above is replaced by
p
σ2=α+αy2 , t 0 it−i
i=1
then yt is said to follow an ARCH process of order p denoted by ARCH(p). As with AR models the choice of p, is usually data driven, and it is often found that a relatively large value of p is needed. This is problematic if it is desired to estimate the conditional variance because to do so involves estimation of p + 1 parameters. As an alternative, Bollerslev (1986) introduces the class of Generalized ARCH or GARCH models which – as the name suggests – contains ARCH models as a special case. To illustrate, consider the popular GARCH(1,1) model:
yt = σtεt, σ2=α0+α1y2 +δ1σ2,
t t−1 t−1 εt ∼ IN(0,1).
The key innovation of GARCH models is that σt2 depends on its own lagged values as well as those of y. As a result, GARCH models are able to capture the conditional variance of economic time series with fewer parameters than an ARCH model.
ARCH/GARCH models are widely applied in empirical finance because they provide a way to capture so-called “volatility clustering”. The volatility of a stock refers to how widely the price of the stock varies within a particular period of time. High volatility in a given period tends to be followed by high
127
volatility in the next period and so periods of high volatility tend to be clustered together. An intuitive explanation of this phenomena is that asset prices become more volatile when important economic news is released, and it may take several periods for the market to fully process this new information.22
To allow for the possibility of conditional variation, we replace Assumption TS5 by the following condition in our analysis here.
Assumption TS5-H. V ar[yt | It] = σt2.
Models with time varying conditional variance involve aspects of behaviour beyond just the first two moments of the times series. Therefore, we replace Assumption TS2 by the following assumption.
Assumption TS2-SS. (yt, h′t) is a strongly stationary, weakly dependent time series.
Recall that under this assumption the (unconditional) expectation of all functions of the data are invariant to t. As with Assumption TS2, Assumption TS2-SS also implicitly imposes the necessary conditions under which we can deduce the WLLN and CLT for the relevant functions of {yt, ht} below.
4.4.2 OLS
We divide our discussion into two parts reflecting the two sets of regularity conditions used to underpin the large sample analysis of OLS in Section 3.3.2.
(i) Assumption TS4 holds but Assumption TS6 is violated:
To motivate this discussion, we begin by considering an example that illustrates that this pattern of behaviour can occur in economic models of interest.
Example 4.5. Exchange rates can be quoted as either spot or future rates. The spot rate is the exchange rate between the two currencies assuming the transaction takes place immediately; the future rate is exchange rate between the two currencies assuming that the transaction takes place at some specified time in the future. As such future exchange rates can be regarded as predictions based on information today about what the spot rate rate will be in the specified time in the future. The simple efficient-markets hypothesis implies that future exchange rate are optimal predictors of the future spot rates in the sense that they represent the conditional expectation of the future spot rate given all information available today. To express this idea mathematically, we introduce the following notation: ft,k is the log of the k−period forward exchange rate at time t, st+k is the log of the spot exchange rate in period t+k and Ωt denote all information set available at time t. Following Hansen and Hodrick (1980), the simple efficient-markets hypothesis implies that E[st+k Ωt] = ft,k.
22For a review of ARCH/GARCH models in finance see Bollerslev, Chou, and Kroner (1992). While, inevitably, some of the discussion is dated, this review provides a good introduction to these models and their uses in finance.
128
Now consider the following regression model:
s =β +β f +u(k) (4.39) t+k 0,1 0,2 t,k t
where u(k) is an error term. This type of model is often referred to as a predictive t
regression model because it captures the relationship between the quantity being predicted, s here, and its predictor, f here. As a result, u(k) = s −
t+k t,k t t+k β0,1 − β0,2ft,k is often referred to as the prediction error in this context. From this model it follows that:
E[s Ω]=β +β E[f Ω]+E[u(k)Ω]. t+k t 0,1 0,2 t,k t t t
E[u(k) Ω ] = 0 and the parameter restriction (β , β ) = (0, 1). tt 0,10,2
Since ft,k ∈ Ωt, it follows that
E [ s Ω ] = β + β f + E [ u ( k ) Ω ] ,
t+k t 0,1 0,2t,k t t
and so within this model the simple efficient markets hypothesis equates to
One way to assess whether exchange rates exhibit the behaviour implied the
simple efficient markets hypothesis is to estimate (4.39) via OLS and then the
parameter restrictions above. To this end, it is necessary to consider the prop-
erties of the error u(k). Using the Law of Iterated Expectations, Hansen and t
Hodrick (1980) show that under the simple efficient markets hypothesis
E[f u(k)]=E[f (s −f )]=0, t,k t t,k t+k t,k
and so the regressor in our model is uncorrelated with the error, meaning that the OLS estimator, βˆT , is consistent for β0 = (0, 1)′. Now consider what can be said about the serial correlation of the errors. To this end consider
Cov[f u(k),f u(k) ] = Eu(k)u(k) f f t,k t t−n,k t−n t t−n t,k t−n,k
Using the Law of Iterated Expectation, we have
Eu(k)u(k) f f = EEu(k)u(k) f f
t t−n t,k t−n,k t t−n t,k t−n,k
Ω . t
(4.40) However,itisonlyifn≥kthat{s ,f ,f }∈Ω (andsou(k) ∈Ω)
t−n+k t−n,k t,k t t−n t and we can use E[u(k) Ωt] = 0 in (4.40) to deduce that
t
Eu(k)u(k) f f = EEu(k)Ωu(k) f f ,
t t−n t,k t−n,k t t t−n t,k t−n,k = 0.
Therefore, in this example, the simple efficient markets hypothesis implies that ft,k u(k) (and u(k)) are only serially uncorrelated at lags k, k + 1, k + 2 . . ..
tt
We now consider how to perform OLS-based inference in this setting. Based on (3.25), we can deduce the following result under our conditions here.
129
Theorem 4.6. If Assumptions TS1, TS2-SS, TS3, TS4, and TS5-H hold then T1/2(βˆT−β0)→d N(0,Vsc),
where Vsc = Q−1ΩQ−1,
Ω = Γ 0 + ∞ ( Γ i + Γ ′ i ) ,
i=1
and Γi = Cov[xtut, xt−iut−i].
Therefore, to perform inference based on the OLS estimator, it is necessary to construct a consistent estimator of Ω. There has been quite a considerable literature in econometrics on the construction of consistent estimators of the long run variance. Here we describe the most popular method: heteroscedasticity autocorrelation covariance (HAC) matrices.
Given this structure, it is natural to estimate Ω by truncating this infinite s u m a n d e s t i m a t i n g Γ j b y Γˆ j = T − 1 Tt = j + 1 x t x ′t − j uˆ t uˆ ′t − j w h e r e uˆ t = u t ( βˆ T ) . This leads to the estimator
lT i=1
where “TR” stands for truncated. White and Domowitz (1984) first proposed this type of estimator and showed its consistency in certain least squares settings provided lT → ∞ as T → ∞ at a certain rate. This would appear to solve the problem, but does not. While Ωˆ T R converges in probability to a positive definite matrix, it may be indefinite in finite samples – or in other words, if k = 1 then the estimated variance is not guaranteed to be positive. The source of the trouble is not the truncation but the weights given to the sample autocovariances in (4.41). The solution is to construct an estimator in which the contribution of the {Γˆj} are weighted to downgrade their role sufficiently in finite samples to ensure positive semi–definiteness but have the weights tend to one as T → ∞ to ensure consistency. This is the intuition behind the class of HAC estimators which take the form
T−1 i=1
where ωi,T is known as the kernel (or weight). The kernel must be carefully cho- sen to ensure the twin properties of consistency and positive semi–definiteness. A number of choices have been suggested, but for brevity we focus here on the choice proposed by Newey and West (1987) in which
ωi,T =1−aiforai≤1 0forai >1
130
Ωˆ T R = Γˆ 0 +
( Γˆ i + Γˆ ′i ) , ( 4 . 4 1 )
ΩˆHAC =Γˆ0+
ωi,T(Γˆi+Γˆ′i) (4.42)
where ai = i/(bT + 1). This choice yields what is sometimes referred to as the “Bartlett” or “Newey-West” kernel (or HAC estimator).23 The parameter bT is known as the bandwidth, and must be non–negative. Notice that this parameter controls the number of autocovariances included in the HAC estimator with this choice of kernel.24 In general for ΩˆHAC to be consistent, we require bT → ∞ as T → ∞, that we must include an increasing number of autocovariances as the sample size increases but the rate at which do so is important. Andrews (1991) presents theoretical arguments to justify setting bT proportional to T1/3 for the Bartlett kernel; Newey and West (1994) suggest setting bT proportional to T2/9. However, this type of condition provides little practical guidance because it only restricts the optimal bandwidth for the Bartlett weights to be of the form cT 1/3 (or cT 2/9 depending on who you choose to believe) for any choice of finite c > 0. Both Andrews (1991) and Newey and West (1994) provide methods for choosing c. In practice, T 1/3 or T 2/9 makes little difference, and researchers tend to assess the robustness of their results to different choices of c.
Therefore, if we set Vˆsc = Qˆ−1ΩˆHACQˆ−1 then it can be shown that Vˆsc →p Vsc. The square root of the diagonal elements of Vˆsc/T are sometimes referred to as the “Newey-West” standard errors and can be obtained in many regression packages. We can then perform all the large sample inference discussed in Section 3.2 provided we replace σˆT2 Qˆ−1 by Vˆsc. Since the modification to the inference methods is qualitatively the same as in the heteroscedasticity case, we do not provide illustrations here.
Finally, note that if the errors are believed to be conditionally heteroscedas- tic (Assumption TS5-H) but satisfy Γi = 0 for all i ̸= 0 then inferences based on Vˆsc remain valid for non-zero bT but it suffices to set bT = 0; in the latter case, the Newey-West standard errors reduce to White’s standard errors.25 ⋄
(ii) Assumption TS7 is violated:
In this case, the model is misspecified and so – in the absence of further informa- tion – there is no reason to suppose Assumption TS4 holds, meaning the OLS estimator is inconsistent. The solution here is not to modify our OLS inference framework but to modify the model to address the misspecification. In principle, this misspecification can arise in three ways: the model is linear and we have not included the correct variables on the right-hand side; we have included the correct variables on the right-hand side but the model is nonlinear; the model is nonlinear and we have also not included the correct variables on the right-hand side. Here, we focus on the first of these; you explore certain nonlinear mod- els for time series in ECON60522. Staying within the linear regression model framework, the misspecification is addressed by the inclusion of additional lags of the current variables and/or other variables in the model until the evidence
23“Bartlett” is in reference to its proponent in an earlier application in the time series literature.
24As a result, the kernel is said to be “truncated” and the resulting covariance matrix a member of the class of “truncated” HAC estimators.
25However, see Tutorial 7 Question 4.
131
of serial correlation disappears. Once the model appears correctly specified, the OLS-based inference frameworks described in Sections 3.3.2 or 4.4.2 can be used depending on whether the errors are considered to be potentially conditionally heteroscedastic. ⋄
Which approach should we take? The answer depends on the setting. The first approach is only really appropriate in settings where we are specifically interested in the impact of xt on yt and the underlying model implies that Assumption TS4 holds even though xtut is serially correlated.26 Outside of this type of setting, practitioners typically follow the second approach and so re-assess the model specification. Even in cases where Assumption TS4 is ar- guably valid, inferences based on OLS estimators with HAC estimators can be unreliable in the sample sizes typically encountered with macroeconomic data.
4.4.3 GLS
To implement GLS, it is necessary to specify a model for the conditional variance and serial correlation. Since we have covered GLS and heteroscedasticity in Section 4.3.2, we focus here where the errors are homoscedastic (Assumption TS5 holds) but serially correlated. Historically, a common approach to GLS in this setting was to assume the errors followed an AR process. To illustrate, supposes that the the model is:
where
y t = x ′t β 0 + u t ( 4 . 4 3 ) ut = ρ0ut−1 + εt (4.44)
and εt is an unobservable random variable that is independently and identically distributed with mean zero and variance σε2. We assume E[xtut] = 0 so that the OLS estimator based on (4.43) , βˆT , is consistent for β0. It can be shown that GLS estimation of (4.43) is OLS based on the transformed regression model,
y ̈t = x ̈′tβ0 + u ̈t (4.45)
where y ̈t = yt − ρ0yt−1, x ̈t = xt − ρ0xt−1 and u ̈t = ut − ρ0ut−1. (This trans- formation is known as the “quasi-difference” of the variable in question.) It can be recognized that u ̈t = εt and so – as it is designed to do – the transformation behind GLS produces an error term that is mean zero, homoscedastic and seri- ally uncorrelated. However, it should be noted that the conditions for GLS to be consistent are
E[x ̈tu ̈t] = E[(xt − ρ0xt−1)(ut − ρ0ut−1)] = 0,
26This rules out cases where xt contains lagged dependent variables, and for analysis of this case see Tutorial 7 Question 3.
132
and so requires not only E[xtut] = 0 but (in general) also E[xt+1ut] = 0 and E[xt−1ut] = 0. Thus, if the model for the errors is correct then the conditions for the consistency of GLS are stronger than those for the consistency of OLS.
There is in fact a connection between GLS and the strategy of including additional lagged variables on the right-hand side. Equation (4.45) can be re-
written as:
yt = ρ0yt−1 + x′tβ0 − x′t−1(ρ0β0) + εt. This is a special case of the linear regression model
yt = δ0yt−1 + x′tδ1 + x′t−1δ2 + εt, in which the parameters satisfy the following restrictions
δ2 +δ0 ∗δ1 = 0.
(4.46) (4.47)
(4.48)
Equation (4.48) are known as the common factor or COMFAC restrictions. While once popular, GLS is no longer commonly used in time series analysis for two main reasons. First, the method rests on a particular model for the errors but as discussed in the previous sub-section serial correlation in the errors can arise for multiple reasons. While it is true that the adoption of an AR model for the errors leads us to estimation of a dynamic regression model with additional lags of the variables included, it also imposes the COMFAC restrictions on the parameters of the model. Second, even if we are prepared to assume an AR model for the errors then it is necessary to place stronger assumptions on the relationship between x and u for the GLS estimator to be consistent than are
necessary for the OLS estimator to exhibit the same property.
4.4.4 Testing for serial correlation
Durbin and Watson (1950) propose what has become the most well known test for serial correlation in regression. The so-called Durbin-Watson statistic is routinely reported in many regression packages. This test is derived within the framework of a regression model with an AR(1) model for the errors that is, equations (4.43)-(4.44) in the previous sub-section. Under the null hypothesis that ρ0 = 0, the errors are serially uncorrelated. The test statistic is:
T 2 ( e t − e t − 1 )
DW= t=2 . Tt = 1 e 2t
Durbin and Watson (1950) derive the exact finite sample distribution of DW conditional on X under the Classical regression model assumptions (Assump- tions CA1-CA6). Unfortunately this distribution depends on X, meaning crit- ical values need to be re-calculated for every choice of X. To circumvent this problem, Durbin and Watson (1950) provided bounds on the critical values and these are provided in a number of econometrics texts.27 Multiplying out the
27The test is versus a one-sided alternative, either ρ < 0 or ρ > 0, with the latter being the most relevant with economic data.
133
numerator, it can be shown that DW ≈ 2(1 − ρˆ) where ρˆ is the sample cor- relation between et and et−1 (known as the first order sample autocorrelation of the residuals). In moderate sized samples the approximation is often good, and using this relationship it can be shown that the DW statistic is approxi- mately distributed N (2, 4/T ) under the null hypothesis of serially uncorrelated errors.28 This distribution can then be used to perform a large sample approx- imate version of the Durbin-Watson test without the need to refer to the tables of bounds.
For our purposes here, the Durbin-Watson test has one other major weak- ness: it is invalid in models where the regressors include lagged values of the de- pendent variables. Working simultaneously but independently, Breusch (1978) and Godfrey (1978) propose the same solution to testing for serial correlation in the errors when the regressors include lagged dependent variables. Like the Breusch-Pagan test for heteroscedasticity, their test is based on an application of the Lagrange Multiplier test principle from maximum likelihood theory. As a result of this genesis, the method is sometimes referred to as the “Breusch- Godfrey” test or, perhaps more often, the LM test for serial correlation.29
In practice there is no reason to limit attention to the AR(1) as the model for serial correlation in the errors under the alternative hypothesis, and Breusch and Godfrey consider the case where the regression model is given (4.43) and the errors are generated by a AR(p) process,
ut = ρ0,1ut−1 + ρ0,2ut−2 + . . . + ρ0,put−p + εt (4.49)
and εt is an unobservable random variable that is independently and identically distributed with mean zero and variance σε2. The null and alternative hypotheses are:
H0: ρ0,i =0,foralli=1,2,…,p,
HA : ρ0,i ̸= 0, for at least one i = 1,2,…,p.
As with the Durbin-Watson test, the null hypothesis is that the errors are serially uncorrelated. The LM statistic is:
LMp = e′HpHp′(IT −P)Hp−1Hp′e, (4.50) σˆT2
where e is the vector of OLS residuals, P = X(X′X)−1X′, and H is a T × p matrix with (t, i) element equal to et−i for t > i and equal to zero for t ≤ i.
The limiting distribution of the LM statistic under H0 is given by the fol- lowing theorem.
Theorem 4.7. If Assumptions TS1-TS3, TS5 and TS7 hold then: LMp →d χ2p.
28See Harvey (1990)[p.201].
29Durbin (1970) proposes a correction to the DW test that makes it valid when lagged dependent variables are included as regressors. This statistic is commonly referred to as “Durbin’s h-test”. We do not discuss it here as the Breusch-Godfrey test is more general.
134
It can be shown that LMp = TR2 where R2 is the multiple correlation coefficient from the regression of et on xt, et−1, et−2, … ,et−p. So as with the Breusch-Pagan test for heteroscedasticity, the LM test for serial correlation can be calculated via two OLS regressions: yt on xt, and then et on xt, et−1, et−2, . . . ,et−p. The T R2 interpretation is useful for deriving the intuition behind the test. If ut is uncorrelated with any of its lagged values up to lag p then a regression of ut on xt, ut−1, ut−2, … ,ut−p would have a population R2 equal to zero. Therefore, we anticipate a regression of et on xt, et−1, et−2, . . . ,et−p to yield an R2 that is insignificantly different from zero. However, if ut is correlated with any of its lagged values up to lag p then a regression of ut on xt, ut−1, ut−2, . . . ,ut−p would have a population R2 greater than zero, and so we expect a regression of et on xt, et−1, et−2, . . . ,et−p to yield an R2 that is significantly different from zero. This interpretation also emphasizes that the test detects serial correlation in the errors: while ut being generated by an AR(p) model is one possible source of this serial correlation, it is not the only possible source as discussed above. Therefore, a significant statistic should be taken as evidence of serial correlation in the errors rather than as evidence that the errors follow the AR(p) model specified under the alternative of the test. See Tutorial 7 Questions 1 and 2 for further analysis of this issue.
To implement the LM test, it is necessary to choose a value for p. There are no set rules for this but one natural choice is to make p a multiple of the sampling frequency. So, for example, if the data are quarterly then p is set as 4 or 8 or 12.
4.4.5 Empirical illustration
To illustrate the methods described in this section, we return to the model for traffic fatalities in Example 2.2. Recall this example involved monthly time series for the period January 1981 through December 1989. Since the controls include the monthly dummy variables and a linear time trend, the series, yt, is not stationary and so this setting does not fall within the framework in the previous section. However, as noted in Section 3.5, our LS-based inferences can be extended to models including these types of regressors. In Example 2.11, we investigate whether there is evidence that the introduction of the seat belt law led to a reduction in the percentage of traffic accidents involving fatalities, and found marginal support in favour of this view, rejecting at the 10% but not the 5% significance level. However, that inference procedure is only valid if the errors are both conditionally homoscedastic and serially uncorrelated. Here we focus on the validity of the second of these assumptions. Applying the Breusch-Godfrey LM test described in Section 4.4.4 with p = 1, the re- sulting test statistic is 8.506 with a p-value of 0.0035, meaning we can reject the null hypothesis that the first order autocorrelation of the errors is zero at all conventional significance levels. Therefore, the sample evidence suggests the errors are serially correlated and so our previous inferences about impact of the changes in highway regulations are invalid. We now explore here the three possible responses to a significant Breusch-Godfrey test.
135
First, we use the heteroscedasticity autocorrelation robust versions of our OLS-based tests statistics as described in Section 4.4.2. Using the HAC esti- mator, the Newey-West standard error of βˆbelt is 0.030. This is larger than the analogous conventional OLS standard errors reported in Example 2.3.30, and this impacts on our inference. The test statistic is now equal to minus one (to 3dp) and so the p-value for the test of H0 : βbelt,0 ≥ 0 versus H1 : βbelt,0 < 0 is ≈ 0.16, meaning that the test does not reject at the 10% significance level.
Second, we consider GLS estimation based on the assumption that the errors follow an AR(1) process. The GLS estimated equation is:
yt = controls + 0.0651 MPHt − 0.025 BELTt, (0.027) (0.030)
and the estimate of ρ is 0.288 with a standard error of 0.103. Consider again the test of H0 : βbelt,0 ≥ 0 versus H1 : βbelt,0 < 0. Using the GLS results, the t-statistic is −0.833 with a p-value of 0.202, and so we again fail to reject H0 at the 10% significance level.
With this GLS estimation, we have taken the evidence that the first order autocorrelation of the errors is non-zero to mean the errors follow an AR(1). This is a big leap of faith. As argued above, there are many reasons why the errors may appear serially correlated. One possible explanation is that the original model is not dynamically complete. To this end, we re-estimate the model via OLS but this time including the dependent variable lagged one period as an explanatory variable.31 The estimated regression model is:
yt = controls + 0.044 MPHt − 0.020 BELTt + 0.313 yt−1 (0.021) (0.023) (0.103)
where the number in parentheses is the associated standard error calculated using the standard OLS formula defined in Section 2.6. Applying the Breusch- Godfrey with p = 1 in this model, we obtain a test statistic of 0.315 with a p- value of 0.57, and so there is no evidence of serial correlation. Using conventional OLS-based standard errors, the t-statistic for testing H0 : βbelt,0 ≥ 0 versus H1 : βbelt,0 < 0 is −0.870 with a p-value of 0.192. So once again, we fail to reject H0 at the 10% significance level. Collectively, these results provide little evidence based on this sample that the introduction of the seat belt law led to a reduction in the percentage of traffic accident which involved fatalities.
30See Section 2.6.
31In truth, this is a very limited model specification search and in practice a more compre- hensive search would be undertaken. Although, as reported below, the inclusion of the lagged dependent variable removes evidence of serial correlation, it is hard to find a compelling ex- planation for why yt depends positively on its value in the previous month and so perhaps there is a more fundamental misspecification here.
136
Chapter 5
Instrumental Variables estimation
In previous chapters we consider frameworks for inference about the parameters of regression models estimated from either cross-section or time series data us- ing OLS. In each case, the inference procedures in question are justified by large sample analysis under certain conditions on how the data are generated. While the settings varied, one condition assumed throughout is that the regressors are uncorrelated with the errors and so our inferences are based on a consistent estimator of β0.1 While this assumption is arguably valid in many econometric models, it is not valid in all models of interest. If the regressors are correlated with the error then OLS is an inconsistent estimator and as a result all subse- quent inferences based on the estimator are misleading. In such circumstances, we abandon OLS and turn to an alternative method of estimation of β0 known as Instrumental Variables (IV). In this chapter, we introduce IV estimation and explore IV-based inference in linear models estimated from cross-section and time series data.
An outline of the chapter is as follows. Section 5.1 explores the properties of the OLS estimator in models where the regressors are correlated with the errors and discusses why this correlation may occur in econometric models. Section 5.2 demonstrates that OLS can be viewed as a Method of Moments estimator, and this interpretation is used to motivate the form of IV estimators in Section 5.3. In Section 5.4, we present a large sample analysis of the IV estimator and associated methods for inference about β0. Originally, IV was introduced as a method for estimation of the parameters of an equation that is part of system of linear simultaneous equations. In practice, IV estimation is often viewed in this way and in Section 5.5 we explore this perspective on IV which leads to a discussion of the Two Stage Least Squares (2SLS) method of estimation. As emerges, the implementation of IV rests crucially on the
1This property follows from Assumption CS4 in our discussion of cross-section data in Assumption TS4 in our discussion of time series data.
137
existence a vector of observable variables known as instruments that possess certain properties. In Section 5.6, we explore choices of instruments that have been used in three well-known empirical studies. All the inferences in this chapter are based on asymptotic analysis and it has been recognized that for certain choices of instruments this limiting distribution theory can be a poor approximation even in large samples. This type of situation is discussed in Section 5.7 in which the concept of a so-called weak instrument is introduced. Section 5.8 concludes the chapter with an empirical illustration of IV estimation.
5.1 Endogenous regressor models
For completeness, we begin by establishing that the OLS estimator is incon- sistent if the regressors are correlated with the errors. For concreteness, we consider the case of cross-sectional data but the analysis is qualitatively the same for time series data. Therefore, we consider the linear regression model
yi = x′iβ0 + ui
and impose Assumptions CS1-CS3 as in our earlier analyses but now assume thatE[xiui]=μ̸=0.2 FollowingthesamestepsasthoseusedinSection3.2to derive the probability limit of βˆT but under the conditions here, we have
N −1 N
βˆN −β0 = N−1xix′i N−1xiui,
i=1 i=1
where, using the WLLN,
N
N−1 xix′i →p E[xix′i] = Q, a p.d. matrix,
t=1 N
N−1 xiui →p E[xiui] = μ ̸= 0. t=1
Using Slutsky’s Theorem, it follows from (5.1)-(5.3) that βˆ N →p β 0 + Q − 1 μ .
(5.1)
(5.2) (5.3)
( 5 . 4 )
Since Q−1 is nonsingular and μ ̸= 0 by assumption, it follows that Q−1μ ̸= 0 and so, from (5.4) that βˆN does not converge in probability to β0.
Notice that the condition E[xiui] ̸= 0 does not imply necessarily that all the regressors are correlated with the error but only that at least one of the regressors is correlated with the error. In fact, in many models where E[xiui] = 0 is violated, some of the regressors are correlated with errors and some are not, see Examples 5.1-5.3 below. If an explanatory variable in the regression model
2This condition would hold if E[ui|xi] = m(xi) and E[m(xi)xi] = μ ̸= 0. 138
is correlated with the errors then it is commonly referred to as an endogenous regressor. There are three main reasons why endogenous regressors occur in econometric models: reverse causality, omitted variables and measurement error. We elaborate on each in turn using an example.
Example 5.1. Economic development and institutions
Acemoglu, Johnson, and Robinson (2001) (AJR hereafter) analyze whether the differences between income per capita across countries can be explained by differ- ences in institutions and property rights. Their analysis is based on the following equation:
yi = μ+αri +wi′γ+ui, (5.5)
where yi is log GDP per capita in country i, ri is the average protection against expropriation, and wi contains certain control variables that can be assumed uncorrelated with the error ui. In this model, ri captures the differences in institutions and property rights. This variable measures the risk of expropriation of private foreign investment by government and is recorded as an integer score on a scale of zero to ten, with a higher score meaning less risk.
It is certainly plausible that differences in institutions and property rights affect economic performance: better institutions and more secure property rights encourage more investment in human and physical capital which in turn leads to greater income. However, assessment of the sign and significance of this effect - α in (5.5) - is complicated by “reverse causality” that is growth in income arguably leads to the development of better institutions.
Example 5.2. Returns to education
Angrist and Krueger (1991) investigate the returns to education using a model
l n [ w i ] = θ 1 + θ e e d i + h ′i γ + u i ,
where wi equals the weekly wage of individual i, edi is the number of years of education of individual i, and hi is a vector containing certain control variables that can be assumed uncorrelated with the error ui. Here the coefficient of inter- est is θe which is the semi-elasticity of wage with respect to education.3 In this example, edi is likely an endogenous regressor due to “omitted variables”. In particular, both education and wage rate likely depend on the “innate ability” of the individual concerned that is, aspects of the individual’s character that cause them to succeed or otherwise. Since this “innate ability” cannot be observed, it cannot be included as a control variable in the regression and so it represents one of the constituents of the unobserved error term. Since both edi and ui depend on innate ability, it follows that edi is likely correlated with ui.
Example 5.3. Monetary policy reaction function
Clarida, Gali, and Gertler (2000) consider the following model based on the
3The estimation is based on US Census data from 1970 and a separate equation is estimated for each decade of birth cohorts.
139
Taylor rule to explain how the US Federal Reserve Bank (the “Fed”) sets the short term interest rate
rt = c + βπE[πt+1|Ωt] + βyE[yt+1|Ωt] + β1rt−1 + β2rt−2 + ut, (5.6)
where rt is the Federal Funds rate in period t, πt+1 is inflation in period t+1, Ωt is the Fed’s information set at time t, yt+1 is output gap in period t + 1, and ut is unobserved error. The underlying economic model implies that E[ut | Ωt] = 0. All variables are quarterly so t denotes “year.quarter”. The parameters βπ and βy indicate how much weight puts on respectively on expected inflation and the expected output gap in setting short term interest rates. The parameters β1 and β2 indicate how much importance attaches to interest smoothing that is, to avoiding sudden jumps in the interest rate. The Fed’s expectations of inflation and the output gap are not observed and so Clarida, Gali, and Gertler (2000) these expectations by their actual values and estimate the model,
rt = c + βππt+1 + βyyt+1 + β1rt−1 + β2rt−2 + wt (5.7) where wt is the error. However, this substitution can be viewed as using ex-
planatory variables that are subject to measurement error because
πt+1 = E[πt+1|Ωt] + vπ,t+1, yt+1 = E[yt+1|Ωt] + vy,t+1
where vπ,t+1 represents the difference between the actual value of πt+1 and the Fed’s forecast with an analogous definition for vy,t+1. For both (5.6) and (5.7) to both hold, it must follow that
wt = ut − βπvπ,t+1 − βyvy,t+1.
Clearly πt+1 and wt both depend vπ,t+1 and so πt+1 and wt are correlated, as are yt+1 and wt by a similar logic. Therefore, the measurement error causes πt+1 and yt+1 to be endogenous regressors.
5.2 OLS as a Method of Moments estimator
To motivate the form of the IV estimator, it is useful to begin by reinterpreting OLS as a Method of Moments (MoM) estimator. To this end, we first briefly review the MoM estimation principle.
Karl Pearson (1893, 1894, 1895) introduced the MoM in a series of articles published in the late nineteenth century. These articles demonstrated how to estimate the parameters of a probability distribution using information about the population moments of the distribution in question. The term population moment was originally used in statistics to denote the expectation of the poly- nomial powers of a random variable. So if V is a discrete random probability mass function P(V = v) defined on a sample space V then its rth population moment is given by
E[Vr]= vrP(V =v)=νr, {v∈V}
140
where the summation is over all values in V and r is a positive integer. If V is a continuous random variable with probability density function p(v) then its rth
moment is given by
∞
E[V r] = vrp(v)dv = νr.
−∞
From these definitions it can be seen that the population mean is just the first population moment and the population variance is ν2 − ν12.
The principle behind MoM is best introduced via an example. Suppose we are interested in estimating the population mean, μ0, and population variance, σ02, of a normal random variable, V , and have random sample on V denoted {vi}Ni=1. By definition, μ0 and σ02 satisfy the following population moment con- ditions
E[V]−μ0 = 0, E[V2]−(σ02 +μ20) = 0.
MoM estimators of (μ0, σ02) are the values (μˆN , σˆN2 ) which satisfy the analogous sample moment conditions that is,
N
N−1vi−μˆN =0,
i=1 N−1vi2−(σˆN2 +μˆ2N) = 0,
and so, with some rearrangement, it follows that
− 1 N μˆN=N vi,
i=1 N
σˆN2 =N−1(vi−μˆN)2. i=1
It was subsequently realized that this approach extends to more general forms of population moment condition. Suppose that the random vector V has probability distribution indexed by the p × 1 parameter vector θ0 and the underlying statistical model implies the population moment condition
E[f(V,θ0)] = 0,
where f ( ·, ·) is a p × 1 vector of nonlinear functions. Given this information about θ0, we can apply the MoM principle to estimate θ0: specifically, the MoM estimator θˆN of θ0 is defined as the solution to the analogous sample moment conditions that is,
N i=1
N i=1
N−1
f(vi,θˆN) = 0. 141
With this background, we revisit OLS estimation. While OLS is defined via a minimization in Section 2.2, βˆN are actually obtained by solving the normal equations in (2.6). Rearranging the normal equations, it can be seen that the OLS estimator satisfies:
X′u(βˆN) = 0,
where u(β) is the N × 1 vector with ith element ui = yi − x′iβ. It can then be recognized that the same estimator would be obtained if we applied the MoM estimation principle based on the population moment condition
E[xiui(β0)] = 0. (5.8)
Recalling that ui(β0) = ui, it can be recognized that the population moment condition in (5.8) is the very condition that turns out to be crucial for establish- ing that βˆN is consistent. The advantage of the MoM interpretation of OLS is that this information is explicitly made the basis of the estimation. As a result, we have the following intuitively appealing connection between the information on which estimation is based and the properties of the resulting estimator: if the estimation is based on valid information - E[xiui(β0)] = 0 - then the estimator is consistent, but if the estimation is based on invalid information - E[xiui(β0)] ̸= 0 - then the estimator is inconsistent.
5.3 Instrumental Variables estimation
Using the framework in the previous section, it can be recognized that if the regressors are endogenous then the population moment condition upon which OLS is based is actually invalid that is, OLS is based on the assumption that E[xiui(β0)] = 0 when in fact this population moment condition does not hold. Instrumental Variables estimation is based on finding an alternative set of pop- ulation moment conditions upon which to base the estimation. To this end, let zi denote a q × 1 vector of observable variables and assume the following population moment condition holds,
E[ziui(β0)] = 0. (5.9)
Instrumental variables estimation is based on the information about β0 in the population moment condition (5.9); zi is known as an instrument vector.
For this approach to estimation to be “successful”, zi must satisfy certain properties: not only must (5.9) hold but it must also represent unique informa- tion about β0 that is,
E[ziui(β)] ̸= 0 for all β ̸= β0. (5.10)
If (5.10) holds then the population moment condition in (5.9) is said to identify β0. While this condition is easily stated, it is not easy to see when it may hold. However, a simple manipulation leads to a more revealing condition. Since
ui(β) = yi − x′iβ = ui(β0) + x′i(β0 − β), 142
it follows that
E[ziui(β)] = E[ziui(β0)] + E[zix′i](β0 − β). (5.11) Using (5.9) in (5.11), it follows that
E[ziui(β)] = E[zix′i](β0 − β). (5.12)
Therefore β0 is identified if E[zix′i](β0 − β) ̸= 0 for all β ̸= β0. Equation (5.12) can be recognized as a system of linear equations in β0 − β and so this property is guaranteed if the rank of E[zix′i] is k; for example see Orme (Section 5.3).
Collecting the above discussion together, zi must satisfy two properties: • E[ziui(β0)] = 0, known as the orthogonality condition;
• rank {E[zix′i]} = k, known as relevance condition.
The orthogonality condition is so named because it implies that zi and ui are uncorrelated or in other words “statistically orthogonal”. The relevance condi- tion is so named because it implies that zi is (sufficiently) related to xi. This may not be immediately apparent from the condition in its current form but is more apparent if we consider a simple example. Suppose that k = n = 2 with xi = [1, xi,2] and zi = [1, zi,2] then relevance condition is satisfied if and only if xi,2 and zi,2 are correlated4 - or in other words, if we regress xi,2 on zi,2 then the population R2 is non-zero and so zi,2 can be said to be a “relevant” explanatory variable in the regression model for xi,2. (Also see the discussion in Section 5.5.)
To these two conditions, it is useful to add a third that makes explicit some- thing that is implicitly assumed above namely, that each population moment condition contains some unique information. This is the case if no one instru- ment is a linear combination of the others with probability one, and is implied by the following condition.
• E[zizi′] is nonsingular, we refer to this as the uniqueness condition.
If this condition is violated - and so E[zizi′] is singular - then there exists c such that E[(c′zi)2] = 0 which implies c′zi = 0 with probability one. To illustrate the consequences of c′zi = 0, suppose c′ = (c1, c2, . . . cq−1, −1) then E[zi,lui(β0)] = 0 for l = 1, 2, . . ., q − 1 implies automatically that E[zi,qui(β0)] = 0 because
q−1
clE[zi,lui(β0)] = E[zi,qui(β0)].
l=1
Therefore any information in E[zi,qui(β0)] = 0 about β0 is already contained in E[zi,lui(β0)] = 0 for l = 1, 2, . . ., q − 1.
We now turn to the calculation of the IV estimator. It is convenient to split our discussion into two parts depending on the relative values of q and
4See Tutorial 8 Question 1.
143
k. If q = k then (5.9) represents k pieces of information about the k unknown parameters β0 and so β0 is said to be just-identified by (5.9). If q > k then (5.9) represents q > k pieces of information about the k unknown parameters β0 and so β0 is said to be over-identified by (5.9). We examine these cases in turn.
If β0 is just-identified then the sample moment conditions represent k equa- tions in k unknowns and so the IV estimator, βˆIV , is
N i=1
To provide an explicit solution for βˆIV , it is more convenient to switch to matrix notation. Define Z to be the N × q matrix with ith row zi′. The IV estimator can be equivalently characterized via
and so
Z′u(βˆIV ) = 0, Z′y − Z′XβˆIV = 0.
(5.14) (5.15)
(5.16)
N−1
ztui(βˆIV ) = 0. (5.13)
Assuming Z′X is nonsingular,5 it follows from (5.15) that βˆIV = (Z′X)−1Z′y.
If β0 is over-identified then the MoM approach to estimation cannot be applied because the sample moment conditions represent a system of q > k linear equations in k unknowns to which, in general, there is in general no solution. In this case, βˆIV is defined as the value of β that is closest to solving the sample moment conditions. To make this approach to estimation operational, it is necessary to define a measure of how far the sample moment function is from zero. Here, this measure is defined as
QIV (β) = u(β)′Z(Z′Z)−1Z′u(β), (5.17) where we have assumed Z′Z is nonsingular.6 Notice that by construction
(Z′Z)−1 is positive definite and so in turn QIV (β) satisfies: • QIV(β)≥0forallβ;
• QIV(β)=0ifandonlyifZ′u(β)=0.
Therefore, the IV estimator is defined as
βˆIV = argminβ∈BQIV (β).
(5.18)
In Tutorial 8 Question 2, you show that
βˆIV = X′Z(Z′Z)−1Z′X−1 X′Z(Z′Z)−1Z′y. (5.19)
We now consider the large sample properties of the IV estimator and various associated inference techniques. Our analysis is split into two parts. Section 5.4.1 covers the case of cross-section data and Section 5.4.2 covers the case of time series data.
5A reasonable assumption if the relevance condition holds. 6A reasonable assumption if the uniqueness condition holds.
144
5.4 Large sample properties and inference 5.4.1 Cross-sectional data
To perform our large sample analysis of the IV estimator in cross-section data, we impose the following conditions.
Assumption CS1-IV. yi = x′iβ0 + ui.
Assumption CS2-IV. { (ui,x′i,zi′), t = 1,2,…T} forms an independent and
identically distributed sequence.7
Assumption CS3-IV. (i) E[zizi′] = Qzz, finite, p.d.; (ii) E[zix′i] = Qzx,
rank{Qzx} = k.
Assumption CS4-IV. E[ui|zi] = 0. Assumption CS5-IV. V ar[ui|zi] = h(zi) > 0.
Notice that these conditions imply both the orthogonality, relevance and uniqueness conditions: Assumption CS4-IV implies E[ziui] = 0 (via LIE), and Assumption CS3-IV(i)-(ii) imposes the the uniqueness and relevance conditions respectively. Also note that Assumption CS5-IV allows for conditional het- eroscedasticity but subsumes the case of conditional homoscedasticty as the special case in which h(zi) = σ02 for all zi.
To derive the large sample properties of the IV estimator, we substitute for y in (5.19) to obtain
βˆIV = β0 + {X′Z(Z′Z)−1Z′X}−1X′Z(Z′Z)−1Z′u.
To evaluate the probability limit of βˆIV , we standardize the sums in
as follows
βˆIV = β0 + {(N−1X′Z)(N−1Z′Z)−1(N−1Z′X)}−1 ×(N−1X′Z)(N−1Z′Z)−1(N−1Z′u).
Using the WLLN, we have
(5.20) (5.20)
(5.21)
(5.22) (5.23) (5.24)
Furthermore Assumption CS3-IV(i) implies Qzz is nonsingular and Assumption CS3-IV(ii) states that Qzx and Qxz = Q′ are rank k so that QxzQ−1Qzx
p
N ZX =N i=1zixi →Qzx,
−1′ −1N′ −1′ −1N′p
N ZZ =N i=1zizi →Qzz,
N − 1 Z ′ u = N − 1 Ni = 1 z i u i →p E [ z i u i ] = 0 .
zx zz
7There may be some redundancy in the definition of the data vector as zi and xi may have variables in common. However, we overlook this here to avoid overburdensome notation.
145
is nonsingular and its inverse exists. From these results and using Slutsky’s Theorem, we have
plimβˆ = β + {plim(N−1X′Z)plim(N−1Z′Z)−1plim(N−1Z′X)}−1 × IV 0
plim(N−1X′Z)plim(N−1Z′Z)−1)plim(N−1Z′u), (5.25)
= β0+M×0=β0,
where M = (Q Q−1Q )−1Q Q−1. Therefore, βˆ
To develop the large sample distribution, we work with N1/2(βˆIV − β0) = {(N−1X′Z)(N−1Z′Z)−1(N−1Z′X)}−1
×(N−1X′Z)(N−1Z′Z)−1(N−1/2Z′u), = MN N−1/2Z′u,
xz zz zx xz zz IV
0
where
MN = {(N−1X′Z)(N−1Z′Z)−1(N−1Z′X)}−1(N−1X′Z)(N−1Z′Z)−1.
From (5.22)-(5.23), it follows that MN →p M. Using the CLT, we can deduce
N
N−1/2Z′u = N−1/2 ziui →d N (0, Ω), (5.28)
i=1
where Ω = limN→∞ΩN and ΩN = VarN−1/2Ni=1ziui. To deduce Ω,
we apply similar arguments to those employed in our analysis of OLS in Sec- tion 4.3.1. Assumption CS2-IV implies {zi ui ; i = 1, 2, . . . N } are i.i.d. and so Cov[ziui,zjuj] = 0 (i ̸= j). Therefore, we obtain ΩN = Var[ziui]. Since E[ziui] = 0, it follows that
Var[ziui] = E[u2izizi′] = EE[u2i|zi]zizi′ = E[h(zi)zizi′] = Ωh, say.
From the above arguments, it can be recognized that N1/2(βˆIV −β0) = MN mN where MN converges in probability to the matrix of constants M and mN con- verges in distribution to a random vector with distribution N(0, Ωh). Using Lemma 3.5, it follows that
N1/2(βˆIV −β0) →d N(0,VIV), (5.29)
where VIV = MΩhM′.
To use this result as a basis for inference, we need a consistent estimator of
VIV . Such an estimator can be constructed as follows. Define VˆIV = Mˆ ΩˆhMˆ ′, where Mˆ = (Qˆ Qˆ−1Qˆ )−1Qˆ Qˆ−1 for Qˆ = N−1Z′Z, Qˆ = N−1X′Z and
xz zz zx xz zz zz xz p Ωˆh = N−1 Ni=1 e2i zizi′. From (5.22)-(5.23), it follows that Qˆzz = N−1Z′Z → Qzz and Qˆxz = N−1X′Z →p Qxz, and using similar arguments to Section 4.3.1, Ωˆ h →p Ωh . Therefore, using Slutsky’s Theorem, we have VˆI V →p VI V .
146
(5.26) is consistent for β .
(5.27)
We can then perform all the types of inference discussed in Section 3.2 based on the IV estimator. For example, an approximate 100(1 − α)% confidence interval for β0,l is given by,
βˆ I V , l ± z 1 − α / 2 Vˆ I V , l , l / N .
In Tutorial 8 Question 3, you consider a suitable test for H0 : Rβ0 = r. Since VˆIV is consistent for VIV whether ui is conditionally homoscedastic or het- eroscedastic, the inference procedures described above are “heteroscedasticity robust”.
5.4.2 Time series data
We now consider the large sample properties of the IV estimator and associated inference procedures in time series data. We consider the situation in which the regression model
yt = x′tβ0 + ut, (5.30) is estimated based on the information in the population moment condition,
E[ztut(β0)] = 0 (5.31) where as before ut = yt − x′tβ. Our assumptions are as follows.
Assumption TS1-IV. yt is generated via the general dynamic linear regression model in (5.30).
This is the familiar assumption that we estimate a correctly specified model. Assumption TS2-IV. (yt,h′t,zt′) is a strongly stationary, weakly dependent
time series.8
This assumption is imposed to facilitate the application of the WLLN and CLT,
as discussed in Section 3.3.1.
Assumption TS3-IV. (i) E[ztzt′] = Qzz, a finite, positive definite matrix; (ii)
E[ztx′t] = Qzx, rank{Qzx} = k.
Assumption TS4-IV. E[ut |zt] = 0 for all t = 1,2,…,T. Assumption TS5-IV. V ar[ut | zt] = σt2 where σt2 > 0 for all t.
Notice that these conditions imply the orthogonality, relevance and unique- ness conditions: Assumption TS4-IV implies E[ztut] = 0 (via LIE), and As- sumption TS3-IV(i)-(ii) imposes the the uniqueness and relevance conditions respectively. Assumption TS5-IV allows for the errors to exhibit conditional
8There may be some redundancy in the definition of the data vector as xt may contain lagged values of yt and zt may involve current or lagged values of xt or lagged values of yt. However, we overlook this here to avoid over burdensome notation.
147
heteroscedasticity. As with Assumption TS2 in our earlier analysis, Assump- tion TS2-IV implicitly contains the additional regularity conditions under which we can apply the WLLN and CLT to the appropriate sums in our formula for the IV estimator. As we have placed no restrictions on the autocovariance structure of ztut, the CLT implies that
where
T
T−1/2Z′u = T−1/2 ztut →d N (0, Ωsc ),
t=1
Ω s c = Γ 0 + ∞ ( Γ i + Γ ′ i ) , i=1
(5.32)
and Γi = Cov[ztut,zt−iut−i]. Therefore, using analogous arguments to the cross-section case (only this time appealing to the WLLN and CLT for strongly stationary, weakly dependent time series), it can be shown that
j=1
T1/2(βˆIV−β0)→d N(0,VIV), whereV =MΩ M′,M=(Q Q−1Q )−1Q Q−1,
IV sc xz zz zx xz zz Ω = Γ 0 + ∞ ( Γ j + Γ ′ j ) ,
and Γi = E[ztut,zt−jut−j].
To perform inference based on this result, we need a consistent estimator of
VIV . As with our OLS-based inferences before, we can use the HAC estima- tor, ΩˆHAC, defined in (4.42) only this time with Γˆj = T−1 Tt=j+1 ztzt′−juˆtuˆ′t−j
where uˆ = u (θˆ ). Setting Vˆ = Mˆ Ωˆ Mˆ ′, it follows via Slutsky’s Theo- ttIV IVHAC
rem that VˆIV →p VIV . So for instance, an approximate 100(1 − α)% confidence interval for β0,l is given by,9
βˆ I V , l ± z 1 − α / 2 Vˆ I V , l , l / T ,
and an approximate 100α% significance level test of H0 : Rβ0 = r versus
HA : Rβ0 ̸=ristorejectH0 ifW(IV) >cnr(1−α)where T
W(IV ) = T(RβˆIV − r)′(RVˆIV R′)−1(RβˆIV − r), T
cnr (1 − α) is the 100(1 − α)th percentile of the χ2nr distribution, and nr is the number of restrictions.10 The use of the HAC estimator ensures that these inference procedures are robust to heteroscedasticity and serial correlation (or autocorrelation).
9Here, as above, z1−α/2 is the 100(1−α/2)th percentile of the standard normal distribution. 10As in all such tests, it is assumed that rank(R) = nr.
148
5.5 IV and 2SLS
Originally, IV was introduced as a method for estimation of the parameters of an equation that is part of system of linear simultaneous equations. In prac- tice, IV estimation is often viewed in this way and in this section we explore this perspective on the estimator. This leads us to an alternative method for calculation of IV estimators known as Two Stage Least Squares (2SLS). The arguments in this section are generic applying equally to both cross-section and time series data. However for sake of the exposition, we assume cross-section data below.
Consider the following simple linear simultaneous equations model (LSEM), y1,i = z1′ ,iγ0 + α0y2,i + u1,i (5.33) y2,i = z1′ ,iδ1,0 + z2′ ,iδ2,0 + u2,i (5.34)
in which y1,i and y2,i are scalar random variables, z1,i and z2,i are respectively q1 ×1 and q2 ×1 vectors of explanatory variables, and u1,i and u2,i represent the unobserved error terms. Let zi = (z1′ ,i, z2′ ,i)′ and set q = q1 + q2. It is assumed that {(zi′,u1,i,u2,i)}Ni=1 is an i.i.d. random sequence with E[uj,i|zi] = 0 for j = 1, 2 and
Var u1,i |z = σ12 σ1,2 , σ ̸=0. u2,i i σ1,2σ2 1,2
Under these conditions, it follows from the LIE that
E[zj,iul,i] = 0 for j, l = 1, 2. (5.35)
Here our focus is on estimation of (5.33) which we therefore refer to as our “equation of interest”. However, before discussing the estimation of equation of interest, it is useful to highlight briefly certain features of the LSEM. From (5.35), it follows that zi is uncorrelated with both u1,i and u2,i. For this reason, zi is referred to as being exogenous in the terminology of LSEM’s. The values of y1,i and y2,i are simultaneously determined as the solutions to (5.33)-(5.34) given values for zi and (u1,i,u2,i) and so are referred to as being endogenous. Notice that σ1,2 ̸= 0 implies that the error terms are correlated with each other, and in turn this creates a correlation between y1,i and y2,i conditional on zi. Thus our equation of interest is a linear regression model in which some regressors, z1,i, are uncorrelated with the error, u1,i, but one right-hand side variable, y2,i is correlated with the error.11 As discussed above, this means that OLS estimation of (5.33) yields an inconsistent estimator of β0 = (γ0′ , α0)′ .
Using the MoM perspective on estimation, we need at least k = (q1 + 1) population moment conditions to estimate the k × 1 unknown parameters in β0 .
11 Within the LSEM framework, equations that have endogenous variables on the right-hand side are referred to as structural equation and equations that have only exogenous variables on the right-hand side are referred to as reduced form equations.
149
Setting xi = (z1′,i,y2,i)′ and u1,i(β) = y1,i −x′iβ, one natural choice set of pop- ulation moment conditions is E[z1,iu1,i(β0)] = 0 because z1,i are (exogenous) explanatory variables in our equation of interest. However these population conditions only provide q1 pieces of information and so are insufficient by them- selves as k > q1. Additional instruments can be found by using the equation for y2,i. From (5.34), it can be seen that y2,i depends on both z1,i and z2,i. There- fore, we can use E[z2,iu1,i(β0)] = 0 as additional moment conditions giving a q × 1 vector of moment conditions
E[ziu1,i(β0)] = 0, (5.36)
where q = q1 + q2. Notice that for this approach to instrument selection to work it is critical that δ2,0 ̸= 0 – in other words that there are exogenous regressors in the equation for y2,i that do not appear as explanatory variables in the equation of interest. In LSEM’s, this condition is often referred to as an exclusion restriction – because z2,i is excluded from the right-hand side of the equation of interest. If this condition fails – that is, δ2,0 = 0 – then the relevance condition must fail.
The LSEM structure can be exploited to develop an alternative method of calculating the IV estimator. To explain this method, it is useful to μy2|z = E[y2,i | zi]. Notice that from (5.34) it follows that
and
μy2|z = z1′ ,iδ1,0 + z2′ ,iδ2,0, (5.37) y2,i = μy2|z + u2,i. (5.38)
Suppose that μy2|z is known: in this case we could substitute for y2,i from (5.38) in (5.33) to obtain
y1,i = z1′ ,iγ0 + α0μy2|z + wi, (5.39)
where wi = u1,i + u2,i. Recalling from (5.37) that μy2|z is a linear function of zi, it follows from (5.35) that both z1,i and μy2|z are uncorrelated with wi and so OLS estimation of (5.39) would yield a consistent estimator of β0. Of course, this approach is infeasible because μy2|z is unobserved. 2SLS estimation of β0 involves replacing μy2|z by its predicted value based on an estimated version of (5.34). This leads to the following two stage estimation procedure.
• Stage 1: Estimate (5.34) via OLS to obtain the predicted value of y2,i, yˆ 2 , i .
• Stage 2: Estimate the model
y 1 , i = z 1′ , i γ 0 + α 0 yˆ 2 , i + “ e r r o r ” ,
via OLS to obtain βˆ2SLS which is known as the 2SLS estimator of β0.
It can be shown that βˆ2SLS equals the IV estimator based on the population moment condition in (5.36).
150
This LSEM framework can also be used to revisit the condition for instru- ment relevance. Since z1,i is included as an explanatory variable in the equation of interest and also in the instrument vector, the key to the relevance condition holding is whether or not y2,i is sufficiently related to z2,i controlling for z1,i. This can be assessed from (5.34) which in this context is known as the first stage regression, reflecting its role in 2SLS estimation. It can be shown that the instrument relevance condition is satisfied if δ2,0 ̸= 0. We can therefore test instrument relevance by estimating the first stage regression via OLS and using theWN in(3.14)ortheF-statisticin(2.44)toH0 :δ2,0 =0vsHA :δ2,0 ̸=0. Notice that under the null hypothesis the instrument relevance condition is not satisfied, and the relevance condition is satisfied if at least one element of δ2,0 is non-zero.
5.6 Examples of instruments
So far we have assumed the existence of suitable instrument vectors but in prac- tice an appropriate choice of z has to be found. To illustrate this choice, we return to the three endogenous regressor models in Examples 5.1-5.3.
Example 5.1 continued
AJR analyze whether the differences between income per capita across countries can be explained by differences in institutions and property rights. Their analysis is based on the following equation:
yi =μ+αri+wi′γ+ui,
where yi is log GDP per capita in country i, ri is the average protection against expropriation, and wi contains certain control variables that can be assumed uncorrelated with the error ui. In this model, ri captures the differences in in- stitutions and property rights. As discussed earlier, the concern is that ri is an endogenous regressor due to reverse causality. To implement IV, we need instruments that are both correlated with ri but arguably uncorrelated with ui. A key innovation of AJR’s study is the use of historical data on colonial settler mortality rates as an instrument. Their argument for this choice is as follows. When Europeans colonized countries, they tended to pursue one of two types of regime: “extractive” regimes, in which the purpose was purely to extract wealth from the country being colonized; and “neo-Europe” regimes, in which coloniz- ers looked to settle in the country as well. In “neo-Europe” regimes, the settlers introduced institutions and property rights similar to those from where they had come, but they tended not to do so in extractive regimes. One determinant of the colonizing regime pursued was the feasibility of European settlement, espe- cially the prevalence of disease. AJR document that settler mortality rates were well known back in Europe and cite evidence that they affected decisions of Eu- ropeans about whether or not to emigrate to particular countries. So settler mortality rates can be argued to have affected the colonizing regime and so, in
151
turn, to have affected whether or not European type institutions were introduced historically. AJR provide evidence that the quality of these “early” institutions persists to the current time, and thereby connect historical settler mortality rates to the quality of current institutions in countries colonized by Europeans. At the same time, AJR argue that there is no evidence to support that either there is any direct relationship between historical settler mortality rates and income (so this instrument satisfies the exclusion restriction). Furthermore, AJR argue it does not seem plausible that settler mortality rates are correlated with ui due to the considerable temporal difference between the two. For completeness, we note that AJR’s approach has received some criticism: see Albouy (2012) and the response in Acemoglu, Johnson, and Robinson (2012).
Example 5.2 continued
Angrist and Krueger (1991) investigate the returns to education using a model
ln[wi] = θ1 +θeedi +h′iγ+ui
where wi equals the weekly wage of individual i, edi is the number of years of education of individual i, and hi is a vector containing certain control variables that can be assumed uncorrelated with the error ui. Here the concern is that edi is correlated with ui due to the omitted (unobservable) variable “innate ability”. Angrist and Krueger (1991) argue that quarter of birth provides a suitable in- strument. Their reasoning exploits the historical school attendance laws in the US: students must attend school once they are seven years old but can actually only enter school at certain dates on the calendar (e.g. September or January); however they are free to leave as soon as they turn sixteen. This means that pupils born earlier in the year are older when the start school and so have the opportunity to leave with less formal schooling than students born later in the year. At the same time, there is no reason to suppose quarter of birth is corre- lated with ui and Angrist and Krueger (1991) document evidence that quarter of birth does not have a direct effect on wages and so does not need to be included in the controls.
Example 5.3 continued
Clarida, Gali, and Gertler (2000) use the following model to estimate the pa- rameters of the Fed’s monetary policy reaction function,
rt = c + βππt+1 + βyyt+1 + β1rt−1 + β2rt−2 + wt.
Within this model both πt+1 and yt+1 are endogenous regressors due to mea- surement error. Within the framework of this model, it can be shown that any variables in the information set Ωt are valid instruments because they are or- thogonal to both ut, the error in the Taylor rule, (by assumption), and to the measurement errors (vπ,t+1 and vy,t+1) via a rational expectations argument.12
12For example, using rational expectations, E[vπ,t+1|Ωt] = 0 and so using the LIE we have E[ztvπ,t+1] = 0 for any zt ∈ Ωt.
152
5.7 Weak instruments
Recall that large sample theory is used as an approximation to the finite sam- pling distribution of the test statistics. As remarked previously, it is reasonable to be concerned about how good this approximation actually is in any situation. In the context of IV estimation, it has been realized that an important deter- minant of the quality of this approximation is the strength of the relationship between xi and zi. A scenario of particular concern is cases where the relevance condition is technically satisfied but close to failure: for example, using the LSEM example in the previous section, scenarios where the relevance condition is technically satisfied because δ2,0 ̸= 0 but is close to failure because δ2,0 ≈ 0. In such cases, our asymptotic distribution theory can be a poor approximation to finite sample behavior in what are conventionally viewed as large samples in practice.13 If this is the case then β0 is said to be weakly identified by the popu- lation moment condition E[ziui(β0)] = 0 – or equivalently that the instruments are weak.
This issue turns out to be not just an academic consideration. For example, Angrist and Krueger’s (1991) study in Example 5.2 has been criticized because the instruments are weak. In their setting, the instruments consist of the con- trols and dummy variables indicating quarter of birth both individually and interacted with year of birth (one of the controls). In this case, the condition for identification can be expressed equivalently as D = R2ed,z − R2ed,controls > 0, where R2a,b is the multiple correlation coefficient from the regression of a on b. While US schooling attendance laws offer a reason why quarter of birth should be related to years of education, Bound, Jaeger, and Baker (1995) observe that D is only 0.0001 or 0.0002 in Angrist and Krueger’s (1991) data. They further provide evidence that with values of D in this range, the finite sample behaviour of the IV estimator is not well approximated by the first order asymptotic dis- tribution even in samples of 250k+ used by Angrist and Krueger (1991).
Such concerns have spawned a considerable literature on how to perform in- ference in the presence of weak identification.14 One important point to emerge from this literature is that there is a difference between rejecting the null hy- pothesis that the instrument relevance condition is not satisfied and the state- ment that the instruments are not weak. To illustrate this difference, consider the LSEM in Section 5.5 and assume q2 = 5. In this case, the critical value for a 5% significance level test of instrument relevance (H0 : δ2,0 = 0) based on the F-statistic is in the range 2 − 3.5 depending on the second degrees of freedom (N − q). In contrast, Stock, Wright, and Yogo (2002) suggest that the F -statistic has to be greater than 10 to avoid the problems associated with weak instruments.
Finally, it should be noted that if the relevance condition fails (δ2,0 = 0 in our LSEM example above) then the inference procedures described are not valid
13 However, note that if δ2,0 ̸= 0 then there is a sample size at which asymptotic distribution theory is a good approximation – the issue is that it may be so large that the theory provide no guide in the sample sizes encountered in practice.
14For a review and references see Hall (2015). 153
even in the limit.15
5.8 Empirical example
In this section, we illustrate some of the issues discussed in this chapter using a version of the model estimated in the AJR study in Example 5.1 above. Recall that their model takes the form
yi = μ0 +α0ri +wi′γ0 +ui
where yi is log GDP per capita in country i, ri is the average protection against expropriation, and wi is contains certain control variables. The sample consists of data on 59 countries for the year 1995, with ri being calculated using data from 1985-1995. Note that, given the way r is calculated, α0 > 0 is consistent with the hypothesis that better institutions and more secure property rights lead to increased income. In our estimation, wi consists of a single variable, life expectancy at birth in 1995. In Table 5.1, we report results from estimation of the model using both OLS and IV. For the IV estimation, the instrument vector zi is:
zi =[1,wi,z1,i,z2,i,z3,i,z4,i]′,
where z1,i is log settler mortality of country i, z2,i is the absolute latitude of country i, z3,i is the mean temperature of country i, z4,i is the proportion of land area within 100km of the seacoast. Comparing the results, two features stand out. While both estimators of α0 have the anticipated sign, the IV estimator is larger. Second the standard errors for IV are much larger. This can explained using the 2SLS formulation of IV: on the second stage, we replace ri by its prediction from the first stage regression, rˆi, and this prediction only exhibits part of the variation of ri. As such, rˆi, provides a less strong signal about the role of ri in explaining yi than ri does itself, and this translates into 2SLS having higher standard errors than OLS. However, if we believe ri is an endogenous regressor then we would still prefer to base inference on the IV estimator as it is consistent whereas the OLS estimator is not.
To investigate whether the instruments satisfy the relevance condition, we estimate the first-stage regression model:
ri = δ0 +δ1∗wi +δ2∗z1, +δ3∗z2,i +δ4∗z3,i +δ5∗z4,i +“error”.
Recall that the null hypothesis is that the instruments are not relevant which in this case equates to the parameter restrictions: δ2 = 0, δ3 = 0, δ4 = 0, δ5 = 0. The alternative hypothesis is that the instruments are relevant and so δj ̸= 0 for at least one of j = 2,3,4,5. The F-statistic for this test is F = 2.27 with a p- value of 0.0740. Thus, there is only marginal evidence in support of instrument relevance. Furthermore, the F-statistic is sufficiently low that the instruments may be weak, calling into question the validity of our inferences above.
15See Tutorial 8, Question 5.
154
Table 5.1: Estimation results for AJR model
αˆ 95%CIforα0 γˆ 95%CIforγ0
OLS 0.287 (0.186,0.387) 0.0496 (0.036,0.063) 2SLS 0.744 (0.335,1.153) 0.016 (-0.018,0.051)
155
Chapter 6
Maximum Likelihood
In this chapter, we introduce a different estimation principle known as Maximum Likelihood (ML). This approach can be applied to estimate the parameters of any statistical model whose specification involves the probability distribution of the data. In this course, we apply ML estimation to binary response models and regression models.
Binary response models are used to model the probability of an event occur- ring as a function of a vector of explanatory variables. While these models can be estimated by regression techniques, this approach is unsatisfactory because it does not fully take account of the structure of the phenomenon being modeled. A more coherent estimation strategy requires the use of ML estimation. It is this application of ML that is the primary focus in this course. However, ML can also be used to estimate linear regression models. As emerges, OLS can also be interpreted as a ML estimator and this provides us with further insights into the properties of OLS.
6.1 Intuition and definition
For the purposes of introducing the ML estimator, it is useful to emphasize the difference between random variables and their sample outcomes. For this section, we adopt the following notation: we use capital letters to denote random vectors and small letters to denote sample outcomes of V . In the subsequent sections, we revert to the more familiar notation in econometrics – that we have used in the course to date – in which letters may denote random variables or their sample outcome depending on the context.
It is easier to motivate the ML estimation principle for the case of discrete random variables and so we begin with this case and then discuss the extension to continuous random variables.
Accordingly, suppose that {Vi; i = 1,2,…N} is a sequence of discrete ran- dom variables with some joint probability distribution function,
P(V1 =v1,V2 =v2,…,VN =vN;θ0). 156
where θ0 is a p × 1 vector of parameters that index the probability distri- bution function. We assume throughout that the functional form of P ( · ) is known. Given θ0, a researcher can use the joint probability function to deter- mine the probability of observing any given sample outcome (v1,v2,…,vN) for (V1 , V2 , . . . , VN ) before the statistical experiment is performed. Let this proba- bility be denoted more compactly as p(v1,v2,…,vN; θ0).1
In estimation, the situation is different. We know the functional form of p(·) and we have observed a sample but do not know the true value of the parameter vector. The basic idea of ML is to estimate θ0 by the parameter value that maximizes the probability of observing the particular sample we actually obtained. To express this principle mathematically, we need to introduce the likelihood function associated with the sample which is denoted as LFN (θ) and defined as:
LFN(θ) = LF(θ;v1,v2,…vN) =d p(v1,v2,…,vN; θ).
Notice that the only difference between the likelihood function and the joint probability distribution function is in the arguments: the joint probability func- tion has argument (v1,v2,…,vN) and conditions on θ0; the likelihood function has argument θ and conditions on the sample outcome. As the name suggests, the ML estimator (MLE) of θ0 is the value of θ that maximizes the likelihood function that is,
θˆML = argmaxLFN(θ), (6.1) θ∈Θ
where Θ denotes the parameter space.
Clearly to implement ML, we require knowledge of the joint probability
distribution function. Sometimes our model implies an explicit form for the joint distribution function and so p(v1,v2,…,vN; θ) is readily available. In other cases, our model implies the distribution for Vi and a sampling scheme from which we can deduce joint p(v1,v2,…,vN; θ). For example, suppose that the statistical model implies that {Vi} is independently and identically distributed, with P (Vi = vi) = p(vi; θ0). In this case, we have that2
N i=1
However, this representation is not particularly convenient for the optimiza- tion in (6.1). Under fairly mild conditions, the MLE is the solution to the first order conditions and the calculation of these conditions requires differentiation of a product of functions of θ which is analytically inconvenient. Recalling that the natural logarithm of z is a monotonic transformation of z, a more convenient characterization of the MLE is as follows:
θˆN = arg max LLFN (θ), (6.3) θ∈Θ
1That is, we set P(V1 = v1,V2 = v2,…,VN = vN; θ0) = p(v1,v2,…,vN; θ0).
2Recall the if two random variables, A and B say, are independent then P(A = a,B = b) = P(A = a)P(B = b).
157
LFN(θ) =
p(vi;θ0). (6.2)
where
LLFN(θ)=d ln[LFN(θ)]. (6.4)
Unsurprisingly, LLFN (θ) is known as the log likelihood function (LLF). The advantage of this alternative characterization is that if the LF takes the form in (6.2) then the LLF is:
N
LLFN(θ) = ln[p(vi;θ)], (6.5)
i=1
and so the first order conditions involve the derivative of a sum which is more analytically convenient. From here on, we work exclusively with the definition of the ML in (6.3).
Under fairly mild conditions, the MLE can be characterized as the solution to the first order conditions associated with the maximization in (6.3) that is,
∂LLFN(θ) = 0. (6.6) ∂θ θ=θˆN
This set of equations is known as the score equations; and ∂LLFN(θ)/∂θ is known as the score function. Sometimes, it is possible to solve (6.6) to obtain a closed form solution for θˆN (in terms of data alone) but other times the score equations only implicitly characterize θˆN and the solution is found by using computer-based numerical optimization routines.
To illustrate ML estimation of the parameters of discrete distributions, we consider the Bernoulli distribution. Recall that a random variable with a Bernoulli distribution has only two possible outcomes denoted here by {0, 1}, and its probability distribution is indexed by a single parameter, denoted here by θ0, which gives the probability that the outcome is one. A random variable with a Bernoulli distribution can be used to model the simplest of statistical experiments that is, tossing a coin.
Example 6.1. Let {Vi}Ni=1 be a sequence of i.i.d. Bernoulli random variables with P(Vi = 1) = θ0. We assume here that θ0 ∈ (0,1) and that our sample size is large enough for both outcomes to occur. Since the outcomes satisfy vi ∈ {0, 1}, the probability distribution function of Vi can be written very compactly as:
P(Vi =vi) = θvi(1 −θ0)1−vi, 0
where we use P (Vi = 0) = 1 − θ0 which follows from that the form of the sample space for Vi. Therefore, the likelihood function is
N
LFN(θ) = θvi(1 −θ)1−vi,
i=1
and the log likelihood function is
N i=1
LLFN(θ) =
{viln[θ] + (1−vi)ln[1−θ]}. 158
(6.7)
From (6.7) it follows that
∂LLFN(θ) = N vi − 1−vi = Ni=1vi − Ni=1(1−vi). ∂θ i=1 θ 1−θ θ 1−θ
Defining Ni=1 vi = N1, it follows that the score equations is:
N1 −N−N1 =0. (6.8)
θˆN 1 − θˆN
Assuming that 0 < θˆ < 1, the score equations imply that θˆN satisfies:
(1−θˆN)N1−(N−N1)θˆN =0,
which can be solved to give θˆN = N1/N. Thus the ML estimator of the prob- ability that the outcome is one is just the relative frequency of this outcome in the sample.
The above motivation works for discrete random variables but does not apply for continuous random variables.3 To extend this approach to contin- uous random variables, the joint probability distribution function is replaced by the joint probability density function. To elaborate, suppose now that {Vi; i = 1,2,...N} is a sequence of continuous random variables with joint probability density function (pdf) f(w1,w2,...,wN; θ).4In this case the likeli- hood function is defined as:
LFN(θ) =d f(v1,v2,...,vN; θ),
where the sample outcomes are (as before) V1 = v1,V2 = v2,...,VN = vN.
Example 6.2. Suppose {Vi}Ni=1 is a sequence of i.i.d. normal random variables with mean μ0 and variance σ02. In this case the pdf of Vi is
1 (w−μ)2 f ( w ; θ 0 ) = 2 π σ 02 e x p − 2 σ 02 ,
where θ0 = (μ0, σ02)′. Given the independence part of the the i.i.d. assumption, the joint pdf of {Vi}Ni=1 is the product of the marginal pdfs and so the likelihood function is
N 1 (vi−μ)2 LFN(θ) = √2πσ2exp − 2σ2 .
i=1
3Recall that a continuous random variable has a zero probability of taking a specific out- come.
4Recall that
Z bN P(a1
yi∗ ≤0⇒yi=0.
It then follows (using the properties of the standard normal distribution) that8
P(yi =1|xi) = P(yi∗ >0|xi) = P(x′iβ0 +ui > 0|xi) = P(ui>−x′iβ0|xi)=1−Φ(−x′iβ0)
= Φ ( x ′i β 0 ) ,
which is identical to (6.21).
We now turn to estimation of β0. The probit model specifies the conditional
probability distribution function for yi given xi in (6.21). Therefore, assuming the marginal distribution of xi does not involve β0, we can estimate β0 by ML. To do so, we need to construct the conditional log likelihood function. Using Assumption CS-BR, the conditional log likelihood function takes the form:
or more compactly,
li(β) =
ifyi = 0 ,
CLLFN(β) = ln[Φ(x′iβ)]
N i=1
li(β), (6.24) ifyi =1
It is not possible to solve the score equations to obtain and explicit solution for the MLE βˆML as a function of {(yi,x′i)}Ni=1, and so the estimator must be calculated using computer-based numerical optimization.
6.2.3 Logit model
The Logit model has the same qualitative structure as the Probit model but is based on the logistic distribution. Therefore, the conditional probability of the event occurring is:
P(yi =1|xi) = Λ(x′iβ0), (6.25)
8Recall that the standard normal distribution is symmetric about 0 and so Φ(v) = 1 − Φ(−v).
167
where
ln[1−Φ(x′iβ)]
li(β) = yiln[Φ(x′iβ)] + (1 − yi)ln[1 − Φ(x′iβ)].
where Λ(z) is the cumulative distribution function of the logistic distribution that is,
Λ(z) = exp(z) . 1 + exp(z)
Many of the features highlighted in our discussion of the probit model are present in the logit model but the details are different reflecting the different basis distribution. In particular, the predicted probabilities must lie in the valid range for a probability because Λ(z) is a cdf, and the change in the probability of the event in response to a change in xi,l is data dependent. Specifically, we have:9
• if xi,l is continuous then
∂P (yi = 1|xi) = Λ(x′iβ0)[1 − Λ(x′iβ0)]β0,l;
∂xi,l
• if xi,l is a dummy variable then holding values of other elements of xi
constant then
∆P (yi = 1|xi) = Λ(x′iβ0, xi,l = 1) − Λ(x′iβ0, xi,l = 0).
Like the probit model, the logit model can be estimated via ML. In Tutorial 9 Question 2, you derive the form of the (conditional) log likelihood function.
6.2.4 Empirical example
We illustrate the binary response model via an application in which the event to be explained is whether or not an individual is arrested in 1986.10 This in- formation is captured by the dummy variable arr86 which takes the value one if the individual is arrested. The probability of this event is to be explained in terms of the following variables: pncv, the proportion of prior arrests that led to conviction; avgsen, the average sentence served from prior convictions (in months); tottime, the months spent in prison since age 18 prior to 1986; ptime86, months spent in prison in 1986; qemp86, the number of quarters (0 to 4) that the man was legally employed in 1986. The population is a group of young men in California born in 1960 or 1961 who have at least one arrest prior to 1986.11
(i) The linear probability model
The linear probability model (LPM) is implemented by regressing arr86 on an intercept and the explanatory variables listed above. Recall that the errors in the LPM can not be homoscedastic and so we report heteroscedasticity robust standard errors. The output is as follows:
9See Tutorial 9 Question 3
10The data are contained in the Wooldridge library: library(Wooldridge) data(“crime1”)
which is provided by Wooldridge
11All calculations are performed in R.
168
—————————————————— Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.440615 0.018546 23.758 < 2.2e-16 *** pcnv -0.162445 0.019220 -8.452 < 2.2e-14 *** avgsen 0.006113 0.006210 0.984 0.325
tottime -0.002262 0.004573 -0.494 0.621
ptime86 -0.021966 0.002919 -7.526 7.06e-14 *** qemp86 -0.042829 0.005466 -7.835 6.66e-15 *** ------------------------------------------------------ Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 0.4373 on 2719 degrees of freedom Multiple R-squared: 0.04735,Adjusted R-squared: 0.0456 F-statistic: 27.03 on 5 and 2719 DF, p-value: < 2.2e-16
Recall that in the LPM, we have (in generic notation) that P ( y i = 1 | x i ) = x ′i β 0
Therefore, the results suggest that, for example:
• The intercept represents the probability of arrest for an individual who has not been convicted in 1986, has served no prison time since he was 18 and is unemployed in 1986. For this sample this probability is 0.441.
• if the man is incarcerated in 1986 then this reduces the probability of arrest because he is already in prison. For each month the man is in prison, this probability drops by 0.0219.
• Each quarter that the man is employed in 1986 reduces the probability of arrest by 0.042.
Suppose it is desired to test whether past incarceration affects the probability of arrest. These effects are captured by the variables avgsen and tottime. So writing the index as,
x′iβ0 = β0,1 + pcnv ∗ β0,2 + avgsen ∗ β0,3 + tottime ∗ β0,4 + ptime86 ∗ β0,5 + qemp86 ∗ β0,6,
incarceration has no effect on the probability of arrest if β3 = 0 and β4 = 0, but does have an effect if either or both of these coefficients is non-zero. We can test H0 : β0,3 =0,β0,4 =0versusHA : β0,i ̸=0foratatleastoneofi=3,4usingthe F-test in (2.44). The test statistic is F = 1.02 with a p-value of 0.3618 and so we fail to reject the null hypothesis that past incarceration does not affect the probability of arrest at all conventional significance levels.
(ii) Probit model:
The estimation is implements using the command
Call: glm(formula = arr pcnv + avgsen + tottime + ptime86 + qemp86, family = binomial(link = "probit"), data = crime1)
The output is as follows:
169
------------------------------------------------------
Estimate Std. Error
(Intercept) -0.101999 pcnv -0.540475 avgsen 0.018923
z value Pr(>|z|) -1.982 0.0475 * -7.799 6.25e-15 ***
Number of Fisher Scoring iterations: 5
As can be seen the layout of the output is similar to that obtained for the regression model but there are some differences. So we now discuss the features of this output.
• Recall that the maximum likelihood estimates (MLE) are the values that max- imize the log of the likelihood function. In the probit model, there is no closed form solution for the MLE, that is the first order conditions only implicitly char- acterize the MLE and cannot be manipulated to provide an explicit solution in terms of functions of the data. This means that the solution is found by a nu- merical optimization routine. In such routines, the solution is found by making an initial guess (pre-programmed into the procedure) and then repeatedly up- dating this guess until the values that maximize the function are found. The “Number of Fisher Scoring iterations” gives the number of such updates which is five in this case.
• Just as in our linear regression model, it is of interest to test whether the event is related to the explanatory variables, that is if
P ( y i = 1 | x i ) = Φ ( β 0 , 1 + x ̃ ′i γ 0 ) ,
then to test whether γ0 = 0. More formally, the null hypothesis is H0 : γ0 = 0 versus the alternative H1 : γ0 ̸= 0. The test statistic is given by LR = Null deviance – Residual deviance = 3216.4-3079.2 = 137.2. In this ex- ample H0 involves the five restrictions that the coefficients on pncv, avgsen, tottime, ptime86 and qemp86 are all zero. The p-value for the test is given by 1-pchisq(137.2,5); in this example the p-value is 0.0000 and so we reject H0 at all conventional significance levels. Therefore, this evidence suggests that the probability of arrest is related collectively to these variables.
• the layout of the parameter estimates table is very similar to that from the reg procedure. Reported are:
– the estimates, βˆi;
– the standard error of these estimates, s.e.(βˆi);
– the test statistic for testing H0 : βi = 0 vs H1 : βi ̸= 0, z = βˆi/s.e.(βˆi); – the p-value based on z for testing H0 : βi = 0 vs H1 : βi ̸= 0 , P > |z|;
170
0.051462 0.069303 0.020459 0.016175 0.017055 0.016600
0.925 0.3550 -0.406 0.6847 -4.587 4.49e-06 *** -7.931 2.17e-15 ***
tottime
ptime86
qemp86 —————————————————— Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3216.4 on 2724 degrees of freedom Residual deviance: 3079.2 on 2719 degrees of freedom AIC: 3091.2
-0.006569 -0.078238 -0.131658
In probit and logit models, the relationship between the probability of the event and the explanatory variables is non-linear. Therefore, we can not assess the marginal responses by looking at the coefficients alone as we did in the LPM. Most computer packages calculate these marginal responses. For our example, the results can be ob- tained by implementing the command:
Call: probitmfx(formula = arr pcnv + avgsen + tottime + ptime86 + qemp86, data = crime1, atmean = TRUE)
The output is as follows:
%Call: probitmfx(formula = arr ~ pcnv + avgsen + tottime + ptime86 + % qemp86, data = crime1, atmean = TRUE) ——————————————————
Marginal Effects:
dF/dx pcnv -0.1775607 avgsen 0.0062166 tottime -0.0021580 ptime86 -0.0257034 qemp86 -0.0432534
Std. Err. 0.0226318 0.0067212 0.0053141 0.0055866 0.0054381
z
-7.8456
0.9249
-0.4061
-4.6009
-7.9537
P>|z| 4.308e-15 ***
0.3550
0.6847 4.207e-06 *** 1.810e-15 ***
—————————————————— Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
Notice that the output has the same layout as the parameter estimates table and the columns have the analogous interpretations. The additional argument in the probitmfx function atmean = TRUE indicates that the marginal effects are evaluated at the sample mean of each explanatory variable. Comparing the marginal responses to the parameter estimates in the output for the LPM, we can see the signs agree but there are some differences in the magnitudes.
As in the LPM, we test whether past incarceration affects the probability of arrest inthismodel. Asabove,wetestH0 : β0,3 =0,β0,4 =0versusHA : β0,i ̸=0foratat least one of i = 3, 4. Within the ML framework, we can test this hypothesis using the Wald, LR or LM test. R uses the LR test.12 The test statistic is 2.36 with a p-value of 0.3071, and so we fail to reject H0 at all conventional significance levels, indicating past incarceration has no effect on the probability of arrest.
(iii) Logit model
The output is as follows:13
—————————————————— Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.15986 0.08422 -1.898 0.0577 . pcnv -0.90080 0.11990 -7.513 5.78e-14 *** avgsen 0.03099 0.03439 0.901 0.3676
12This is claculated using the commands: library(car) linearHypothesis(probit,c(“avgsen”,”tottime”)).
13The command is the same as for the probit estimation only with “probit” replaced by “logit”.
171
tottime -0.01044 0.02746 -0.380
ptime86 -0.12678 0.03081 -4.114
qemp86 -0.21586 0.02773 -7.784 —————————————————— Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 (Dispersion parameter for binomial family taken to be 1)
Null deviance: 3216.4 on 2724 degrees of freedom Residual deviance: 3082.5 on 2719 degrees of freedom AIC: 3094.5
Number of Fisher Scoring iterations: 4
The marginal effects are:14
—————————————————— Marginal Effects:
dF/dx pcnv -0.1755626 avgsen 0.0060394 tottime -0.0020340 ptime86 -0.0247087 qemp86 -0.0420697
Std. Err. 0.0230195 0.0067026 0.0053524 0.0059708 0.0053547
z
-7.6267
0.9010
-0.3800
-4.1383
-7.8566
P>|z| 2.409e-14 ***
0.3676
0.7039 3.499e-05 *** 3.946e-15 ***
—————————————————— Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
It is interesting to compare the results to those from the probit. It can be seen that the parameter estimates are different but the marginal responses are very sim- ilar. This can be explained in terms of the latent variable model representation for the binary response model. The logit model is derived by assuming that the error of the underlying latent regression model has a logistic distribution, and the probit is derived by assuming this distribution is standard normal. The logisitic and standard normal have fixed values for their variances but they are different. This difference means in effect that the parameter estimates from the two models have a different scaling. Notice this scaling does not affect the sign.
For completeness, we also test whether past incarceration affects the probability of arrest using this model as well. Here the LR test statistic is 2.44 with a p-value of 0.2952 which confirms the findings using the other two models.15
6.3 The linear regression model
In this section, we revisit the linear regression model and derive the MLE under two different assumptions about the conditional distribution of the errors given the
14This implemented using the command: Call: logitmfx(formula = arr pcnv + avgsen + tottime + ptime86 + qemp86, data = crime1, atmean = TRUE).
15This is implemented using the commands: library(car) linearHypothesis(logit,c(“avgsen”,”tottime”)).
172
0.7039 3.88e-05 *** 7.02e-15 ***
explanatory variables. The first assumption is the case where the errors have a con- ditional normal distribution, and the second is where the errors have a conditional Student’s t distribution. The form of the estimators provides further insights into the properties of the OLS estimator discussed previously.
In both the models considered, we maintain Assumptions CS1-CS5 and so work- ing with the case where {(yi,x′i),i = 1,2,…,N} are independently and identically distributed. Note that in both cases ui satisfies E[ui|xi] = 0 and V ar[ui|xi] = σ02, and so the difference between the two cases comes from other aspects of the distribution. In particular, while both distributions are “bell-shaped” the Student’s t- distribution as fatter tails than the normal, and so the there is an increased chance of large values for |ui| under the Student’s t distribution than under the normal.
Throughout the analysis, we assume that the parameters θ0 = (β0′ , σ02)′ do not appear in the marginal distribution of xi. We therefore work with the conditional log likelihood function. Since the data are i.i.d., the conditional log likelihood function is given by
N
CLLFN(θ) = Xli(θ), (6.26)
i=1
where li(θ) = ln[f(yi|xi,θ)] and f(yi|xi,θ) is the pdf of the conditional distribution
of yi given xi.
6.3.1 MLE under normality
If ui|xi ∼ N(0,σ02) and Assumption CS1 holds then yi|xi ∼ N(x′iβ0,σ02) and so f(yi|xi,θ) = (√2πσ2)−1exp ̆−(yi −x′iβ)2/(2σ2) ̄.
Substituting form (6.27) into (6.26), we obtain
(6.27)
(6.28)
ˆ
Let θnl denote the MLE of θ0 where the “nl” subscript is to emphasize that the
estimator is the MLE under the assumption the errors have a normal distribution. The MLE can is the solution to the score equations:
N
Xlnh(2πσ2)−1/2exp{−(yi − x′iβ)2/2σ2)}i, i=1
CLLFN(θ) =
= − 2 ln[2π] − 2 ln[σ2] − 2σ2 (yi − x′iβ)2.
N N 1XN i=1
” ∂CLLF (βˆ ,σˆ2 ) # N nl nl
From (6.28), it follows that ∂CLLFN(θ)
∂ β
∂ C L L F N ( θ )
∂σ2
1 XN
= σ 2 ( y i − x ′i β ) x i ,
( 6 . 2 9 ) (6.30)
∂β
∂ C L L F ( βˆ , σˆ 2 )
∂σ2
= 0.
N nl nl
i=1
N 1 XN
= −2σ2+2σ4 (yi−x′iβ)2. i=1
173
Therefore, the MLE’s can be obtained by solving:
NN
Xyixi − Xxix′iβˆnl i=1 i=1
N
σˆn2l −N−1X(yi−x′iβˆnl)2 i=1
Simple manipulation of (6.31) yields
= 0, = 0.
(6.31) (6.32)
(6.33)
(6.34)
and (6.32) yields
NN
βˆnl = (Xxix′i)−1 Xxiyi = (X′X)−1X′y,
i=1 i=1
N
σˆn2l =N−1X(yi−x′iβˆnl)2 =e′e/N,
i=1
where e = y − Xβˆnl (y and X are defined as in Section 2.1.)
It can be recognized that βˆnl = βˆN , the OLS estimator of β0 in (2.7), but σˆn2 l ̸= σˆN2 , the OLS estimator of σ02 in (2.33). This pattern of relationships tells us something about OLS and something about MLE. About OLS, it implies that if the errors have a mean zero, homoscedastic normal distribution and are pairwise uncorrelated then βˆN is asymptotically the most efficient estimator in the class of CUAN estimators of β0. Notice this optimality is relative to any other estimators that are either linear or nonlinear. This contrasts with the Gauss-Markov Theorem which states that if the errors have mean zero, are homoscedastic and pairwise uncorrelated then OLS is the best linear unbiased estimator. Thus, the equivalence of OLS to MLE in this setting provides and additional reason for basing inference about β0 upon the OLS estimator. About MLE, the relationship between the OLS and ML estimators demonstrates that MLE need not be unbiased: using similar arguments to Section 2.5 it can be shown t h a t E [ σˆ n2 l ] = { ( N − k ) / N } σ 02 . 1 6
6.3.2 MLE under Student’s t distribution
This section is non-examinable.
If Assumptions CS4 and CS5 hold and ui|xi ∼ Student’s t with ν df then it follows
that
−1 ” (yi − x′iβ0)2 #−(ν+1)/2
1 + (ν−2)σ2 (6.35)
f(yi|xi,θ) = c∗σ
where c is a constant that ensures the pdf integrates to one, but need not concern us here. Assumingagainthatthemarginalpdfforxi doesnotinvolveθ0 =(β0′,σ02)′,we can work with the conditional LLF. From Assumption CS2 and (6.35), it follows that
CLLFN(θ) = =
XN 8< − 1 " ( y i − x ′ i β 0 ) 2 # − ( ν + 1 ) / 2 9= ln c2σ 1+ 2 ,
(6.36) (6.37)
i=1 N
: ( ν X− 2 ) σ ;
2 „ ν + 1 « N " ( y i − x ′i β 0 ) 2 #
− 2 ln[σ ] − 2 ln 1 + (ν − 2)σ2 , i=1
16Although as N → ∞ the bias disappears because (N − k)/N → 1. 174
where we have omitted the term involving c for ease of presentation as it plays no role in our subsequent analysis.
ˆ
The MLE, θSt, is the solution to the score equations:
∂CLLFN(θ) ̨θ=θˆ ∂β St
∂CLLFN(θ,σ2) ̨
∂ σ 2 ̨ θ = θˆ S t
where the subscript “St” is short for Student’s t. For this model, we have:
= 0, = 0,
(6.38) (6.39)
(6.40)
(6.41)
∂CLLFN (β, σ2) ∂β
i=1
∂LLFN(β,σ2) = −N + ν+1 ×
∂σ2 2σ2 2σ4(ν − 2)
=
ν + 1 ×
σ2(ν − 2)
XN »
(yi−x′β)2–−1 1 + i
x i ( y i − x ′i β ) ,
(ν − 2)σ2
XN » i=1
1 +
(yi−x′β)2–−1
i ( y i − x ′i β ) 2 .
(ν − 2)σ2
We begin by considering the score equations with respect to β in (6.38). From
(6.40), it follows that (6.38) can be written as
N
X wˆ i [ x i y i − x i x ′i βˆ S t ] = 0 , i=1
where wˆi = [1 + uˆ2i /σˆS2 t(ν − 2)]−1 and uˆi = yi − x′iβˆSt. With rearrangement,
(6.42) (6.42) (6.43)
becomes
From (6.43), it follows that βˆ S t
NN
Xwˆixiyi = Xwˆixix′iβˆSt. i=1 i=1
NN
= [ X wˆ i x i x ′i ] − 1 X wˆ i x i y i ,
with respect to σ2 in (6.39). Using (6.41), equation (6.39) can be written as
N ν+1XN
−σˆ2 + σˆ4 (ν−2) wˆiuˆ2i = 0.
St St i=1 With rearrangement, (6.45) becomes
i=1
i=1 i=1 = (X′Wˆ X)−1X′Wˆ y,
(6.44) where Wˆ is a diagonal matrix with ith diagonal element wˆi. Now consider the score
(6.45)
„ ν + 1 « XN
St ν−2i
σˆ 2 = N − 1
= N − 1 „ ν + 1 « uˆ ′ Wˆ uˆ .
wˆ i uˆ 2 ,
ν−2 175
It can be seen that βˆSt ̸= βˆnl and σˆS2t ̸= σˆn2l - and so, as would be expected, the form of the MLE is sensitive to the assumed distribution of the errors. Notice also that the MLE’s under the Student’s t distribution are also not equal to their OLS counterparts. Therefore, the OLS estimators are not asymptotically efficient in this case. Notice that βˆSt is a nonlinear function of yi and so there is no contradiction with the Gauss-Markov Theorem. It is interesting to compare the MLE’s of β0 under the normal and Student’s t distributions. Notice that βˆSt has a similar structure to the WLS estimator in (4.34) and we can use that analogy to understand the difference between the two MLE’s. It can be recognized that if wˆi is replaced by 1 in the formula for βˆSt then the result would be βˆnl. The difference between βˆnl and βˆSt stems from the weights applied in the sums that appear in their formula. With βˆnl all observations receive equal weight. With βˆSt, the observations are weighted in a way that is inversely related to the magnitude of errors, |uˆi|. The latter structure makes sense because the Student’s t distribution has thicker tails than the normal, and so the ML places less weight on those observations in the tail because for these the relationship between yi and xi is less strong.
6.4 Appendix: The large sample distribution of the MLE and two important identities
In this appendix, we present an overview of the arguments behind the large sample distribution theory of the MLE. In so doing, we must appeal to two important identities associated with ML estimation which are also derived. Throughout this section, it is taken for granted that the MLE is consistent. The proof of consistency of the MLE is complicated by the fact that in general the score equations only implicitly characterize the MLE and can not be solved to obtain an explicit solution for the MLE as a function of the data. As a result, the proof strategy is different to that taken to establish the consistency of the OLS, GLS or IV estimators in Chapters 3, 4 and 5. The necessary argument involves additional statistical concepts that are beyond the scope of this course. However, as emerges below, given consistency, the arguments used to establish the large sample distribution of the MLE are very similar to those used to establish the analogous result for OLS, GLS and IV.
Consider the case where {Vi} be a sequence of independently and identically dis- tributed random variables with probability density function f(v;θ0). It is pedagogi- cally convenient to present the aforementioned two identities first and then refer to them as needed in the derivation of the large sample distribution of the MLE. To this end, define li(θ) = ln[f(Vi;θ)],
.
Theorem 6.1. If {Vi} be a sequence of independently and identically distributed ran- dom variables with probability density function f(v;θ0) then: (i) E[si(θ0)] = 0; (ii) E[si(θ0)si(θ0)′] = −E[Hi(θ0)].
Both identities involve the score function of the ith observation, si(θ0). Theorem 6.1(i) states that the expected value of the score function is zero. Theorem 6.1(ii) states that the expected value of the outer product of the score function is equal to
176
si(θ0) = ∂ln[f(Vi;θ)] ̨ , and Hi(θ0) = ∂2ln[f(Vi;θ)] ̨ ∂θ ̨ ∂θ∂θ′ ̨
θ=θ0 θ=θ0 The two identities are presented in the following theorem.
the negative of the expectation of the hessian matrix of li(θ) evaluated at θ0. No- tice that as {Vi} is a sequence of independently and identically distributed r.v.’s, −E[Hi(θ0)] = Iθ,N, the information matrix. Therefore, Theorem 6.1(ii) can be re- stated in our context here as saying the the expected value of the outer product of the score function equals the information matrix; Theorem 6.1(ii) is referred to as the information matrix identity.
Proof of Theorem 6.1:
Recall that by definition R ∞ f (v; θ)dv = 1. Assuming it is permissible to reverse the −∞
order of differentiation and integration, it follows from R ∞ −∞
f (v; θ)dv = 1 that
= 0
= 0
= 0
= 0
= 0.
∂Z∞ ∂
and so
∂θ
−∞
f(v;θ)dv = ∂θ1, Z∞ ∂f(v;θ)dv
−∞ ∂θ
Z ∞ ∂f(v;θ)f(v;θ)dv
⇒
⇒ Z∞» 1 ∂f(v;θ)–f(v;θ)dv
−∞ ∂θ f(v;θ)
−∞ f(v;θ) ∂θ
⇒ Z ∞ » ∂ln[f(v;θ)] –f(v;θ)dv
−∞ ∂θ
⇒ E»∂ln[f(v;θ)]–
∂θ
This gives part (i). From part(i) it follows that
∂ Z ∞ » ∂ln[f(v;θ) –f(v;θ)dv = 0,
and so
∂θ′ ∂θ −∞
Z ∞ » ∂2ln[f(v;θ)] f(v;θ) + ∂ln[f(v;θ)] „∂f(v;θ)«′ –dv = 0.
(6.46) If we use the same trick as in the proof of part (i) on the second term in (6.46), i.e.
∂θ∂θ′ ∂θ ∂θ multiply and divide by f(v;θ) then we obtain
Z ∞ » ∂2ln[f(v;θ)] + ∂ln[f(v;θ)] „∂ln[f(v;θ)]«′ –f(v;θ)dv = 0.
∂θ∂θ′ ∂θ ∂θ Equation (6.47) implies
E»„ ∂ln[f(Vi;θ)] «„ ∂ln[f(Vi;θ)] «′ – = −E» ∂2ln[f(Vi;θ)] –, ∂θ ∂θ ∂θ∂θ′
−∞
−∞
(6.47)
which gives us part(ii). ⋄.
We now turn to the large sample distribution of the MLE. Specializing the result
stated Section 6.1 to our setting here, we have the following result
177
Theorem 6.2. Let {Vi} be a sequence of independently and identically distributed ˆ
randomvariableswithprobabilitydensityfunctionf(v;θ0)andθN betheMLEthatis, ˆ PN ˆp
θN = argmax i=1 li(θ) where li(θ) = ln[f(vi;θ)]. Assume that θN → θ0 and certain
other regularity conditions hold.
Vθ = −{E[Hi(θ0)]}−1 and Hi(θ0) is defined above.
(Heuristic) Proof:
First note that the score equations can be written as
N
∂LLFN(θ) ̨ ∂θ
17 1/2 ˆ d
Then we have: N (θN − θ0 ) → N ( 0, Vθ ) where
̨ Xˆ
= si(θN) = 0, (6.48) θ=θˆN i=1
because LLFN (θ) = PNi=1 li(θ). Taking a first order Taylor expansion of (6.48) we obtain18
N ( N )−1 −1/2X −1X 1/2 ˆ
N si(θ0) + N Hi(θ0) N (θN − θ0) + ξN = 0, i=1 i=1
where ξN represents a remainder term. Under the (unstated) regularity conditions of the theorem, it can be shown that ξN →p 0, and so ξN is asymptotically negligible.
1/2 ˆ
This means that the large sample behaviour of N (θN − θ0) can be determined from
N(N)
−1/2X −1X 1/2 ˆ a
N si(θ0)+ N Hi(θ0) N (θN −θ0)=0, i=1 i=1
where “=a ” stands for “asymptotically equal to” and is included to remind us that the identity is not exact as we have omitted ξN . Rearranging we obtain,
1/2 ˆ a
N (θN − θ0) = −
( N )−1 N −1X −1/2X
(6.49)
N Hi(θ0) N si(θ0). i=1 i=1
n −1PN o−1 Notice we can write N (θN − θ0) = MNmPN where MN = N i=1 Hi(θ0) ,
N
N−1XHi(θ0) →p E[Hi(θ0)]. (6.50)
a p × p random matrix, and mN = N−1/2 therefore apply the WLLN to deduce that
1/2 ˆ
Ni=1 si(θ0), a p × 1 random vector. Fur- thermore we can use the WLLN and CLT to deduce the large sample behaviour of these two terms. Since {Vi} is i.i.d. then so too are {si(θ0)} and {Hi(θ0)}. We can
i=1
One of the suppressed regularity conditions in the theorem is the assumption that E[Hi(θ0)] is nonsingular and so via Slutsky’s Theorem we can deduce that MN →p {E[Hi(θ0)]}−1.19 Since Theorem 6.1(i) implies that E[si(θ0)] = 0 which in turn implies
17These regularity conditions are relatively mild and are satisfied by the models considered in this chapter.
18Recall the discussion of the Delta method in Section 4.2.
19Note that the second order condition for maximization of the LLF is that PN H (θˆ )
is negative definite which implies this matrix is nonsingular.
178
i=1 i N
V ar[si(θ0)] = E[si(θ0){si(θ0)}′], we can apply the CLT to deduce that
N
N−1/2 Xsi(θ0) →d N `0, E[si(θ0){si(θ0)}′] ́.
i=1
It then follows from Lemma 3.5 that
1/2ˆd` −1 ′ −1′ ́ N (θN − θ0) → N 0,{E[Hi(θ0)]} E[si(θ0){si(θ0)} ]{E[Hi(θ0)]} .
Using the information matrix identity - Theorem 6.1(ii) - it follows that {E[Hi(θ0)]}−1E[si(θ0){si(θ0)}′]{E[Hi(θ0)]}−1′ = −{E[Hi(θ0)]}−1.
Using (6.53) in (6.52), we obtain the desired result. ⋄
(6.51)
(6.52)
(6.53)
Notice that the structure of the arguments used to deduce the large sample distri- bution of the MLE are essentially the same as those used to derive the corresponding result for the OLS, GLS and IV estimators.
179
Bibliography
Acemoglu, D., Johnson, S., and Robinson, J. A. (2001). ‘The Colonial Origins of Com- parative Development: An Empirical Investigation’, American Economic Review, 91: 1369–1401.
(2012). ‘The Colonial Origins of Comparative Development: An Empirical Investigation: Reply’, American Economic Review, 102: 3077–3110.
Albouy, D. Y. (2012). ‘The Colonial Origins of Comparative Development: An Em- pirical Investigation: Comment’, American Economic Review, 102: 3059–3076.
Andrews, D. W. K. (1991). ‘Heteroscedasticity and autocorrelation consistent covari- ance matrix estimation’, Econometrica, 59: 817–858.
Angrist, J. D., and Krueger, A. B. (1991). ‘Does compulsory school attendance affect schooling and earnings?’, Quarterly Journal of Economics, 87: 979–1014.
Bollerslev, T. (1986). ‘Generalized Autoregressive Conditional Heteroscedasticity’, Journal of Econometrics, 31: 307–327.
Bollerslev, T., Chou, R. Y., and Kroner, K. F. (1992). ‘ARCH modeling in finance: a review of the theory and empirical evidence’, Journal of Econometrics, 52: 5–29.
Bound, J., Jaeger, D. A., and Baker, R. M. (1995). ‘Problems with instrumental vari- ables estimation when the correlation between the instruments and the endogenous explanatory variable is weak’, Journal of the American Statistical Association, 90: 443–450.
Box, G. E. P., and Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control. Prentice Hall, Englewood Cliffs, NJ, U. S. A.
Breusch, T. S. (1978). ‘Testing for autocorrelation in dynamic linear models’, Aus- tralian Economic Papers, 17: 334–355.
Breusch, T. S., and Pagan, A. R. (1979). ‘A simple test for heteroscedasticity and random coefficient variation’, Econometrica, 47: 987–1007.
Clarida, R., Gali, J., and Gertler, M. (2000). ‘Monetary policy rules and macroeco- nomic stability: evidence and some theory’, Quarterly Journal of Economics, 96: 147–180.
Dell, M., Jones, B. F., and Olken, B. A. (2014). ‘What do we learn from the weather? The new climate-economy literature’, Journal of Economic Literature, 52: 740–798.
Dickey, D. A., and Fuller, W. A. (1979). ‘Distribution of the the estimators for autore- gressive time series with a unit root’, Journal of the American Statistical Associa- tion, 74: 427–431.
180
Durbin, J. (1970). ‘Testing for serial correlation in least squares regression when some of the regressors are lagged dependent variables’, Econometrica, 38: 410–421.
Durbin, J., and Watson, G. S. (1950). ‘Testing for serial correlation in least squares regressions I’, Biometrika, 37: 409–428.
Eicker, F. (1967). ‘Limit theorems for regressions with unequal and dependent er- rors’, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 59–82, Berkeley, CA USA. Berkeley: University of Cali- fornia Press.
Engle, R. F. (1982). ‘Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation’, Econometrica, 50: 987–1007.
Frisch, R., and Waugh, F. V. (1933). ‘Partial Time Regressions as Compared with Individual Trends’, Econometrica, 1: 387–401.
Godfrey, L. G. (1978). ‘Testing against general autoregressive and moving avergae error models when the regressors include lagged dependent variables’, Econometrica, 46: 1293–1302.
Greene, W. H. (2012). Econometric Analysis. Pearson, London, UK, seventh edn. Hall, A. R. (2015). ‘Econometricians have their moments: GMM at 32’, Economic
Record, 91, S1: 1–24.
Hansen, L. P., and Hodrick, R. J. (1980). ‘Forward exchange rates as optimal predictors
of future spot rates’, Journal of Political Economy, 887: 829–853.
Harvey, A. C. (1990). The Econometric Analysis of Time Series. MIT Press, Cam-
bridge, MA, USA, second edn.
Huber, P. J. (1967). ‘The behaviour of Maximum Likelihood estimators under nonstan- dard conditions’, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 221–233, Berkeley, CA USA. Berkeley: Uni- versity of CAlifornia Press.
Lovell, M. (1963). ‘Seasonal Adjustment of Economic Time Series and Multiple Re- gression Analysis’, Journal of the American Statistical Association, 58: 993–1010.
McCarthy, P. S. (1994). ‘Relaxed spped limits and highway safety: new evidence from California’, Economics Letters, 46: 173–179.
Newey, W. K., and West, K. D. (1987). ‘A simple positive semi–definite heteroscedas- ticity and autocorrelation consistent covariance matrix’, Econometrica, 55: 703–708.
(1994). ‘Automatic lag selection in covariance matrix estimation’, Review of Economic Studies, 61: 631–653.
Orme, C. (2009). Lecture Notes in Linear Algebra. University of ManchesterTeaching Materials, Manchester, UK.
Pearson, K. (1893). ‘Asymmetrical frequency curves’, Nature, 48: 615–616.
(1894). ‘Contributions to the mathematical theory of evolution’, Philosophical
transactions of the Royal Society of London (A), 185: 71–110.
(1895). ‘Contributions to the mathematical theory of evolution, II: skew vari-
ation’, Philosophical transactions of the Royal Society of London (A), 186: 343–414.
Stock, J. H., Wright, J. H., and Yogo, M. (2002). ‘A survey of weak instruments and weak identification in generalized method of moments’, Journal of Business and Economic Statistics, 20: 518–529.
181
White, H. (1980). ‘A heteroscedasticity-consistent covariance matrix estimator and a direct test for heteroscedasticity’, Econometrica, 48: 817–838.
White, H., and Domowitz, I. (1984). ‘Nonlinear regression with dependent observa- tions’, Econometrica, 52: 143–161.
Wooldridge, J. M. (2002). Econometrics Analysis of Cross Section and Panel data. MIT Press, Cambridge, MA, USA.
(2006). Introductory Econometrics: A Modern Approach. Thomson- Southwestern, Mason, OH, USA, third edn.
182