ECON6300/7320/8300 Advanced Microeconometrics Review of Multiple Regression and M-estimation
Christiern Rose 1University of Queensland
Lecture 2
1/37
Features of microeconometrics (1)
Data pertain to firms, individuals, households, etc
Focus on “outcomes”, and relationships linking outcomes to actions of individuals
earnings = f(hours worked, years of education, gender, expereince, institutions)
Heteroegeneity of economic subjects’ preferences, constraints, goals etc. explicitly acknowledged (no “representative agent” assumption)
Noisy data, large samples,
Economic factors supplemented by social, spatial, temporal interdependence
2/37
Features of microeconometrics (2)
Sources of data:
Surveys (Govt/private); cross section or longitudinal (panel)
Census
Administrative data (by-product: Tax related, health related,
program related)
Natural experiments
Designed experiments
Randomized trials with controls
Type of data impacts method and model used in analysis
3/37
Features of microeconometrics (3)
Measures of “outcomes”
Continuous (e.g. earnings)
Discrete (binary or multinomial chpoice as in discrete
choice) or integers-valued (number of doctor visits)
Partially observed/censored (hours of work)
Proportions or intervals
Type of measure may affect the choice of model used Many types of regression models
4/37
This Lecture
1. Understand the role of regression
2. Review of the basic regression analysis results in matrix
notation (Main reference: W. Greene)
3. Review features of regression analysis
4. Review of the scope and limitations of regression model; consider causal parameters and treatment effects
5. Compare causal (“structural”) and non-causal (“reduced form”) regression models
6. Move on to the topic of m-Estimation
5/37
General set-up and notation
Data:(y:(N×1),X:(N×K))
Regression model in matrix notation: y = Xβ + u
A joint unknown population distribution of data: f (y, X;θ0), where both f and θ0 are unknown
Three approaches:
1. Fully parametric: assume f is given, θ0 is finite dimensional
but unknown
2. Semi-parametric: assume that θ0 is finite dimensional but
unknown we can specify some moment functions for y, e.g. E[y|X], or Var[y|X] and we do not want to make assumptions about the distribution f (.)
3. Nonparametric: assume that θ0 is infinite dimensional, and we want to estimate the relation between y and X without making a parametric assumption about f (.)
6/37
Definition and notation
θ0 : vector of mean and variance parameters in the relationships to be estimated
θ : the estimator of θ0 based on sample of observations from the population of interest.
Ingeneralθ̸=θ0;(θ−θ0):samplingerrorhasastatistical distribution
Ideally the distribution of θ is centered on θ0 (unbiased estimator) with high precision (efficiency property), and a known distribution, to support statistical inference (probability statements and hypothesis testing).
p
Consistency means θ → θ0.
7/37
General approach to estimation and inference
Model specification and identification
Which specification/restrictions are reasonable?
Can the parameter θ0 be recovered given infinite data?
Correct model specification or correct specification of key components of the model given the data we have available is necessary for consistency
Qualification: All models are necessarily misspecified as they are simplifications
True specifications vs Pseudo-true specifications
8/37
Under additional assumptions the estimators are asymptotically normally distributed,
i.e. the sampling distribution is well approximated by the multivariate normal in large samples:
θ∼a N[θ,V[θ]]
where V[θ] denotes the (asymptotic) variance-covariance
matrix of the estimator (VCE).
Efficient estimators are consistent and have smaller
variance, and VCE ( V [θ]).
9/37
In many (most) cases large sample (normal) distribution of θ is the best we can do. Hence inference on θ is based on distributions derived from the normal
Test statistics based on (asymptotic) normal results include t-test, F-test, chi-square test
Standard errors of the parameter estimates are obtained
Different assumptions about the data generating process (d.g.p.), such as heteroskedasticity, can lead to different VCE.
from V [θ].
10/37
OLS regression
Linear regression estimated by least squares can be regraded as semi-parametric
Goal: to estimate the linear conditional mean function
E [y|x] = x′β = β1×1 + β2×2 + · · · + βK xK , (1)
where usually an intercept is included so x1 = 1.
E[y|x] is of direct interest if goal is prediction based on x′β
Econometrics interested in marginal effects (e.g. price
change on quantity transacted): ∂ E [y|x] = βj. ∂xj
The linear regression has two components, conditional mean and the error
yi = E[yi|xi]+ui (2)
yi = x′iβ+ui, i=1,…,N. (3)
11/37
OLS (1)
The objective function is the sum of squared errors, QN(β) = (y − Xβ)′(y − Xβ) ≡ Ni=1(yi − x′iβ)2 which is minimized w.r.t. β
Solving FOC (first oder conditions) using calculus methods yieldstheOLSsolution: X′(y−Xβ)=0
Matrix notation provides a very compact way to represent estimator and variance matrix formulas that involve sums of products and cross-products.
y:N×1columnvectorwithith entryyi,X:N×K regressor matrix X to have ith row x′i.
Convention is that all vectors as column vectors, with transpose if row vectors are desired.
12/37
OLS (2)
The OLS estimator can be written in matrix or mixed matrix-scalar notation:
β = =
( X ′ X ) − 1 X ′ y
N −1N
x x′ x y ii ii
i=1 i=1
N x2 N x1ix2i ··· N x1ixKi −1
i=1 1i N x2ix1i
i=1
N x2
i=1
.
N xKix1i i=1
Ni = 1 x 1 i y i
i=1 1i
N x y
i=1 2i i × . .
Ni=1 xKiyi
i=11i ..
··· N x2
=i=1
.
13/37
Properties of OLS estimator
Properties of any estimator depend on assumptions about the data generating process (d.g.p.).
For the linear regression model this reduces to assumptions about the regression error ui .
As a starting point in regression analysis it is typical to assume:
1. E[ui|xi]=0(exogeneity).
2. E[ui2|xi ] = σ2 (conditional homoskedasticity).
3. E[ui uj |xi , xj ] = 0, i ̸= j , (conditional uncorrelatedness).
4. u ∼ i.i.d. N[0,σ2], (not essential for estimation but often added for simplicity)
14/37
Properties of OLS estimator (2)
Assumption 1 is essential for consistent estimation of β, and implies that the conditional mean given in (1) is correctly specified.
It also implies linearity and no omitted variables. Linearity in variables can be relaxed.
Assumptions (2)-(3) determine the form of the VCE of β.
Assumptions 1-3 lead to β being asymptotically normally
distributed with default estimator of the VCE
V default [β] = s2(X′X) −1, (4)
′2 −12 whereui=yi−xiβands =(N−K) iui.
15/37
Properties of OLS estimator (3)
Under assumptions 1-3 (with or without 4), β has asymptotic normal distribution (assuming no perfect collinearity).
β converges in probability to β and s2 to σ2
Under assumptions 1-4 βj /se(βj ) are exactly t -distributed.
Assumption 4 is not always assumed. If not it is common to continue to use the t-distribution for hypothesis testing (as opposed to the standard normal), hoping that it provides a better finite sample approximation.
If assumptions 2-3 are relaxed, OLS is no longer efficient.
16/37
Heteroskedasticity-robust standard errors
If assumption 1 holds, but 2 or 3 do not, we have heteroskedastic or dependent errors.
Then variance estimated using the standard formula is wrong
A heteroskedasticity-robust estimator, of the correct formula of the VCE of the OLS estimator is
i
For cross-section data the above “robust estimator” is widely used as the default variance matrix estimate in most applied work
In Stata a robust estimate of the VCE is obtained using the vce(robust) option of the regress command Related better options are vce(hc2) and vce(hc3)
N
′−1 2′′−1
uixixi (XX) . (5) Note that we now have a correction factor.
Vrobust[β]=(XX) N−K
17/37
Objectives of econometric model
1. Data description and summary of associations between
variables (including data mining)
2. Conditional prediction and policy analysis, prospective and retrospective
– Simulation of counter-factual scenarios to address “what if” type questions
– Analysis of interventions, both actual and hypothetical
3. Estimation of causal (“structural”, “key”) parameters – Inference about structural parameters and interdependence between endogenous variables
4. Empirical confirmation or refutation of hypotheses regarding microeconomic behavior.
18/37
When assumptions fail
“All models are lies but they get us closer to the truth.”
A specified/assumed model is a “pseudo-true” model, our
approximation to the unknown d.g.p.
Goal: Get the best estimates of the assumed model (usually an approximation)
Use diagnostic checks to see if the approximation can be improved
19/37
Common failures
Omitted variable bias (OVB) is unavoidable. So how to interpret OLS estimates?
Supposethecorrectregressionisy=xβ+zγ+ubutzis incorrectly omitted.
Consequences – Modeling objectives 2-4 are affected but not necessarily 1!
β = (X′X) −1X′y. is biased as E[β|X, Z]= β + (X′X) −1X′Zγ where the second term measures the bias
β suffers from confounding (i.e., its value depends on Zγ) and β is not identified
However, xβ may still be useful for conditional prediction
20/37
Some common misspecifications
Potentially a very long list. The most important are:
1. Omitted variables (unobserved factors that affect economic
behavior – e.g., business confidence)
2. Misspecified functional forms (departures from linearity)
3. Ignoring endogenous regressors
4. Ignoring measurement errors in regressors
5. Ignoring violations of “classical” assumptions
(heteroskedasticity, serial and cross section dependence)
21/37
Regression diagnostics and tests
Usual to apply diagnostic checks of model specification A standard modeling cycle has four steps:
specification → estimation → diagnostics → re-estimation
Diagnostic checks involve testing a restricted model against a less restricted model
Ex. 1: fewer regressors vs. more regressors (e.g. F-tests)
Ex. 2: homoskedastic errors vs. heteroskedastic errors
(e.g. tests of homoskedasticity)
Ex. 3: nonlinear regression vs. linear regression (tests of
nonlinearity)
Ex. 4: serially independent errors vs. dependent errors
(tests of serial correlation)
Regression is almost always followed by postregression analysis involving diagnostics
22/37
Structural vs reduced form models
Very highly structured models, derived from detailed specification of: underlying economic behavior; institutional set-up, constraints and administrative information; statistical and functional form assumptions, assumptions of agent’s optimizing behavior
Reduced form studies which aim to uncover correlations and associations among variables
Hybrid models that have some elements of structural models but do not necessarily assume optimizing behavior.
23/37
An example of Mincerian earnings regression
lnE =β0 +β1yreduc+β3age+β3occ+x′γ+ε
1. Does this regression equation (with perhaps a small number of regressors) provide a good fit to the sample data? Is the fit improved by adding age2 to the regression? [Data description]
2. Is the regression equation a good predictor of earnings at different ages and occupations? [Conditional prediction]
3. What does the regression say about the rate of return to an extra year of education? [Structural or causal parameter]
4. Can the regression be used to explain the sources of earnings differential between male and female workers? [Counterfactual scenario]
These seemingly different objectives are connected, but may imply differences in emphasis on various aspects of modeling
24/37
Regression decomposition – an example of counterfactual analysis
Consider the problem of explaining male-female earnings differential
Yg = βg+xikβg +εg, g=M,F i0ki
∆ = Y ̄ F − Y ̄ M
KK
= (βF −βM)+(βF −βM)x ̄F +(x ̄F −x ̄M)βM +R
00 kkk kkk k=1 k=1
This is counterfactual analysis as it answers the question: what if certain differentials were equalized?
25/37
m-Estimation
We consider the very extensive topic of m-estimation. Almost all estimation methods used in this class are special cases of m-estimation.
Examples: Least squares (LS); generalized least squares (GLS); generalized method of moments (GMM); maximum likelihood (ML); quantile regression (QR)
Objective: Introduce key and useful asymptotic properties of m-estimators
26/37
Basic set-up and notation
Definitions
We define an m-estimator θ of the q × 1 parameter vector θ is an estimator that maximizes an objective function that is a sum or average of N sub-functions
1 N
QN(θ) = N
where q(·) is a scalar function, yi is the dependent variable, xi is a regressor vector (of exogenous variables) and we assume conditional independence over i.
Common properties of q(·) – continuity and differentiability w.r.t. θ
m-estimation typically involves minimizing or maximizing a specified objective function defined in terms of data and unknown population parameters.
q(yi,xi,θ), (6)
i=1
27/37
m-estimation
Definition
The estimator θ that is the solution to the first-order conditions ∂QN (θ)/∂θ|θ = 0, or equivalently
1 N ∂q(yi,xi,θ) =0. (7) Ni=1 ∂θ θ
is an m-estimator. It is a system of q estimating equations in q unknowns that does not necessarily have a closed-form solution for θin terms of data (yi,xi,i = 1,…,N).
The term m-estimator is interpreted as an abbreviation for maximum-likelihood-like estimator.
Many econometricians define an m-estimator as optimizing over a sum of terms, as in (6).
Other authors define an m-estimator as solutions of equations such as (7).
Examples: MLE, GMM, OLS, NLS
28/37
Property
Objective Function Examples
First-order conditions
Algebraic formula
QN (θ) = N−1 i q(yi , xi , θ) is maximized wrt θ
MLE:qi =lnf(yi|xi,θ)isthelog-density
NLS: qi = −(yi − g(xi , θ))2 is minus the squared error MM: qi = [(yi − g(xi , θ))x′i xi(yi − g(xi, θ))]
∂QN (θ)/∂θ = N−1 Ni=1 ∂q(yi , xi , θ)/∂θ|θ = 0.
29/37
Example
Univariate distribution : yi (i = 1, . . . , N) is a 1/0 binary variable generated by a Bernoulli trial with parameter π which is the target parameter of interest
Method
Objective Function
First order condition
OLS ML MM
QN=1 N (yi−π)2 N i=1
QN=1 f(yi;π)=1 πyi(1−π)1−yi NN
QN = 1 N N i=1
(yi −π)[π(1−π)]−1(yi −π)
1 N (yi−π)=0 N i=1
1 N (yi−π)=0 N i=1
1 N (yi−π)=0 N i=1
.
30/37
Variance estimation for m-estimators
For all m-estimators we can obtain the expression for the stochastic error of the estimator.
We can then derive the expression for the asymptotic variance of the estimator.
Two approaches are possible
1. Derive the variance expression assuming that the errors
are i.i.d. (restrictive)
2. Derive the variance expression assuming that the errors
are heteroskedastic or serially correlated (less restrictive).
The second approach yields robust variance estimator relative to the i.i.d. case
Example of least squares is given below.
31/37
Standard vs robust variance estimation
Standard version
Assume ui are i.i.d.; V[u|X]=σ2IN
β = (X′X)−1X′y
β = (X′X)−1X′(Xβ + u)
= β + (X′X)−1X′u
β − β = ( X ′ X ) − 1 X ′ u
V[β|X] = E[(X′X)−1X′uu′X(X′X)−1|X]
Robust version (two-step)
Assume ui are not i.i.d; V[u|X]=Ω ̸= σ2IN β = (X′X)−1X′y
V[β|X] = E[(X′X)−1X′uu′X(X′X)−1|X] V[β|X] = (X′X)−1(X′ΩX)(X′X)−1
′ N 2′ (XΩX)= i=1xiuixi
′−1N 2′′−1 V[β] = (X X) ( i=1 xiui xi)(X X)
V[β|X] = σ2(X′X)−1 2 ′
σ =uu/(N−K)
2′−1 V[β] = σ (X X)
32/37
LLS Properties
Given the i.i.d. assumption and exogeneity of regressors, LS estimator is unbiased ( and consistent) and efficient. (Gauss-Markov Theorem.)
The linear predictor E[y |X = Xf ] = Xf β is also the optimal predictor (unbiased and efficient)
The i.i.d. assumption is violated if errors are heteroskedastic, or serially correlated in which case,
V[u] = Ω ̸= σ2IN
33/37
Two possible structures for N=5
σ12 0 0 0 0 0 σ2 0 0 0
2 ΩN×N=0 0 σ2 0 0
3 0 0 0 σ42 0
0 0 0 0 σ52 σ12 σ12 0 0 0
σ12 σ2 σ23 0 0 2
ΩN×N=0 σ23 σ2 σ34 0 3
0 0 σ34 σ42 σ45 0 0 0 σ45 σ52
34/37
Properties of OLS vs. GLS
Then OLS is consistent but GLS estimator is more efficient.
Two alternatives are: (i) use feasible two-step GLS, or (ii)
use the robustified estimator of V[β], which requires fewer assumptions.
The robustified variance estimator is the “sandwich estimator” which can be computed in two steps.
The idea behind robust variance estimator can be extended to other M-estimators.
35/37
Generalized Least Squares Estimator
GLS
β = (X′Ω−1X)−1X′Ω−1y
β = (X′Ω−1X)−1X′Ω−1(Xβ + u)
= β + (X′Ω−1X)−1XΩ−1u
FGLS
β = (X′X)−1X′y consistent
Assume Ω = Ω(θ) ;
θ can be consistently estimated given β
Ω =Ω(θ)
β − β = (X′Ω−1X)−1XΩ−1u
V[β|X,Ω]=E[(β−β)(β−β)′|X,Ω]
V[β|X,Ω] = (X′Ω−1X)−1 V[β|X,Ω] = (X′Ω−1X)−1
V[β] = (X′Ω−1X)−1 βFGLS →p βGLS b/c Ω →p Ω
36/37
Why m-estimation?
Large sample optimality of m-estimators Consistency and asymptotic normality
Property
Consistency Consistency (informal)
LimitDistribution
Asymptotic Distribution
Algebraic formula
Is plim QN (θ) maximized at θ = θ0? Does E ∂q(yi , xi , θ)/∂θ| = 0?
√ θ0 N(θ−θ0)→d N[0,A−1B0A−1]
00
A0 = plim N−1 Ni=1 ∂2qi (θ)/∂θ∂θ′θ0
B0 = plimN−1 Ni=1 ∂qi /∂θ × ∂qi /∂θ′|θ0 . θ∼a N[θ0,N−1A−1BA−1]
A=N−1Ni=1 ∂2qi(θ)/∂θ∂θ′θ B=N−1Ni=1 ∂qi/∂θ×∂qi/∂θ′|θ
37/37