Final Exam Format
Two hours (10 minutes reading time).
Closed book, approved calculators permitted.
Four questions, each worth 20-30 marks.
A mixture of analytical and practical (approx 50:50), sometimes both within the same question.
Analytical component will be in the same spirit as the assignments.
Practical component will involve describing how one implements specific methods and interpreting STATA output.
Knowledge of STATA code/syntax will not be examined.
1/35
ECON6300/7320/8300 Advanced Microeconometrics Nonparametric Methods
Christiern Rose 1 1University of Queensland
2/35
Outline:
Motivation
Nonparametric Density Estimation Nonparametric Regression
Semiparametric Regression
3/35
Motivation:
The underlying decision problem is to choose m(X):=argminE[(Y−g(X))2] = E[Y|X]
g∈G
where G is the space of functions defined on the support of
X.
If we know m(X) up to a finite dimensional parameter
θ ∈ Θ ⊂ RK , we may substantially restrict G.
If we knew m(X) = g(X,θ), we may consider the
semi-parametric model
Gθ :={g(X,θ);θ∈Θ}⊂G Awidelyusedexample:g(X,θ)=Xθ.
turns out
4/35
Motivation:
The prior information, m(X) ∈ Gθ, improves efficiency, if it’s correct.
But, if we use m(X) ∈ Gθ when it is wrong, our analysis may be invalid
Often, economic theory is silent about the functional form of m(X)
We wish to have a method that is robust to a functional form assumption
For this reason, we study nonparametric analysis
We do density estimation first and then regression
5/35
Density Estimation:
Let Y1,…,Yn be a random sample from an unknown distribution.
We wish to learn the distribution of Yi from the random sample.
iid
We may assume Y1,…,Yn ∼ f(·|θ) with unknown θ ∈ Θ .
For example, the normal model has θ = (μ, σ2) ∈ Θ = R × R++.
In most cases, we do not know the correct parametric family, and our model may be likely misspecified and it can be very different from the true model.
We want our analysis to be robust to any misspecification.
6/35
Density Estimation: Histogram
We start with a histogram.
Let yi be the realisation of the random variable Yi in the
sample.
To draw a histogram, we choose an increasing sequence of equidistant points {a0, a1, . . . , aJ } such that
a0 < min{yi }Ni=1 and aJ > max{yi }Ni=1.
Thehistogram,{H1,…,HJ},countsthenumberof observations in each sub-interval (aj−1, aj ], i.e.,
N
Hj :=1(yi ∈(aj−1,aj])
i=1
By normalising {Hj}Jj=1, we may have a naive density estimate.
7/35
Density Estimation: Histogram
8/35
Density Estimation: Histogram
The density estimate based on {H}Jj=1 provides useful information on the data distribution
small J → uninformative, but large J → noisy (too bumpy)
J has to be appropriately chosen (depending on N).
A good choice of J should increase as N grows
The histogram estimate is a step function, but the true density may not be a step function.
This density estimate cannot be used for any problem where a derivative of density plays a critical role, e.g., auction models.
We wish to have a ‘smooth’ density estimate.
9/35
Kernel Density Estimation
A kernel K (·) is a density with mean zero.
The density φ(·) of N (0, 1) can be used as a kernel
(Gaussian Kernel). The density of N(y,h2) can be written
as
1 z−y fy,h(z) = hφ h
Similarly, when a kernel is shifted by yi and rescaled by h,
we have
1 z−yi Kyi,h(z) := hK h
A kernel density estimate (KDE) at a point y is: ˆ 1N1y−yi
f(y) := N
hK h
i=1
When we have a bell-shape kernel, the KDE puts high weights on data near the point y and small weights on data far from y.
10/35
Kernel Density Estimation
We choose a kernel K (·) and bandwidth (smoothing parameter) h.
Some examples of kernels:
Uniform Kernel: K (u) = 1 1(|u| < 1)
2
Triangular Kernel: K (u) = (1 − |u|)1(|u| < 1)
Epanechnikov Kernel: K(u) = 3(1 − z2)1(|u| < 1) 4
1 12 Gaussian Kernel: K(u) = √2π exp −2u
1(u ∈ R)
11/35
Kernel Density Estimation
12/35
Kernel Density Estimation
Different kernel weights differently the sample points.
Uniform kernel puts equal weights within the support
[y − h, y + h]
Triangular kernel puts weight (inversely) proportional to the
distance from y in [y − h,y + h].
Epanechnikov Kernel is bell-shape on the support
[y −h,y +h].
Gaussian Kernel is bell-shape on the entire R.
A smooth kernel like Epanechnikov or Gaussian is often preferred.
13/35
Kernel Density Estimation
We generate a random sample of size 100 drawn from the N (0, 252) and fit the date by KDEs with different kernels (all with h = 12.5)
14/35
Kernel Density Estimation
Choice of h is much more important than choice of K (·) large h → uninformative, but small h → noisy (too bumpy)
15/35
Kernel Density Estimation
The trade off is common for all kernels
A good choice of h decreases as N increases
16/35
Kernel Density Estimation
The bias of the Kernel density estimator is ˆ 12′′2
Bias(y)=E f(y) −f(y)=2hf(y) zK(z)dz
which depends bandwidth, the curvature of the true density
and the kernel.
The bias is increasing in h, and disappears asymptotically
ifh→0asN →∞.
The variance is
ˆ 1 2
V f(y) ≈ Nhf(y) K(z) dz
which depends on the sample size, bandwidth, the true
density and the kernel.
The variance is decreasing in h, and disappears
asymptotically if Nh → ∞ as N → ∞. That is that h → 0 at a slower rate than the N → ∞.
17/35
Kernel Density Estimation
Provided that both the bias and variance terms disappear asymptotically (i.e. for appropriate h) the kernel estimator is pointwise consistent
ˆp
f ( y ) −→ f ( y )
at a point y
To obtain uniform consistency we need that
ˆp sup |f (y ) − f (y )| −→ 0
y
which can be shown to occur if Nh/ ln N → ∞. This
requires a larger h than pointwise consistency.
The limit distribution is:
√ ˆ d 2 Nh f(y)−f(y)−Bias(y) −→N 0,f(y) K(z) dz
18/35
Kernel Density Estimation
Optimal choice of bandwidth?
We frequently represent our loss using the mean squared
error (MSE). For a parametric model,
MSE(θˆ) := E[(θˆ− θ)2] = V(θˆ) + Bias(θˆ)2
The nonparametric estimate ˆf (y ) is a function of y . So is MSE (ˆf (y ))
Thus, we use the overall MSE after integrating y out. The mean integrated squared error (MISE) is given as
ˆ ˆ 2ˆ ˆ2 MISE[f(y)] := E f(y) − f(y) dy = V[f(y)]dy+ Bias[f(y)] dy
The optimal bandwidth h∗ minimises the MISE by balancing the variance and bias. But, h∗ is infeasible because f(y) is unknown.
19/35
Kernel Density Estimation
Let s be the sample standard deviation.
Silverman (1986) proposed the rule-of-thumb
h∗ := 1.059sN−1/5
which is optimal when the reference distribution is N (μ, σ2)
20/35
Kernel Density Estimation
Silverman’s rule may result in an inaccurate estimate if the true distribution is not close to the normal, in which case
∗
−1/5 qˆ.75 − qˆ.25 min s, 1.349
h
is often used, where qˆ.75 and qˆ.25 are sample percentiles
:= 1.3643δN
and δ is a constant depending on the kernel.
Values of δ are: 1.3510 (Uniform), 1.7188 (Epanechnikov), 2.0362 (Quartic/biweight), 2.3122 (Triweight), 0.7764 (Normal).
The optimal bandwidth converges slowly to zero.
There is no commonly accepted way to choose h and prior beliefs on smoothness are often exploited in practice.
21/35
Kernel Density Estimation: joint density
Suppose we observe a random sample (yi,xi)Ni=1 drawn from an unknown bivariate distribution.
Then, a natural extension of the univariate KDE would be 1N 1 x−xi1 y−yi
h K h b K b
However, the required N rapidly increases ⇒ curse of
f ( x , y ) = N
Similarly, we may obtain a high dimensional KDE.
dimensionality.
i=1
A density with dimensionality greater than 3 or 4 should be parametrised for any realistic sample size.
22/35
Kernel Density Estimation: conditional density
The conditional density can be estimated by
f(y|x) = f(x,y) f ( x )
= =
1 N 1K x−xi 1K y−yi Ni=1h h b b
1 N 1K x−xi N i=1h h
N 1y−yi
ωi,h(x) bK b K x−xi
h ωi,h(x) := N K x−xi
i=1 h
where
i=1
23/35
Kernel Density Estimation: conditional expectation
The conditional expectation can be nonparametrically estimated by
E[Y|X = x] :=
N 1 y−yi
yf(y|x)dy
= y ωi,h(x) bK b dy
i=1
= ωi,h(x) y bK b dy
i=1 =a density with mean yi
N
= ωi,h(x)yi
i=1
N
1 y−yi
24/35
Nonparametric Regressions
Now, recall that we want to estimate the regression function
m(X):=argminE[(Y −g(X))2]=E[Y|X] g∈G
where G is the space of functions defined on the support of X.
We’ve just learned that a nonparametric regression is given as
where
K x−xi h
N
m(x) = ωi,h(x)yi i=1
ωi,h(x) = N K x−xi i=1 h
This estimator is called Nadaraya & Watson estimator.
25/35
Nonparametric Regressions
N-W estimator is a weighted average of yi because Ni=1 ωi,h(x) = 1.
Especially, it puts large (small) weights on observations (yi,xi)withxi closeto(farfrom)thegivenpointx.
N-W is a special case of local weighted average (LOWESS) estimator.
In a class of LOWESS estimators, there are many options to choose the weights ωi,h(x).
For example, a LOWESS estimate at x is the average of
yi s whose associated xi s are included in the h × 100% closest to the given x , where h ∈ (0, 1). Note: Stata default for h = 0.8.
Also, LOWESS estimate at x is the average of yi s whose associated xi s are included in the h nearest to the given x , where h ∈ N.
26/35
Nonparametric Regressions
27/35
Nonparametric Regressions
JustlikeKDE,h↑⇒Variance↓butBias↑.Thereisa trade-off. The optimal bandwidth h∗ should balance the variance and bias. But, there is no commonly accepted rule for choosing the bandwidth.
Moreover, xi should be low dimensional: curse of dimensionality.
When the regressor is high-dimensional, we may consider a semi-parametric specification such as
m(x, z; β) = x′β + λ(z)
which has both parametric part x′β and nonparametric part λ(z ).
28/35
Series Estimation
Series estimation is widely used for nonparametric analysis.
Let φ1(·), φ2(·), . . . be a sequence of (basis) functions such
that
∞
θjφj(·) j=1
approximates any function on the same domain with some
θ1,θ2,....
Then, if φ1(·), φ2(·), . . . are densities, by restricting θj to be
positive and sum to one, we have a fully flexible density
specification with infinite dimensional θ.
If φ1(·), φ2(·), . . . are not densities, by normalising
we may approximate any density.
∞
exp θj φj (·) j=1
29/35
Series Estimation
In practice, we cannot have infinitely many components. So, we often use a finite number K of components.
K determines the smoothness of the estimate like h of KDE.
If K is too small, the density estimate would not be informative.
If K is too large, the density estimate would be too noisy (bumpy).
The optimal number of components K should increase as the sample size increases.
30/35
Series Estimation
HowtochooseK? Frequentist:
often use BIC or AIC, or some data driven method such as cross-validation, or even use prior information informally.
There is no universally accepted rule for choosing K .
In any case, computation of parameter estimates is for each
given fixed K , but inference is nonparametric (slow-convergence).
Bayesian:
regard K as a latent variable (parameter), put a prior on K
with a full support N, and obtain the posterior of K as well
as other parameters using an MCMC method.
Do not choose an arbitrary K as its distribution is
determined.
31/35
Series Estimation
There are many functions that can be used as basis functions {φj (·)}
Examples include
Polynomials: Legendre polynomials, Bernstein polynomials,
etc.
Splines: piecewise linear splines, B-splines, etc.
Densities: Beta densities (Bernstein polynomials), normal
densities, Gamma densities, etc.
32/35
Series Estimation: BPD
For example, the Bernstein polynomial density of order K is given as
K
f (y |θ1 , . . . , θK ) := θj Beta(j , K − j + 1)
j=1
where θj are all positive and sum to 1 and Beta(a, b) denotes the Beta density with parameters a and b, i.e., its mean is ab/(a + b).
When K → ∞, the BPD approximates any absolutely continuous density on [0, 1]; see Petrone (1999) for Bayesian nonparametric method using BPD.
33/35
Series Estimation: BPD
The Basis functions are plotted below (a) k=3
3 2 1 0
4 3 2 1 0
6 4 2 0
The BPD is a histogram smoothing; see Petrone (1999)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (b) k=4
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) k=6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
34/35
Series Estimation: normal mixture
Another example: normal densities can be used as basis functions.
2 ∞ 1 y − μ j f(y|{πj,μj,σj }) = πj σ φ σ
j=1 j j
where φ(·) is the PDF of N (0, 1).
The normal mixture approximates any absolutely continuous density.
35/35