程序代写代做代考 Bayesian flex kernel Final Exam Format

Final Exam Format
􏰉 Two hours (10 minutes reading time).
􏰉 Closed book, approved calculators permitted.
􏰉 Four questions, each worth 20-30 marks.
􏰉 A mixture of analytical and practical (approx 50:50), sometimes both within the same question.
􏰉 Analytical component will be in the same spirit as the assignments.
􏰉 Practical component will involve describing how one implements specific methods and interpreting STATA output.
􏰉 Knowledge of STATA code/syntax will not be examined.
1/35

ECON6300/7320/8300 Advanced Microeconometrics Nonparametric Methods
Christiern Rose 1 1University of Queensland
2/35

Outline:
􏰉 Motivation
􏰉 Nonparametric Density Estimation 􏰉 Nonparametric Regression
􏰉 Semiparametric Regression
3/35

Motivation:
􏰉 The underlying decision problem is to choose m(X):=argminE[(Y−g(X))2] = E[Y|X]
g∈G
where G is the space of functions defined on the support of
X.
􏰉 If we know m(X) up to a finite dimensional parameter
θ ∈ Θ ⊂ RK , we may substantially restrict G.
􏰉 If we knew m(X) = g(X,θ), we may consider the
semi-parametric model
Gθ :={g(X,θ);θ∈Θ}⊂G 􏰉 Awidelyusedexample:g(X,θ)=Xθ.
􏰎􏰍􏰌􏰏
turns out
4/35

Motivation:
􏰉 The prior information, m(X) ∈ Gθ, improves efficiency, if it’s correct.
􏰉 But, if we use m(X) ∈ Gθ when it is wrong, our analysis may be invalid
􏰉 Often, economic theory is silent about the functional form of m(X)
􏰉 We wish to have a method that is robust to a functional form assumption
􏰉 For this reason, we study nonparametric analysis
􏰉 We do density estimation first and then regression
5/35

Density Estimation:
􏰉 Let Y1,…,Yn be a random sample from an unknown distribution.
􏰉 We wish to learn the distribution of Yi from the random sample.
􏰉 iid
We may assume Y1,…,Yn ∼ f(·|θ) with unknown θ ∈ Θ .
􏰉 For example, the normal model has θ = (μ, σ2) ∈ Θ = R × R++.
􏰉 In most cases, we do not know the correct parametric family, and our model may be likely misspecified and it can be very different from the true model.
􏰉 We want our analysis to be robust to any misspecification.
6/35

Density Estimation: Histogram
􏰉 We start with a histogram.
􏰉 Let yi be the realisation of the random variable Yi in the
sample.
􏰉 To draw a histogram, we choose an increasing sequence of equidistant points {a0, a1, . . . , aJ } such that
a0 < min{yi }Ni=1 and aJ > max{yi }Ni=1.
􏰉 Thehistogram,{H1,…,HJ},countsthenumberof observations in each sub-interval (aj−1, aj ], i.e.,
N
Hj :=􏰁1(yi ∈(aj−1,aj])
i=1
􏰉 By normalising {Hj}Jj=1, we may have a naive density estimate.
7/35

Density Estimation: Histogram
8/35

Density Estimation: Histogram
􏰉 The density estimate based on {H}Jj=1 provides useful information on the data distribution
􏰉 small J → uninformative, but large J → noisy (too bumpy)
􏰉 J has to be appropriately chosen (depending on N).
􏰉 A good choice of J should increase as N grows
􏰉 The histogram estimate is a step function, but the true density may not be a step function.
􏰉 This density estimate cannot be used for any problem where a derivative of density plays a critical role, e.g., auction models.
􏰉 We wish to have a ‘smooth’ density estimate.
9/35

Kernel Density Estimation
􏰉 A kernel K (·) is a density with mean zero.
􏰉 The density φ(·) of N (0, 1) can be used as a kernel
(Gaussian Kernel). The density of N(y,h2) can be written
as
1 􏰑z−y􏰒 fy,h(z) = hφ h
􏰉 Similarly, when a kernel is shifted by yi and rescaled by h,
we have
1 􏰑z−yi􏰒 Kyi,h(z) := hK h
􏰉 A kernel density estimate (KDE) at a point y is: ˆ 1􏰁N1􏰑y−yi􏰒
f(y) := N
hK h
i=1
􏰉 When we have a bell-shape kernel, the KDE puts high weights on data near the point y and small weights on data far from y.
10/35

Kernel Density Estimation
􏰉 We choose a kernel K (·) and bandwidth (smoothing parameter) h.
􏰉 Some examples of kernels:
Uniform Kernel: K (u) = 1 1(|u| < 1) 2 Triangular Kernel: K (u) = (1 − |u|)1(|u| < 1) Epanechnikov Kernel: K(u) = 3(1 − z2)1(|u| < 1) 4 1 􏰑12􏰒 Gaussian Kernel: K(u) = √2π exp −2u 1(u ∈ R) 11/35 Kernel Density Estimation 12/35 Kernel Density Estimation 􏰉 Different kernel weights differently the sample points. 􏰉 Uniform kernel puts equal weights within the support [y − h, y + h] 􏰉 Triangular kernel puts weight (inversely) proportional to the distance from y in [y − h,y + h]. 􏰉 Epanechnikov Kernel is bell-shape on the support [y −h,y +h]. 􏰉 Gaussian Kernel is bell-shape on the entire R. 􏰉 A smooth kernel like Epanechnikov or Gaussian is often preferred. 13/35 Kernel Density Estimation 􏰉 We generate a random sample of size 100 drawn from the N (0, 252) and fit the date by KDEs with different kernels (all with h = 12.5) 14/35 Kernel Density Estimation 􏰉 Choice of h is much more important than choice of K (·) 􏰉 large h → uninformative, but small h → noisy (too bumpy) 15/35 Kernel Density Estimation 􏰉 The trade off is common for all kernels 􏰉 A good choice of h decreases as N increases 16/35 Kernel Density Estimation 􏰉 The bias of the Kernel density estimator is 􏰛ˆ􏰜 12′′􏰥2 Bias(y)=E f(y) −f(y)=2hf(y) zK(z)dz which depends bandwidth, the curvature of the true density and the kernel. 􏰉 The bias is increasing in h, and disappears asymptotically ifh→0asN →∞. 􏰉 The variance is 􏰛ˆ 􏰜 1 􏰥 2 V f(y) ≈ Nhf(y) K(z) dz which depends on the sample size, bandwidth, the true density and the kernel. 􏰉 The variance is decreasing in h, and disappears asymptotically if Nh → ∞ as N → ∞. That is that h → 0 at a slower rate than the N → ∞. 17/35 Kernel Density Estimation 􏰉 Provided that both the bias and variance terms disappear asymptotically (i.e. for appropriate h) the kernel estimator is pointwise consistent ˆp f ( y ) −→ f ( y ) at a point y 􏰉 To obtain uniform consistency we need that ˆp sup |f (y ) − f (y )| −→ 0 y which can be shown to occur if Nh/ ln N → ∞. This requires a larger h than pointwise consistency. 􏰉 The limit distribution is: √ 􏰛ˆ 􏰜d 􏰑 􏰥 2 􏰒 Nh f(y)−f(y)−Bias(y) −→N 0,f(y) K(z) dz 18/35 Kernel Density Estimation 􏰉 Optimal choice of bandwidth? 􏰉 We frequently represent our loss using the mean squared error (MSE). For a parametric model, MSE(θˆ) := E[(θˆ− θ)2] = V(θˆ) + Bias(θˆ)2 􏰉 The nonparametric estimate ˆf (y ) is a function of y . So is MSE (ˆf (y )) 􏰉 Thus, we use the overall MSE after integrating y out. The mean integrated squared error (MISE) is given as ˆ 􏰑􏰥􏰪ˆ 􏰫2􏰒􏰥ˆ 􏰥 ˆ2 MISE[f(y)] := E f(y) − f(y) dy = V[f(y)]dy+ Bias[f(y)] dy 􏰉 The optimal bandwidth h∗ minimises the MISE by balancing the variance and bias. But, h∗ is infeasible because f(y) is unknown. 19/35 Kernel Density Estimation 􏰉 Let s be the sample standard deviation. 􏰉 Silverman (1986) proposed the rule-of-thumb h􏰭∗ := 1.059sN−1/5 which is optimal when the reference distribution is N (μ, σ2) 20/35 Kernel Density Estimation 􏰉 Silverman’s rule may result in an inaccurate estimate if the true distribution is not close to the normal, in which case ∗ −1/5 􏰖 qˆ.75 − qˆ.25 􏰤 min s, 1.349 h􏱀 is often used, where qˆ.75 and qˆ.25 are sample percentiles := 1.3643δN and δ is a constant depending on the kernel. 􏰉 Values of δ are: 1.3510 (Uniform), 1.7188 (Epanechnikov), 2.0362 (Quartic/biweight), 2.3122 (Triweight), 0.7764 (Normal). 􏰉 The optimal bandwidth converges slowly to zero. 􏰉 There is no commonly accepted way to choose h and prior beliefs on smoothness are often exploited in practice. 21/35 Kernel Density Estimation: joint density 􏰉 Suppose we observe a random sample (yi,xi)Ni=1 drawn from an unknown bivariate distribution. 􏰉 Then, a natural extension of the univariate KDE would be 1􏰁N 􏰖1 􏰑x−xi􏰒􏰤􏰖1 􏰑y−yi􏰒􏰤 h K h b K b 􏰉 However, the required N rapidly increases ⇒ curse of 􏰭f ( x , y ) = N 􏰉 Similarly, we may obtain a high dimensional KDE. dimensionality. i=1 􏰉 A density with dimensionality greater than 3 or 4 should be parametrised for any realistic sample size. 22/35 Kernel Density Estimation: conditional density 􏰉 The conditional density can be estimated by 􏰭f(y|x) = 􏰭f(x,y) 􏰭f ( x ) = = 1 􏰀N 􏲐1K 􏰗x−xi 􏰘􏲑􏲐1K 􏰗y−yi 􏰘􏲑 Ni=1h h b b 1 􏰀N 1K 􏰗x−xi 􏰘 N i=1h h 􏰁N 􏰖1􏰑y−yi􏰒􏰤 ωi,h(x) bK b K 􏰗x−xi 􏰘 h ωi,h(x) := 􏰀N K 􏰗x−xi 􏰘 i=1 h where i=1 23/35 Kernel Density Estimation: conditional expectation 􏰉 The conditional expectation can be nonparametrically estimated by 􏰥 E􏰭[Y|X = x] := 􏰥 􏰁N 􏰖1 􏰑y−yi􏰒􏰤 y􏰭f(y|x)dy = y ωi,h(x) bK b dy i=1 = ωi,h(x) y bK b dy i=1 􏰎 􏰍􏰌 􏰏 =a density with mean yi N = 􏰁ωi,h(x)yi i=1 􏰁N 􏰥 􏰖1 􏰑y−yi􏰒􏰤 24/35 Nonparametric Regressions 􏰉 Now, recall that we want to estimate the regression function m(X):=argminE[(Y −g(X))2]=E[Y|X] g∈G where G is the space of functions defined on the support of X. 􏰉 We’ve just learned that a nonparametric regression is given as where K 􏰗x−xi 􏰘 h N m􏰭(x) = 􏰁ωi,h(x)yi i=1 ωi,h(x) = 􏰀N K 􏰗x−xi 􏰘 i=1 h 􏰉 This estimator is called Nadaraya & Watson estimator. 25/35 Nonparametric Regressions 􏰉 N-W estimator is a weighted average of yi because 􏰀Ni=1 ωi,h(x) = 1. 􏰉 Especially, it puts large (small) weights on observations (yi,xi)withxi closeto(farfrom)thegivenpointx. 􏰉 N-W is a special case of local weighted average (LOWESS) estimator. 􏰉 In a class of LOWESS estimators, there are many options to choose the weights ωi,h(x). 􏰉 For example, a LOWESS estimate at x is the average of yi s whose associated xi s are included in the h × 100% closest to the given x , where h ∈ (0, 1). Note: Stata default for h = 0.8. 􏰉 Also, LOWESS estimate at x is the average of yi s whose associated xi s are included in the h nearest to the given x , where h ∈ N. 26/35 Nonparametric Regressions 27/35 Nonparametric Regressions 􏰉 JustlikeKDE,h↑⇒Variance↓butBias↑.Thereisa trade-off. The optimal bandwidth h∗ should balance the variance and bias. But, there is no commonly accepted rule for choosing the bandwidth. 􏰉 Moreover, xi should be low dimensional: curse of dimensionality. 􏰉 When the regressor is high-dimensional, we may consider a semi-parametric specification such as m(x, z; β) = x′β + λ(z) which has both parametric part x′β and nonparametric part λ(z ). 28/35 Series Estimation 􏰉 Series estimation is widely used for nonparametric analysis. 􏰉 Let φ1(·), φ2(·), . . . be a sequence of (basis) functions such that ∞ 􏰁θjφj(·) j=1 approximates any function on the same domain with some θ1,θ2,.... 􏰉 Then, if φ1(·), φ2(·), . . . are densities, by restricting θj to be positive and sum to one, we have a fully flexible density specification with infinite dimensional θ. 􏰉 If φ1(·), φ2(·), . . . are not densities, by normalising we may approximate any density. ∞  exp 􏰁 θj φj (·) j=1 29/35 Series Estimation 􏰉 In practice, we cannot have infinitely many components. So, we often use a finite number K of components. 􏰉 K determines the smoothness of the estimate like h of KDE. 􏰉 If K is too small, the density estimate would not be informative. If K is too large, the density estimate would be too noisy (bumpy). 􏰉 The optimal number of components K should increase as the sample size increases. 30/35 Series Estimation 􏰉 HowtochooseK? 􏰉 Frequentist: 􏰉 often use BIC or AIC, or some data driven method such as cross-validation, or even use prior information informally. 􏰉 There is no universally accepted rule for choosing K . 􏰉 In any case, computation of parameter estimates is for each given fixed K , but inference is nonparametric (slow-convergence). 􏰉 Bayesian: 􏰉 regard K as a latent variable (parameter), put a prior on K with a full support N, and obtain the posterior of K as well as other parameters using an MCMC method. 􏰉 Do not choose an arbitrary K as its distribution is determined. 31/35 Series Estimation 􏰉 There are many functions that can be used as basis functions {φj (·)} 􏰉 Examples include 􏰉 Polynomials: Legendre polynomials, Bernstein polynomials, etc. 􏰉 Splines: piecewise linear splines, B-splines, etc. 􏰉 Densities: Beta densities (Bernstein polynomials), normal densities, Gamma densities, etc. 32/35 Series Estimation: BPD 􏰉 For example, the Bernstein polynomial density of order K is given as K f (y |θ1 , . . . , θK ) := 􏰁 θj Beta(j , K − j + 1) j=1 where θj are all positive and sum to 1 and Beta(a, b) denotes the Beta density with parameters a and b, i.e., its mean is ab/(a + b). 􏰉 When K → ∞, the BPD approximates any absolutely continuous density on [0, 1]; see Petrone (1999) for Bayesian nonparametric method using BPD. 33/35 Series Estimation: BPD 􏰉 The Basis functions are plotted below (a) k=3 3 2 1 0 4 3 2 1 0 6 4 2 0 􏰉 The BPD is a histogram smoothing; see Petrone (1999) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (b) k=4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (c) k=6 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 34/35 Series Estimation: normal mixture 􏰉 Another example: normal densities can be used as basis functions. 2 􏰁∞ 􏰖 1 􏰑 y − μ j 􏰒 􏰤 f(y|{πj,μj,σj }) = πj σ φ σ j=1 j j where φ(·) is the PDF of N (0, 1). 􏰉 The normal mixture approximates any absolutely continuous density. 35/35