Non-Parametrics
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
March 1-2, 2021
1/46
Non-Parametrics
1. Kernel Density Estimation 2. Non-Parametric Regression
2/46
Kernel Density Estimation
1. Parametric vs. non-parametric approaches 2. Histograms and the uniform kernel
3. Different bandwidths
4. Different kernels
3/46
Estimating Densities
Suppose we see n=100 draws from a continuous random variable X: x1,x2 ··· ,xn
We are often interested in the distribution of X: CDF:
PDF:
FX(u)=P(X ≤u) fX (u) = dFX (u)
du
How do we uncover the distribution of X from the data?
4/46
Scatter Plot of x1,x2 ··· ,xn
5/46
Estimating Densities
How do we uncover the distribution of X from the data? x1,x2 ··· ,xn
Parametric Approach
One strategy is to assume we know the form of the distribution
e.g. Normal or χ2
But we don’t know the particular parameters:
Use the data to estimate the unknown parameters
For example: we know X ∼ N(μ,σ2), but we don’t know μ or σ2
Estimate: μˆ = ∑n xi i=1 n
Estimate: σˆ2 = ∑n (xi −μˆ)2 i=1 n−1
PlotN(μˆ,σˆ2)
6/46
Normal Density with Estimated μˆ = −0.75, σˆ = 9.24
7/46
Downsides of the Parametric Approach
In practice, we often don’t know the underlying distribution e.g. The assumption of normality may provide a very bad fit
Non-parametric Approach
No assumptions about underlying distribution
Recover directly from the data Simplest form: histogram
8/46
Histogram Built from x1, x2 · · · , xn
9/46
Histograms
Histograms (appropriately scaled) provide a non-parametric approach to the density
But a few downsides:
Doesn’t provide a smooth, continuous distribution
Lots of holes in the distribution when choice of bin is small Uninformative when bins are big
10/46
Histogram Built from x1,x2 ··· ,xn: Bin Size=1
11/46
Histogram Built from x1,x2 ··· ,xn: Bin Size=5
12/46
Histogram Built from x1,x2 ··· ,xn: Bin Size=20
13/46
Kernel Density Estimation: Uniform Kernel
To uncover smoother non-parametric densities we use a technique called kernel density estimation
Many different versions (“choices of kernel”) but lets start with one very similar to a histogram
Suppose we are interested in estimating fˆ(u) for any u First, lets count how many xi are “near” u
We’ll define “near” as within 1 of u in either direction: 2
n 1 number of xi near u = ∑1 |u−xi|≤ 2
i=1
u – 1/2
u
u + 1/2
14/46
Kernel Density Estimation: Uniform Kernel
To turn this count into a density, just scale by n: ˆ1n 1
f(u)=n∑1 |u−xi|≤2 i=1
Average number of xi near u (per unit of x) scaled by n observations A density
Note that ∞ =fˆ(u)du=1 −∞
15/46
Kernel Density Estimation: Uniform Kernel
16/46
Kernel Density Estimation: Uniform Kernel
Naturally, can adjust definition of “near” depending on the context For example, define “near” as within 1 of u in either direction:
i=1
Doubling “near” ⇒ divide by 2 to keep things comparable
n 1{|u−xi|≤1}
∑
i=1
Number of xi near u per unit of x. To get a density:
fˆ(u)= 1 ∑n 1{|u−xi|≤1} n i=1 2
n
∑1 |u−xi|≤1
2
17/46
Kernel Density Estimation: Uniform Kernel
We call the function:
the uniform (or box, or rectangular) kernel Note that above we evaluate:
K(u−xi)= 1{|u−xi|≤1} 2
We can write the density in terms of the kernel: ˆ1n
K(z) = 1{|z| ≤ 1} 2
f(u)= n ∑K(u−xi) i=1
18/46
What Defines a Kernel?
Typically, a kernel is a function K(·) that satisfies two properties: 1. K(·) integrates to 1;
∞
K(z)dz = 1
−∞ 2. Symmetry: K(−z) = K(z)
You can think of it as a weighting function
19/46
Kernel Density Estimation: Different Bandwidths
K(u−xi)= 1{|u−xi|≤1} 2
By adjusting definition of “near” u, we get smoother densities: For example, define “near” as within 3:
n
Numberofxi within3ofu =∑1 3 ≤1
Average number of xi near u (per unit):
numberofxi nearu 1 n |u−xi| 1 n u−xi
unit =6∑1 3 ≤1=3∑K 3 i=1 i=1
|u−xi| i=1
Then we can estimate the density as:
ˆ 11nu−xi
f (u) = n · 3 ∑ K 3 i=1
20/46
Uniform Kernel Density Estimation: Bandwidth=3
21/46
Kernel Density Estimation: Different Bandwidths
K(ui −xi) = 1{|ui −xi| ≤ 1} 2
In general, we can estimate our density as:
ˆ 11n ui−xi
fh(u) = n · h ∑ K h i=1
we call h the bandwidth
Larger bandwidth ⇒ smoother
Note that for any choice of h:
1n
= n · ∑Kh(ui −xi)
i=1
∞ˆ
fh(u)du = 1 −∞
22/46
Kernel Density Estimation: Bandwidth=6
23/46
Kernel Density Estimation: Different Kernels
The uniform kernel is one of the simplest:
1 n 1 n 1{|ui−xi|≤1} n ∑K(ui −xi)= n ∑ 2
i=1 i=1
Many other choices of kernel that do a better job
In fact, can choose any function K(z) such that: ∞
K(z)dz = 1 −∞
Common choice: Gaussian
1 −1z2 K(z)=φ(z)=√ e 2
2π
24/46
Kernel Density Estimation: Different Kernels
For any choice of Kh: Kh(u−xi) gives a weight for observation xi
Uniform (h=1)
Weight=1 if xi is within 1 of u 2
0 Otherwise
Gaussian
Weight is positive for all xi
But declines depending on distance from u
By taking the average of these weights (across all xi ), we get an estimate of the density at any point u
ˆ1n
fh(u)= n ∑Kh(u−xi)
i=1
25/46
Different Kernels
26/46
Different Kernels
27/46
Kernel Density Estimation: Epanechnikov
A frequently used kernel is the Epanechnikov:
√
3 1−1z2 ift2<5
K(z)= 4 5 5 0
otherwise
Optimal under certain assumptions – default in many softwares
But the difference between this and, e.g. Gaussian is not huge
28/46
Different Kernels
29/46
Kernel Density Estimation: Bandwidth Choice
Choice of bandwidth (h) often matters a lot more
Many different approaches to choose bandwidth optimally
Most software will have a decent bandwidth choice built in as default One rule of thumb (works well when underlying data is normal) is
−1 h=1.06σˆn 5
30/46
Bandwidth (h) too Big
31/46
Bandwidth (h) too Small
32/46
−1 Bandwidth: h=1.06 σˆn 5
33/46
Kernel Density Estimation
Write a kernel density estimator with: Epanechnikov kernal
Rule of thumb bandwidth what is fˆ(0)?
Hints:
The Epanechnikov function can be accessed by installing the “kader”
package
You can implement it with kader:::epanechnikov(x)
34/46
Non-Parametric Regression
1. Nearest-Neighbors
2. Nadaraya-Watson
3. Local Polynomial Regression
35/46
Non-Parametric Regression
Given y and x, we previously wrote:
y =E[y|x]+ε
Showed that OLS estimator provides best linear approximation of the conditional mean function
In many settings, E [y |x ] = h(x ) is a non-linear function
Known functional form: can often use OLS to estimate parameters
For example: β0,β1,β2 in:
E[y|x] = β0 +β1x +β2x2
Or other methods if non additive
But we often do not know the functional form
36/46
Non-Parametric Regression
h(u)=E[y|u]
Given data: (y1,x1),(y2,x2),···(yn,xn)
Goal: estimate hˆ(u) without knowing the functional form
Simple approach: Nearest Neighbors (local averaging)
For any u, define Ku as the set of K individuals with xi nearest to u
The K observations (yi,xi) with the smallest values of |u−xi| For any point x, define:
hˆ ( u ) = ∑ y i i∈Ku K
37/46
Nearest Neighbors Regression (K=3)
X1 X2X3uX4 X5
38/46
Nearest Neighbors Regression (K=3)
X1 X2X3uX4 X5
38/46
Nearest Neighbors Regression (K=3)
y2
y4 y3
X1 X2X3uX4 X5
38/46
Nearest Neighbors Regression (K=3)
y2
y4 y3
X1 X2X3uX4 X5
38/46
Nearest Neighbors Regression (K=3)
h(u)
y2
y4 y3
X1 X2X3uX4 X5
38/46
Nearest Neighbors Regression (K=3)
h(u)= y2+y3+y4 3
y4 y3
y2
X1 X2X3uX4 X5
38/46
Non-Parametric Regression: Nearest Neighbors
hˆ ( u ) = ∑ y i i∈Ku K
Downsides of nearest neighbors:
Problems in extremes: suppose h(u) is an increasing function:
For small u all of the neighbors will be above u
But slightly larger u has almost exactly the same neighbors Awkward flattening in the extremes
Big jumps as large values of yi enter Ku
39/46
Non-Parametric Regression: Nadaraya-Watson
Instead of averaging yi for xi close to u, weighted average of all yi
n
hˆ(u) = ∑ ωi (u)yi i=1
With ∑ni=1ωi(u)=1 for any u
If we choose ωi(u)= 1 for all i,u, we get:
n
ˆn1
h ( u ) = ∑ n y i = y ̄ i=1
Not very informative
Choose ωi that give higher weight for observations with xi close to u
40/46
Non-Parametric Regression: Nadaraya-Watson
n
hˆ(u) = ∑ ωi (u)yi i=1
Choose ωi that give higher weight for observations with xi close to x:
What about:
u−xi ωi(u)=K h
Where K(·) is a kernel (e.g gaussian)
Gives higher weight to observations with xi close to x Does not necessarily sum to 1...
Solution:
Ku−xi h
ωi(x)= n u−xi ∑i=1K h
41/46
Non-Parametric Regression: Nadaraya-Watson
n
hˆ(u) = ∑ ωi (u)yi i=1
Want to choose ωi that give higher weight for observations with xi close to u:
Hence, the Nadaraya-Watson Estimator for the point u is:
n Ku−xi h
ˆ
h(u) = ∑ ∑n Ku−xi yi
i=1 i=1 h
42/46
Non-Parametric Regression: Local Polynomial Regression
h(x) = E[y|x]
Consider a Taylor expansion at the point x ̃ close to u:
h(x ̃) ≈ h(u)+h(1)(u)(x ̃−u)+ h(2)(u)(x ̃−x)2 +···+ h(p)(u)(x ̃−x)p 2! p!
= β 0 + β 1 ( x ̃ − x ) + β 2 ( x ̃ − x ) 2 + · · · + β p ( x ̃ − x ) p Where β0 = h(u), β1 = h′(u),···
Key idea: estimate βˆ0 at any u.
43/46
Non-Parametric Regression: Local Polynomial Regression
Key idea: estimate βˆ0 at any u. Ifallxi areclosetouthen
yi ≈ β0 +β1(xi −u)+β2(xi −u)2 +···+βp(xi −u)p Can just run a regression (for any given point u)
n 2 min ∑ yi −β0 −β1(xi −u)−···−βp(xi −u)p
β i=1
Estimate unknown βs (and specifically) β0
This is just a regression of yi on (xi −u), (xi −u)2,···, (xi −u)p βˆ = hˆ ( u )
A different regression for each u
0
44/46
Non-Parametric Regression: Local Polynomial Regression
n 2 min ∑ yi −β0 −β1(x −xi)−···−βp(x −xi)p
β i=1
Of course, as xi gets far from u, the approximation is bad
Solution: give more weight locally:
n p2 xi−u
min ∑ yi −β0 −β1(xi −u)−···−βp(xi −u) β i=1
Where K(·) is some kernel
If p=0 this is just Nadaraya-Watson If p=1 is local linear regression
K
h
45/46
Non-Parametrics
1. Kernel Density Estimation 2. Non-Parametric Regression
46/46