Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L15 — Nonparametric Density
Models – Kernel Density Estimates
Nonparametric Density Estimates
Parametric forms involve a chosen functional form of the density,
and fitting parameters by estimation from a sample — e.g. the
Gaussian
Non-parametric methods do not use a chosen parametric functional
form, but rather are unstructured. The histogram is an example.
We’ll look at
– Parzen windows or kernel estimate
– k-nearest neighbor estimate
xxxp
T 1
2
1
2
1 exp)(
2
Parzen
Consider a region L(x) about the point x (not necessarily a
data point)
x
L(x)
The region L(x) contains volume V, and
probability mass
ML(x) d
n
x’
L(x)
p(x’ ) p(x) V
The approximation is more accurate as the
region shrinks (to below the length scale over
which p(x) varies appreciably)
We can approximate the
density at x by
where N is total number of data points, k(x) the
number of points in L(x), and V the volume enclosed
by L. This is the Parzen window estimate.
ˆ p (x) V
k(x)
N
3
Kernel Estimates
The Parzen window can be constructed in a slightly different light.
Consider data in 2-D. Let the function k(x) have value 1/V
throughout the region L(x) (of 2-D area V), and value zero outside
1/ V
k(x)
base area V
(data points in plane)
The Parzen estimate can be re-written
ˆ p (x)
1
N
k (x xi
i1
N
)
where xi , i=1 … N are the data points. The
kernel k(x- xi) takes value 1/V for all data points xi
that fall within the base area V (centered on x),
but zero outside. The summation is thus k(x)/V.
4
Kernel Estimates
Since the function k(y) is symmetric in y, we can put another
interpretation on the kernel density estimate
ˆ p (x)
1
N
k (x xi
i1
N
)
Place the center of a kernel at each data point xi, sum up the result,
and using that as a picture of the density.
This is how kernel estimates are usually interpreted. There’s lots of
possible kernels. They satisfy
k(y) d
n
y 1
k(y) k (y)
5
Kernel Estimates
One possible kernel is the Dirac delta function — an infinitely
narrow, infinitely tall spike that
R
N
R
N
otherwise
Rinyyf
xdxfyx
otherwise
Riny
xdyx
xx
,0
),(
)( )(
,0
,1
)(
)( )(
The corresponding density estimate is a set of spikes, one at each
data point.
is symmetric
integrates to 1
has the sifting property
6
Kernel Estimates
The delta kernel density estimate
N
i
is
xxNxp
1
)( /1 )(ˆ
is a set of spikes – one at each data point
x
This needs to be smoothed to give a
reasonable picture of the density p(x).
To smooth it, we convolve this spike
density with a smoothing kernel
function k(x).
7
Kernel Estimates
x
Take the
spike density
ˆ p s(x) 1/ N (x xi )
i1
N
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
-8 -6 -4 -2 0 2 4 6 8 10
ˆ p (x) k(x) ˆ p s(x)
k (x y) 1/ N (y xi) dy
1/ N
i1
N
k(x xi)
and convolve
(smear) it
with the
kernel k(x) to
get the
smoothed
version
8
Kernel Estimates
The width of the kernel function determines how much smoothing
results – here’s a progression from narrow to wide kernels:
0
1
2
3
4
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
1.2
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
1
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
-10 -5 0 5 10
0
0.2
0.4
0.6
0.8
-10 -5 0 5 10
If the kernel is narrower
than the distance between
the data points, the density
estimate will be too bumpy.
If the kernel is too wide, we
wipe out detail in the density.
9
Family of Kernels
Here’s a convenient family of kernels
where
m — determines shape (m=1 –> normal curve)
is the covariance
r — determines the spatial extent of the kernel
A — matrix determines the directional variation of the kernel
This family satisfies
r
2
A k
k(x) d
n
x 1
k x x
T
k(x) d
n
x r
2
A
m
T
m
n
m
n
n
m
nnn
m
nn
xArx
nArn
nm
x
12
2
2
2
2/1
2
12/2/
2
22/
exp
1)2/(
)(
k
10
Kernels
Here’s what they look like (1-d) for different
values of m:
0
0.2
0.4
0.6
0.8
-3 -2 -1 0 1 2 3
m=.5m=1 (normal)
m=5
and as the kernel becomes uniformm
11
Histograms and Kernel
Density Estimates
Histograms are a crude type of kernel density estimate obtained by
– Take a rectangular kernel
– Form the kernel density estimate
– Sample the estimate at the discrete points
)( /1 )(ˆ
1
i
N
i
xxNxp
k
x
otherwise
xxif
xx
0
2/|’|/1
)'(k
…,3,2,1,0,1,2…,,, nnxn
x
xn xn+1
12
Bias and Variance of Kernel Estimates
The kernel density estimate
is an estimator dependent on the sample data points xi . Like all
such statistical estimators we can ask about its bias and variance.
Start with the delta-function kernel estimate
and take its expectation over all sets of xi
)( /1 )(ˆ
1
i
N
i
xxNxp
k
N
i
is xxNxp
1
)( /1 )(ˆ
13
Bias and Variance of Kernel Estimates
Start with the delta-function kernel estimate
and take its expectation over all sets of xi
so it’s unbiased!
N
i
is xxNxp
1
)( /1 )(ˆ
N
i
N
N
nn
N
N
i
iN
N
nn
N
N
i
iNsD
xpxp
xdxdxpxpxpxx
xdxdxxxpxxxpE
1
1
121
1
1
121
1
1
)()(
…)(…)()()(
…)…,,,()( ])(ˆ[
14
Bias of Kernel Density Estimate
For a symmetric but otherwise arbitrary kernel, the expectation is
and this is, in general, biased.
What does the bias look like?
1
1 2 1
1
1
1 2 1
1
ˆ[ ( ) ] ( ) ( ) ( ) … ( ) …
( ) ( ) ( ) ( ) … ( ) …
( ) ( )
N
n n
D i N NN
i
N
n n n
i N NN
i
n
E p x x x p x p x p x d x d x
x y y x p x p x p x d x d x d y
x y p y d y p
k
k
k
k k
ˆ[ ( )]
D
E p x p
k
k
15
Convolution with the kernel smoothes the parent distribution p(x). Here’s an
example of convolving a density with a Gaussian kernel
Smoothing reduces the “contrast” in the curve; the smoothed density
underestimates the true density where the latter is high, and overestimates
the true density where the latter is low.
Bias of Kernel Density Estimate
16
Bias of Kernel Density Estimates
• The expected kernel density estimate is smoother
than the real density for all finite-width kernels.
• The extent of the smoothing, and hence of the bias,
increases as the kernel width increases.
• Only delta function kernels give an unbiased
estimate of p(x).
17
Variance of Kernel Estimate
The variance involves the second moment
Note that this decreases as 1/N.
2121
1
1
212
)()()()(
)()( ])(ˆ[
2
2
pp
zdydzpypzxyx
ydypyxxpE
N
N
N
j
N
i
N
jij
ji
i
N
i
N
i
N
i
i
N
D
kk
kk
k
221ˆvar[ ( ) ] ( 1)
N
p x p N pk k
18
Bias and Variance
Intuitively, the bias decreases as the kernel gets narrower.
The variance is difficult to understand, apart from its decrease
with increasing dataset size. Fukunaga (1) gives approximate
forms for the bias and variance for the family of kernels on pp10-
11 of these notes.
So we are able to control the bias and variance by adjusting the
kernel width r.
One prescription is to maximize the likelihood of a validation set
{yi} under a kernel model with kernel centers at the training
dataset points {xi}.
Why not maximize the likelihood of {xi}?
);( )(ˆ)}({ˆ
1
1
11
rxyypyp ji
N
j
N
Q
i
i
Q
i
k
19