程序代写代做代考 Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L15 — Nonparametric Density

Models – Kernel Density Estimates

Nonparametric Density Estimates

Parametric forms involve a chosen functional form of the density,

and fitting parameters by estimation from a sample — e.g. the

Gaussian

Non-parametric methods do not use a chosen parametric functional

form, but rather are unstructured. The histogram is an example.

We’ll look at

– Parzen windows or kernel estimate

– k-nearest neighbor estimate

   






xxxp

T 1

2
1

1 exp)(

Parzen

Consider a region L(x) about the point x (not necessarily a

data point)

L(x)

The region L(x) contains volume V, and

probability mass

ML(x)  d
n

x’

L(x)

 p(x’ )  p(x) V

The approximation is more accurate as the

region shrinks (to below the length scale over

which p(x) varies appreciably)
We can approximate the

density at x by

where N is total number of data points, k(x) the

number of points in L(x), and V the volume enclosed

by L. This is the Parzen window estimate.

ˆ p (x) V 
k(x)

Kernel Estimates

The Parzen window can be constructed in a slightly different light.

Consider data in 2-D. Let the function k(x) have value 1/V

throughout the region L(x) (of 2-D area V), and value zero outside

1/ V

k(x)

base area V

(data points in plane)

The Parzen estimate can be re-written

ˆ p (x) 
1

N
k (x  xi
i1

 )

where xi , i=1 … N are the data points. The

kernel k(x- xi) takes value 1/V for all data points xi
that fall within the base area V (centered on x),

but zero outside. The summation is thus k(x)/V.

Kernel Estimates

Since the function k(y) is symmetric in y, we can put another

interpretation on the kernel density estimate

ˆ p (x) 
1

N
k (x  xi
i1

 )

Place the center of a kernel at each data point xi, sum up the result,

and using that as a picture of the density.

This is how kernel estimates are usually interpreted. There’s lots of

possible kernels. They satisfy

k(y) d
n

y  1
k(y)  k (y)

Kernel Estimates

One possible kernel is the Dirac delta function — an infinitely

narrow, infinitely tall spike that

















otherwise

Rinyyf
xdxfyx

otherwise

Riny
xdyx

),(
)( )(

,1
)(

)( )(





The corresponding density estimate is a set of spikes, one at each

data point.

is symmetric

integrates to 1

has the sifting property

Kernel Estimates

The delta kernel density estimate





N

is
xxNxp

)( /1 )(ˆ 

is a set of spikes – one at each data point

This needs to be smoothed to give a

reasonable picture of the density p(x).

To smooth it, we convolve this spike

density with a smoothing kernel

function k(x).

Kernel Estimates

Take the

spike density
ˆ p s(x)  1/ N (x  xi )

i1



0.1

0.2

0.3

0.4

0.5

0.6

0.7

-8 -6 -4 -2 0 2 4 6 8 10

ˆ p (x)  k(x) ˆ p s(x)

 k (x  y) 1/ N  (y  xi) dy

 1/ N

i1

 k(x  xi)

and convolve

(smear) it

with the

kernel k(x) to
get the

smoothed

version

Kernel Estimates

The width of the kernel function determines how much smoothing

results – here’s a progression from narrow to wide kernels:

-10 -5 0 5 10
0

0.2

0.4

0.6

0.8

1.2

-10 -5 0 5 10
0

0.2

0.4

0.6

0.8

-10 -5 0 5 10

0.2

0.4

0.6

0.8

-10 -5 0 5 10
0

0.2

0.4

0.6

0.8

-10 -5 0 5 10

If the kernel is narrower

than the distance between

the data points, the density

estimate will be too bumpy.

If the kernel is too wide, we

wipe out detail in the density.

Family of Kernels

Here’s a convenient family of kernels

where

m — determines shape (m=1 –> normal curve)

is the covariance

r — determines the spatial extent of the kernel

A — matrix determines the directional variation of the kernel

This family satisfies

r
2

A  k

k(x) d
n

x  1
k  x x

T
k(x) d

n
x  r

2
A

 
   

 
 

 
m

m
n

n
m
nnn
m

xArx
nArn

nm
x





























12

2
2

2/1

2
12/2/

2
22/

exp
1)2/(

)(


Kernels

Here’s what they look like (1-d) for different

values of m:

0.2

0.4

0.6

0.8

-3 -2 -1 0 1 2 3

m=.5m=1 (normal)

m=5

and as the kernel becomes uniformm  

Histograms and Kernel
Density Estimates

Histograms are a crude type of kernel density estimate obtained by

– Take a rectangular kernel

– Form the kernel density estimate

– Sample the estimate at the discrete points

)( /1 )(ˆ

xxNxp  




 


otherwise

xxif
xx

2/|’|/1
)'(k



…,3,2,1,0,1,2…,,,  nnxn
x

xn xn+1

Bias and Variance of Kernel Estimates

The kernel density estimate

is an estimator dependent on the sample data points xi . Like all
such statistical estimators we can ask about its bias and variance.

Start with the delta-function kernel estimate

and take its expectation over all sets of xi

)( /1 )(ˆ

xxNxp  






N

is xxNxp
1

)( /1 )(ˆ 

Bias and Variance of Kernel Estimates

Start with the delta-function kernel estimate

and take its expectation over all sets of xi

so it’s unbiased!





N

is xxNxp
1

)( /1 )(ˆ 











i
N

N
nn

iNsD

xpxp

xdxdxpxpxpxx

xdxdxxxpxxxpE

121

)()(

…)(…)()()(

…)…,,,()( ])(ˆ[



Bias of Kernel Density Estimate

For a symmetric but otherwise arbitrary kernel, the expectation is

and this is, in general, biased.

What does the bias look like?

1
1 2 1

ˆ[ ( ) ] ( ) ( ) ( ) … ( ) …

( ) ( ) ( ) ( ) … ( ) …

( ) ( )

N
n n

D i N NN

N
n n n

i N NN

E p x x x p x p x p x d x d x

x y y x p x p x p x d x d x d y

x y p y d y p

k
k

k 

k k



 

  

   



 



ˆ[ ( )]
D

E p x p
k

k 

Convolution with the kernel smoothes the parent distribution p(x). Here’s an
example of convolving a density with a Gaussian kernel

Smoothing reduces the “contrast” in the curve; the smoothed density
underestimates the true density where the latter is high, and overestimates
the true density where the latter is low.

Bias of Kernel Density Estimate

Bias of Kernel Density Estimates

• The expected kernel density estimate is smoother

than the real density for all finite-width kernels.

• The extent of the smoothing, and hence of the bias,

increases as the kernel width increases.

• Only delta function kernels give an unbiased

estimate of p(x).

Variance of Kernel Estimate

The variance involves the second moment

Note that this decreases as 1/N.

   2121

212

)()()()(

)()( ])(ˆ[

zdydzpypzxyx

ydypyxxpE

N
N

j
N

i
N

jij

i
N













  



  
221ˆvar[ ( ) ] ( 1)

N
p x p N pk k    

Bias and Variance
Intuitively, the bias decreases as the kernel gets narrower.

The variance is difficult to understand, apart from its decrease
with increasing dataset size. Fukunaga (1) gives approximate
forms for the bias and variance for the family of kernels on pp10-
11 of these notes.

So we are able to control the bias and variance by adjusting the
kernel width r.

One prescription is to maximize the likelihood of a validation set
{yi} under a kernel model with kernel centers at the training
dataset points {xi}.

Why not maximize the likelihood of {xi}?

);( )(ˆ)}({ˆ

rxyypyp ji

j
N

 


Related Posts