程序代写代做代考 algorithm Option One Title Here

Option One Title Here

ANLY-601
Advanced Pattern Recognition

Spring 2018

L13 – Clustering and Mixture

Models

Clustering

Unsupervised classification

Central problem is defining a cluster

Parametric: use a cluster criterion, either a “distance” or

“distortion function” (does not have to be a mathematical

metric) for closeness, or a parametric model of the

distribution

Model-based clustering produces representative values of the

data in the cluster. For example, the centroid. Representative

values are substitutes for the data in lossy data compression

(e.g. scalar and vector quantization). Hence clustering and

lossy coding are related.

Model-Based Clustering

General algorithm properties

Data – x1, x2, …, xN each to be assigned to one of L clusters or
classes

Assignment of the ith data point to a cluster is denoted

Classification of entire dataset is denoted

There’s a cluster criterion function

and a best clustering

L
 …,,

}…,,1{ Lk
iki


 
Nkkk

 …,,,
21



)(J

)(minarg
*




Model-Based Clustering

General Properties (cont’d)

There are a set of parameters associated with the kth cluster –

denoted . The parameters may include the mean of the data in

the cluster, the covariance of the data in the cluster, and other

summary statistics.

The union of all these parameters for all clusters is Q.

The clustering criterion function is generally a function of the

parameters

and the best clustering model simultaneously minimizes J over the

cluster assignment and the parameters

k


),( Q JJ

),(minarg),(
,

**
QQ

Q

Example K-means

Classic algorithm. (MacQueen. Some methods for classification and

analysis of multivariate observations, in Proce. 5th Berkeley Symposium on

Mathematical Stat. and Prob., vol 3, 1967.)

Criterion function is the average squared distance between the data and

the cluster “means”

where is the jth data point assigned to cluster i , and Ni is the number

of data points assigned to cluster i .

The mi are the “means” associated with the clusters – collectively they

comprise the parameters Q associated with the clusters. (We have yet to
show that these are statistical means, hence the quotation marks.)

Note that is the fraction of data points in the ith cluster.

It’s an estimator of the cluster prior Pi.

)(1





i

N
mxJ

)(i

j
x

N i

K-Means Iterative Optimization

• Freeze the means mi , and find the assignment  that minimizes
J. This assigns xj to the cluster whose mean mr is the closest in

Euclidean distance to xi.

• Freeze the cluster assignments, move the means to the centroid

of each cluster

Since the cluster assignments depend on the mk, and the means

depend on the cluster assignment Q, we must iterate these two
steps.

rkmxmx
kjrj







r

iNr
xm

)(1

LGB

The k-means algorithm arose in the statistics community.
It was independently discovered in the EE community
where it’s known as the Lloyd, Gray, Buzo (LGB)
algorithm used to design vector quantizers for lossy
coding.

The algorithm can be generalized by substituting for the
squared Euclidean distance, any distortion function that
is bounded below by zero

The criterion function is then

0),( 
ri

xd 





i

ijN

N
xdJ

),( 

LBG Optimization

As before the optimization is iterative and proceeds in two steps

• Freeze the cluster parameters k and assign each datapoint xi
to the cluster with lowest distortion

• Freeze the cluster assignments and adjust the parameters to

minimize the distortion in each cluster. The resulting values of

the parameters k are called the generalized centroids, they

depend on the distortion function (may not be actual centroid)

rjxdxdx
jiriri

 ),,(),(where 





r

iNr
xd

)(1 ),(minarg 


LBG Convergence

Each of the two optimization operations either lowers J, or
leaves it unchanged. Since J is bounded below, the
algorithm converges to a (local) minimum of J.

For a finite number of data points, J stops changing after a
finite number of steps. For L clusters and N datapoints,
there are LN different cluster assignments . Each such
assignment, together with its optimal set of parameters Q*

() produces a particular value of J. Since each step
lowers J or leaves it unchanged, the algorithm must arrive
at a local minimum of J in LN steps or less.

Note – the convergence means J comes to rest at a local
minimum. It’s conceivable that  may continue to change
(a finite number of times).

K-means Boundaries

For Euclidean distance distortion, boundaries between

clusters are piecewise linear and bisectors perpendicular to

the line segment between pairs of means. A point x on the

boundary between clusters r and s must satisfy

expanding the quadratics gives

which is linear is x. (You can prove

the bisector property yourself.)

The set of boundaries is called

a Voronoi tessellation.

sr
mxmx 

0)()()(2 
sr

srsr

T
mmmmmmx

K-means Partition and Mean Location

The algorithm tends to concentrate the means where

p(x) is high.

2-D Gaussian data (blue

dots) and 30 means (red

circles) placed by k-means

iterating 65 times over

a dataset of 1000 points.

Relation to Gaussian Mixture Models

Take a Gaussian mixture model with spherical components

all with the same variance. Don’t fit the component

variance, but instead regard it as a ‘knob’. The model is

the posterior cluster probabilities data point x are





L

ii
mxixpixpPxp

)2(

1 )exp()|(),|()( 22 








ij i

ixpP

jxpPjxpP

ixpP
xip

)|(

)|(
1

)|(

)|(
)|(

K-means and Gaussian Mixtures

We have the cluster posteriors

with








ij i

ixpP

jxpPjxpP

ixpP
xip

)|(

)|(
1

)|(

)|(
)|(

( )22
2

1
2exp

)|(

)|(
ij

j
mxmx

ixpP

jxpP




mix

K-means and Gaussian Mixtures

Take the limit . It’s quick to show that, provided

non of the priors Pi are zero,

So the cluster posteriors become hard cluster assignments

(0 or 1) with the same cluster assignment as in k-means!

0
2





 


 otherwise

ikxmxmif
xip

1
)|(lim

0
2



K-means and Gaussian Mixtures

The optimal position of the mean for the ith cluster is (recall the EM
algorithm for mixture model fitting)

So the M-step of the EM algorithm becomes the cluster mean
adjustment step of the k-means algorithm.

Thus, the EM algorithm for fitting spherical Gaussian mixture models
reduces to the k-means clustering algorithm in the limit that the
component variances are kept the same and taken to zero.








lNi

xip

xxip

)(1

2
lim

becomes this,posteriors theof valueslimitingour with

)|(



Related Posts