Option One Title Here
ANLY-601
Advanced Pattern Recognition
Spring 2018
L13 – Clustering and Mixture
Models
Clustering
Unsupervised classification
Central problem is defining a cluster
Parametric: use a cluster criterion, either a “distance” or
“distortion function” (does not have to be a mathematical
metric) for closeness, or a parametric model of the
distribution
Model-based clustering produces representative values of the
data in the cluster. For example, the centroid. Representative
values are substitutes for the data in lossy data compression
(e.g. scalar and vector quantization). Hence clustering and
lossy coding are related.
2
Model-Based Clustering
General algorithm properties
Data – x1, x2, …, xN each to be assigned to one of L clusters or
classes
Assignment of the ith data point to a cluster is denoted
Classification of entire dataset is denoted
There’s a cluster criterion function
and a best clustering
L
…,,
1
}…,,1{ Lk
iki
Nkkk
…,,,
21
)(J
)(minarg
*
J
3
Model-Based Clustering
General Properties (cont’d)
There are a set of parameters associated with the kth cluster –
denoted . The parameters may include the mean of the data in
the cluster, the covariance of the data in the cluster, and other
summary statistics.
The union of all these parameters for all clusters is Q.
The clustering criterion function is generally a function of the
parameters
and the best clustering model simultaneously minimizes J over the
cluster assignment and the parameters
k
),( Q JJ
),(minarg),(
,
**
QQ
Q
J
4
Example K-means
Classic algorithm. (MacQueen. Some methods for classification and
analysis of multivariate observations, in Proce. 5th Berkeley Symposium on
Mathematical Stat. and Prob., vol 3, 1967.)
Criterion function is the average squared distance between the data and
the cluster “means”
where is the jth data point assigned to cluster i , and Ni is the number
of data points assigned to cluster i .
The mi are the “means” associated with the clusters – collectively they
comprise the parameters Q associated with the clusters. (We have yet to
show that these are statistical means, hence the quotation marks.)
Note that is the fraction of data points in the ith cluster.
It’s an estimator of the cluster prior Pi.
2
1
)(1
1
i
i
i
N
j
i
i
jN
L
i
N
N
mxJ
)(i
j
x
N
N i
5
K-Means Iterative Optimization
• Freeze the means mi , and find the assignment that minimizes
J. This assigns xj to the cluster whose mean mr is the closest in
Euclidean distance to xi.
• Freeze the cluster assignments, move the means to the centroid
of each cluster
Since the cluster assignments depend on the mk, and the means
depend on the cluster assignment Q, we must iterate these two
steps.
rkmxmx
kjrj
r
r
N
i
r
iNr
xm
1
)(1
6
LGB
The k-means algorithm arose in the statistics community.
It was independently discovered in the EE community
where it’s known as the Lloyd, Gray, Buzo (LGB)
algorithm used to design vector quantizers for lossy
coding.
The algorithm can be generalized by substituting for the
squared Euclidean distance, any distortion function that
is bounded below by zero
The criterion function is then
0),(
ri
xd
i
i
i
N
j
ijN
L
i
N
N
xdJ
1
1
1
),(
7
LBG Optimization
As before the optimization is iterative and proceeds in two steps
• Freeze the cluster parameters k and assign each datapoint xi
to the cluster with lowest distortion
• Freeze the cluster assignments and adjust the parameters to
minimize the distortion in each cluster. The resulting values of
the parameters k are called the generalized centroids, they
depend on the distortion function (may not be actual centroid)
rjxdxdx
jiriri
),,(),(where
r
r
N
i
r
iNr
xd
1
)(1 ),(minarg
8
LBG Convergence
Each of the two optimization operations either lowers J, or
leaves it unchanged. Since J is bounded below, the
algorithm converges to a (local) minimum of J.
For a finite number of data points, J stops changing after a
finite number of steps. For L clusters and N datapoints,
there are LN different cluster assignments . Each such
assignment, together with its optimal set of parameters Q*
() produces a particular value of J. Since each step
lowers J or leaves it unchanged, the algorithm must arrive
at a local minimum of J in LN steps or less.
Note – the convergence means J comes to rest at a local
minimum. It’s conceivable that may continue to change
(a finite number of times).
9
K-means Boundaries
For Euclidean distance distortion, boundaries between
clusters are piecewise linear and bisectors perpendicular to
the line segment between pairs of means. A point x on the
boundary between clusters r and s must satisfy
expanding the quadratics gives
which is linear is x. (You can prove
the bisector property yourself.)
The set of boundaries is called
a Voronoi tessellation.
22
sr
mxmx
0)()()(2
sr
T
srsr
T
mmmmmmx
10
K-means Partition and Mean Location
The algorithm tends to concentrate the means where
p(x) is high.
2-D Gaussian data (blue
dots) and 30 means (red
circles) placed by k-means
iterating 65 times over
a dataset of 1000 points.
11
Relation to Gaussian Mixture Models
Take a Gaussian mixture model with spherical components
all with the same variance. Don’t fit the component
variance, but instead regard it as a ‘knob’. The model is
the posterior cluster probabilities data point x are
L
i
ii
mxixpixpPxp
nn
1
2
2
1
)2(
1 )exp()|(),|()( 22
ij i
j
j
j
i
ixpP
jxpPjxpP
ixpP
xip
)|(
)|(
1
1
)|(
)|(
)|(
12
K-means and Gaussian Mixtures
We have the cluster posteriors
with
ij i
j
j
j
i
ixpP
jxpPjxpP
ixpP
xip
)|(
)|(
1
1
)|(
)|(
)|(
( )22
2
1
2exp
)|(
)|(
ij
i
j
i
j
mxmx
P
P
ixpP
jxpP
mj
mix
13
K-means and Gaussian Mixtures
Take the limit . It’s quick to show that, provided
non of the priors Pi are zero,
So the cluster posteriors become hard cluster assignments
(0 or 1) with the same cluster assignment as in k-means!
0
2
otherwise
ikxmxmif
xip
ki
0
1
)|(lim
0
2
14
K-means and Gaussian Mixtures
The optimal position of the mean for the ith cluster is (recall the EM
algorithm for mixture model fitting)
So the M-step of the EM algorithm becomes the cluster mean
adjustment step of the k-means algorithm.
Thus, the EM algorithm for fitting spherical Gaussian mixture models
reduces to the k-means clustering algorithm in the limit that the
component variances are kept the same and taken to zero.
i
i
N
l
i
lNi
k
k
N
l
ll
i
xm
xip
xxip
m
1
)(1
0
1
2
lim
becomes this,posteriors theof valueslimitingour with
)|(
)|(
15