COMS 4771 Clustering
Nakul Verma
Supervised Learning
Data:
Assumption: there is a (relatively simple) function
such that for most i
Learning task: given n examples from the data, find an approximation
Supervised learning
Goal: gives mostly correct prediction
on unseen examples
Training Phase
Testing Phase
Unlabeled test data (unseen / future data)
Labeled training data (n examples from data)
Learning Algorithm
‘classifier’
prediction
Unsupervised Learning
Data:
Assumption: there is an underlying structure in
Learning task: discover the structure given n examples from the data
Goal: come up with the summary of the data using the discovered structure
Partition the data into meaningful structures
Find a low-dimensional representation that retains important information, and suppresses irrelevant/noise information
clustering
Dimensionality reduction
Let’s take a closer look using an example…
Unsupervised learning
Example: Handwritten digits revisited
Handwritten digit data, but with no labels
What can we do?
• Suppose know that there are 10 groupings,
can we find the groups?
• What if we don’t know there are 10 groups?
• How can we discover/explore other structure in such data?
A 2D visualization of digits da
taset
Handwritten digits visualization
Grouping The Data, aka Clustering
Data:
Given: known target number of groups k Output: Partition into k groups.
This is called the clustering problem,
also known as unsupervised classification, or quantization
k-means
Given: data
Idea:
find a set of representatives representative
, and intended number of groupings k such that data is close to some
Optimization:
Unfortunately this is NP-hard Even for d=2 and k=2
How do we solve for d=1 or k=1 case?
How do we optimize this?
Algorithm to approximate k-means
Given: data , and intended number of groupings k
Alternating optimization algorithm:
• Initialize cluster centers (say randomly) • Repeat till no more changes occur
• Assign data to its closest center (this creates a partition) (assume centers are fixed)
• Find the optimal centers (assuming the data partition is fixed)
Demo:
Algorithm to approximate k-means
Given: data , and intended number of groupings k
Alternating optimization algorithm:
• Initialize cluster centers (say randomly) • Repeat till no more changes occur
• Assign data to its closest center (this creates a partition) (assume centers are fixed)
• Find the optimal centers (assuming the data partition is fixed)
Demo:
Algorithm to approximate k-means
Given: data , and intended number of groupings k
Alternating optimization algorithm:
• Initialize cluster centers (say randomly) • Repeat till no more changes occur
• Assign data to its closest center (this creates a partition) (assume centers are fixed)
• Find the optimal centers (assuming the data partition is fixed)
Demo:
Algorithm to approximate k-means
Given: data , and intended number of groupings k
Alternating optimization algorithm:
• Initialize cluster centers (say randomly) • Repeat till no more changes occur
• Assign data to its closest center (this creates a partition) (assume centers are fixed)
• Find the optimal centers (assuming the data partition is fixed)
Demo:
Algorithm to approximate k-means
Given: data , and intended number of groupings k
Alternating optimization algorithm:
• Initialize cluster centers (say randomly) • Repeat till no more changes occur
• Assign data to its closest center (this creates a partition) (assume centers are fixed)
• Find the optimal centers (assuming the data partition is fixed)
Demo:
Algorithm to approximate k-means
Given: data , and intended number of groupings k
Alternating optimization algorithm:
• Initialize cluster centers (say randomly) • Repeat till no more changes occur
• Assign data to its closest center (this creates a partition) (assume centers are fixed)
• Find the optimal centers (assuming the data partition is fixed)
Demo:
Algorithm to approximate k-means
Given: data , and intended number of groupings k
Alternating optimization algorithm:
• Initialize cluster centers (say randomly) • Repeat till no more changes occur
• Assign data to its closest center (this creates a partition) (assume centers are fixed)
• Find the optimal centers (assuming the data partition is fixed)
Demo:
k-means
Some properties of this alternating updates algorithm:
• The approximation can be arbitrarily bad, compared to the best cluster assignment!
• Performance quality heavily dependent on the initialization!
k-means:
• How to select k?
is the right k=2 or k=3?
Solution: encode clustering for all values of k!
(hierarchical clustering)
Example: Clustering Without Committing to k
K=3 k=6 (coarser resolution) (finer resolution)
Hierarchical Clustering
Two approaches:
Top Down (divisive):
• Partition data into two groups (say, by k-means, with k=2)
• Recurse on each part
• Stop when cannot partition data anymore (ie single points left)
Bottom Up (agglomerative):
• Start by each data sample as its own cluster (so initial number of clusters is n)
• Repeatedly merge “closest” pair of clusters
• Stop when only one cluster is left
Clustering via Probabilistic Mixture Modeling
Alternative way to cluster data:
Given: and number of intended number of clusters k. Assume a joint probability distribution over the joint space
Parameters: looks familiar?
Modeling assumption data (x1,c1),…, (xn,cn) i.i.d. from
BUT only get to see partial information: x1, x2, …, xn (c1, …, cn hidden!)
π1 πk
Discrete distribution over the clusters P[C=i] = πi Some multivariate distribution, e.g.
Gaussian Mixture Modeling (GMM)
Mixing weight Mixture component Example in R2 x [3]:
(this is called a mixture model)
Given: and k.
Assume a joint probability distribution over the joint space
π1 πk
Gaussian Mixture Model
GMM: Parameter Learning
So… how to learn the parameters θ ?
MLE approach:
Given data i.i.d.
ummm…. now what?
Cannot really simplify further!
GMM: Maximum Likelihood
MLE for Mixture modeling (like GMMs) is NOT a convex optimization problem In fact Maximum Likelihood Estimate for GMMs is degenerate!
X=R, k=2 (fittwoGaussianin1d):
X as σ → 0, MLE → ∞! X
Which pair of Gaussians gives higher likelihood?
Aside: why doesn’t this occur when fitting one Gaussian?
GMM: (local) Maximum Likelihood
So, can we make any progress?
Observation: even though a global MLE maximizer is not appropriate, several local maximizers are desirable!
X
X
An example non-maximized likelihood
(do a few steps of gradient ascent)
Reaches a desirable local maximum!
A better algorithm for finding good parameters: Expectation Maximization (EM)
Expectation Maximization (EM) Algorithm
Similar in spirit to the alternating update for k-means algorithm
Idea:
• Initialize the parameters arbitrarily
• Given the current setting of parameters find the best (soft) assignment of data samples to the clusters (Expectation-step)
• Update all the parameters with respect to the current (soft) assignment that maximizes the likelihood (Maximization-step)
• Repeat until no more progress is made.
EM for GMM
Initialize arbitrarily
Expectation-step: For each and compute the
assignment of data xi to cluster j
Maximization-step: Maximize the log-likelihood of the parameters
Effective number of points assigned to cluster j
Why?
EM for GMM in Action
Arbitrary θ assignment
EM for GMM in Action
E step: soft assignment of data
EM for GMM in Action
M step: Maximize parameter estimate
EM for GMM in Action
After two rounds
EM for GMM in Action
After five rounds
EM for GMM in Action
After twenty rounds
What We Learned…
• Unsupervised Learning problems: Clustering and Dimensionality Reduction
• K-means
• Hierarchical Clustering
• Gaussian Mixture Models
• EM algorithm
Questions?
Next time…
Dimension reduction!