Clustering. Unsupervised Learning
Maria-Florina Balcan
04/06/2015
Reading:
• Chapter 14.3: Hastie, Tibshirani, Friedman.
Additional resources:
• Center Based Clustering: A Foundational Perspective.
Awasthi, Balcan. Handbook of Clustering Analysis. 2015.
• Project:
• Midway Review due today.
• Final Report, May 8.
• Poster Presentation, May 11.
• Communicate with your mentor TA!
• Exam #2 on April 29th.
Logistics
Clustering, Informal Goals
Goal: Automatically partition unlabeled data into groups of similar datapoints.
Question: When and why would we want to do this? Useful for:
• Automatically organizing data.
• Understanding hidden structure in data.
• Preprocessing for further analysis.
• Representing high-dimensional data in a low-dimensional space (e.g., for visualization purposes).
•
Applications (Clustering comes up everywhere…) Cluster news articles or web pages or search results by topic.
Cluster protein sequences by function or genes according to expression profile.
Cluster users of social networks by interest (community detection).
•
•
Facebook network
Twitter Network
•
Applications (Clustering comes up everywhere…) Cluster customers according to purchase history.
Cluster galaxies or nearby stars (e.g. Sloan Digital Sky Survey)
And many many more applications….
•
•
Today:
Clustering
• Objective based clustering
• Hierarchical clustering
• Mention overlapping clusters
[March 4th: EM-style algorithm for clustering for mixture of Gaussians (specific probabilistic model).]
Objective Based Clustering
Input: A set S of n points, also a distance/dissimilarity measure specifying the distance d(x,y) between pairs (x,y).
E.g., # keywords in common, edit distance, wavelets coef., etc.
Goal: output a partition of the data. – k-means: find center pts 𝒄𝟏, 𝒄𝟐, … , 𝒄𝒌 to
minimize ∑n min d2(𝐱𝐢,𝐜) i=1 j∈ 1,…,k 𝐣
– k-median: find center pts 𝐜𝟏, 𝐜𝟐, … , 𝐜𝐤 to minimize ∑n min d(𝐱𝐢,𝐜)
cy sc3 1z
i=1 j∈ 1,…,k 𝐣 x c2 – K-center: find partition to minimize the maxim radius
Euclidean k-means Clustering
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd target #clusters k
Output: k representatives 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd Objective: choose 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd to minimize
2 i=1 j∈ 1,…,k 𝐣
∑n min 𝐱𝐢−𝐜
Euclidean k-means Clustering
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd target #clusters k
Output: k representatives 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd Objective: choose 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd to minimize
2 i=1 j∈ 1,…,k 𝐣
∑n min 𝐱𝐢−𝐜
Natural assignment: each point assigned to its closest center, leads to a Voronoi partition.
Euclidean k-means Clustering
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd target #clusters k
Output: k representatives 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd Objective: choose 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd to minimize
2 i=1 j∈ 1,…,k 𝐣
∑n min 𝐱𝐢−𝐜
Computational complexity:
NP hard: even for k = 2 [Dagupta’08] or
d = 2 [Mahajan-Nimbhorkar-Varadarajan09] There are a couple of easy cases…
An Easy Case for k-means: k=1
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd
Output: 𝒄 ∈ Rd to minimize
2 i=1
∑n 𝐱𝐢 − 𝐜
Solution: The optimal choice is 𝛍 = 1 ∑n 𝐱𝐢
Idea: bias/variance like decomposition
1∑n 𝐱𝐢−𝐜 2= 𝛍−𝐜 2+1∑n 𝐱𝐢−𝛍 2
n
i=1
n i=1
n i=1
Avg k-means cost wrt μ
Avg k-means cost wrt c
So, the optimal choice for 𝐜 is 𝛍.
Another Easy Case for k-means: d=1
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝒏 in Rd
Output: 𝒄 ∈ Rd to minimize
2 i=1
∑n 𝐱𝐢 − 𝐜
Extra-credit homework question
Hint: dynamic programming in time O(n2k).
Common Heuristic in Practice:
The Lloyd’s method
[Least squares quantization in PCM, Lloyd, IEEE Transactions on Information Theory, 1982]
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐧 in Rd Initialize centers 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd and
clusters C1, C2, … , Ck in any way.
Repeat until there is no further change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣} • For each j: 𝐜𝐣 ←mean of Cj
Common Heuristic in Practice: The Lloyd’s method
[Least squares quantization in PCM, Lloyd, IEEE Transactions on Information Theory, 1982]
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐧 in Rd Initialize centers 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 ∈ Rd and
clusters C1, C2, … , Ck in any way.
Repeat until there is no further change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣} • For each j: 𝐜𝐣 ←mean of Cj
Holding 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌 fixed, pick optimal C1, C2, … , Ck
HoldingC1,C2,…,Ck fixed, pick optimal 𝒄𝟏, 𝐜𝟐, … , 𝒄𝒌
Common Heuristic: The Lloyd’s method
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐧 in Rd Initialize centers 𝐜𝟏, 𝐜𝟐, … , 𝐜𝐤 ∈ Rd and
clusters C1, C2, … , Ck in any way.
Repeat until there is no further change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣} • For each j: 𝐜𝐣 ←mean of Cj
Note: it always converges.
• the cost always drops and
• there is only a finite #s of Voronoi partitions (so a finite # of values the cost could take)
Initialization for the Lloyd’s method
Input: A set of n datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐧 in Rd
Initialize centers 𝐜𝟏, 𝐜𝟐, … , 𝐜𝐤 ∈ Rd and clusters C1, C2, … , Ck in any way.
Repeat until there is no further change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣}
• For each j: 𝐜𝐣 ←mean of Cj
• Initialization is crucial (how fast it converges, quality of solution output) • Discuss techniques commonly used in practice
• Random centers from the datapoints (repeat a few times)
• Furthest traversal
• K-means ++ (works well and has provable guarantees)
Lloyd’s method: Random Initialization
Lloyd’s method: Random Initialization Example: Given a set of datapoints
Lloyd’s method: Random Initialization
Select initial centers at random
Lloyd’s method: Random Initialization
Assign each point to its nearest center
Lloyd’s method: Random Initialization
Recompute optimal centers given a fixed clustering
Lloyd’s method: Random Initialization
Assign each point to its nearest center
Lloyd’s method: Random Initialization
Recompute optimal centers given a fixed clustering
Lloyd’s method: Random Initialization
Assign each point to its nearest center
Lloyd’s method: Random Initialization
Recompute optimal centers given a fixed clustering
Get a good quality solution in this example.
Lloyd’s method: Performance
It always converges, but it may converge at a local optimum that is different from the global optimum, and in fact could be arbitrarily worse in terms of its score.
Lloyd’s method: Performance
Local optimum: every point is assigned to its nearest center and every center is the mean value of its points.
Lloyd’s method: Performance
.It is arbitrarily worse than optimum solution….
Lloyd’s method: Performance
This bad performance, can happen even with well separated Gaussian clusters.
Lloyd’s method: Performance
This bad performance, can happen even with well separated Gaussian clusters.
Some Gaussian are combined…..
Lloyd’s method: Performance • If we do random initialization, as k increases, it becomes
more likely we won’t have perfectly picked one center per Gaussian in our initialization (so Lloyd’s method will output a bad solution).
• For k equal-sized Gaussians, Pr[each initial center is in a different Gaussian] ≈ 𝑘! ≈ 1
• Becomes unlikely as k gets large.
𝑘𝑘 𝑒𝑘
Another Initialization Idea: Furthest Point Heuristic
Choose 𝐜𝟏 arbitrarily (or at random). • Forj=2,…,k
• Pick 𝐜𝐣 among datapoints 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐝 that is farthest from previously chosen 𝐜𝟏, 𝐜𝟐, … , 𝐜𝒋−𝟏
Fixes the Gaussian problem. But it can be thrown off by outliers….
Furthest point heuristic does well on previous example
Furthest point initialization heuristic sensitive to outliers
Assume k=3 (-2,0)
(0,1)
(3,0)
(0,-1)
Furthest point initialization heuristic sensitive to outliers
Assume k=3 (-2,0)
(0,1)
(3,0)
(0,-1)
K-means++ Initialization: D2 sampling [AV07]
• Interpolate between random and furthest point initialization
• Let D(x) be the distance between a point 𝑥 and its nearest center. Chose the next center proportional to D2(𝐱).
• Choose 𝐜𝟏 at random. • Forj=2,…,k
• Pick 𝐜𝐣 among 𝐱𝟏, 𝐱𝟐, … , 𝐱𝐝 according to the distribution
𝐢𝐢𝟐
𝐏𝐫(𝐜𝐣 = 𝐱 ) ∝ 𝐦𝐢𝐧𝐣′<𝐣 𝐱 − 𝐜𝐣′ D2(𝐱𝐢)
Theorem: K-means++ always attains an O(log k) approximation to optimal k-means solution in expectation.
Running Lloyd’s can only further improve the cost.
• •
K-means++ Idea: D2 sampling Interpolate between random and furthest point initialization
Let D(x) be the distance between a point 𝑥 and its nearest center. Chose the next center proportional to D𝛼(𝐱).
𝛼 = 0, random sampling
•
• 𝛼 = ∞, furthest point (Side note: it actually works well for k-center)
• 𝛼 = 2, k-means++
Side note: 𝛼 = 1, works well for k-median
K-means ++ Fix
(0,1)
(-2,0)
(3,0)
(0,-1)
• •
K-means++/ Lloyd’s Running Time K-means ++ initialization: O(nd) and one pass over data to
select next center. So O(nkd) time in total. Lloyd’s method
Each round takes time O(nkd).
Exponential # of rounds in the worst case [AV07]. Expected polynomial time in the smoothed analysis model!
Repeat until there is no change in the cost.
• For each j: Cj ←{𝑥 ∈ 𝑆 whose closest center is 𝐜𝐣}
• For each j: 𝐜𝐣 ←mean of Cj
• •
• •
• •
•
K-means++/ Lloyd’s Summary K-means++ always attains an O(log k) approximation to optimal
k-means solution in expectation.
Running Lloyd’s can only further improve the cost.
Exponential # of rounds in the worst case [AV07]. Expected polynomial time in the smoothed analysis model!
Does well in practice.
What value of k???
• Heuristic: Find large gap between k -1-means cost and k-means cost.
• Hold-out validation/cross-validation on auxiliary task (e.g., supervised learning task).
• Try hierarchical clustering.
Hierarchical Clustering
All topics
sports fashion
soccer tennis Gucci
• A hierarchy might be more natural.
• Different users might care about different levels of granularity or even prunings.
Lacoste
Hierarchical Clustering
Top-down (divisive)
•
• •
• Partition data into 2-groups (e.g., 2-means)
• Recursively cluster each group.
Bottom-Up (agglomerative)
Start with every point in its own cluster.
Repeatedly merge the “closest” two clusters.
Different defs of “closest” give different algorithms.
soccer
All topics sports
fashion
Lacoste
tennis
Gucci
Bottom-Up (agglomerative)
Have a distance measure on pairs of objects. d(x,y) – distance between x and y
E.g., # keywords in common, edit distance, etc
All topics
sports tennis
fashion
Gucci
• Single linkage:
• Complete linkage:
• Average linkage: • Wards’ method
dist A, 𝐵 = min x∈A,x′∈B′
dist(x, x′)
dist A, B = max dist(x, x′) x∈A,x′∈B′
dist A, B = avg dist(x, x′) x∈A,x′∈B′
soccer
Lacoste
Single Linkage
Bottom-up (agglomerative)
• Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Single linkage: dist A, 𝐵
= min x∈A,x′∈𝐵
dist(x, x′)
Dendogram
5
ABCDEF
4 3 ABC
1AB -3 -2
ABCDE
2DE
0
2.1 3.2 6
ABCDEF
Single Linkage
Bottom-up (agglomerative)
• Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Single linkage: dist A, 𝐵 = min dist(x, x′) x∈A,x′∈𝐵
One way to think of it: at any moment, we see connected components of the graph where connect any two pts of distance < r.
Watch as r grows (only n-1 relevant values because we only we merge
at value of r corresponding to values of r in different clusters).
13
4
5 2
-3 -2 0 2.1 3.2
6
ABCDEF
Complete Linkage
Bottom-up (agglomerative)
• Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Complete linkage: dist A, B = max dist(x, x′) x∈A,x′∈B
One way to think of it: keep max diameter as small as possible at
any level.
5 3 ABC
ABCDEF
4
DEF
1AB -3 -2
2DE
0
2.1 3.2
6
ABCDEF
Complete Linkage
Bottom-up (agglomerative)
• Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Complete linkage: dist A, B = max dist(x, x′) x∈A,x′∈B
One way to think of it: keep max diameter as small as possible.
13
-3 -2 0
5
2
2.1 3.2
4
ABCDEF
6
Ward’s Method
Bottom-up (agglomerative)
• Start with every point in its own cluster.
• Repeatedly merge the “closest” two clusters.
Ward’smethod:distC,C′=C⋅C′ meanC−meanC′2 C + C′
Merge the two clusters such that the increase in k-means cost is as small as possible.
Works well in practice.
13
-3 -2 0
5
2
2.1 3.2
4
ABCDEF
6
Running time
• Each algorithm starts with N clusters, and performs N-1 merges.
• For each algorithm, computing 𝑑𝑖𝑠𝑡(𝐶, 𝐶′) can be done in time
𝑂( 𝐶 ⋅ 𝐶′ ). (e.g., examining 𝑑𝑖𝑠𝑡(𝑥, 𝑥′) for all 𝑥 ∈ 𝐶, 𝑥′ ∈ 𝐶′)
• Time to compute all pairwise distances and take smallest is 𝑂(𝑁2).
• Overall time is 𝑂(𝑁3).
In fact, can run all these algorithms in time 𝑂(𝑁2 log 𝑁).
See: Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to Information Retrieval, Cambridge University Press. 2008. http://www-nlp.stanford.edu/IR-book/
Hierarchical Clustering Experiments [BLG, JMLR’15]
Ward’s method does the best among classic techniques.
Hierarchical Clustering Experiments [BLG, JMLR’15]
Ward’s method does the best among classic techniques.
What You Should Know
• Partitional Clustering. k-means and k-means ++
• Lloyd’s method
• Initialization techniques (random, furthest
traversal, k-means++)
• Hierarchical Clustering.
• Single linkage, Complete Linkge, Ward’s method
Additional Slides
Smoothed analysis model
• Imagine a worst-case input.
• But then add small Gaussian perturbation to each data point.
Smoothed analysis model
• Imagine a worst-case input.
• But then add small Gaussian perturbation to each data point.
• Theorem [Arthur-Manthey-Roglin 2009]:
- E[number of rounds until Lloyd’s converges] if add Gaussian
perturbation with variance 𝜎2 is polynomial in 𝑛, 1/𝜎.
𝑛34𝑘34𝑑8 𝜎6
- The actual bound is : 𝑂
• Might still find local opt that is far from global opt.
Overlapping Clusters: Communities
Christos Papadimitriou
Colleagues at Berkeley
TCS
Databases Systems
Algorithmic Game Theory
Overlapping Clusters: Communities
• Social networks • Professional networks
• Product Purchasing Networks, Citation Networks, Biological Networks, etc
Overlapping Clusters: Communities
Baby's Favorite CDs Songs
lullabies
Kids
Electronics