Lecture 26: Clustering CS 189 (CDSS offering)
Today’s lecture
Recap: Unsupervised learning is focused on finding patterns in data without • Generally speaking, rather than having data (x1, y1), …, (xN, yN), our data is
Copyright By PowCoder代写 加微信 powcoder
instead of the form x1, …, xN Previously: Dimensionality Reduction
Today: Clustering!
the associated labels
Clustering: High-Level Idea
Objective: Assign each data point to a cluster Applications
Data Exploration: Categorizing Consumer Behavior
Outlier Detection using Clustering
What makes good clusters?
Well-Constructed Clusters should have:
1. High intra-cluster similarity
2. Low inter-cluster similarity
It seems that we have to formalize the notion of similarity…
We can define similarity with a distance metric!
Want points to be similar/dissimilar = Want distance to be maximized/minimized
Distance Metric as Similarity
Properties of a Distance Metric
1. Equivalence: j = k if and only if distance( j, k) = 0
2. Symmetry: distance( j, k) = distance(k, j)
3. Triangle Inequality: distance(i, j) + distance( j, k) > distance(i, k)
Euclidian Distance (l2): distance(x, y) = ∑d (xi − yi)2 i=1
Manhattan Distance (l1): distance(x, y) = ∑d | xi − yi | i=1
K-Means Clustering: Intro
In K-Means clustering, each cluster is represented by a centroid, and each data point is assigned to one cluster
Goal: Find the best centroids that minimize the distance between the centroid and the data points assigned to its cluster
Objective Function: min ∑ ∑ ∥x − ck∥2 k x∈Sk
where {c1, c2, . . . , ck} represent clusters with points in the set {S1, S2, . . . , Sk} 6
K-Means Clustering: Algorithm
1. Initialize cluster centroids {c1, c2, . . . , ck} randomly
2. Repeat until convergence:
A. For every data point x(i), assign it a cluster label z(i) = argmink∥x(i) − ck∥2
∑ni=1 1{z(i) = ck}x(i) ∑ni=1 1{z(i) = ck}
B. For every cluster centroid ck , set ck =
K-Means: Example
8 Image Credits: CS 229n Notes
K-Means Clustering vs. K-Nearest Neighbors
K-Nearest Neighbors (K=15)
K-Means Clustering (K=3)
Find patterns in the data
Predict the label of a test point
x1,x2,…,xn
Move clusters to be the mean of its assigned points
(x1,y1),…,(xn,yn)
Training Algorithm
When given a test point…
Compares the point to the k centroids
Compares the point to its k nearest data points
Tuning the K-Means Algorithm
How do we know how many clusters (k) we should have?
Elbow Heuristic: Plot the K-means loss function for various k. Pick k so that
increasing k leads to a non-substantial decrease in the loss
Tuning the K-Means Algorithm
K-Means seems to be heavily dependent on centroid initialization
What happens if a centroid is initialized that is so far away from all the data points Because of this random initialization, K-Means is likely to converge to a local
that it gets assigned no points?
minima rather than global minima
To get best results, it is good to run K-Means multiple times and pick the
clustering that yields the lowest loss
Initialization scheme to improve cluster quality
1. Randomly select 1st centroid c1 from the data points
2. For every data point x(i), compute d(i) = distance(x(i), c1)
3. Select any data point x(i) to be the next centroid with probability
4. Repeat this process until k centroids have been sampled
(d(i))2 ∑nj=1 (d( j))2
K-Means++: Example
13 Image Credits: CS 229n Notes
Briefly: Mixture of Gaussian Models (MoG)
Mixture of Gaussians Models are another method of clustering
Unlike K-Means, MoG Models are probabilistic in nature
We believe that every data point x(i) was drawn from one of k Gaussians with a
unique mean and variance N(μk, Σk)
However, we have a prior on the probability that the Gaussian will be picked!
For each Gaussian N(μk, Σk), the probability that it will be used to model x(i) is
P(z(i) = k) ∼ Multinomial(φ)
The Challenge: We do not know what φ is!
Briefly: Mixture of Gaussian Models (MoG)
In order to classify x(i) under a Gaussian Distribution, we need to know: 2. P(x(i) | z(i)), which is one of the k Gaussians N(μz(i), Σz(i))
1. P(z(i))
In other words, we are modeling P(x(i),z(i)) = P(x(i)|z(i))*P(z(i)) i=1 z(i)=1
• Optimization Objective: max ∑n log( ∑k P(x(i) | z(i)) * P(z(i)))
We use an optimization technique called Expectation-Maximization (EM) to optimize 15
MoG models (Not in Scope)
Benefits of MoG Models
MoG allows us to learn separate covariances per cluster. In K-Means, every
cluster has the same properties
MoG makes explicit assumptions in the form of statistical distributions, which • Example:https://tinyurl.com/MoG189
makes the model better at modeling uncertainty and thus more robust
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com