CS代考 CSE 3521:Unsupervised Learning

PowerPoint Presentation

CSE 3521:Unsupervised Learning
[Many slides are adapted from and at UC Berkeley CS-188 and previous CSE 5521 course at OSU.]

An overview of unsupervised learning
Clustering
Generative models

K-means clustering
Agglomerative (hierarchical) clustering

Unsupervised learning
Data type:

Discover the structure (e.g., clusters, groups, or classes) of the data instances
Estimate the distribution/density of the data instances
Learn to generate and sample data instances

Unsupervised learning
Clustering:
Group emails or search results
Find categories of customers
Detect anomalies

Distribution/Density estimation
Can sample from the distribution to generate new data

Unsupervised learning
Generative models
After seeing many data, can the model generate one?
Example: Google bedroom image search (real images)

Unsupervised learning
Generative models
After seeing many data, can the model generate one?
Example: Machine generated images [Radford et al., ICLR 2016]

Unsupervised learning
Generative models (variants)
Image “translation”
[Huang et al., ECCV 2018]
Conditional image generation
[Wang et al., CVPR 2018]
[Park et al., CVPR 2019]

Unsupervised learning
Generative models (how to do it?)
One way: probabilistic graphical models (Bayesian networks)
season  flu, hayfever  muscle-pain, congestion
burglar, earthquake  alarm  phone call
Pixel 1  Pixel 2  Pixel n  …  Pixel I  …
Note that, a graph node can contain “multiple” random variables

Sampling of (s, f, h, c, m)

Clustering

Clustering
Clustering systems:
Unsupervised learning
Detect patterns in unlabeled data
group emails or search results
find categories of customers
detect anomalous program executions
Useful when we don’t know what you are looking for
Have data, but no labels
May get gibberish

Clustering
Basic idea: group together similar instances
Example: 2D point patterns

What does “similar” mean?
One option: small (squared) Euclidean distance (i.e., L2 distance)

Clustering
Basic idea: group together similar instances
Example: 2D point patterns

What does “similar” mean?
One option: small (squared) Euclidean distance (i.e., L2 distance)

Need other distance metrics
Feature normalization
Other algorithms

K-means: a simple and popular clustering algorithm
Find K centers and group data instances into the K centers
Example: K = 2
Assign each data instance to the closet (most similar, smallest distant) center  

Find K centers and group data instances into the K centers
Example: K = 2
Assign each data instance to the closet (most similar, smallest distant) center  

Example: K-means for segmentation

Original Image
Treat each “pixel” as a 3-dimensional data instance
Group pixels by their colors

Find K centers and group data instances into the K centers simultaneously
Chicken-egg problem: need to have centers before assignments, but where are the centers?

Given , formulate as an optimization problem (total distance to the centers)

Given data instances

Given , formulate as an optimization problem (total distance to the centers)

“pseudo label” of each instance
“location” of each center

Given , formulate as an optimization problem (total distance to the centers)

All data instances assigned to group k
Each group k (around center k)

Given , formulate as an optimization problem (total distance to the centers)

To find centers (K of them) and group assignments (N data instances) simultaneously
Difficulty: the group assignments (pseudo labels) are discrete values
Solution: “alternative” optimization

All data instances assigned to group k

Given , formulate as an optimization problem (total distance to the centers)

To find centers (K of them) and group assignments (N data instances) simultaneously
Given centers, then for each data point , the assignment is independent

All data instances assigned to group k

Given , formulate as an optimization problem (total distance to the centers)

To find centers (K of them) and group assignments (N data instances) simultaneously
Given assignments, then finding the center for each center k is independent

All data instances assigned to group k

K-means: an iterative clustering algorithm
Pick random points as cluster centers (means):
Alternate (the error is monotonically decreasing):
Assign each example to the mean that is closest to it

Set each mean to the average of its assigned points

Stop when no points’ assignments change

An iterative clustering algorithm
Pick random cluster centers:
For : [or, stop if assignments don’t change]
for : [update cluster assignments]

for : [update cluster centers]

Random cluster means:
c1=[-1,0], c2=[0,0]

y1 = argmink dist(x1,ck) = 1
y2 = argmink dist(x2,ck) = 2
y3 = argmink dist(x3,ck) = 2

c1 = (1/1) * [-1,0] = [-1,0]
c2 = ((1/2) * ([0,0]+[2,2]) = [1,1]

i=1 i=2 i=3

c1=[-1,0], c2=[1,1]

y1 = argmink dist(x1,ck) = 1
y2 = argmink dist(x2,ck) = 1
y3 = argmink dist(x3,ck) = 2

c1 = ((1/2) * ([-1,0]+[0,0]) = [-0.5,0]
c2 = (1/1) * [2,2] = [2,2]

i=1 i=2 i=3

c1=[-0.5,0], c2=[2,2]
yi won’t change

K-means optimization: a second glance
Consider the total distance to the centers:

Two stages in each iteration:
Update assignments: fix means, change assignments
Update means: fix assignments, change means
Will it converge?

K-means optimization: a second glance
Consider the total distance to the centers:

Two stages in each iteration:
Update assignments: fix means, change assignments
Update means: fix assignments, change means
Will it converge?
Yes!, if you can argue that each update can decrease the error
Converge to a local minimum!

Phase I: update assignments
For each point, re-assign to closest mean:

Can only decrease the total distance

Phase II: update means
Move each mean to the average of its assigned points:

Also can only decrease total distance… (Why?)

Phase II: update means
Move each mean to the average of its assigned points:

Also can only decrease total distance… (Why?)

The center with the least squared L2 distance to points is their mean
Proof: take partial derivative w.r.t. and set it to 0 to find

K-means: animation

Initialization
K-means is non-deterministic
Requires initial means

It does matter what you pick!

What can go wrong?

Various schemes for preventing this kind of thing:
Variance-based split / merge, initialization heuristics

K-means getting stuck
A local optimum:

Ideally, we want the purple center to take half of the blue points, but the initial centers locations lead to a sub-optimal solution

K-means questions
-means converge?
To a global optimum?

Will it always find the true patterns in the data?
If the patterns are very clear?

Will it find something interesting?

Do people ever use it?

How many clusters to pick?
Try different number K, and see how the error drops

K-means questions
What distance to use?
Square Euclidean: mean
L1 distance: medium
Mahalanobis distance

Perform feature transform before K-means

K-medoid: chooses data points as centers (medoids) and can be used with arbitrary distances
, where M is positive semi-definite (e.g., )

An overview of unsupervised learning
Clustering
Generative models

K-means clustering
Agglomerative (hierarchical) clustering

Agglomerative clustering
Agglomerative clustering:
First merge very similar instances
Incrementally build larger clusters out of smaller clusters

Algorithm:
Maintain a set of clusters
Initially, each instance in its own cluster
Pick the two closest clusters
Merge them into a new cluster
Stop when there is only one cluster left

Produces not one clustering, but a family of clusters represented by a dendrogram

Agglomerative clustering
How should we define the “closeness” between two clusters of multiple instances?

Many options
Closest pair (single-link clustering)
Farthest pair (complete-link clustering)
Average of all pairs
Ward’s method (min variance, like k-means)

Different choices create different clustering behaviors

Agglomerative clustering

complete link
2 1.8 1.6 1.4 1.2
1 0.8 0.6 0.4 0.2

single link