Announcements
Reminder: ps2 due tonight at midnight (Boston)
• Self-Grading form for ps1 out tomorrow 9/25 (1 week to turn in)
• Self-Grading form for ps2 out Monday 9/28 (1 week to turn in)
Unsupervised Learning II
Agglomerative Clustering
2
slide credit: Andrew Ng
cluster centroids
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
distance to blue: 2.4
distance to red: 4.4
slide credit: Andrew Ng
cluster centroids
slide credit: Andrew Ng
slide credit: Andrew Ng
slide credit: Andrew Ng
Iteratively combine the two closest points
Iteratively combine the two closest points
Iteratively combine the two closest points
Iteratively combine the two closest points
Iteratively combine the two closest points
Iteratively combine the two closest points
Agglomerative Clustering Example (bottom-up clustering)
Image source: https://en.wikipedia.org/wiki/Hierarchical_clustering
When do we stop combining?
• Select based on prior knowledge or task performance (e.g. you know there are two categories of data)
• Choose cost threshold to stop combining
Unsupervised Learning II
Continuous Latent Variables
20
Today
• Applications of clustering: vector quantization, data compression
• Continuous latent variables: principal component analysis
21
Unsupervised Learning II
Applications of Clustering
22
Application of Clustering: Vector Quantization
• •
Compress an image using clustering
Each {R, G, B} pixel value is an input vector 𝑥(𝑖)
(255 x 255 x 255 possible values )
Cluster into K clusters (using k-means)
Replace each vector by its cluster’s index 𝑐(𝑖)
(K possible values)
For display, show the mean 𝜇𝑐 𝑖
.. .
𝑥(𝑖) = 138 80
79
→ 𝜇 𝑐𝑖
• •
•
23
Vector quantization: color values
Example: R, G, B vectors
…
138 155 156 … 80 64 76 79 65 76
Slide credit: Josef Sivic
24
Vector quantization: color values
138 155 156 … 80 64 76 79 65 76
replace with ‘3’
…
k=1
k=2 k=3
Vector quantization
Slide credit: Josef Sivic
25
Vector quantization: general case
…
Slide credit: Josef Sivic
26
Vector quantization: general case
…
Map from N-dim to 1-dim
Vector quantization
Slide credit: Josef Sivic
K-Means for Image Compression
28
Unsupervised Learning II
Continuous Latent Variables
29
Data Compression
Reduce data from 2D to 1D
Vector quantization?
(cm)
Andrew Ng
(inches)
Data Compression: hidden dimension
Reduce data from 2D to 1D
(cm)
Andrew Ng
(inches)
Data Compression
Reduce data from 3D to 2D
Andrew Ng
Data Visualization
Country Canada
China India Russia Singapore USA
GDP (trillions of US$)
1.577
5.878 1.632 1.48 0.223 14.527
Per capita GDP (thousands of intl. $)
39.17
7.54
Human
Develop- Life
ment Index expectancy 0.908 80.7
0.687 73 0.547 64.7 0.755 65.5 0.866 80
Poverty Index (Gini as
Mean household income (thousands of
3.41 19.84 56.69 46.86
percentage) US$) … 32.6 67.293 …
46.9 10.22 … 36.8 0.735 … 39.9 0.72 … 42.5 67.1 … 40.8 84.3 …
0.91 78.3 …………………
[resources from en.wikipedia.org]
Andrew Ng
Data Visualization
Country Canada
China India Russia Singapore USA
1.6 1.2
1.7 0.3 1.6 0.2 1.4 0.5 0.5 1.7
2 1.5 ………
Andrew Ng
Data Visualization
Andrew Ng
Unsupervised Learning II
Principal Component Analysis
36
How to choose lower-dim subspace?
37
Minimize “error”
38
Choose subspace with minimal “information loss”
𝑢(1)
Reduce from 2-dimension to 1- dimension: Find a direction (a vector 𝑢(1) ) onto which to project the data, so as to minimize the projection error.
Choose subspace with minimal “information loss”
𝑢(1) ∈ 𝑅3
𝑢(1)
𝑢(2) ∈ 𝑅3
Reduce from 2-dimension to 1- dimension: Find a direction (a vector 𝑢(1) ) onto which to project the data, so as to minimize the projection error.
Reduce from n-dimension to K- dimension: Find K vectors
𝑢(1), 𝑢(2), … , 𝑢(𝐾) onto which to project the data so as to minimize the projection error.
Principal Components Analysis
𝑢(1) 𝑥
Find orthonormal basis vectors
U = [𝑢(1) … 𝑢(𝐾)], where K ≪ n
𝑧 = 𝑈𝑇𝑥, 𝑧𝑘 = 𝑢(𝑘)𝑇𝑥
Principal Components Analysis
𝑥
𝑢(1) 𝑥
Find orthonormal basis vectors
U = [𝑢(1) … 𝑢(𝐾)], where K ≪ n
𝑧 = 𝑈𝑇𝑥, 𝑧𝑘 = 𝑢(𝑘)𝑇𝑥 Reconstructed data point
𝐾
𝑥= 𝑧𝑘𝑢(𝑘) 𝑘=1
Cost function: reconstruction error
1𝑚
𝐽 = 𝑚
Want:
𝑥𝑖 − 𝑥𝑖
min 𝐽 𝑈
2
𝑖=1
PCA Solution
• The solution turns out to be the first K eigenvectors of the data covariance matrix (see Bishop 12.1 for details)
• Closed-form, use Singular Value Decomposition (SVD) on covariance matrix
• Other PCA formulations
– can derive via maximizing variance of projected data
– probabilistic formulation of PCA possible, or the similar factor analysis, see Bishop 8.1.4
43
PCA Algorithm
Normalize features (ensure every feature has zero mean) and optionally scale feature
Compute “covariance matrix” Σ: Sigma =
Compute its “eigenvectors”:
[U,S,V] = svd(Sigma);
Keep first K eigenvectors and project to get new features z
Ureduce = U(:,1:K); z = Ureduce’*x;
PCA Algorithm
Data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):
Replace each with .
If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable range of values.
Andrew Ng
PCA is not linear regression
• There is no “output” in PCA, all dimensions are equal Regression 𝑥1 → 𝑥2 Regression 𝑥2 → 𝑥1 PCA
Choosing (number of principal components)
Average squared projection error: Total variation in the data:
Typically, choose to be smallest value so that (1%)
“99% of variance is retained”
Andrew Ng
Choosing (number of principal components)
[U,S,V] = svd(Sigma)
Pick smallest value of for which
(99% of variance retained)
Andrew Ng
Good use of PCA
– Compression
– Reduce memory/disk needed to store data – Speeduplearningalgorithm
– Visualization
Andrew Ng
Bad use of PCA: To prevent overfitting
Use instead of to reduce the number of features to
Thus, fewer features, less likely to overfit.
This might work OK, but isn’t a good way to address overfitting. Use regularization instead.
Andrew Ng
When does PCA fail?
(a) Tracking a person on a ferris wheel (black dots). All dynamics can be described by the phase of the wheel θ, a non-linear combination of the naïve basis.
(b) Non-Gaussian distributed data and non- orthogonal axes cause PCA to fail. The axes with the largest variance do not correspond to the appropriate answer.
Shlens, A Tutorial on Principal Component Analysis
Next Class
Neural Networks I: Feed-forward Nets:
artificial neuron, MLP, sigmoid units; neuroscience inspiration; output vs hidden layers; linear vs nonlinear networks; feed- forward neural networks
Reading: Bishop 5.1-5.3