Unsupervised Learning II
Continuous Latent Variables
1
Today
• Applications of clustering: vector quantization, data compression
• Continuouslatentvariables:principalcomponent analysis
2
Unsupervised Learning II
Applications of Clustering
3
Application of Clustering: Vector Quantization
• •
Compress an image using clustering
Each {R, G, B} pixel value is an input vector 𝑥(𝑖)
(255 x 255 x 255 possible values )
Cluster into K clusters (using k-means)
Replace each vector by its cluster’s index 𝑐(𝑖)
(K possible values)
For display, show the mean 𝜇𝑐𝑖
.. .
𝑥(𝑖) = 138 80
79
→ 𝜇 𝑐𝑖
• •
•
4
Vector quantization: color values
Example: R, G, B vectors
…
138 155 156 … 80 64 76 79 65 76
Slide credit: Josef Sivic
5
Vector quantization: color values
138 155 156 … 80 64 76 79 65 76
replace with ‘3’
…
k=1
k=2 k=3
Vector quantization
Slide credit: Josef Sivic
6
Vector quantization: general case
…
Slide credit: Josef Sivic
7
Vector quantization: general case
…
Map from N-dim to 1-dim
Vector quantization
Slide credit: Josef Sivic
K-Means for Image Compression
9
Unsupervised Learning II
Continuous Latent Variables
10
Data Compression
Reduce data from 2D to 1D
Vector quantization?
(cm)
Andrew Ng
(inches)
Data Compression: hidden dimension
Reduce data from 2D to 1D
(cm)
Andrew Ng
(inches)
Data Compression
Reduce data from 3D to 2D
Andrew Ng
Data Visualization
Country Canada
China India Russia Singapore USA
GDP (trillions of US$)
1.577
5.878 1.632 1.48 0.223 14.527
Per capita GDP (thousands of intl. $)
39.17
7.54
Human
Develop- Life
ment Index expectancy 0.908 80.7
0.687 73 0.547 64.7 0.755 65.5 0.866 80
Poverty Index (Gini as
Mean household income (thousands of
3.41 19.84 56.69 46.86
percentage) US$) … 32.6 67.293 …
46.9 10.22 … 36.8 0.735 … 39.9 0.72 … 42.5 67.1 … 40.8 84.3 …
0.91 78.3 …………………
[resources from en.wikipedia.org]
Andrew Ng
Data Visualization
Country Canada
China India Russia Singapore USA
1.6 1.2
1.7 0.3 1.6 0.2 1.4 0.5 0.5 1.7
2 1.5 ………
Andrew Ng
Data Visualization
Andrew Ng
Unsupervised Learning II
Principal Component Analysis
17
How to choose lower-dim subspace?
18
Minimize “error”
19
Choose subspace with minimal “information loss”
𝑢(1)
Reduce from 2-dimension to 1- dimension: Find a direction (a vector 𝑢(1) ) onto which to project the data, so as to minimize the projection error.
Choose subspace with minimal “information loss”
𝑢(1) ∈ 𝑅3
𝑢(1)
𝑢(2) ∈ 𝑅3
Reduce from 2-dimension to 1- dimension: Find a direction (a vector 𝑢(1) ) onto which to project the data, so as to minimize the projection error.
Reduce from n-dimension to K- dimension: Find K vectors
𝑢(1), 𝑢(2), … , 𝑢(𝐾) onto which to project the data so as to minimize the projection error.
Principal Components Analysis
𝑢(1) 𝑥
Find orthonormal basis vectors U=[𝑢(1) … 𝑢(𝐾)],whereK≪n
𝑧 = 𝑈𝑇𝑥, 𝑧𝑘 = 𝑢(𝑘)𝑇𝑥
Principal Components Analysis
𝑥
𝑢(1) 𝑥
Find orthonormal basis vectors U=[𝑢(1) … 𝑢(𝐾)],whereK≪n
𝑧 = 𝑈𝑇𝑥, 𝑧𝑘 = 𝑢(𝑘)𝑇𝑥 Reconstructed data point
𝐾
𝑥 = 𝑧 𝑘 𝑢 ( 𝑘 ) 𝑘=1
Cost function: reconstruction error
1𝑚
𝐽 = 𝑚 𝑥 𝑖 − 𝑥 𝑖
𝑖=1
Want:
2
min 𝐽 𝑈
PCA Solution
• The solution turns out to be the first K eigenvectors of the data covariance matrix (see Bishop 12.1 for details)
• Closed-form, use Singular Value Decomposition (SVD) on covariance matrix
• Other PCA formulations
– can derive via maximizing variance of projected data
– probabilistic formulation of PCA possible, or the similar factor analysis, see Bishop 8.1.4
24
PCA Algorithm
Normalize features (ensure every feature has zero mean) and optionally scale feature
Compute “covariance matrix” Σ: Sigma =
Compute its “eigenvectors”:
[U,S,V] = svd(Sigma);
Keep first K eigenvectors and project to get new features z
Ureduce = U(:,1:K); z = Ureduce’*x;
PCA Algorithm
Data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):
Replace each with .
If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable range of values.
Andrew Ng
PCA is not linear regression
• There is no “output” in PCA, all dimensions are equal Regression 𝑥1 → 𝑥2 Regression 𝑥2 → 𝑥1 PCA
Choosing (number of principal components)
Average squared projection error: Total variation in the data:
Typically, choose to be smallest value so that (1%)
“99% of variance is retained”
Andrew Ng
Choosing (number of principal components)
[U,S,V] = svd(Sigma)
Pick smallest value of for which
(99% of variance retained)
Andrew Ng
Good use of PCA
– Compression
– Reduce memory/disk needed to store data
– Speed up learning algorithm
– Visualization
Andrew Ng
Bad use of PCA: To prevent overfitting
Use instead of to reduce the number of features to
Thus, fewer features, less likely to overfit.
This might work OK, but isn’t a good way to address overfitting. Use regularization instead.
Andrew Ng
When does PCA fail?
(a) Tracking a person on a ferris wheel (black dots). All dynamics can be described by the phase of the wheel θ, a non-linear combination of the naïve basis.
(b) Non-Gaussian distributed data and non- orthogonal axes cause PCA to fail. The axes with the largest variance do not correspond to the appropriate answer.
Shlens, A Tutorial on Principal Component Analysis