程序代写代做代考 algorithm Announcements

Announcements
Reminder: ps2 due tonight at midnight (Boston)
• Self-Grading form for ps1 out tomorrow 9/25 (1 week to turn in)
• Self-Grading form for ps2 out Monday 9/28 (1 week to turn in)

Unsupervised Learning II
Agglomerative Clustering
2

slide credit: Andrew Ng

cluster centroids
slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

distance to blue: 2.4
distance to red: 4.4
slide credit: Andrew Ng

cluster centroids
slide credit: Andrew Ng

slide credit: Andrew Ng

slide credit: Andrew Ng

Iteratively combine the two closest points

Iteratively combine the two closest points

Iteratively combine the two closest points

Iteratively combine the two closest points

Iteratively combine the two closest points

Iteratively combine the two closest points

Agglomerative Clustering Example (bottom-up clustering)
Image source: https://en.wikipedia.org/wiki/Hierarchical_clustering

When do we stop combining?
• Select based on prior knowledge or task performance (e.g. you know there are two categories of data)
• Choose cost threshold to stop combining

Unsupervised Learning II
Continuous Latent Variables
20

Today
• Applications of clustering: vector quantization, data compression
• Continuous latent variables: principal component analysis
21

Unsupervised Learning II
Applications of Clustering
22

Application of Clustering: Vector Quantization
• •
Compress an image using clustering
Each {R, G, B} pixel value is an input vector 𝑥(𝑖)
(255 x 255 x 255 possible values )
Cluster into K clusters (using k-means)
Replace each vector by its cluster’s index 𝑐(𝑖)
(K possible values)
For display, show the mean 𝜇𝑐 𝑖
.. .
𝑥(𝑖) = 138 80
79
→ 𝜇 𝑐𝑖
• •

23

Vector quantization: color values
Example: R, G, B vectors

138 155 156 … 80 64 76 79 65 76
Slide credit: Josef Sivic
24

Vector quantization: color values
138 155 156 … 80 64 76 79 65 76
replace with ‘3’

k=1
k=2 k=3
Vector quantization
Slide credit: Josef Sivic
25

Vector quantization: general case

Slide credit: Josef Sivic
26

Vector quantization: general case

Map from N-dim to 1-dim
Vector quantization
Slide credit: Josef Sivic

K-Means for Image Compression
28

Unsupervised Learning II
Continuous Latent Variables
29

Data Compression
Reduce data from 2D to 1D
Vector quantization?
(cm)
Andrew Ng
(inches)

Data Compression: hidden dimension
Reduce data from 2D to 1D
(cm)
Andrew Ng
(inches)

Data Compression
Reduce data from 3D to 2D
Andrew Ng

Data Visualization
Country Canada
China India Russia Singapore USA
GDP (trillions of US$)
1.577
5.878 1.632 1.48 0.223 14.527
Per capita GDP (thousands of intl. $)
39.17
7.54
Human
Develop- Life
ment Index expectancy 0.908 80.7
0.687 73 0.547 64.7 0.755 65.5 0.866 80
Poverty Index (Gini as
Mean household income (thousands of
3.41 19.84 56.69 46.86
percentage) US$) … 32.6 67.293 …
46.9 10.22 … 36.8 0.735 … 39.9 0.72 … 42.5 67.1 … 40.8 84.3 …
0.91 78.3 …………………
[resources from en.wikipedia.org]
Andrew Ng

Data Visualization
Country Canada
China India Russia Singapore USA
1.6 1.2
1.7 0.3 1.6 0.2 1.4 0.5 0.5 1.7
2 1.5 ………
Andrew Ng

Data Visualization
Andrew Ng

Unsupervised Learning II
Principal Component Analysis
36

How to choose lower-dim subspace?
37

Minimize “error”
38

Choose subspace with minimal “information loss”
𝑢(1)
Reduce from 2-dimension to 1- dimension: Find a direction (a vector 𝑢(1) ) onto which to project the data, so as to minimize the projection error.

Choose subspace with minimal “information loss”
𝑢(1) ∈ 𝑅3
𝑢(1)
𝑢(2) ∈ 𝑅3
Reduce from 2-dimension to 1- dimension: Find a direction (a vector 𝑢(1) ) onto which to project the data, so as to minimize the projection error.
Reduce from n-dimension to K- dimension: Find K vectors
𝑢(1), 𝑢(2), … , 𝑢(𝐾) onto which to project the data so as to minimize the projection error.

Principal Components Analysis
𝑢(1) 𝑥
Find orthonormal basis vectors
U = [𝑢(1) … 𝑢(𝐾)], where K ≪ n
𝑧 = 𝑈𝑇𝑥, 𝑧𝑘 = 𝑢(𝑘)𝑇𝑥

Principal Components Analysis
𝑥
𝑢(1) 𝑥
Find orthonormal basis vectors
U = [𝑢(1) … 𝑢(𝐾)], where K ≪ n
𝑧 = 𝑈𝑇𝑥, 𝑧𝑘 = 𝑢(𝑘)𝑇𝑥 Reconstructed data point
𝐾
𝑥= 𝑧𝑘𝑢(𝑘) 𝑘=1
Cost function: reconstruction error
1𝑚
𝐽 = 𝑚
Want:
𝑥𝑖 − 𝑥𝑖
min 𝐽 𝑈
2
𝑖=1

PCA Solution
• The solution turns out to be the first K eigenvectors of the data covariance matrix (see Bishop 12.1 for details)
• Closed-form, use Singular Value Decomposition (SVD) on covariance matrix
• Other PCA formulations
– can derive via maximizing variance of projected data
– probabilistic formulation of PCA possible, or the similar factor analysis, see Bishop 8.1.4
43

PCA Algorithm
Normalize features (ensure every feature has zero mean) and optionally scale feature
Compute “covariance matrix” Σ: Sigma =
Compute its “eigenvectors”:
[U,S,V] = svd(Sigma);
Keep first K eigenvectors and project to get new features z
Ureduce = U(:,1:K); z = Ureduce’*x;

PCA Algorithm
Data preprocessing
Training set:
Preprocessing (feature scaling/mean normalization):
Replace each with .
If different features on different scales (e.g., size of house,
number of bedrooms), scale features to have comparable range of values.
Andrew Ng

PCA is not linear regression
• There is no “output” in PCA, all dimensions are equal Regression 𝑥1 → 𝑥2 Regression 𝑥2 → 𝑥1 PCA

Choosing (number of principal components)
Average squared projection error: Total variation in the data:
Typically, choose to be smallest value so that (1%)
“99% of variance is retained”
Andrew Ng

Choosing (number of principal components)
[U,S,V] = svd(Sigma)
Pick smallest value of for which
(99% of variance retained)
Andrew Ng

Good use of PCA
– Compression
– Reduce memory/disk needed to store data – Speeduplearningalgorithm
– Visualization
Andrew Ng

Bad use of PCA: To prevent overfitting
Use instead of to reduce the number of features to
Thus, fewer features, less likely to overfit.
This might work OK, but isn’t a good way to address overfitting. Use regularization instead.
Andrew Ng

When does PCA fail?
(a) Tracking a person on a ferris wheel (black dots). All dynamics can be described by the phase of the wheel θ, a non-linear combination of the naïve basis.
(b) Non-Gaussian distributed data and non- orthogonal axes cause PCA to fail. The axes with the largest variance do not correspond to the appropriate answer.
Shlens, A Tutorial on Principal Component Analysis

Next Class
Neural Networks I: Feed-forward Nets:
artificial neuron, MLP, sigmoid units; neuroscience inspiration; output vs hidden layers; linear vs nonlinear networks; feed- forward neural networks
Reading: Bishop 5.1-5.3