DATA7703 Machine Learning for Data Scientists
Lecture 4 – Dimensionality Reduction
PCA & LDA
Anders Eriksson Aug 25, 2020
Week 1
2 3
5 6
7 8
9
10 Oct-13
11 Oct-20 12 Oct-27
Date Aug-4 Aug-11 Aug-18
Sep-1: Sep-8
Sep-15 Sep-22
Topic
Introduction – Basic Concepts of Machine Learning
Classification – Sorting Fish Regression – Predicting House Prices
Clustering and Similarity
Support Vector Machines & Kernel Methods
Ensemble methods: Random forest and Boosting The Perceptron
Mid-Semester Break
Convolutional Neural Networks
Adversarial Examples Interpreting ML algorithms Course Summary
Assignment 2 due Mid-term exam
Lecture Schedule
4
Aug-25 PCA & LDA – Analysing Faces Assignment 1 due Thu 27/8 11:59pm Projects announced
Oct-6
Assignment 3 due Assignment 4 due
Lecture 4 – PCA & LDA 2
Last Week
Regression
4
How much is my house worth?
How we assume the world works. y
Regression – Model
I want to list my house for sale. $?
square feet (sq.ft.)
x
Lecture 4 – PCA & LDA 5
price ($)
Linear Model
Quadratic Model
square feet (sq.ft.)
Higher Order Polynomial
Model
y
Regression Models
y
y
x
x
square feet (sq.ft.)
x
Lecture 4 – PCA & LDA
6
price ($)
price ($)
price ($)
Generic Basis Expansion
Model:
y=wh(x)+wh(x)+…+wh(x)+ε=σ𝐷 𝑤h(𝑥)+𝜖 i 00i 11i DDi i 𝑗=0𝑗𝑗𝑖 𝑖
j th regression coefficient or weight
feature 1 =h0(x) … often 1(constant) feature 2 =h1(x) …e.g.,x
feature 3 =h2(x) …e.g.,x2orsin(2πx/12) …
feature D+1 =hD(x)…e.g.,xp
j th feature
Lecture 4 – PCA & LDA 7
Predictions Just Based on House Size
y
Add more inputs
f(x)=w0+w1 (sq.ft.)+w2(#bath)
x[2]
square feet (sq.ft.)
x[1]
Lecture 4 – PCA & LDA 8
price ($)
“Cost” of using a given line
y
Residual sum of squares (RSS)
RSS(w ,w)=(y-[w +wx])2 01i01i
square feet (sq.ft.)
x
Lecture 4 – PCA & LDA 9
price ($)
Closed-form solution
Normal Equation: w =(HTH)-1HTy Inverse exists if:
HTH has full rank
Complexity of inverse: O(n2.373)
(More efficient methods exist, such as gradient descent)
Lecture 4 – PCA & LDA 10
Dimensionality Reduction
PCA & LDA
11
PCA Example – Analysing Faces
Lecture 4 – PCA & LDA 12
Dimensionality Reduction
• The Curse of Dimensionality
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
Lecture 4 – PCA & LDA 13
The Curse of Dimensionality
14
Curse of Dimensionality
In previous examples we have seen that more features makes better classifiers.
More features More features More features
More information Easier to separate data Better classification
Right?
Lecture 4 – PCA & LDA
• •
Curse of Dimensionality: Complexity
Complexity (running time) increases with dimension d
A lot of methods have at least O(nd2) complexity, where n is
the number of samples
• For example if we need to estimate covariance matrix
So as d becomes large,O(nd2) complexity may be too costly
•
Lecture 4 – PCA & LDA
Curse of Dimensionality: Overfitting
• If d is large, n, the number of samples, may be too small for accurate parameter estimation
• For example, covariance matrix has d2 • parameters:
• For accurate estimation, n should be much bigger than d2
• Otherwise model is too complicated for the data,
overfitting:
Lecture 4 – PCA & LDA
▪
▪ ▪
Curse of Dimensionality: Overfitting
Paradox: If n < d2 we are better off assuming that features are uncorrelated, even if we know this assumption is wrong
In this case, the covariance matrix has only d parameters:
We are likely to avoid overfitting because we fit a model with less
parameters:
model with more parameters
model with less parameters
Lecture 4 – PCA & LDA
Curse of Dimensionality: Number of Samples
• Suppose we want to use the nearest neighbor approach with k = 1 (1NN)
• Suppose we start with only one feature 0
• This feature is not discriminative, i.e. it does not separate the classes well
• We decide to use 2 features. For the 1NN method to work well, need a lot of samples, i.e. samples have to be dense
• To maintain the same density as in 1D (9 samples per unit length), how many samples do we need?
1
Lecture 4 – PCA & LDA
Curse of Dimensionality: Number of Samples
• We need 92 samples to maintain the same density as in 1D. 1
0
1
Lecture 4 – PCA & LDA
Curse of Dimensionality: Number of Samples
• Of course, when we go from 1 feature to 2, no one gives us more samples, we still have 9
▪ This is way too sparse for 1NN to work well
0
1
Lecture 4 – PCA & LDA
Curse of Dimensionality: Number of Samples
• Things go from bad to worse if we decide to use 3 features:
0
1
• If 9 was dense enough in 1D, in 3D we need 93=729 samples!
Lecture 4 – PCA & LDA
Curse of Dimensionality: Number of Samples
▪ Ingeneral,ifnsamplesisdenseenoughin1D ▪ Then in d dimensions we need nd samples!
▪ And nd grows really really fast as a function of d
▪ Common pitfall:
▪ If we can’t solve a problem with a few features, adding more features seems like a good idea
▪ However the number of samples usually stays the same
▪ The method with more features is likely to perform worse instead of expected better
Lecture 4 – PCA & LDA
Curse of Dimensionality: Number of Samples
• For a fixed number of samples, as we add features, the graph of classification error:
1
optimal # features
# features
• Thus for each fixed sample size n, there is the optimal number of features to use
Lecture 4 – PCA & LDA
classification error
The Curse of Dimensionality
▪ We should try to avoid creating lot of features
▪ Often no choice, problem starts with many features
▪ Example: Face Detection
▪ One sample point is k by m array of pixels
=
▪ Feature extraction is not trivial, usually every pixel is taken as a feature
▪ Typical dimension is 20 by 20 = 400
▪ Suppose 10 samples are dense enough for 1 dimension. Need only 10400 samples
Lecture 4 – PCA & LDA
The Curse of Dimensionality
• Face Detection, dimension of one sample point is km =
• The fact that we set up the problem with km dimensions (features) does not mean it is really a km-dimensional problem
• Space of all k by m images has km dimensions
• Space of all k by m faces must be much smaller, since faces form a tiny
fraction of all possible images
• Most likely we are not setting the problem up with the right features
• If we used better features, we are likely need much less than km- dimensions
Lecture 4 – PCA & LDA
•
• •
Dimensionality Reduction
High dimensionality is challenging and redundant
It is natural to try to reduce dimensionality
Reduce dimensionality by feature combination: combine old features x to create new features y
For example,
Ideally, the new vector y should retain from x all information important for classification
• •
Lecture 4 – PCA & LDA
Dimensionality Reduction
• The best f(x) is most likely a non-linear function
• Linear functions are easier to find though
• For now, assume that f(x) is a linear mapping
• Thus it can be represented by a matrix W:
Lecture 4 – PCA & LDA
Feature Combination
We will look at 2 methods for feature combination:
• Principle Component Analysis (PCA)
• Fischer Linear Discriminant (LDA)
Lecture 4 – PCA & LDA
Principal
Component Analysis
Unsupervised Learning
30
• •
Principle Component Analysis (PCA)
Main idea: seek most accurate data representation in a lower dimensional space
Example in 2-D
• Project data to 1-D subspace (a line) which minimize the projection error
large projection errors, bad small projection errors, good
•
Notice that the the good line to use for projection lies in the direction of largest variance
line to project to
line to project to
Lecture 4 – PCA & LDA
PCA
• After the data is projected on the best line, need to transform the coordinate system to get 1D representation for vector y
• Note that new data y has the same variance as old data x in the direction of the green line
• PCA preserves largest variances in the data. We will prove this statement, for now it is just an intuition of what PCA will do
Lecture 4 – PCA & LDA
PCA: Approximation of Elliptical Cloud in 3D
best 2D approximation best 1D approximation
Lecture 4 – PCA & LDA
PCA
• What is the direction of largest variance in data?
• Recall that if x has multivariate distribution N(, ), direction of largest variance is given by eigenvector corresponding to the largest eigenvalue of .
• This is a hint that we should be looking at the covariance matrix of the data (note that PCA can be applied to distributions other than Gaussian)
Lecture 4 – PCA & LDA
PCA: Linear Algebra for Derivation
• Let V be a d dimensional linear space, and W be a k dimensional linear subspace of V.
• We can always find a set of d dimensional vectors {e1,e2,...,ek} which forms an orthonormal basis for W
•
1
2
W
Let V = R2 and W be the line x-2y=0. Then the orthonormal basis for W is:
e= 1/ 5 −2/ 5
Lecture 4 – PCA & LDA
PCA: Linear Algebra for Derivation
• Recall that subspace W contains the zero vector, i.e. it goes
through the origin
this line is not a subspace of R2
• For derivation, it will be convenient to project to subspace W: thus we need to shift everything
this line is a subspace of R2
Lecture 4 – PCA & LDA
PCA Derivation: Shift by the Mean Vector
• Before PCA, subtract sample mean from the data:
• The new data has zero mean: E(X-E(X)) = E(X)-E(X) = 0
• All we did is change the coordinate system
• Another way to look at it:
• first step of getting y is to subtract the mean of x
Lecture 4 – PCA & LDA
• •
𝜶𝒊𝒆𝒊 𝒊=𝟏
Thus x1 will be represented by some vector in W: Error this representation:
𝒌
𝒙𝟏 = 𝜶𝟏𝒊 𝒆𝒊 𝒊=𝟏
• •
PCA: Derivation
We want to find the most accurate representation of data D={x1,x2,…,xn} in some subspace W which has dimension k < d
Let {e1,e2,...,ek} be the orthonormal basis for W. Any vector in W can be written as: 𝒌
Lecture 4 – PCA & LDA
•
•
PCA: Derivation
To find the total error, we need to sum over all xj’s 𝒌
• Any xj can be writtenas: 𝒙𝒋 = 𝜶𝒋𝒊 𝒆𝒊 𝒊=𝟏
Thus the total error for representation of all data D is: sum over all data points
J(e,...,e,α ,...α )=σ𝑛
1 k 11 nk 𝑗=1
unknowns
||𝑥−σ𝑘 𝛼𝑒||2 𝑗 𝑖=1 𝑗𝑖 𝑖
error at one point
Lecture 4 – PCA & LDA
PCA: Derivation
• To minimize J, need to take partial derivatives and also enforce constraint that {e1, e2, ...,ek} are orthogonal
• Let us simplify J first
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA: Derivation
Lecture 4 – PCA & LDA
PCA as Data Approximation
Lecture 4 – PCA & LDA
PCA: Last Step
Lecture 4 – PCA & LDA
Recipe for Dimension Reduction with PCA
Lecture 4 – PCA & LDA
PCA Example Using Matlab
Lecture 4 – PCA & LDA
PCA Example Using Matlab
Lecture 4 – PCA & LDA
PCA Example - Eigenfaces
An image is a point in a high dimensional space
• An N x M image is a point in RNM
• Space of all possible N x M
images.
• Facesonlymakeupasmallsubset of those.
Lecture 4 – PCA & LDA
PCA Example - Eigenfaces
This set of faces is most likely some complicated non-linear high-dimensional structure.
• Assume that the set of faces is a (linear) “subspace”of the set of images.
• Suppose it is K dimensional.
• We can find the best subspace using PCA.
• This is like fitting a hyperplane to the set of faces spanned by the vectors
v1, v2, ..., vK
Lecture 4 – PCA & LDA
PCA Example - Eigenfaces
Training Images:
• 300 by 250 pixels
• Find base for low dimensional
subspace using PCA
Lecture 4 – PCA & LDA
PCA Example - Eigenfaces • These base vectors are called Eigenfaces.
• Even though they appear meaningless to us, our theory states that they capture the most information about the faces in the training set.
-8029 -1183 2900 -2088 1751 -4336 1445 -669 4238 -4221 6193 10549
Lecture 4 – PCA & LDA
PCA Example - Eigenfaces
The more Eigenfaces you have the better the reconstruction, but you can have high quality reconstruction even with a small number of Eigenfaces.
Original image dim=75000
82 70 50
30 20 10
Lecture 4 – PCA & LDA
PCA Example - Eigenfaces
http://immersivemath.com/ila/ch10_eigen/ch10.html
Lecture 4 – PCA & LDA
:
Drawbacks of PCA
Lecture 4 – PCA & LDA
Linear Discriminant Analysis
Supervised Learning
62
Data Representation vs. Data Classification
Lecture 4 – PCA & LDA 63
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Derivation
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Derivation
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Derivation
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Derivation
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Derivation
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Derivation
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Derivation
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Derivation
Lecture 4 – PCA & LDA
Fisher Linear Discriminant Example
Lecture 4 – PCA & LDA
Multiple Discriminant Analysis (MDA)
▪ Can generalize FLD to multiple classes
▪ In case of c classes, can reduce dimensionality to
1, 2, 3,..., c-1 dimensions
▪ Project sample xi to a linear subspaceyi = Vtxi
▪ V is called projection matrix
Lecture 4 – PCA & LDA
Multiple Discriminant Analysis (MDA)
Lecture 4 – PCA & LDA
Multiple Discriminant Analysis (MDA)
Lecture 4 – PCA & LDA
FDA and MDA Drawbacks
Lecture 4 – PCA & LDA
This week
• Assignment 1 due Thu Aug 27 11:59pm
• Projects announced
Next week
• Clustering and Similarity
Lecture 4 – PCA & LDA 87
Lecture 4 – PCA & LDA 88