程序代写代做代考 C graph kernel html go algorithm DATA7703 Machine Learning for Data Scientists

DATA7703 Machine Learning for Data Scientists
Lecture 4 – Dimensionality Reduction
PCA & LDA
Anders Eriksson Aug 25, 2020

Week 1
2 3
5 6
7 8
9
10 Oct-13
11 Oct-20 12 Oct-27
Date Aug-4 Aug-11 Aug-18
Sep-1: Sep-8
Sep-15 Sep-22
Topic
Introduction – Basic Concepts of Machine Learning
Classification – Sorting Fish Regression – Predicting House Prices
Clustering and Similarity
Support Vector Machines & Kernel Methods
Ensemble methods: Random forest and Boosting The Perceptron
Mid-Semester Break
Convolutional Neural Networks
Adversarial Examples Interpreting ML algorithms Course Summary
Assignment 2 due Mid-term exam
Lecture Schedule
4
Aug-25 PCA & LDA – Analysing Faces Assignment 1 due Thu 27/8 11:59pm Projects announced
Oct-6
Assignment 3 due Assignment 4 due
Lecture 4 – PCA & LDA 2

Last Week
Regression
4

How much is my house worth?
How we assume the world works. y
Regression – Model
I want to list my house for sale. $?
square feet (sq.ft.)
x
Lecture 4 – PCA & LDA 5
price ($)

Linear Model
Quadratic Model
square feet (sq.ft.)
Higher Order Polynomial
Model
y
Regression Models
y
y
x
x
square feet (sq.ft.)
x
Lecture 4 – PCA & LDA
6
price ($)
price ($)
price ($)

Generic Basis Expansion
Model:
y=wh(x)+wh(x)+…+wh(x)+ε=σ𝐷 𝑤h(𝑥)+𝜖 i 00i 11i DDi i 𝑗=0𝑗𝑗𝑖 𝑖
j th regression coefficient or weight
feature 1 =h0(x) … often 1(constant) feature 2 =h1(x) …e.g.,x
feature 3 =h2(x) …e.g.,x2orsin(2πx/12) …
feature D+1 =hD(x)…e.g.,xp
j th feature
Lecture 4 – PCA & LDA 7

Predictions Just Based on House Size
y
Add more inputs
f(x)=w0+w1 (sq.ft.)+w2(#bath)
x[2]
square feet (sq.ft.)
x[1]
Lecture 4 – PCA & LDA 8
price ($)

“Cost” of using a given line
y
Residual sum of squares (RSS)
RSS(w ,w)=(y-[w +wx])2 01i01i
square feet (sq.ft.)
x
Lecture 4 – PCA & LDA 9
price ($)

Closed-form solution
Normal Equation: w =(HTH)-1HTy Inverse exists if:
HTH has full rank
Complexity of inverse: O(n2.373)
(More efficient methods exist, such as gradient descent)
Lecture 4 – PCA & LDA 10

Dimensionality Reduction
PCA & LDA
11

PCA Example – Analysing Faces
Lecture 4 – PCA & LDA 12

Dimensionality Reduction
• The Curse of Dimensionality
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
Lecture 4 – PCA & LDA 13

The Curse of Dimensionality
14

Curse of Dimensionality
In previous examples we have seen that more features makes better classifiers.
More features More features More features
More information Easier to separate data Better classification
Right?
Lecture 4 – PCA & LDA

• •
Curse of Dimensionality: Complexity
Complexity (running time) increases with dimension d
A lot of methods have at least O(nd2) complexity, where n is
the number of samples
• For example if we need to estimate covariance matrix
So as d becomes large,O(nd2) complexity may be too costly

Lecture 4 – PCA & LDA

Curse of Dimensionality: Overfitting
• If d is large, n, the number of samples, may be too small for accurate parameter estimation
• For example, covariance matrix has d2 • parameters:
• For accurate estimation, n should be much bigger than d2
• Otherwise model is too complicated for the data,
overfitting:
Lecture 4 – PCA & LDA


▪ ▪
Curse of Dimensionality: Overfitting
Paradox: If n < d2 we are better off assuming that features are uncorrelated, even if we know this assumption is wrong In this case, the covariance matrix has only d parameters: We are likely to avoid overfitting because we fit a model with less parameters: model with more parameters model with less parameters Lecture 4 – PCA & LDA Curse of Dimensionality: Number of Samples • Suppose we want to use the nearest neighbor approach with k = 1 (1NN) • Suppose we start with only one feature 0 • This feature is not discriminative, i.e. it does not separate the classes well • We decide to use 2 features. For the 1NN method to work well, need a lot of samples, i.e. samples have to be dense • To maintain the same density as in 1D (9 samples per unit length), how many samples do we need? 1 Lecture 4 – PCA & LDA Curse of Dimensionality: Number of Samples • We need 92 samples to maintain the same density as in 1D. 1 0 1 Lecture 4 – PCA & LDA Curse of Dimensionality: Number of Samples • Of course, when we go from 1 feature to 2, no one gives us more samples, we still have 9 ▪ This is way too sparse for 1NN to work well 0 1 Lecture 4 – PCA & LDA Curse of Dimensionality: Number of Samples • Things go from bad to worse if we decide to use 3 features: 0 1 • If 9 was dense enough in 1D, in 3D we need 93=729 samples! Lecture 4 – PCA & LDA Curse of Dimensionality: Number of Samples ▪ Ingeneral,ifnsamplesisdenseenoughin1D ▪ Then in d dimensions we need nd samples! ▪ And nd grows really really fast as a function of d ▪ Common pitfall: ▪ If we can’t solve a problem with a few features, adding more features seems like a good idea ▪ However the number of samples usually stays the same ▪ The method with more features is likely to perform worse instead of expected better Lecture 4 – PCA & LDA Curse of Dimensionality: Number of Samples • For a fixed number of samples, as we add features, the graph of classification error: 1 optimal # features # features • Thus for each fixed sample size n, there is the optimal number of features to use Lecture 4 – PCA & LDA classification error The Curse of Dimensionality ▪ We should try to avoid creating lot of features ▪ Often no choice, problem starts with many features ▪ Example: Face Detection ▪ One sample point is k by m array of pixels = ▪ Feature extraction is not trivial, usually every pixel is taken as a feature ▪ Typical dimension is 20 by 20 = 400 ▪ Suppose 10 samples are dense enough for 1 dimension. Need only 10400 samples Lecture 4 – PCA & LDA The Curse of Dimensionality • Face Detection, dimension of one sample point is km = • The fact that we set up the problem with km dimensions (features) does not mean it is really a km-dimensional problem • Space of all k by m images has km dimensions • Space of all k by m faces must be much smaller, since faces form a tiny fraction of all possible images • Most likely we are not setting the problem up with the right features • If we used better features, we are likely need much less than km- dimensions Lecture 4 – PCA & LDA • • • Dimensionality Reduction High dimensionality is challenging and redundant It is natural to try to reduce dimensionality Reduce dimensionality by feature combination: combine old features x to create new features y For example, Ideally, the new vector y should retain from x all information important for classification • • Lecture 4 – PCA & LDA Dimensionality Reduction • The best f(x) is most likely a non-linear function • Linear functions are easier to find though • For now, assume that f(x) is a linear mapping • Thus it can be represented by a matrix W: Lecture 4 – PCA & LDA Feature Combination We will look at 2 methods for feature combination: • Principle Component Analysis (PCA) • Fischer Linear Discriminant (LDA) Lecture 4 – PCA & LDA Principal Component Analysis Unsupervised Learning 30 • • Principle Component Analysis (PCA) Main idea: seek most accurate data representation in a lower dimensional space Example in 2-D • Project data to 1-D subspace (a line) which minimize the projection error large projection errors, bad small projection errors, good • Notice that the the good line to use for projection lies in the direction of largest variance line to project to line to project to Lecture 4 – PCA & LDA PCA • After the data is projected on the best line, need to transform the coordinate system to get 1D representation for vector y • Note that new data y has the same variance as old data x in the direction of the green line • PCA preserves largest variances in the data. We will prove this statement, for now it is just an intuition of what PCA will do Lecture 4 – PCA & LDA PCA: Approximation of Elliptical Cloud in 3D best 2D approximation best 1D approximation Lecture 4 – PCA & LDA PCA • What is the direction of largest variance in data? • Recall that if x has multivariate distribution N(,  ), direction of largest variance is given by eigenvector corresponding to the largest eigenvalue of  . • This is a hint that we should be looking at the covariance matrix of the data (note that PCA can be applied to distributions other than Gaussian) Lecture 4 – PCA & LDA PCA: Linear Algebra for Derivation • Let V be a d dimensional linear space, and W be a k dimensional linear subspace of V. • We can always find a set of d dimensional vectors {e1,e2,...,ek} which forms an orthonormal basis for W • = 0 if i is not equal to j and = 1 • Thus any vector in W can be written as
1
2
W
Let V = R2 and W be the line x-2y=0. Then the orthonormal basis for W is:
e= 1/ 5 −2/ 5
Lecture 4 – PCA & LDA

PCA: Linear Algebra for Derivation
• Recall that subspace W contains the zero vector, i.e. it goes
through the origin
this line is not a subspace of R2
• For derivation, it will be convenient to project to subspace W: thus we need to shift everything
this line is a subspace of R2
Lecture 4 – PCA & LDA

PCA Derivation: Shift by the Mean Vector
• Before PCA, subtract sample mean from the data:
• The new data has zero mean: E(X-E(X)) = E(X)-E(X) = 0
• All we did is change the coordinate system
• Another way to look at it:
• first step of getting y is to subtract the mean of x
Lecture 4 – PCA & LDA

• •
෍𝜶𝒊𝒆𝒊 𝒊=𝟏
Thus x1 will be represented by some vector in W: Error this representation:
𝒌
𝒙𝟏 = ෍ 𝜶𝟏𝒊 𝒆𝒊 𝒊=𝟏
• •
PCA: Derivation
We want to find the most accurate representation of data D={x1,x2,…,xn} in some subspace W which has dimension k < d Let {e1,e2,...,ek} be the orthonormal basis for W. Any vector in W can be written as: 𝒌 Lecture 4 – PCA & LDA • • PCA: Derivation To find the total error, we need to sum over all xj’s 𝒌 • Any xj can be writtenas: 𝒙𝒋 = ෍𝜶𝒋𝒊 𝒆𝒊 𝒊=𝟏 Thus the total error for representation of all data D is: sum over all data points J(e,...,e,α ,...α )=σ𝑛 1 k 11 nk 𝑗=1 unknowns ||𝑥−σ𝑘 𝛼𝑒||2 𝑗 𝑖=1 𝑗𝑖 𝑖 error at one point Lecture 4 – PCA & LDA PCA: Derivation • To minimize J, need to take partial derivatives and also enforce constraint that {e1, e2, ...,ek} are orthogonal • Let us simplify J first Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA: Derivation Lecture 4 – PCA & LDA PCA as Data Approximation Lecture 4 – PCA & LDA PCA: Last Step Lecture 4 – PCA & LDA Recipe for Dimension Reduction with PCA Lecture 4 – PCA & LDA PCA Example Using Matlab Lecture 4 – PCA & LDA PCA Example Using Matlab Lecture 4 – PCA & LDA PCA Example - Eigenfaces An image is a point in a high dimensional space • An N x M image is a point in RNM • Space of all possible N x M images. • Facesonlymakeupasmallsubset of those. Lecture 4 – PCA & LDA PCA Example - Eigenfaces This set of faces is most likely some complicated non-linear high-dimensional structure. • Assume that the set of faces is a (linear) “subspace”of the set of images. • Suppose it is K dimensional. • We can find the best subspace using PCA. • This is like fitting a hyperplane to the set of faces spanned by the vectors v1, v2, ..., vK Lecture 4 – PCA & LDA PCA Example - Eigenfaces Training Images: • 300 by 250 pixels • Find base for low dimensional subspace using PCA Lecture 4 – PCA & LDA PCA Example - Eigenfaces • These base vectors are called Eigenfaces. • Even though they appear meaningless to us, our theory states that they capture the most information about the faces in the training set. -8029 -1183 2900 -2088 1751 -4336 1445 -669 4238 -4221 6193 10549 Lecture 4 – PCA & LDA PCA Example - Eigenfaces The more Eigenfaces you have the better the reconstruction, but you can have high quality reconstruction even with a small number of Eigenfaces. Original image dim=75000 82 70 50 30 20 10 Lecture 4 – PCA & LDA PCA Example - Eigenfaces http://immersivemath.com/ila/ch10_eigen/ch10.html Lecture 4 – PCA & LDA : Drawbacks of PCA Lecture 4 – PCA & LDA Linear Discriminant Analysis Supervised Learning 62 Data Representation vs. Data Classification Lecture 4 – PCA & LDA 63 Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Lecture 4 – PCA & LDA Fisher Linear Discriminant Derivation Lecture 4 – PCA & LDA Fisher Linear Discriminant Derivation Lecture 4 – PCA & LDA Fisher Linear Discriminant Derivation Lecture 4 – PCA & LDA Fisher Linear Discriminant Derivation Lecture 4 – PCA & LDA Fisher Linear Discriminant Derivation Lecture 4 – PCA & LDA Fisher Linear Discriminant Derivation Lecture 4 – PCA & LDA Fisher Linear Discriminant Derivation Lecture 4 – PCA & LDA Fisher Linear Discriminant Derivation Lecture 4 – PCA & LDA Fisher Linear Discriminant Example Lecture 4 – PCA & LDA Multiple Discriminant Analysis (MDA) ▪ Can generalize FLD to multiple classes ▪ In case of c classes, can reduce dimensionality to 1, 2, 3,..., c-1 dimensions ▪ Project sample xi to a linear subspaceyi = Vtxi ▪ V is called projection matrix Lecture 4 – PCA & LDA Multiple Discriminant Analysis (MDA) Lecture 4 – PCA & LDA Multiple Discriminant Analysis (MDA) Lecture 4 – PCA & LDA FDA and MDA Drawbacks Lecture 4 – PCA & LDA This week • Assignment 1 due Thu Aug 27 11:59pm • Projects announced Next week • Clustering and Similarity Lecture 4 – PCA & LDA 87 Lecture 4 – PCA & LDA 88