程序代写代做代考 js data mining matlab algorithm data structure ECE 657A: Data and Knowledge Modelling and Analysis – Lecture 3: Parameter Estimation, Dimensionality Reduction, Feature Extraction/Selection

ECE 657A: Data and Knowledge Modelling and Analysis – Lecture 3: Parameter Estimation, Dimensionality Reduction, Feature Extraction/Selection

ECE 657A: Data and Knowledge Modelling and Analysis
Lecture 3: Parameter Estimation, Dimensionality Reduction,

Feature Extraction/Selection

Mark Crowley

January 18, 2016

Mark Crowley ECE657A : Lecture 3 January 18, 2016 1 / 76

Opening Data Example : Guess the Dataset

??

Mark Crowley ECE657A : Lecture 3 January 18, 2016 2 / 76

Today’s Class

Announcements

Parameter Estimation

Dimensionality Reduction

Feature Extraction

Mark Crowley ECE657A : Lecture 3 January 18, 2016 3 / 76

Today’s Class

Announcements

Project Description

Lecture 2 slides

Part IV : Parameter Estimation

Lecture 3 slides : Feature Extraction

Linear Methods: PCA, LDA
Nonlinear Methods

Mark Crowley ECE657A : Lecture 3 January 18, 2016 4 / 76

Announcements

Volunteer note taker needed

Project Description is out. What do you think?

Mark Crowley ECE657A : Lecture 3 January 18, 2016 5 / 76

Back to Lecture 2 Parameter Estimation!

.

Mark Crowley ECE657A : Lecture 3 January 18, 2016 6 / 76

Data Reduction Overview

Outline of Dimensionality Reduction Methods

Data Reduction Overview
Feature Extraction
Feature Selection
Motivation: Data Reduction

Principle Component Analysis
Overview of PCA
Implementing PCA

Linear Discriminant Analysis
Separation Measures
Fisher Linear Descriminant
Independent Component Analysis

Nonlinear Methods For Dimensionality Reduction
Global Methods – Multidimensional Scaling
isomap
Locally Linear Embedding (LLE)

Mark Crowley ECE657A : Lecture 3 January 18, 2016 7 / 76

Data Reduction Overview

Feature Extraction vs. Feature Selection

Given a dataset X : N × D matrix.
We can extract or transform new features F to describe the
variation, distances, proximity, etc, in the data such that |F | < D. We can select from the existing features the ones which are the most representative F to describe X for |F | < D. Mark Crowley ECE657A : Lecture 3 January 18, 2016 8 / 76 Data Reduction Overview Feature Extraction Feature Extraction Methods Feature Extraction combines original features into a new set of features by transformation or projection. Projection is according to some objective function or concept. objective should find the most significant set of new features used as a lower dimensional subspace than the original set of features Interpretation: Transformation can produce new features have interpretable meaning to the data But many projections produce features that are difficult to relate to the original features. Mark Crowley ECE657A : Lecture 3 January 18, 2016 9 / 76 Data Reduction Overview Feature Extraction Methods of Feature Extraction Linear methods: Unsupervised (no class label information, e.g .PCA, ICA) Supervised (class labels know, e.g LDA) Nonlinear methods Global (preserves global properties, eg. MDS, Isomap, Kernel PCA) Local (preserves properties within local neighborhood, e.g. LLE) Mark Crowley ECE657A : Lecture 3 January 18, 2016 10 / 76 Data Reduction Overview Feature Selection Feature Selection Feature Selection is attempt to find a subset of the original features which satisfy some criteria. Ideally the selected subset includes the significant features and eliminates irrelevant and redundant features. The selected features being a subset of the original are interpretable and keep their original meaning. Approaches include: Feature Ranking Subset Selection which includes: Wrapper approach and Filter approach Both use search strategies such as: Exhaustive Sequential (Forward or Backward) Mark Crowley ECE657A : Lecture 3 January 18, 2016 11 / 76 Data Reduction Overview Motivation: Data Reduction Dimensionality Reduction Can be seen as a preprocessing step that may be required for: Visualization of the data. Finding the intrinsic dimensionality (features) Many applications have a large number of features that may be redundant, irrelevant or sparse (e.g text documents where features are words) If the number of samples are much smaller than number of dimensions then this makes learning at a desired resolution difficult (curse of dimensionality). Reducing costs of training and learning algorithms Reducing storage and future measurement costs Mark Crowley ECE657A : Lecture 3 January 18, 2016 12 / 76 Data Reduction Overview Motivation: Data Reduction Motivation: The Curse of Dimensionality A term introduced by Richard Bellman to illustrate the explosion of the samples required when we have higher dimensions. The number of samples required to fully represent a 10 valued function with one dimension=10 For 20 dimensions with same number of values (10 datapoints in each dimension) we will need 1020 The number of samples grows exponentially in terms of the dimensions. This poses a problem for non-parametric algorithms such as most classification algorithms that require many training samples. Mark Crowley ECE657A : Lecture 3 January 18, 2016 13 / 76 Data Reduction Overview Motivation: Data Reduction Motivation: The Curse of Dimensionality Learn a 1-D function from data: How many data points are needed? N points Mark Crowley ECE657A : Lecture 3 January 18, 2016 14 / 76 Data Reduction Overview Motivation: Data Reduction Motivation: The Curse of Dimensionality Learn a 1-D function from data: How many data points are needed? N points Mark Crowley ECE657A : Lecture 3 January 18, 2016 14 / 76 Data Reduction Overview Motivation: Data Reduction Motivation: The Curse of Dimensionality What if the data is better explained by a 2-D function? N2 points Mark Crowley ECE657A : Lecture 3 January 18, 2016 15 / 76 Data Reduction Overview Motivation: Data Reduction Motivation: The Curse of Dimensionality What if the data is better explained by a 2-D function? N2 points Mark Crowley ECE657A : Lecture 3 January 18, 2016 15 / 76 Data Reduction Overview Motivation: Data Reduction Motivation: Peaking Phenomenon For a fixed number of samples increasing the number of features may degrade the performance. (eg. genetics data: very high dimensional, small data sample) Classic solution is to assume fixed order of features, only choose most important. Newer approach: use classification algorithms to reduce feature space (see http://ift.tt/2iN7cA8[From [7] ]) ECE 750 T17 | Peaking Phenomenon For a fixed number of samples increasing the number of features may degrade the performance 42 # of features Accuracy Desired # of features Error Mark Crowley ECE657A : Lecture 3 January 18, 2016 16 / 76 http://ift.tt/2iN7cA8 Principle Component Analysis Outline of Dimensionality Reduction Methods Data Reduction Overview Feature Extraction Feature Selection Motivation: Data Reduction Principle Component Analysis Overview of PCA Implementing PCA Linear Discriminant Analysis Separation Measures Fisher Linear Descriminant Independent Component Analysis Nonlinear Methods For Dimensionality Reduction Global Methods - Multidimensional Scaling isomap Locally Linear Embedding (LLE) Mark Crowley ECE657A : Lecture 3 January 18, 2016 17 / 76 Principle Component Analysis Overview of PCA Principal Component Analysis A way to linearly transform a set of d-dimensional vectors x̄1, x̄2, . . . , x̄n into another set of m-dimensional vectors Has the property that most of the information content is stored in the first few dimensions, so we can have m < d The main idea is that high information corresponds to high variance (more discriminating). The direction of max variance is parallel to the eigenvector corresponding to the largest eigenvalue of the covariance matrix of the sample matrix A. [From [4] ] Mark Crowley ECE657A : Lecture 3 January 18, 2016 18 / 76 Principle Component Analysis Overview of PCA Intuitive View Find new axes which will be better in terms of variance and errors than original ones. PCA is equivalent to minimizing the mean-square error. It also maximizes the scatter. see Jain and Dubes Book [From [4] ] for derivation. Visual Explanation: http://setosa.io/ev/principal-component-analysis/ Mark Crowley ECE657A : Lecture 3 January 18, 2016 19 / 76 http://setosa.io/ev/principal-component-analysis/ Principle Component Analysis Implementing PCA Implementing PCA Let R be the d × d covariance matrix of A A is normalized by subtracting mean x ′ij = (xij − x̄j), i = 1, . . . , n; j = 1, . . . , d R is symmetric positive definite, its eigenvalues are real and positive Now we apply an orthogonal transformation to R to diagonalize it. CRCT = Λd (1) Where Λd is a diagonal matrix of the d eigenvalues of R and C is the matrix with its columns corresponds to the eigenvectors of R Mark Crowley ECE657A : Lecture 3 January 18, 2016 20 / 76 Principle Component Analysis Implementing PCA Implementing PCA We can sort the eigenvalues of R such that λ1 ≥ λ2 ≥ 3 . . . ≥ λd ≥ 0 (2) and ĉ1, ĉ2, ĉ3, . . . , ĉd are the corresponding eigenvectors, called the Principal Components (axes) To reduce the dimensions and keep large percentage of the variance we select the 1st m eigenvalues and eigenvectors Let Hm =   ĉT1 ĉT2 ... ĉTm   be an m × d matrix. Mark Crowley ECE657A : Lecture 3 January 18, 2016 21 / 76 Principle Component Analysis Implementing PCA Implementing PCA Then ȳi = Hmx̄i , i = 1, 2, . . . , n m × 1 = m × d · d × 1 The projected matrix Bm Bm =   ȳ1 T ȳ2 T ...ȳn T   =   x̄1 T x̄2 T ...x̄n T  HTm = AHTm where x̄Tm = [xk1, xk2, . . . , xkd ] ȳTm = [xk1, xk2, . . . , xkd ] Mark Crowley ECE657A : Lecture 3 January 18, 2016 22 / 76 Principle Component Analysis Implementing PCA Implementing PCA The covariance matrix in the new space can be defined as 1 n BTmBm = 1 n n∑ i=1 ȳi ȳ T i = HmRH T m = Hm(C TΛC )HTm = HmC TΛ(HmC T )T = Λm = diag(λ1, λ2, . . . , λm) HmC T =   c̄T1 c̄T2 ... c̄Tm   [c̄1c̄2 . . . c̄m] =   100 · · · · · 0 010 · · · · · 0 · · 1 · · · · · 0 ... 10 · · · 1 · ·0   Which means the m new features are uncorrelated. Mark Crowley ECE657A : Lecture 3 January 18, 2016 23 / 76 Principle Component Analysis Implementing PCA Sum of Eigenvalues The sum of the Eigenvalues of R are the sample variance in the new space One would choose m such that rm = ( m∑ i=1 λi ) / ( d∑ i=1 λi ) ≥ τ < 1 e.g. Choosing τ = 0.95 will ensures that 95% of the variance is retained in the new space. One way to know the right value is to use a scree plot. The scree plot can be done in different ways: simply plotting the eigenvalues of the components (in descending order) and look for a gap or a knee in the plot. or plot rm as a function of m and look for a knee in curve. could also plot the cumulative variance. [From [8] ] Mark Crowley ECE657A : Lecture 3 January 18, 2016 24 / 76 Principle Component Analysis Implementing PCA Scree Plot (Descending Eigenvectors) ECE 750 T17 0 20 40 60 80 100 120 PC1 PC2 PC3 PC4 PC5 PC6 PC7 eigenvalues scree plot 8 Mark Crowley ECE657A : Lecture 3 January 18, 2016 25 / 76 Principle Component Analysis Implementing PCA Scree Plot (Normalized Eigenvectors) ECE 750 T17 0.00 0.20 0.40 0.60 0.80 1.00 1.20 PC1 PC2 PC3 PC4 PC5 PC6 PC7 eigenvalue/sum rm 9 Mark Crowley ECE657A : Lecture 3 January 18, 2016 26 / 76 Principle Component Analysis Implementing PCA Cost of Computation The covariance matrix R = ATA is a d × d matrix and d (features) may be much larger than n (samples). Using this approach could be too complex. Mark Crowley ECE657A : Lecture 3 January 18, 2016 27 / 76 Principle Component Analysis Implementing PCA Singular Value Decomposition The matrix A can be decomposed using singular value decomposition into An×d = USV T , where: U is a n × d matrix of orthonormal columns UTU = I the left singular vectors V is d × d matrix of orthonormal columns V TV = I the rght singular vectors S is d × d diagonal matrix of singular values. Mark Crowley ECE657A : Lecture 3 January 18, 2016 28 / 76 Principle Component Analysis Implementing PCA PCA Using Singular Value Decomposition SVD can be used to obtain PCA. Now AAT = USV T (VSUT ) = US2UT and ATA = VSUT (USV T ) = VS2V T Which leads to the following facts: The singular values are the square root of the eigenvalues of the covariance matrix The right singular vectors are the eigenvectors of the covariance matrix. So, the SVD gives us the d eigenvalues (ordered) values as well as the PC components. Now we can reduce the dimensions by selecting the largest m. Mark Crowley ECE657A : Lecture 3 January 18, 2016 29 / 76 Principle Component Analysis Implementing PCA Interpretation of PCA PCA is also called Karhunen-Loeve transform (KLT) or the Hotelling transform. The transformed points ȳ1, ȳ2, . . . , ȳn are sometimes called scores The eigenvectors or principal components c̄1, c̄2, . . . , c̄d are called loadings represented by coefficients The eigenvalues of the covariance matrix are sometimes called the latent representation The Hotelling T 2 measures the distance from the mean. Mark Crowley ECE657A : Lecture 3 January 18, 2016 30 / 76 Principle Component Analysis Implementing PCA Interpretation of PCA PCA is Optimal in the sense of min. sum of square of errors. It obtains max variance projection by finding orthogonal linear combinations of the original variables. It mainly rotates the coordinates ( for zero mean data) to find new axes that have the max variance. It de-correlates the axes. For uni-modal Gaussian data this will amount to finding independent axes. For data that doesn’t follow this distribution, the axes may not necessarily be independent. The principle components may not be the best for discriminating between classes. Mark Crowley ECE657A : Lecture 3 January 18, 2016 31 / 76 Principle Component Analysis Implementing PCA Whitening (Sphering) Goal is to have features that are less correlated together, each represents something independently important all the features have the same variance (a kind of a normalization) A Whitening Transformation given a vector of random variables with known covariance matrix want a linear transformation into a set of new variables such that covariance is the identity matrix (ie. they are uncorrelated) all have variance 1 The transformation changes the input vector into a ”white noise” vector. Mark Crowley ECE657A : Lecture 3 January 18, 2016 32 / 76 Principle Component Analysis Implementing PCA Performing Whitening Shift data to zero mean (subtract mean from each feature) Compute eigenvectors U and eigenvalues S using SVD Project data using U and S to obtain whitened data points See http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/ for an example using matlab. Mark Crowley ECE657A : Lecture 3 January 18, 2016 33 / 76 http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/ Linear Discriminant Analysis Outline of Dimensionality Reduction Methods Data Reduction Overview Feature Extraction Feature Selection Motivation: Data Reduction Principle Component Analysis Overview of PCA Implementing PCA Linear Discriminant Analysis Separation Measures Fisher Linear Descriminant Independent Component Analysis Nonlinear Methods For Dimensionality Reduction Global Methods - Multidimensional Scaling isomap Locally Linear Embedding (LLE) Mark Crowley ECE657A : Lecture 3 January 18, 2016 34 / 76 Linear Discriminant Analysis Linear Discriminant Analysis (LDA) If we know the labels of the data (classes) we can use a supervised approach such as discriminant analysis. Emphasis on finding projections that best discriminate the data in lower dimensions. Intuition: Maximize the between-class scatter while holding fixed the within-class scatter. [From [3] ] Mark Crowley ECE657A : Lecture 3 January 18, 2016 35 / 76 Linear Discriminant Analysis PCA vs. LDAECE 750 T17 14 � Linear Discriminant Analysis (LDA) (From Duda, Hart and Stork’s book ref 3) If we know the labels of the data (classes) we can use a supervised approach such as discriminant analysis. Here the emphasis is on finding projections that best discriminate the data in lower dimensions. Better in terms of Variance and errors x1 x2 Better for discrimination Mark Crowley ECE657A : Lecture 3 January 18, 2016 36 / 76 Linear Discriminant Analysis Fisher Linear Discriminant FLD is a special case of LDA Consider the 2 class problem. We have n samples x each of d dimensions divided into two classes C1 has n1 samples C2 has n2 samples We want to project these onto a line w such that the projected n points y (each is a scalar of one dimension) are divided into two classes, D1 and D2 such that y = wT y Our goal is to have a projection that well separates the projected points into two classes Mark Crowley ECE657A : Lecture 3 January 18, 2016 37 / 76 Linear Discriminant Analysis Separation Measures Separation Measures Use the difference between the projected points means of each group. The mean of original sample points in each class are: mi = 1 ni ∑ c∈Ci x, i = 1, 2 And mean of projected points are m′i = 1 ni ∑ y∈Di y = 1 ni ∑ x∈Ci wT x = wTmi Mark Crowley ECE657A : Lecture 3 January 18, 2016 38 / 76 Linear Discriminant Analysis Separation Measures Simple Mean Difference mi = 1 ni ∑ c∈Ci x, i = 1, 2 m′i = w Tmi Note that the original sample means are vectors of means of the features for samples in each class. |m′1 −m ′ 2| = |w T (m1 −m2)| We can find w that maximizes this difference. However, this difference can be made large simply by scaling w so we need to normalize it. Mark Crowley ECE657A : Lecture 3 January 18, 2016 39 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Within-Class Scatter Fisher suggested normalizing the difference by the within-class scatter. The scatter is defined as s2i = ∑ y∈Di (y −m′i ) 2 s21 + s 2 2 is called the within-class scatter of the projected points (n times the variance). Define the sample within-class scatter of the original samples: Sw = S1 + S2 Si = ∑ x∈Ci (x−mi)(x−mi)T Mark Crowley ECE657A : Lecture 3 January 18, 2016 40 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Between-Class Scatter Substitute for y and m′ s21 + s 2 2 = w TSww similarly, (m′1 −m ′ 2) 2 = wTSBw Where SB = (m1 −m2)(m1 −m2)T is the Between-Class Scatter So the problem can be now stated as finding w that maximizes wTSBw wTSwW Mark Crowley ECE657A : Lecture 3 January 18, 2016 41 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Between-Class Scatter Maximize, wTSBw wTSwW The solution w needs to satisfy S−1W SBw = λw In general this is an eigenvalue problem, but in the case of 2 classes SBw is always in the direction of (m1 −m2) Mark Crowley ECE657A : Lecture 3 January 18, 2016 42 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Generalization of FLD to K classes Given K classes, we find a projection into (K − 1) - dimensional subspace. So the problem reduces to finding a (K − 1)× d projection matrix W such that the projected sample points are well separated. y = W Tx Find the projection matrix W by maximizing the between-group scatter while holding the within-group scatter constant. [From [4] ] Mark Crowley ECE657A : Lecture 3 January 18, 2016 43 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Generalization of FLD to K classes Let there be K classes each of size ni , for i ∈ [1,K ] Let the points in the `th group be the vectors [x̄1 `, . . . , x̄1 `]T where x̄j ` = [x`j1 , . . . , x ` jd ]T The mean of the i th feature for the `th group is m (`) i = 1 ni nj∑ j=1 x (`) ji Mark Crowley ECE657A : Lecture 3 January 18, 2016 44 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Generalization to K classes The vector of feature means is then m̄(`) = [m (`) 1 ,m (`) 2 , . . . ,m (`) d ] T and the pooled mean m is the mean vector over all samples m = 1 n K∑ `=1 n`m (`) n = K∑ `=1 n` Mark Crowley ECE657A : Lecture 3 January 18, 2016 45 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Scatter Matrix The scatter matrix is S = K∑ `=1 n∑̀ j=1 (x̄ (`) j −m)(x̄ (`) j −m) T The scatter matrix of the `th group is S (`) = n∑̀ j=1 (x̄ (`) j − m̄ (`))(x̄ (`) j − m̄ (`))T Mark Crowley ECE657A : Lecture 3 January 18, 2016 46 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Within-Group and Between-Group Scatter The within-group scatter matrix is then SW SW = K∑ `=1 S (`) And the between-group scatter matrix SB is SB = K∑ `=1 n∑̀ j=1 (m̄(`) −m)(m̄(`) −m)T = K∑ `=1 n`m̄ (`)(m̄(`))T − nm̄m̄T Mark Crowley ECE657A : Lecture 3 January 18, 2016 47 / 76 Linear Discriminant Analysis Fisher Linear Descriminant Combining Scatter Matricies S = SB + SW The projection is to maximize SB and keep SW /S constant. The solution gives the rows of W projection matrix which as the K − 1 eigenvectors of S−1W SB whose eigenvalues are non-zero. Reducing dimensions: to project to t ≤ (k − 1) then use the eigenvectors of the largest t eigenvalues. LDA vs PCA: LDA is a supervised method. It’s fast. Eigenvector based like PCA but generally better than PCA for classification. Limited to k − 1 components. Mark Crowley ECE657A : Lecture 3 January 18, 2016 48 / 76 Linear Discriminant Analysis Independent Component Analysis Independent Component Analysis (ICA) PCA uses 2nd order stats (ie. mean and variance) Linear projection methods can use higher order stats so they can be used for non-Gaussian distributed data : eg. ICA, Projection Pursuit ICA was proposed for blind source separation of signals (”the cocktail party problem”). It finds projections that are independent but may not be orthogonal. Each source can be any non-Gaussian distribution. Mark Crowley ECE657A : Lecture 3 January 18, 2016 49 / 76 Linear Discriminant Analysis Independent Component Analysis PCA vs. ICA 0 100 200 300 400 500 −2 0 2 0 100 200 300 400 500 −5 0 5 0 100 200 300 400 500 −2 0 2 truth 0 100 200 300 400 500 −10 0 10 0 100 200 300 400 500 −10 0 10 0 100 200 300 400 500 −5 0 5 0 100 200 300 400 500 −10 0 10 observed signals 0 100 200 300 400 500 −5 0 5 Figure: Truth vs. Observed Mark Crowley ECE657A : Lecture 3 January 18, 2016 50 / 76 Linear Discriminant Analysis Independent Component Analysis PCA vs. ICA 0 100 200 300 400 500 −2 0 2 0 100 200 300 400 500 −5 0 5 0 100 200 300 400 500 −2 0 2 truth 0 100 200 300 400 500 −10 0 10 0 100 200 300 400 500 −10 0 10 0 100 200 300 400 500 −5 0 5 0 100 200 300 400 500 −2 0 2 PCA estimate 0 100 200 300 400 500 −1 0 1 Figure: Truth vs. PCA Mark Crowley ECE657A : Lecture 3 January 18, 2016 51 / 76 Linear Discriminant Analysis Independent Component Analysis PCA vs. ICA 0 100 200 300 400 500 −2 0 2 0 100 200 300 400 500 −5 0 5 0 100 200 300 400 500 −2 0 2 truth 0 100 200 300 400 500 −10 0 10 0 100 200 300 400 500 −5 0 5 0 100 200 300 400 500 −10 0 10 0 100 200 300 400 500 −2 0 2 ICA estimate 0 100 200 300 400 500 −2 0 2 Figure: Truth vs. ICA Mark Crowley ECE657A : Lecture 3 January 18, 2016 52 / 76 Linear Discriminant Analysis Independent Component Analysis Projection Pursuit How do we choose each projection dimension? Projection pursuit uses a measure of interestingness of a projection. Interestingness is a measure of some aspects of not being Gaussian (such as entropy). It tries to find a projection that maximize the measure Data is reduced by removing components along this projection. Projection performs projections one at a time such that the extracted signal is as non-Gaussian as possible The Gaussian distribution is the maximum entropy distribution. So, we can maximize the negative entropy (negentropy) This is equivalent to MLE up to a sign change and addition of a constant Mark Crowley ECE657A : Lecture 3 January 18, 2016 53 / 76 Linear Discriminant Analysis Independent Component Analysis Implementing ICA Maximum Likelihood Estimation formulation : xt = Wzt + �t where xt =