留学生作业代写 COMP 3115 Exploratory Data Analysis and Visualization

COMP 3115 Exploratory Data Analysis and Visualization
Lecture 6: Dimensionality Reduction Dr.
§ Revise: Eigenvalues and Eigenvectors § Why we need dimensionality reduction? § Projection
§ Linear Dimensionality Reduction

– Principle Component Analysis (PCA)
§ Dimensionality Reduction – Nonlinear Methods – Locally Linear Embedding

§ Revise: Eigenvalues and Eigenvectors § Why we need dimensionality reduction? § Projection
§ Linear Dimensionality Reduction
– Principle Component Analysis (PCA)
§ Dimensionality Reduction – Nonlinear Methods – Locally Linear Embedding

sepal_length
sepal_width
petal_length
petal_width
5.1 3.5 1.4 0.2 4.9 3 X1.4 0.2 7 3.2 4.7 1.4 6.3 3.3 6 2.5 …………
setosa setosya versicolor virginica …
§ X: the n*d feature matrix
– n: the number of data samples
– d: the dimensionality of each data sample
§ y: n*1 label vector
§ X(i,:): the i-th row in matrix X – i.e., the i-th data sample
§ xi: the i-th sample in column vector representation
– xi = X(i,:)T
§ x(i,j): the j-th feature of the i-th sample § yi: the label of the i-th sample
Capital bold letter for matrix, small bold letter for vector, small (italic) letter for scalar. Vectors are column vectors

Matrix-vector multiplication
§ Matrix-vector multiplication
𝑥! 1∗𝑥! +0∗𝑥” +0∗𝑥# 𝑥! 𝐀𝐱= 𝑥” = −1∗𝑥!+1∗𝑥”+0∗𝑥# = −𝑥!+𝑥” 𝑥# 0∗𝑥! −1∗𝑥” +1∗𝑥# −𝑥” +𝑥#
10011 −1 1 0 4 = 3 0 −1 1 9 5
Most of the time, Ax will not be in the same direction of x.

Eigenvalues and Eigenvectors
§ Matrix-Vector multiplication: Ax.
– Almost all vectors change direction when they are multiplied by a matrix A.
§ Eigenvectors (Eigen is a German word meaning “characteristic” ) – There are certain exceptional vectors x whose direction is the same as Ax. – Multiplied by matrix A does not change the direction of these vectors.
– These vectors are “eigenvectors”.
§ Eigenvalues
– For those eigenvectors, multiplied by matrix A is equal to multiplied by a number.
– Therefore, 𝐀𝐱 = 𝜆𝐱.
– The number 𝜆 is an eigenvalue of A.

Eigenvalues and Eigenvectors
§ Eigenvalue equation 𝐀𝐱 = 𝜆𝐱
§ The eigenvalue 𝜆 tells whether the eigenvector x is stretched or shrunk or
reversed or left unchanged when it is multiplied by matrix A. § E.g.,
422=52 131 1

Eigenvalues and Eigenvectors
§ If A is identity matrix, every vector has Ax = x. All vectors are eigenvectors of I. And all eigenvalues are 1.
102=2 011 1
103=3 011 1

Computing eigenvalues, eigenvectors
§ How to compute eigenvalues, eigenvectors based on eigenvalue equation 𝐀𝐱 = 𝜆𝐱
§ Let rewrite 𝐀𝐱 = 𝜆𝐱 as (𝐀 − 𝜆𝐈)𝐱 = 0
– The matrix 𝐀 − 𝜆𝐈 times the eigenvector x is the zero vector.
– If𝐀−𝜆𝐈 isinvertible,then𝐱= 𝐀−𝜆𝐈 !”0=0.However,weareNOTinterestedinthe trivial solution x = 0.
– Since x ≠ 0, this requires matrix 𝐀 − 𝜆𝐈 is not invertible.

Inverse matrix
§ Suppose A is a square matrix. We look for an inverse matrix A−1 of the same size, such that A−1 times A equals to I. I is the identity matrix. The matrix A−1 is called “A inverse”
§ The product A−1A is like multiplying by a number then dividing by that number.
– E.g., 5−1*5
– A number has an inverse if it is not zero. 0−1 is invalid.
– Inverse of matrix is more complicated and more interesting.
§ The matrix A is invertible if there exists a matrix A−1 that “inverts” A: A−1A = AA−1 = I

Which square matrices are invertible? § Consider a matrix
§ If we multiply A with another matrix B
§ We obtain
𝐀𝐁=𝑎𝑑−𝑏𝑐 0
=𝑎𝑑−𝑏𝑐 1 0=𝑎𝑑−𝑏𝑐𝐈 01

Which square matrices are invertible? (cont’d)
§ AB = (ad − bc)I
§ We know A−1A = AA−1 = I if A−1 does exist.

Which square matrices are invertible? (cont’d)
§ AB = (ad − bc)I
§ We know A−1A = AA−1 = I if A−1 does exist. § Therefore,
𝐀$!= 1 𝐁= 1 𝑑 −𝑏 (𝑎𝑑−𝑏𝑐) (𝑎𝑑−𝑏𝑐) −𝑐 𝑎
exists if and only if (ad − bc) ≠ 0. When (ad − bc) = 0, we are asked to divide by zero and we can’t. Then A has no inverse if (ad − bc) = 0.

Determinant of a matrix
§ The single number (ad − bc) is the determinant of 𝐀 = 𝑎 𝑏 . 𝑐𝑑
§ This single number immediately tells whether the matrix is invertible.
§ For any square matrix A, A is invertible if and only if the determinant of A is non-zero.

Determinant of a matrix
§ The determinant of a square matrix A is a scalar value that can be computed from the elements of A. We usually use det(A) or |A| to denote the determinant of A.
§ We have explicit expression for determinant of small matrices in terms of the elements of the matrix.
det 𝑎 =𝑎 det 𝑎 𝑏 =𝑎𝑑−𝑏𝑐 𝑐𝑑

Computing eigenvalues and eigenvectors
§ How to compute eigenvalues and eigenvectors based on eigenvalue equation 𝐀𝐱 = 𝜆𝐱
§ Let rewrite 𝐀𝐱 = 𝜆𝐱 as (𝐀 − 𝜆𝐈)𝐱 = 0
– The matrix 𝐀 − 𝜆𝐈 times the eigenvector x is the zero vector.
– If𝐀−𝜆𝐈 isinvertible,then𝐱= 𝐀−𝜆𝐈 !”0=0.However,weareNOTinterestedinthe trivial solution x = 0.
§ Obtain eigenvalues first
– Since x ≠ 0, this requires matrix 𝐀 − 𝜆𝐈 is not invertible.
– Therefore, det 𝐀 − 𝜆𝐈 = 0. This equation only involves 𝜆 not x.
§ For each eigenvalue 𝜆, solve (𝐀 − 𝜆𝐈)𝐱 to find an eigenvector x.

Example of Computing eigenvalues and eigenvectors
§ Let us find the eigenvalues and eigenvectors of the following 2*2 matrix.
§ Find 𝐀 − 𝜆I by subtract 𝜆 from the diagonal of A
§Obtaineigenvaluesbysolvingdet𝐀−𝜆𝐈 =0
𝐀−𝜆𝐈= 4−𝜆 2 = 4−𝜆 3−𝜆 −2∗1=0
10−7𝜆+𝜆” = 2−𝜆 5−𝜆 =0 𝜆! =2 and 𝜆” =5

Example of Computing eigenvalues and eigenvectors
§ Now find the eigenvectors by solving (𝐀 − 𝜆𝐈)𝐱 = 0 separately for 𝜆! = 2 and
§ Let denote x as 𝐱 =
(1)For𝜆! =2,weobtain ((𝐀−2𝐈)𝐱=𝟎 4−2
2𝑥 +2𝑥 = 0 ! ”
13−2″ 0 𝑥!+𝑥”=0 Thismeansanyvector𝐱= 𝑥! ,where𝑥 =−𝑥 ,suchas
1 2 𝑥” ” !
−1 (or −2 ) is an eigenvector of A with eigenvalue 2.
There are multiple solutions for eigenvector. In practice, we will use the unit
eigenvector by dividing by the length, i.e., 1/ 2 for this example. −1/ 2

Example of Computing eigenvalues, eigenvectors
§ Now find the eigenvectors by solving (𝐀 − 𝜆𝐈)𝐱 = 0 separately for 𝜆! = 2 and
§ Let denote x as 𝐱 =
(2)For𝜆” =5,weobtain ((𝐀−5𝐈)𝐱=𝟎 4−5
−𝑥 +2𝑥 = 0 ! ”
13−5″ 0 𝑥!−2𝑥”=0 Thismeansanyvector𝐱= 𝑥! ,where𝑥 =)!,suchas 2 (or
4 𝑥”””1 2 ) is an eigenvector of A with eigenvalue 5.
There are multiple solutions for eigenvector. In practice, we will use the unit
eigenvectorbydividingbythelength,i.e., 2/ 5 forthisexample. 1/ 5

Why we need dimensionality reduction?
§ High-dimensional Datasets are around …
Face Image Data
Gene Expression Data

Why we need dimensionality reduction?
§ Visualization: projection of high-dimensional data onto 2D or 3D
§ New representation with fewer dimensions will help to prevent overfitting of subsequent learning algorithms (e.g., clustering, regression, classification etc)
§ Speeding up learning algorithms
§ Less storage requirements (data compression)

Projection
§ Vector projection
– Dot product of two vectors
– 𝐚 = [𝑎1,𝑎2], 𝐛 = [𝑏1,𝑏2] –𝐚3𝐛=𝑎1𝑏1+𝑎2𝑏2=𝐚 𝐛cos𝜃
§ Projection on “standard coordinate system”
– Vector [2, 4] projection on the x-axis is the dot production between [2, 4] and [1, 0]: 2*1 + 4*0 = 2
– Vector [2, 4] projection on the y-axis is the dot production between [2, 4] and [0, 1]: 2*0 + 4*1 = 4
[-1, 2] [-3, -1]

Projection on other directions
§ Project on the direction u = [ !” , !”]
§ Project [2, 4] on direction u: 2 12 + 4 12 = 62
§ Project [-1, 2] on direction u: − 1 12 + 2 12 = 12
§ Project [-3, -1] on direction u: −312+−1 12=−42
u = [ #$ , #$ ] (2, 4)

Projection for Dimensionality Reduction
Data Points in 2D Projection onto 1D 6 Data Points in 1D 16
24 24212 12 𝐗 = − 1 2 𝑿2 = 𝐗 𝐮 = − 1 2 1 = 𝐙 = 2 −3−1 −3−12 4

Projection for Dimensionality Reduction
This process projects the two dimensional data points to one dimensional data points (dimensionality reduction).
Data Points in 2D
Projection onto 1D 6 Data Points in 1D 16
2 4 2 12 12 𝑿2 = 𝐗 𝐮 = − 1 2 1 = 𝐙 = 2 −3−1 −3−12 4
2 4 𝐗 = − 1 2

Reconstruction from the Projected Data
2 12 1 1 3 3
𝐙= 1 𝐗)=𝐙𝐮!= = 1 1 2 22222
−4 −4 −2 −2 22
§ The X can be constructed as 𝐗9 = 𝐙𝐮# = 𝐗𝐮𝐮#
§ The original representation (e.g., X) can not be perfectly recovered from low dimensional representation (e.g., 1D projected representation Z).

Linear Dimensionality Reduction
§ A projection matrix U = [u1 u2 … uk] of size d*k defines k linear projection directions.
§ Each column uk in U denotes a linear project direction for d dimensional data (assume k < d) § Then projection matrix U can be used to transform a high dimensional sample X(i,:) (i.e., xTi ) into a low dimensional sample Z(i,:) (i.e., zTi ) as shown below 1*k embedding of xi zi is also low-dimensional Linear Dimensionality Reduction § X is a n*d matrix denoting input data with n data points, each data point is represented by d features. § Z is a n*k matrix denoting the low-dimensional embedding of input data matrix X. Linear Dimensionality Reduction § How do we learn the “best” projection matrix U? § What criteria should we optimize for learning U? § Principle Component Analysis (PCA) is an algorithm for doing this. There are infinite ways to project the data X. § Revise: Eigenvalues and Eigenvectors § Why we need dimensionality reduction? § Projection § Linear Dimensionality Reduction – Principle Component Analysis (PCA) § Dimensionality Reduction – Nonlinear Methods – Locally Linear Embedding Principal Component Analysis § A classic linear dimensionality reduction method (Pearson, 1901; Hotelling, 1930) § PCA can be seen as – Learning projection directions that capture maximum variance in data – Learning projection directions that results in smallest reconstruction error PCA as Maximizing Variance PCA as Maximizing Variance: A Simple Illustration § Consider this two dimensional dataset § The coordinate of each data point x can be represented by [x1, x2] § Considering ignoring the feature x2 for each data sample § The coordinate of each data point x now becomes one-dimensional [x1] § Are we losing much information by simply removing x2 ? – No. Most of the data spread is along x1 (very little variance along x2) PCA as Maximizing Variance: A Simple Illustration § Consider this two dimensional data § The coordinate of each data point x can be represented by [x1, x2] § Considering ignoring the feature x2 for each data sample § The coordinate of each data point x now becomes one-dimensional [x1] § Are we losing much information by simply removing x1 ? – Yes. This data has substantial variance along both features. PCA as Maximizing Variance: A Simple Illustration § Now consider we project the data into another two directions u1, u2 § Each data sample x is represented by 2 features [z1, z2] § Considering ignoring the feature z2 for each data sample § Each 2-dimensional data sample x now becomes one-dimensional [z1] § Are we losing much information by simply removing z2 ? – No. Most of the data spread is along z1 (very little variance along z2) PCA as Maximizing Variance § Consider projecting xi (a d dimensional feature vector) to a one dimensional vector zi by u1 (a column vector) X(i,:) * 1*d = Z(i,:) 1*1 It can also be written as: 𝐳%& = 𝐱%&𝐮# PCA as Maximizing Variance § Projecting xi (a d dimensional feature vector) to a one dimensional vector zi by u1: 𝐳78 = 𝐱78𝐮! 𝐱%&𝐮# denotes the location of the & green point along the purple line 𝐱% 𝐮# representing 𝐮# PCA as Maximizing Variance § Projecting xi (a d dimensional feature vector) to a one dimensional vector zi by u1: 𝐳78 = 𝐱78𝐮! § Therefore, the mean of projections of all data (i.e., “center” of the green points ) can be computed as ∑/ 𝐱0𝐮 ∑/ 𝐱0 - . ! - ! = - . ! - 𝐮 ! = 𝐱G 8 𝐮 ! § Variance of the projected data (i.e., “spread” of the green points) ∑' (𝐱&𝐮 − 𝐱;&𝐮 )$ ∑' ((𝐱& − 𝐱;&)𝐮 )$ %(#%# #=%(#% # 𝐱; is the mean feature vector 𝐱; = # ∑' 𝐱 ' %(# % For simplicity, we use the “population” variance where the denominator is ‘n’. PCA as Maximizing Variance § Variance of the projected data ∑& ((𝐱# − 𝐱?#)𝐮 )' ∑& (𝐱 − 𝐱?)(𝐱# − 𝐱?#) $%" $ " =𝐮"# $%" $ $ 𝑛𝑛 -.! § Let 𝐒 = ∑/ (𝐱-$𝐱;)(𝐱-0$𝐱;0) , the variance of the projected data is : 𝐮 !8 𝐒 𝐮 ! § S is the d*d data covariance matrix. If data is already centered (i.e., 𝐱G = 0), ∑/ (𝐱 )(𝐱0) ! then𝐒= -.! - - = 𝐗8𝐗 :: Direction of Maximum Variance Variance of the projected data is: 𝐮 !" 𝐒 𝐮 ! § Objective: We want 𝐮! that the variance of the project data is maximized m a x 𝐮 !8 𝐒 𝐮 ! § To prevent trivial solution (max variance = infinite), assume 𝐮! " = 𝐮!𝐮! = 1. Therefore 𝐮!8𝐮! = 1 § Therefore, 𝐮! can be obtained by solving the following optimization problem max𝐮#&𝐒𝐮# +𝜆#(1−𝐮#&𝐮#) 𝐮! 𝜆# is a Lagrange multiplier Direction of Maximum Variance § The objective: max 𝐮!8𝐒𝐮! + 𝜆!(1 − 𝐮!8𝐮!) 𝐮! § Obtaining the optimal solution by taking the derivative with respect to 𝐮! and setting to zero (D0E) = 2𝐴𝑥) 𝐒𝐮" = 𝜆"𝐮" § Thus 𝐮! is an eigenvector of S (with corresponding eigenvalue 𝜆!) § S is a d*d matrix, there are d possible eigenvectors, which one to take? Direction of Maximum Variance § Note that the constraint 𝐮!8𝐮! = 1, the variance of the projected data is 𝐮!8𝐒𝐮! = 𝐮!8𝜆!𝐮!=𝜆!𝐮!8𝐮! = 𝜆! § Therefore, variance is maximized when 𝐮! is the (top) eigenvector with largest eigenvalue. § The (top) eigenvector 𝐮!can be used to obtained the first Principal Component (PC): 𝐗G𝐮! § Other directions can also be found similarly (with each being orthogonal to all previous ones) using eigendecomposition of S (this is PCA) Steps of Principle Component Analysis § Center the data (subtract the mean 𝐱G = ! ∑: 𝐱 from each data point) to get § Compute the covariance matrix S using the centered data as 𝐒 = &" 𝐗 #( 𝐗 ( § Do an eigen-decomposition of the covariance matrix S § Take first k leading eigenvectors {𝐮!,...,𝐮I} with k largest eigenvalue {𝜆!, § The final k dimensional representation of data is obtained by n*k n*d d*k PCA as Minimizing the Reconstruction Error Reconstruction of Data from Projections § Data reconstruction: 𝐗O = 𝐙𝐮8 = 𝐗𝐮𝐮8 Data Points in 2D 2 4 𝐗 = − 1 2 Projection onto 1D Reconstructed Data in 2D 12 1 1 3 3 = 1 1 2 2 2 2 2 −4 −2−2 2 4 1 2 𝐙 = 𝐗 𝐮 = − 1 2 2 = 1 −3 −1 1 2 2 −4 𝐗) = 𝐙 𝐮 ! = 22 Reconstruction of Data from Projections § Assume complete orthonormal basis vectors u1, u2 for two dimensional data. Data Points in 2D Projection onto 2D Reconstructed Data in 2D 𝐗=−12 241−1 2 2 11 24 𝐙=𝐗𝐮=−122 2=13 𝐗)=𝐙𝐮!=13 22=−12 −3−1 −3−111 22 22−11−3−1 2 2 −42 2 −42 2 2 2 Reconstruction of Data from Projections § Assume complete orthonormal basis vectors u1, u2, ...ud, U = [u1 u2 ... ud]. UTU = Id (I denotes an identity matrix with size d*d) § Therefore, reconstruction of X from Z will be exact if we use all the d basis vectors u1, u2, ...ud. (As shown in previous slide) :𝐗O=𝐙𝐔8 =𝐗𝐔𝐔8 =𝑿(𝐔8𝐔)8=𝐗 Reconstruction of Data from Projections § The reconstruction will be approximate if we only use k < d basis vectors: 𝐗2=𝐙𝐔& =𝐗𝐔𝐔& Uisad*kmatrix § Let’s use k = 1 basis vector, then the one-dimensional embedding of X is 𝐙 = 𝐗𝐮# § We can now try to “reconstruct” X (n*d matrix) from one-dimensional embedding Z (n*1 matrix) 𝐗2 = 𝐙𝐮#& = 𝐗𝐮#𝐮#& § Total error or “loss” in reconstructing X is 𝑙(𝐮!) = – The Frobenius matrix norm is defined as 𝐀 = 2 2 𝑎) = 𝑡𝑟𝑎𝑐𝑒 𝐀!𝐀 = 𝑡𝑟𝑎𝑐𝑒(𝐀𝐀!) " #' trace is the sum of the elements in the main diagonal. Direction of Minimum Reconstruction Error § We want to find 𝐮! that minimizes the reconstruction error 𝑙 𝐮 # = 𝐗 − 𝐗2 $+ = 𝐗 − 𝐗 𝐮 # 𝐮 # & $+ & = 𝑡𝑟𝑎𝑐𝑒 𝐗 − 𝐗𝐮#𝐮# = 𝑡𝑟𝑎𝑐𝑒 𝐗 − 𝐗𝐮#𝐮#& 𝐮 # & 𝐮 # = 1 𝐗& − 𝐮#𝐮#&𝐗& = 𝑡𝑟𝑎𝑐𝑒 𝐗𝐗& − 𝐗𝐮#𝐮#&𝐗& − 𝐗𝐮#𝐮#&𝐗& + 𝐗𝐮# 𝐮#&𝐗& = 𝑡𝑟𝑎𝑐𝑒 𝐗𝐗& − 𝐗𝐮#𝐮#&𝐗& = 𝑡𝑟𝑎𝑐𝑒 𝐗𝐗& − 𝑡𝑟𝑎𝑐𝑒(𝐗𝐮#𝐮#&𝐗&) A constant § Therefore, minimizing the reconstruction error is equivalent to maximize 𝑡𝑟𝑎𝑐𝑒 𝐗𝐮!𝐮!8𝐗8 = 𝐮!8𝐗8𝐗𝐮! (trace(𝒃𝒂8)= 𝒂8𝒃) § Assume the data is centered, we obtained the same objective that we had when maximized the variance 𝐮!8𝐒𝐮! where S = 𝐗8𝐗 is the covariance matrix. PCA example 1 1 𝐗=22 3 3 𝐱; = 𝑛 T 𝐱 % = [ 2 , 2 ] Centerthedata −1 −1 𝐗*=0 0 1 1 1 & 𝐒=𝑛𝐗*𝐗* 𝐒= 2/3 2/3 2/32/3 Eigen decomposition of S 𝐒𝐮 = 𝜆𝐮 ⇒ (𝐒 − 𝜆𝐈)𝐮 = 0 𝐒𝐮 = 𝜆𝐮 ⇒ ⟹det𝐒−𝜆𝐈=0 2/32/3𝑎= 𝑎 4 2−𝜆 2 2/3 2/3 𝑏 3𝑏 ⟹det3# 3 =0 2∗𝑎+2∗𝑏=4∗𝑎 𝐙=𝐗*𝐔 22−𝜆⟹3330 ⟹ 23−𝜆 23−𝜆 −23∗23=0 ⟹𝜆 =,,𝜆 =0 # - $ 2∗𝑎+2∗𝑏=4∗𝑎 𝐙=−2 Project 2 thedata Thelengthis1 Figure out eigenvalues Figure out eigenvectors PCA example in Python This is different to our calculation in previous slide. It is because of the PCA implementation in sklearn using 𝐒= 1 𝐗&*𝐗* 𝑛−1 PCA example in Python How many Principal Components to Use? § Eigenvalue i measures the variance captured by the corresponding projection direction ui 𝐮&% 𝐒𝐮% = 𝐮&% 𝜆%𝐮%=𝜆%𝐮&% 𝐮% = 𝜆% § The “left-over" variance will therefore be ∑` 𝜆 § Can choose k by looking at what fraction of variance is captured by the first k projection directions: u1, u2, ...uk § Another direct way is to look at the spectrum of the eigenvalues plot, or the plot of reconstruction error vs. k How many Principal Components to Use? § Revise: Eigenvalues and Eigenvectors § Why we need dimensionality reduction? § Projection § Linear Dimensionality Reduction – Principle Component Analysis (PCA) § Dimensionality Reduction – Nonlinear Methods – Locally Linear Embedding Nonlinear Dimensionality Reduction § Given: low-dimensionality surface embedded nonlinearly in high-dimensional space – Such a structure is called manifold § Goal: Recover the low dimensional surface Nonlinear Manifold Euclidean Distance § PCA measure the Euclidean distance § What is important is the distance along the nonlinear manifold (i.e., geodesic distance) – The distance along the surface Geodesic Distance Linear Projection may not be good enough § Consider the swiss-roll dataset § Linear projection methods (e.g., PCA) can not capture intrinsic nonlinearities Nonlinear Dimensionality Reduction § We want to do nonlinear projections § Different criteria could be used for such projections § Most nonlinear methods try to preserve the neighborhood information – Locally linear structure – Pairwise distances along the nonlinear manifold § Roughly translates to “unrolling” the manifold Nonlinear Projection Locally Linear Embedding § Based on a simple geometric intuition 程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts