Machine Learning
Lecture 10: Dimensionality Reduction & Matrix Factorization
Prof. Dr. ̈nnemann
Data Analytics and Machine Learning Group Technical University of Munich
Copyright By PowCoder代写 加微信 powcoder
Winter term 2021/2022
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
• Chapter: Dimensionality Reduction & Matrix Factorization
1. Introduction
2. Principal Component Analysis (PCA)
3. Singular Value Decomposition (SVD)
4. Matrix Factorization
5. Neighbor Graph Methods
6. Autoencoders (Non-linear Dimensionality Reduction)
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Introduction: Unsupervised Learning (I)
• Supervised learning aims to map inputs to targets with 𝑦 = 𝑓(𝒙), or in a probabilistic framework it models 𝑝 𝑦 𝒙
• Unsupervised learning can be seen as modelling 𝑝 𝒙
• We are trying to find the (hidden / latent) structure in the data
– e.g. find a latent distribution 𝑝 𝒛 and a generative transformation 𝑝 𝒙 𝒛 wecanthenobtain𝑝 𝒙 =∫𝑝 𝒙 𝒛)𝑝(𝒛)𝑑𝒛
– latent 𝒛 usually unknown and has to be estimated
• Examples:
– Clustering: the cluster label is the latent state
– Anomaly detection: treat instances with low 𝑝 𝒙
as anomalies
Data Analytics and Machine Learning
Dimensionality Reduction & Matrix Factorization
Introduction: Unsupervised Learning (II)
• Unsupervised learning can be viewed as compression
– compress a data point to a single label corresponding to its cluster
– compress a data point from a higher dim. to a lower dim. latent space
• Unsupervised learning can be used …
– … as a stand-alone method (e.g. to understand your data, visualization)
– … as a pre-processing step (e.g. use cluster label as feature for subsequent classification task; obtain small number of relevant features)
– … to leverage large amounts of unlabeled data for pretraining
• This lecture: Dimensionality Reduction & Matrix Factorization
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Dimensionality Reduction: Motivation
• Often data has very many features, i.e. high-dimensional data
• High-dimensional data is challenging:
– Similarity search/computation is expensive because of high complexity of distance functions
– Highly correlated dimension could cause trouble for some algorithms
– Curse of dimensionality: we need exponential amounts of data to characterize the
density as the dimensionality goes up
– It is hard to visualize high-dimensional data
• Often the data lies on a low-dim. manifold, embedded in a high-dim. space
• Goal: Try to reduce the dimensionality while avoiding information loss
• Benefits:
– Computational or memory savings
– Uncover the intrinsic dimensionality of the data
– (more benefits later….)
Data Analytics and Machine Learning
Dimensionality Reduction & Matrix Factorization
Feature (Sub-)Selection
• Choose “good” dimensions using a-priori knowledge or appropriate heuristics
– e.g. remove low-variance dimensions
– Depending on the application only a few dimensions might be of interest
– Example:shoesizeinterestingforshoepurchases,notsoforcarpurchases
• Advantages:
– No need for an intensive preprocessing or
training phase to determine relevant dimensions • Disadvantages:
– Expert knowledge required; misjudgment possible
– Univariate feature selection ignores correlations
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Beyond Feature (Sub-)Selection
• Can we do – better?
– automatic?
• Obviously: Simply discarding whole features not a good idea
Features are often correlated
transformation of coordinate axes
𝑥𝑥”‘ 𝑥!‘ ”
The 𝑥!ʹ dimension is barely necessary to describe the data in this new coordinate system àcan be ignored
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Dim. Reduction via Linear Transformations
• Represent data in a different coordinate system via linear transformations – change of basis (orthogonal basis transformations)
+ potentially discarding dimensions
• Technical:
– use orthonormal transformation matrix 𝑭 ∈ 𝑅!×#
– (𝒙$)% = 𝒙% ⋅ 𝑭 is the transformation of (column) vector 𝒙 into the new coordinate system defined by 𝑭
– 𝑿$ = 𝑿 ⋅ 𝑭 is the matrix containing all the transformed points 𝒙$&
(first center to mean, then) transformation with F
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Discussion: Linear Transformations
• Feature selection is a linear transformation – What is the matrix 𝑭?
• Let 𝒙( be the mean vector (here: row vector) in the original data space, the (
mean vector in the transformed space is given by 𝒙′ = 𝒙( ⋅ 𝑭
• Let 𝚺𝑿 be the covariance matrix in the original data space, the covariance matrix in the transformed space is then 𝚺𝑿# = 𝑭” ⋅ 𝚺𝑿 ⋅ 𝑭
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
• Chapter: Dimensionality Reduction & Matrix Factorization
1. Introduction
2. Principal Component Analysis (PCA)
3. Singular Value Decomposition (SVD)
4. Matrix Factorization
5. Neighbor Graph Methods
6. Autoencoders (Non-linear Dimensionality Reduction)
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Principal Component Analysis: Motivation
• Question: Which transformation matrix 𝑭 to use?
– Is there an optimal orthogonal transformation (depending on the data)?
– Optimality: Approximate the data with few coefficients as well as possible
transformation of coordinate axes
• Approach: Principal Component Analysis (PCA)
– Find a coordinate system in which the (possibly originally correlated) points are
linearly uncorrelated
– The dimensions with no or low variance can then be ignored
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Determine the Principal Components
– Transform the data, such that the covariance
between the new dimensions is 0
– The transformed data points are not linearly correlated any more
• Given: 𝑁 𝑑-dimensional data points: : 𝒙 & , 𝒙 ∈ R’ ∀𝑖 ∈ {1,…,𝑁} ##$% #
• We represent this set of points by a matrix 𝑿 ∈ R&×’:
𝑥%% ⋯ 𝑥%’ ⋮⋱⋮
– The row 𝒙& = 𝑥&’, … , 𝑥&! ∈ R! denotes the 𝑖-th point and the column 𝑿:,* denotes the vector containing all values from the 𝑗-th dimension
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Determine the Principal Components
– Transform the data, such that the covariance
between the new dimensions is 0
– The transformed data points are not linearly correlated any more
• General approach
1. Center the data
2. Compute the covariance matrix
3. Use the Eigenvector decomposition to transform the coordinate system
𝑉 ⋅ 𝑊 ⋅ 𝑉”
V: Matrix of eigenvectors
W: Diagonal matrix of eigenvalues
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Determine the Principal Components
𝑥%% ⋯ 𝑥%’ ⋮⋱⋮
• Given:𝑿∈R&×’: 𝑿=
• Shift the points by their mean 𝒙( ∈ R’ (centralized data): 𝒙?# = 𝒙# −𝒙(
Statistics:
Zero order statistic : number of points 𝑁
First order statistic: the mean of the 𝑁 points, the vector 𝒙5 ∈ R!:
𝒙5 = ⋮ = 𝑁 ⋅ 𝑿 ⋅ 𝟏 +
where 𝟏+ is an 𝑁-dimensional vector of ones
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Determine the Principal Components
• Determine the variances Var 𝑿D) for each dimension 𝑗 ∈ {1, … 𝑑} • Determine the covariance Cov(𝑿D)$ , 𝑿D)% ) between
dimensions𝑗% and𝑗*,∀𝑗% ≠𝑗* ∈{1,…𝑑} Ø Leads to the covariance matrix 𝚺𝑿+ ∈ R’×’
Statistics:
Second order statistic: variance and covariance
𝑿B 𝑿B. Cov
The variance within the j-th dimension in 𝑿 is: 1+-1%-
V a r 𝑿 = = 𝒙 − 𝒙5 = ⋅ 𝑿 𝑿 − 𝒙5 * 𝑁 &* * 𝑁 * * *
The covariance between dimension 𝑗’ and 𝑗- is:
Cov𝑿,𝑿 = =𝒙 −𝒙5 ⋅𝒙 −𝒙5 = ⋅𝑿𝑿 −𝒙5𝒙5
𝑁 *! *” *! *”
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Determine the Principal Components
Statistics (continued):
For the set of points contained in 𝑿 the corresponding covariance matrix is
defined as:
Var(𝑿!) Cov(𝑿!, 𝑿”) … Cov(𝑿!, 𝑿’) 𝚺𝑿 = Cov(𝑿”, 𝑿!) Var(𝑿”)
= 1 𝑿(𝑿 − 𝒙6 𝒙6) ⋮⋱⋮𝑁
Cov(𝑿’, 𝑿!) ⋯ Var(𝑿’)
• Remark: Covariance matrices are symmetric
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Determine the Principal Components
• Goal of PCA: Transformation of the coordinate system ç 0 0 ! Var(D)’÷
such that the covariances between the new axes are 0 è • Approach:
– Diagonalization by changing the basis (= adapt the coordinate system)
– According to the spectral theorem, the eigenvectors of a symmetric matrix
form an orthogonal basis
Ø Eigendecomposition of the covariance matrix: 𝚺𝑿+ = 𝚪 ⋅ 𝚲 ⋅ 𝚪”
Eigendecomposition (spectral decomposition) is the factorization of 𝑨 ∈ R!×! : 𝑨 = 𝚪 ⋅ 𝚲 ⋅ 𝚪%
→ matrices 𝚪, 𝚲 ∈ R’×’ with columns of 𝚪 being the normalized eigenvectors 𝜸+ →𝚪isanorthonormalmatrix: 𝚪⋅𝚪) =𝚪) ⋅𝚪=Id (𝚪) =𝚪,!)
→𝚲 is a diagonal matrix with eigenvalues 𝜆+ as the diagonal elements
æVar(1)’ 0 ! 0 ö ç 0 Var(2)’ ! 0 ÷
Cov’=ç”” “÷
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Determine the Principal Components
Ø Eigendecomposition of the covariance matrix: 𝚺𝑿+ = 𝚪 ⋅ 𝚲 ⋅ 𝚪”
• The new coordinate system is defined by the eigenvectors 𝛄#: – Transformed data: 𝒀 = 𝑿B ⋅ 𝚪
– 𝚲isthecovariancematrixinthisnewcoordinatesystem Ø New system has variance 𝜆& in dimension 𝑖
Ø∀𝑖’ ≠𝑖-:Cov 𝒀&!,𝒀&” =0
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Dimensionality Reduction with PCA
• Approach
– The coordinates with low variance (hence low 𝜆& ) can be ignored – W.l.o.g.letusassume𝜆’ ≥𝜆- ≥⋯≥𝜆/
Ø Truncation of 𝚪
– Keep only columns (i.e. eigenvectors) of 𝚪 corresponding to the largest 𝐤
eigenvalues 𝝀𝟏, … , 𝝀𝒌
– 𝒀23/453/=𝑿B⋅𝚪62475863/
• How to pick 𝑘?
– Frequently used: 90% rule; the k variances should explain 90% of the energy
– k=smallestvalueensuring∑9 𝜆 ≥0.9⋅∑! 𝜆 &,’ & &,’ &
Ø The modified points (transformed and truncated) contain most of the information of the original points and are low dimensional
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Complexity
• Complexity of PCA:
O𝑁⋅𝑑* + 𝑂𝑑, + 𝑂𝑁⋅𝑑⋅𝑘 =𝑂(𝑁⋅𝑑*+𝑑,)
Compute Eigenvalue Project data onto covariance matrix decomposition the k-dimensional space
• Remarks on eigenvalue decomposition:
– Usually we are interested in the reduced data only
Ø Only the k largest eigenvectors required (i.e. not all of them) – Use iterative approaches (next slide) for finding eigenvectors
– Complexity: 𝑂 #it ⋅ 𝑑” // #it = number of iterations
– For sparse data even faster: 𝑂 #it ⋅ #nz // #nz = number of nonzero elements in the matrix
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
How to Compute Eigenvectors?
• Eigenvalues are important for many machine learning/data mining tasks
– PCA, Ranking of Websites, Community Detection, … // see our other lecture!
– How to compute them efficiently?
• Power iteration (a.k.a. iteration)
– Iterative approach to compute a single eigenvector
• Let 𝑨 be a matrix and 𝒗 be an arbitrary (normalized) vector – Iteratively compute 𝒗 ← 𝑨⋅𝒗 until convergence
– ineachstep,𝒗issimplymultipliedwith𝑨andnormalized
– 𝒗convergestotheeigenvectorof𝑨withlargestabsolutevalue
– Highly efficient for sparse data
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
How to Compute Eigenvectors?
• Convergence:
– Linear convergence with rate 𝜆- /𝜆’
– Fast convergence if first and second eigenvalue are dissimilar
• How to find multiple (the k largest) eigenvectors?
– Let us focus on symmetric matrices 𝑨
– Eigenvaluedecompositionleadsto:𝑨=𝚪⋅𝚲⋅𝚪% =∑! 𝜆 ⋅𝜸 ⋅𝜸% &,’& & &
– Define deflated matrix: 𝑨W = 𝑨 − 𝜆’ ⋅ 𝜸’ ⋅ 𝜸’%
– 𝑨B has the same eigenvectors as 𝑨 except the first one has become zero
Ø Apply power iteration on 𝑨W to find the second largest eigenvector of 𝑨
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Alternative views of PCA
Images adapted from Alexh Williams
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
PCA vs. Regression
Image adapted from ́
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
PCA: Summary
• PCA finds the optimal transformation by deriving uncorrelated dimensions – Exploits eigendecomposition
• Dimensionality reduction
– After transformation simply remove dimensions with lowest variance
(or use only the 𝑘 largest eigenvectors for transformation)
• Limitations
– Only captures linear relationships (one solution: Kernel PCA)
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
• Chapter: Dimensionality Reduction & Matrix Factorization
1. Introduction
2. Principal Component Analysis (PCA)
3. Singular Value Decomposition (SVD)
– Idea:LowRankApproximation – SVD&LatentFactors
– DimensionalityReduction
4. Matrix Factorization
5. Neighbor Graph Methods
6. Autoencoders (Non-linear Dimensionality Reduction)
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Low-Dimensional Manifold
• Data often lies on a low-dimensional manifold embedded in higher dimensional space
• How can me measure the dimensionality of this manifold?
– put differently how to measure the intrinsic dimensionality of the data
• How can we find this manifold?
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Rank of a Matrix
• Q: What is the rank of a matrix A?
• A: Number of linearly independent columns/rows of A
• Example:
– Matrix A = has rank r=2
– Why?Thefirsttworowsarelinearlyindependent,sotherankisatleast2,butall three rows are linearly dependent (the first is equal to the sum of the second and third) so the rank must be less than 3.
• Why do we care about low rank?
– We can write A as two “basis” vectors: [1 2 1] [-2 -3 1]
– And new coordinates of: [1 0] [0 1] [1 -1]
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Rank is “Dimensionality”
• Cloud of points in 3D space:
Think of point positions as a matrix:
B 1 row per point: C
• We can rewrite coordinates more efficiently! – Oldbasisvectors:[100][010][001]
– New basis vectors: [1 2 1] [-2 -3 1]
– Then A has new coordinates: [1 0], B: [0 1], C: [1 -1]
– Notice:Wereducedthenumberofcoordinates!
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Low Rank Approximation
• Idea: approximate original data A by a low rank matrix B
1.01 2.05 0.9 1 2 1 𝐀=−2.1 −3.05 1.1 ≈ −2 −3 1=𝐁 2.99 5.01 0.3 3 5 0
rank(𝑨) = 3
we need 3 coordinates to describe each point
rank(𝑩) = 2
we need only 2 coordinates per point
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Low Rank Approximation
• Idea: approximate original data A by a low rank matrix B 1.01 2.05 0.9 1 2 1
𝐀=−2.1 −3.05 1.1≈−2 −3 1=𝐁 2.99 5.01 0.3 3 5 0
• Important: Even though both A and B are ∈ 𝑅-×’ we need only two coordinates per point to describe B
– rank(A)=3 vs. rank(B)=2 (3 vs. 2 coordinates per point)
• Goal: Find the best low rank approximation
– best = minimize the sum of reconstruction error
– Given matrix 𝑨 ∈ R=×!, find 𝑩 ∈ R=×! with rank(𝑩) = 𝑘 that minimizes +?
𝑨−𝑩-===𝑎−𝑏- > &*&*
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
• Chapter: Dimensionality Reduction & Matrix Factorization
1. Introduction
2. Principal Component Analysis (PCA)
3. Singular Value Decomposition (SVD)
– Idea:LowRankApproximation – SVD&LatentFactors
– DimensionalityReduction
4. Matrix Factorization
5. Neighbor Graph Methods
6. Autoencoders (Non-linear Dimensionality Reduction)
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Singular Value Decomposition (SVD): Definition
• Eachrealmatrix𝑨∈R-×’ canbedecomposedinto𝑨 = 𝑼⋅𝚺⋅𝑽. (note: exact representation, no approximation), where
– – 𝑼,𝑽:columnorthonormal
– i.e. 𝑼)𝑼 = 𝑰; 𝑽)𝑽 = 𝑰 (I: identity matrix)
– 𝑼 are called the left singular vectors , 𝑽 the right singular vectors – 𝚺:diagonal
– 𝑟×𝑟 diagonal matrix (𝑟: rank of matrix 𝑨)
– entries(calledsingularvalues)arepositive,
and sorted in decreasing order (𝜎! ≥ 𝜎” ≥ … ≥ 𝜎- > 0)
• Remark: The decomposition is (almost) unique
– see e.g. multiplication by -1 Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Singular Value Decomposition
here: 𝑟 = 2
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Singular Value Decomposition
𝑨 = 𝑼 𝚺 𝑽 _ = & 𝜎 a ⋅ 𝒖 a ∘ 𝒗 _a abc
s1 ⋅𝒖𝟏 ∘𝒗𝑻𝟏 s2 ⋅𝒖𝟐 ∘𝒗𝑻𝟐
here: 𝑟 = 2
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
SVD Example: Users-to-Movies
• 𝑨 = 𝑼𝚺𝑽” – example: Users to Movies SciFi-concept
Romance-concept
4 4 4 0 0 = 0.56 0
555000.700 x09.48x
00044 00055 00022
0 0.59 0 0.74 0 0.29
Matrix Alien
Reduction & Matrix Factorization
0.57 0.57 0.57 0 0
0 0 0 0.70 0.70
Data Analytics and Machine Learning
SVD Example: Users-to-Movies
• 𝑨 = 𝑼𝚺𝑽” – example: Users to Movies SciFi-concept
Romance-concept
(note the multiplication by -1)
4 4 4 0 0 = 0.56 0
555000.700 x09.48x
00044 00055 00022
0 -0.59 0 -0.74 0 -0.29
0.57 0.57 0.57 0 0
0 0 0 -0.70 -0.70
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Matrix Alien
SVD Example: Latent Factors
• 𝑨 = 𝑼𝚺𝑽” – example: Users to Movies
11100 33300 44400 55500 00044 00055 00022
“Concepts”
a.k.a. Latent dimensions a.k.a. Latent factors
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Matrix Alien
SVD Example: Beyond Blocks
1 1 1 0 0 0.13-0.02-0.01 3 3 3 0 0 0.41-0.07-0.03 44400=0.55-0.09-0.04
12.40 0 x09.50 x
0.56 0.59 0.56 0.09 0.09 -0.12 0.02 -0.12 0.69 0.69 0.40 -0.80 0.40 0.09 0.09
5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2
0.68-0.11-0.05 0.15 0.59 0.65 0.07 0.73-0.67 0.07 0.29 0.32
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Matrix Alien
SVD Example: Beyond Blocks
SciFi-concept
Romance-concept
1 1 1 0 0 0.13-0.02-0.01 3 3 3 0 0 0.41-0.07-0.03 44400=0.55-0.09-0.04
12.40 0 x09.50 x
0.56 0.59 0.56 0.09 0.09 -0.12 0.02 -0.12 0.69 0.69 0.40 -0.80 0.40 0.09 0.09
5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2
0.68-0.11-0.05 0.15 0.59 0.65 0.07 0.73-0.67 0.07 0.29 0.32
Dimensionality Reduction & Matrix Factorization
Data Analytics and Machine Learning
Matrix Alien
SVD Example: Beyond Blocks
U is “user-to-concept” similarity matrix
SciFi-concept Romance-concept
Matrix Alien
1 1 1 0 0 0.13-0.02-0.01 3 3 3 0 0 0.41-0.07-0.03 44400=0.55-0.09-0.04
12.40 0 x09.50 x
0.56 0.59 0.56 0.09 0.09 -0.12 0.02 -0.12 0.69 0.69 0.40 -0.80 0.40 0.09 0.09
Data Analytics and Machine Learning
5 5 5 0 0 0 2 0 4 4 0 0 0 5 5 0 1 0 2 2
0.68-0.11-0.05 0.15 0.59 0.65 0.07 0.73-0.67 0.07 0.29 0.32
Dimensionality Reduction & Matrix Factorization
SVD Example: Beyond Blocks
SciFi-concept
1 1 1 0 0 0.13-0.02-0.01 3 3 3 0 0
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com