Machine Learning Dimensionality Reduction
Department of Computer Science University College London
Copyright By PowCoder代写 加微信 powcoder
Lecture Overview
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction
Lecture Overview
Lecture Overview
By the end of this lecture you should:
1 Understand the problem of the curse of dimensionality and further motivations for dimensionality reduction
2 Know how Principal Components Analysis can be used to project data onto a lower-dimensional subspace
3 Be familiar with manifold learning as a means to uncover the non-linear low-dimensional structure in high dimensional data
Introduction
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction
Introduction
The Curse of Dimensionality
As the dimensionality of our input space increases, the number of instances that we need to ‘fill’ that space increases
Introduction
Specific Motivations
Data Visualisation Data Compression Denoising
Introduction
Dimensionality Reduction
At the heart of dimensionality reduction techniques is the idea that, although our data is high-dimensional, it actually lies on or near a low-dimensional subspace or manifold
If we assume our data lies on a subspace, we can use linear techniques to obtain a low-dimensional estimation
If we assume our data lies on a manifold, then we will need to use
non-linear techniques
Linear Dimensionality Reduction & PCA
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction
Linear Dimensionality Reduction & PCA
Linear Subspaces
The data shown is x ∈ R3
Along with this data is drawn a hyperplane that passes through both the origin and the data. The hyperplane is
referred to as a subspace of R3
Linear Dimensionality Reduction & PCA
Linear Subspaces
The hyperplane is a subset of R3:
Each point on the plane is represented by a single point in R3. But…
…We only require R2 co-ordinates to describe a single point on the hyperplane
So if we have a subspace Rd within Rm, where d ≪ m, then we can reduce the dimensionality of our data
Linear Dimensionality Reduction & PCA
Principal Components Analysis
Principal Component Analysis (PCA), is one of the oldest methods for linear dimensionality reduction
One view of PCA is that of Projected Variance Maximisation (other, equivalent views exist):
Here the idea is to project our high-dimensional data onto a lower-dimensional subspace (defined up to a rotation)…
…Such that the sample variance is maximally preserved
This subspace encapsulates the directions along which the data varies the most
Linear Dimensionality Reduction & PCA
PCA: Setting
As usual, we assume that our input data consists of n instances:
x(i)ni=1 where: x(i) ∈ Rm
Our goal is to project this data (linearly) onto a space Rd , where
d < m, such that the variance of projected data is maximised
This space is spanned by the set of basis vectors u[i]di=1 where:
Since this will not fully define the space in which we are interested, we remove some degeneracy (and ease our mathematical analysis) by requiring orthonormality:
u[i] · u[j] = δij
Linear Dimensionality Reduction & PCA
PCA: Setting
For each basis vector u[j], each data point x(i) is then projected
onto a scalar value u[j] · x(i)
The mean of the projected data onto this basis vector is u[j] · x,
where x is the sample mean:
x=n x(i) i=1
Linear Dimensionality Reduction & PCA
Projected Variance Maximisation
The sample variance of the projected data is given by:
Where S is the sample covariance matrix defined by:
1 n [ j ] ( i ) [ j ] 2 [ j ] T [ j ]
x −x x −x =nXX
u ·x −u ·x =u Su
1n(i) (i) T 1T
Linear Dimensionality Reduction & PCA
Projected Variance Maximisation
Here we note that X is a centred design matrix: (x(1) − x)T
(x(2) − x)T X = ·
· (x(n) − x)T
Linear Dimensionality Reduction & PCA
Aside: Reconstruction Error Minimisation
We can also view PCA as a search for the d dimensional subspace which minimises the reconstruction error:
n ( i ) ( x
d [ j ] ( i ) [ j ] 2
u · ( x − x ) u
Linear Dimensionality Reduction & PCA
PCA: Problem Formulation
We are interested in finding u[j]dj=1 such that the sum of the variance of the projected sample data is maximised
In other words we wish to solve the following optimisation problem:
argmax{u[j]}d L where: L = j=1
Subject to orthonormal u[j]dj=1
u[j]T Su[j] (1)
Linear Dimensionality Reduction & PCA
Eigendecomposition
Consider a square symmetric matrix, S, with eigenvalues {λi}mi=1,
and associated (orthonormal) eigenvectors {qi ∈ Rm}mi=1: Sqi =λiqi where: qi ·qj =δij
Then the eigendecomposition of S is given by: S = QΛQT
Λ = diag(λ1, ..., λm) Q = [q1, ..., qm]
Linear Dimensionality Reduction & PCA
PCA: Greedy Solution
Let us proceed in a step-wise fashion and attempt to learn each
dimension greedily
First, let us search for the direction of highest variance, and write
our objective as the following Lagrangian, L:
L(u[1], λ[1]) = u[1]T Su[1] − λ[1](u[1] · u[1] − 1)
Where λ[1] is a Lagrange multiplier associated with the orthogonality constraint.
Seeking stationarity wrt u[1] gives:
2Su[1] − 2λ[1]u[1] = 0
=⇒ Su[1] = λ[1]u[1]
Linear Dimensionality Reduction & PCA
PCA: First Basis vector
Since u[1] satisfies the eigenvalue equation this lets us equate it
with some eigenvalue qi
But which eigenvalue? Let us calculate the variance of the
projected data:
u[1]T Su[1] = λ[1]
We want to maximise this variance so we select u[1] = q1, the
eigenvector associated with λ1, the largest eigenvalue. Thus:
u[1] = q1 λ[1] = λ1
Linear Dimensionality Reduction & PCA
PCA: Second Basis Vector
Now let us find another direction, u[2], to further increase the
projected variance, such that u[2] · u[2] = 1 and u[1] · u[2] = 0: We write a Lagrangian, L, for this problem as follows:
L(u[2], λ[2], λ[1][2]) = u[2]T Su[2]−λ[2](u[2]·u[2]−1)−λ[1][2](u[2]·u[1]−0)
Where λ[2] and λ[1][2] are Lagrange multipliers.
Seeking stationarity wrt u[2] gives:
2Su[2] − 2λ[2]u[2] − λ[1][2]u[1] = 0 (2)
Linear Dimensionality Reduction & PCA
PCA: Second Basis Vector
Left multiply equation (2) by u[1]T gives:
=⇒ 2u[1]T Su[2] − 2λ[2]u[1] · u[2] − λ[1][2]u[1] · u[1] = 0
Therefore:
=⇒ 2u[2]T Su[1] − λ[1][2] = 0 =⇒ 2λ[1]u[2] · u[1] − λ[1][2] = 0 =⇒ λ[1][2] = 0
Su[2] = λ[2]u[2]
Linear Dimensionality Reduction & PCA
PCA: Second Basis vector
Since u[2] satisfies the eigenvalue equation this lets us equate it
with some eigenvalue qi
But which eigenvalue? Let us calculate the variance of the
projected data:
u[2]T Su[2] = λ[2]
We want to maximise this variance so we select u[2] = q2, the eigenvector associated with λ2, the largest remaining eigenvalue. Thus:
u[2] = q2 λ[2] = λ2
Linear Dimensionality Reduction & PCA
PCA: Subsequent Basis Vectors
The solution proceeds in a similar step-wise fashion, to give:
And this implies a total projected variance of
Linear Dimensionality Reduction & PCA
PCA: Non-Uniqueness
Note that we have made a number of choices in this derivation... Orthogonality of basis vectors
Normality of different basis vectors
Orientation of the first basis vector
Other choices are possible which would also yield similar solutions
This suggests that we should check if a particular similar solution is globally optimal
Linear Dimensionality Reduction & PCA
PCA: Non-Convexity
The trouble is that our original problem is non-convex However we can show that the greedy local optima we have
investigated give rise to globally optimal points
It turns out that PCA is a rare example of a non-convex optimisation problem which we can solve globally!
Linear Dimensionality Reduction & PCA
PCA: Examining the Eigenvalues
Our PCA solution allows us to estimate the subspace as the space spanned by the first d eigenvectors of the sample covariance matrix S ordered by size of eigenvalue
If we refer to the dimensionality of the input space as the ambient dimensionality of the data, then the intrinsic dimensionality is the dimensionality of the subspace upon which the data is assumed to lie
Linear Dimensionality Reduction & PCA
Examining the Eigenvalues
Linear Dimensionality Reduction & PCA
Application: Swaps Curve
Linear Dimensionality Reduction & PCA
Application: Swaps Curve
‘Shift’, ‘Tilt’, ‘Curvature’ effects dominate
Linear Dimensionality Reduction & PCA
Pattern Stability
PCA is intuitively plausible, but can we be sure that our training sample has allowed us to discern a reliable subspace?
At least two possible responses to this, which flow from different ML paradigms:
Linear Dimensionality Reduction & PCA
Pattern Stability
Generative Modelling:
Probabilistic PCA - Seeks a Gaussian latent variable model Learn parameters using MLE or Bayesian treatment
PAC Approach:
Seeks to bound generalisation error resulting from projection:
d [j] [j]2 ED x − j=1(u · x)u
, with high probability
Bound contains terms in sample residual eigenspectrum and in complexity of data space
Indicates low generalisation error if subspace captures a high proportion of the variance in a dimensionality small compared to training data size
Linear Dimensionality Reduction & Probabilistic PCA
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction
Linear Dimensionality Reduction & Probabilistic PCA
Probabilistic PCA
Let’s investigate the probabilistic generalisation of PCA
Here it’s valuable to recall the probabilistic generalisation of the
k-means model to the Mixture of Gaussians model for clustering Can we build a Latent Variable version of PCA?
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Setting
As before, our input data consists of n instances, {x(i) ∈ Rm}ni=1, with sample mean x, and associated sample covariance matrix, S = n1 X T X
And as before, we seek some d-dimensional principal component subspace, where d < m
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Model
Each point has an unknown, latent, variable, z ∈ Rd associated with it, corresponding to its position in the principal component subspace. This variable is the outcome of a Gaussian random variable, Z, such that:
z ∼ N(0,Id)
Contingent on the principal component subspace variable, each x is
the outcome of a Gaussian random variable, X, such that: x|z ∼ N(Wz + μ, σ2Im)
where: W ∈ Rm×d , which defines the directions of the principal subspace, μ ∈ Rm, σ ∈ R+
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Model
We can view the model from a generative standpoint: First a value is drawn for the latent variable, z
Then the observed variable is sampled, conditional on this latent variable, x|z
And residual noise is captured by ε, which is an outcome of an m-dimensional random variable, ε, distributed like a zero-mean Gaussian:
x = Wz + μ + ε (3) z ∼ N(0,Id)
ε ∼ N(0,σ2Im)
where Z and ε are uncorrelated
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Model
Recall the properties of the marginal & conditional distributions associated with Linear Gaussian Models that we encountered in the Probability Lecture:
Given a marginal distribution for x and a conditional Gaussian distribution for y given x:
x ∼ N ( μ , Λ − 1 )
y | x ∼ N ( A x + b , L − 1 )
Where: x ∈ Rn, y ∈ Rm, μ ∈ Rn, Λ ∈ Rn×n, A ∈ Rm×n, b ∈ Rm, L ∈ Rm×m
y ∼ N ( A μ + b , L − 1 + A Λ − 1 A T ) x|y ∼ N(Σ[AT L(y − b) + Λμ], Σ)
Where: Σ = (Λ + AT LA)−1
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Model
From this, we can see that x is drawn from a Gaussian distribution with:
Covariance:
E[X] = WE[Z] + μ + E[ε] =μ
C = E[(WZ + μ + ε − E[X])(WZ + μ + ε − E[X])T ] = E[(WZ + ε)(WZ + ε)T ]
= E[WZZT WT ] + E[εεT ] + E[WZεT ] + E[εZT WT ] =WE[ZZT]WT +E[εεT]
=WWT +σ2Im
x ∼ N(μ, C)
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Interpretation
We can interpret pX(x) as a density defined by taking an isotropic Gaussian ‘spray can’...
...Then moving across the principal subspace, spraying Gaussian ink with density determined by σ2...
...And weighted by the prior distribution, pZ(z)
This results in an ink density which has a pancake shaped distribution, which represents pX(x)
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Interpretation
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Rotational Invariance
Let R be an orthogonal (rotation) matrix (for which RT R = I)
Now apply this rotation to the latent space coordinate matrix, W:
=⇒ WWT =WRRTWT
Thus pX(x) is as well characterised by any W as it is by W
This is the analogue of the non-uniqueness which we encountered in PCA
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Learning Problem
Now that we have defined our generative model we need to learn its parameters: μ, W, σ2
Let’s use a Maximum Likelihood approach:
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Log Likelihood
ln pX(x(i); W, μ, σ2)
n {x(i)}ni=1 =
=− 2 ln(2π)−2ln|C|−2
(x(i)−μ)TC−1(x(i)−μ)
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Optimisation
Tipping & Bishop (‘99) demonstrate the following closed form solutions which flow from the optimisation of this function:
WMLE = Q(Λ − σ2I)1/2R
Q is the m × d matrix whose columns are given by the leading d eigenvectors of the covariance matrix S
Λ is the diagonal matrix of the d leading eigenvalues associated with these eigenvectors
R is an arbitrary orthogonal matrix
σMLE = m − d
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Interpretation of Solution
We can set R = I without loss of generality, in which case the columns of W are the principal component eigenvectors scaled by the square root of the variance parameters (λi − σ2)1/2
So the variance of pX(x) in the qi direction is given by:
qTi Cqi =qTi WWTqi +σ2qTi qi = λi − σ2 + σ2
So this model captures the variance of the data in the direction of
the principal axes
While σ2MLE is the average variance associated with the discarded dimensions
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
Recall the distributional forms for z, and x|z:
z ∼ N(0,Id)
x|z ∼ N(Wz + μ, σ2In)
Then using the conditional distribution property for Linear Gaussian Models once again:
z|x ∼ N(M−1WTMLE(x − μ), σ2M−1) Where: M = WTMLEWMLE + σ2I
This is the particular distribution of z given a point x in the input data space.
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
The expectation of Z|x gives a summary of the point x in latent space:
E[Z|x] = M−1WTMLE(x − μ)
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
And this latent variable point projects back to a point in the input data space the expectation of which is given by:
WMLEE[Z|x] + μ
= WMLEM−1WTMLE(x − μ) + μ
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
And this latent variable point projects back to a point in the input data space the expectation of which is given by:
WMLEE[Z|x] + μ
= WMLEM−1WTMLE(x − μ) + μ
= WMLE WTMLEWMLE + σ2I−1 WTMLE(x − μ) + μ
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
And this latent variable point projects back to a point in the input data space the expectation of which is given by:
WMLEE[Z|x] + μ
= WMLEM−1WTMLE(x − μ) + μ
= WMLE WTMLEWMLE + σ2I−1 WTMLE(x − μ) + μ
21T 212−1T
=WMLE (Λ−σ I)2Q Q(Λ−σ I)2 +σ I WMLE(x−μ)+μ
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
And this latent variable point projects back to a point in the input data space the expectation of which is given by:
WMLEE[Z|x] + μ
= WMLEM−1WTMLE(x − μ) + μ
= WMLE WTMLEWMLE + σ2I−1 WTMLE(x − μ) + μ 21T 212−1T
=WMLE (Λ−σ I)2Q Q(Λ−σ I)2 +σ I WMLE(x−μ)+μ = WMLEΛ−1WTMLE(x − μ) + μ
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
And this latent variable point projects back to a point in the input data space the expectation of which is given by:
WMLEE[Z|x] + μ
= WMLEM−1WTMLE(x − μ) + μ
= WMLE WTMLEWMLE + σ2I−1 WTMLE(x − μ) + μ 21T 212−1T
=WMLE (Λ−σ I)2Q Q(Λ−σ I)2 +σ I WMLE(x−μ)+μ = WMLEΛ−1WTMLE(x − μ) + μ
= Q ( Λ − σ 2 I ) 12 Λ − 1 ( Λ − σ 2 I ) 12 Q T ( x − μ ) + μ
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
And this latent variable point projects back to a point in the input data space the expectation of which is given by:
WMLEE[Z|x] + μ
= WMLEM−1WTMLE(x − μ) + μ
= WMLE WTMLEWMLE + σ2I−1 WTMLE(x − μ) + μ 21T 212−1T
=WMLE (Λ−σ I)2Q Q(Λ−σ I)2 +σ I WMLE(x−μ)+μ
= WMLEΛ−1WTMLE(x − μ) + μ
= Q ( Λ − σ 2 I ) 12 Λ − 1 ( Λ − σ 2 I ) 12 Q T ( x − μ ) + μ
λ1−σ2 λd−σ2 T
=Qdiag λ,...,λ Q(x−μ)+μ
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
Now,asσ2 →0,then(WMLEE[Z|x]+μ)→QQT(x−μ)+μ
Thus each data point is approximated by a mapping into a linear subspace defined by the eigenvectors of S, given by Q, such that each point is orthogonally projected into this subspace
...Just as in PCA:
q Ti ( x − μ )
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Connection with PCA
But for σ2 > 0, each projection is scaled by λi −σ2
< 1, thus the
projection is shrunk towards μ:
λ i q i ( x − μ )
Linear Dimensionality Reduction & Probabilistic PCA
PPCA: Setting d
Just as in the Mixture of Gaussians model for clustering, we can use our generative model to select d in a principled way:
Evaluate the likelihood of data on a validation set for various settings of d and select the one which gives rise to the maximal one
A Bayesian treatment of PPCA or the maximisation of the PAC bound offer different resolutions to this problem
Linear Dimensionality Reduction & Probabilistic PCA
Where PCA Fails
One of the restrictions of PCA is that it assumes the data lies on or near a linear subspace
What happens if that assumption fails?
Non-Linear Dimensionality Reduction
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction
Non-Linear Dimensionality Reduction
From Subspaces to (Sub)Manifolds
Non-Linear Dimensionality Reduction
Manifold Learning Techniques
Kernel PCA
Locally Linear Embedding (LLE) Autoencoder
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction
1 Dimensionality Reduction falls into two broad categories depending on whether the low-dimensional space we are mapping to is linear or non-linear
2 Principal Component Analysis is a linear technique for reducing the dimensionality of the data by projecting it onto the maximum covariance subspace
3 Manifold learning techniques al
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com