程序代写 Machine Learning Dimensionality Reduction

Machine Learning Dimensionality Reduction

Department of Computer Science University College London

Copyright By PowCoder代写 加微信 powcoder

Lecture Overview
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction

Lecture Overview
Lecture Overview
By the end of this lecture you should:
1 Understand the problem of the curse of dimensionality and further motivations for dimensionality reduction
2 Know how Principal Components Analysis can be used to project data onto a lower-dimensional subspace
3 Be familiar with manifold learning as a means to uncover the non-linear low-dimensional structure in high dimensional data

Introduction
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction

Introduction
The Curse of Dimensionality
As the dimensionality of our input space increases, the number of instances that we need to ‘fill’ that space increases

Introduction
Specific Motivations
Data Visualisation Data Compression Denoising

Introduction
Dimensionality Reduction
At the heart of dimensionality reduction techniques is the idea that, although our data is high-dimensional, it actually lies on or near a low-dimensional subspace or manifold
If we assume our data lies on a subspace, we can use linear techniques to obtain a low-dimensional estimation
If we assume our data lies on a manifold, then we will need to use
non-linear techniques

Linear Dimensionality Reduction & PCA
Lecture Overview
1 Lecture Overview
2 Introduction
3 Linear Dimensionality Reduction & PCA
4 Linear Dimensionality Reduction & Probabilistic PCA
5 Non-Linear Dimensionality Reduction

Linear Dimensionality Reduction & PCA
Linear Subspaces
The data shown is x ∈ R3
Along with this data is drawn a hyperplane that passes through both the origin and the data. The hyperplane is
referred to as a subspace of R3

Linear Dimensionality Reduction & PCA
Linear Subspaces
The hyperplane is a subset of R3:
Each point on the plane is represented by a single point in R3. But…
…We only require R2 co-ordinates to describe a single point on the hyperplane
So if we have a subspace Rd within Rm, where d ≪ m, then we can reduce the dimensionality of our data

Linear Dimensionality Reduction & PCA
Principal Components Analysis
Principal Component Analysis (PCA), is one of the oldest methods for linear dimensionality reduction
One view of PCA is that of Projected Variance Maximisation (other, equivalent views exist):
Here the idea is to project our high-dimensional data onto a lower-dimensional subspace (defined up to a rotation)…
…Such that the sample variance is maximally preserved
This subspace encapsulates the directions along which the data varies the most

Linear Dimensionality Reduction & PCA
PCA: Setting
As usual, we assume that our input data consists of n instances:
􏰋x(i)􏰌ni=1 where: x(i) ∈ Rm
Our goal is to project this data (linearly) onto a space Rd , where
d < m, such that the variance of projected data is maximised This space is spanned by the set of basis vectors 􏰋u[i]􏰌di=1 where: Since this will not fully define the space in which we are interested, we remove some degeneracy (and ease our mathematical analysis) by requiring orthonormality: u[i] · u[j] = δij Linear Dimensionality Reduction & PCA PCA: Setting For each basis vector u[j], each data point x(i) is then projected onto a scalar value u[j] · x(i) The mean of the projected data onto this basis vector is u[j] · x, where x is the sample mean: x=n x(i) i=1 Linear Dimensionality Reduction & PCA Projected Variance Maximisation The sample variance of the projected data is given by: Where S is the sample covariance matrix defined by: 1 􏰊n 􏰑 [ j ] ( i ) [ j ] 􏰒 2 [ j ] T [ j ] x −x x −x =nXX u ·x −u ·x =u Su 1􏰊n􏰑(i) 􏰒􏰑(i) 􏰒T 1T Linear Dimensionality Reduction & PCA Projected Variance Maximisation Here we note that X is a centred design matrix: (x(1) − x)T  (x(2) − x)T  X =  ·   ·  (x(n) − x)T Linear Dimensionality Reduction & PCA Aside: Reconstruction Error Minimisation We can also view PCA as a search for the d dimensional subspace which minimises the reconstruction error: 􏰊n 􏰜 ( i ) 􏰜􏰜 ( x 􏰊d 􏰑 [ j ] ( i ) 􏰒 [ j ] 􏰜 2 u · ( x − x ) u 􏰜􏰜 Linear Dimensionality Reduction & PCA PCA: Problem Formulation We are interested in finding 􏰋u[j]􏰌dj=1 such that the sum of the variance of the projected sample data is maximised In other words we wish to solve the following optimisation problem: argmax{u[j]}d L where: L = j=1 Subject to orthonormal 􏰋u[j]􏰌dj=1 u[j]T Su[j] (1) Linear Dimensionality Reduction & PCA Eigendecomposition Consider a square symmetric matrix, S, with eigenvalues {λi}mi=1, and associated (orthonormal) eigenvectors {qi ∈ Rm}mi=1: Sqi =λiqi where: qi ·qj =δij Then the eigendecomposition of S is given by: S = QΛQT Λ = diag(λ1, ..., λm) Q = [q1, ..., qm] Linear Dimensionality Reduction & PCA PCA: Greedy Solution Let us proceed in a step-wise fashion and attempt to learn each dimension greedily First, let us search for the direction of highest variance, and write our objective as the following Lagrangian, L: L(u[1], λ[1]) = u[1]T Su[1] − λ[1](u[1] · u[1] − 1) Where λ[1] is a Lagrange multiplier associated with the orthogonality constraint. Seeking stationarity wrt u[1] gives: 2Su[1] − 2λ[1]u[1] = 0 =⇒ Su[1] = λ[1]u[1] Linear Dimensionality Reduction & PCA PCA: First Basis vector Since u[1] satisfies the eigenvalue equation this lets us equate it with some eigenvalue qi But which eigenvalue? Let us calculate the variance of the projected data: u[1]T Su[1] = λ[1] We want to maximise this variance so we select u[1] = q1, the eigenvector associated with λ1, the largest eigenvalue. Thus: u[1] = q1 λ[1] = λ1 Linear Dimensionality Reduction & PCA PCA: Second Basis Vector Now let us find another direction, u[2], to further increase the projected variance, such that u[2] · u[2] = 1 and u[1] · u[2] = 0: We write a Lagrangian, L, for this problem as follows: L(u[2], λ[2], λ[1][2]) = u[2]T Su[2]−λ[2](u[2]·u[2]−1)−λ[1][2](u[2]·u[1]−0) Where λ[2] and λ[1][2] are Lagrange multipliers. Seeking stationarity wrt u[2] gives: 2Su[2] − 2λ[2]u[2] − λ[1][2]u[1] = 0 (2) Linear Dimensionality Reduction & PCA PCA: Second Basis Vector Left multiply equation (2) by u[1]T gives: =⇒ 2u[1]T Su[2] − 2λ[2]u[1] · u[2] − λ[1][2]u[1] · u[1] = 0 Therefore: =⇒ 2u[2]T Su[1] − λ[1][2] = 0 =⇒ 2λ[1]u[2] · u[1] − λ[1][2] = 0 =⇒ λ[1][2] = 0 Su[2] = λ[2]u[2] Linear Dimensionality Reduction & PCA PCA: Second Basis vector Since u[2] satisfies the eigenvalue equation this lets us equate it with some eigenvalue qi But which eigenvalue? Let us calculate the variance of the projected data: u[2]T Su[2] = λ[2] We want to maximise this variance so we select u[2] = q2, the eigenvector associated with λ2, the largest remaining eigenvalue. Thus: u[2] = q2 λ[2] = λ2 Linear Dimensionality Reduction & PCA PCA: Subsequent Basis Vectors The solution proceeds in a similar step-wise fashion, to give: And this implies a total projected variance of Linear Dimensionality Reduction & PCA PCA: Non-Uniqueness Note that we have made a number of choices in this derivation... Orthogonality of basis vectors Normality of different basis vectors Orientation of the first basis vector Other choices are possible which would also yield similar solutions This suggests that we should check if a particular similar solution is globally optimal Linear Dimensionality Reduction & PCA PCA: Non-Convexity The trouble is that our original problem is non-convex However we can show that the greedy local optima we have investigated give rise to globally optimal points It turns out that PCA is a rare example of a non-convex optimisation problem which we can solve globally! Linear Dimensionality Reduction & PCA PCA: Examining the Eigenvalues Our PCA solution allows us to estimate the subspace as the space spanned by the first d eigenvectors of the sample covariance matrix S ordered by size of eigenvalue If we refer to the dimensionality of the input space as the ambient dimensionality of the data, then the intrinsic dimensionality is the dimensionality of the subspace upon which the data is assumed to lie Linear Dimensionality Reduction & PCA Examining the Eigenvalues Linear Dimensionality Reduction & PCA Application: Swaps Curve Linear Dimensionality Reduction & PCA Application: Swaps Curve ‘Shift’, ‘Tilt’, ‘Curvature’ effects dominate Linear Dimensionality Reduction & PCA Pattern Stability PCA is intuitively plausible, but can we be sure that our training sample has allowed us to discern a reliable subspace? At least two possible responses to this, which flow from different ML paradigms: Linear Dimensionality Reduction & PCA Pattern Stability Generative Modelling: Probabilistic PCA - Seeks a Gaussian latent variable model Learn parameters using MLE or Bayesian treatment PAC Approach: Seeks to bound generalisation error resulting from projection: 􏰈 􏰐d [j] [j]􏰉2 ED x − j=1(u · x)u , with high probability Bound contains terms in sample residual eigenspectrum and in complexity of data space Indicates low generalisation error if subspace captures a high proportion of the variance in a dimensionality small compared to training data size Linear Dimensionality Reduction & Probabilistic PCA Lecture Overview 1 Lecture Overview 2 Introduction 3 Linear Dimensionality Reduction & PCA 4 Linear Dimensionality Reduction & Probabilistic PCA 5 Non-Linear Dimensionality Reduction Linear Dimensionality Reduction & Probabilistic PCA Probabilistic PCA Let’s investigate the probabilistic generalisation of PCA Here it’s valuable to recall the probabilistic generalisation of the k-means model to the Mixture of Gaussians model for clustering Can we build a Latent Variable version of PCA? Linear Dimensionality Reduction & Probabilistic PCA PPCA: Setting As before, our input data consists of n instances, {x(i) ∈ Rm}ni=1, with sample mean x, and associated sample covariance matrix, S = n1 X T X And as before, we seek some d-dimensional principal component subspace, where d < m Linear Dimensionality Reduction & Probabilistic PCA PPCA: Model Each point has an unknown, latent, variable, z ∈ Rd associated with it, corresponding to its position in the principal component subspace. This variable is the outcome of a Gaussian random variable, Z, such that: z ∼ N(0,Id) Contingent on the principal component subspace variable, each x is the outcome of a Gaussian random variable, X, such that: x|z ∼ N(Wz + μ, σ2Im) where: W ∈ Rm×d , which defines the directions of the principal subspace, μ ∈ Rm, σ ∈ R+ Linear Dimensionality Reduction & Probabilistic PCA PPCA: Model We can view the model from a generative standpoint: First a value is drawn for the latent variable, z Then the observed variable is sampled, conditional on this latent variable, x|z And residual noise is captured by ε, which is an outcome of an m-dimensional random variable, ε, distributed like a zero-mean Gaussian: x = Wz + μ + ε (3) z ∼ N(0,Id) ε ∼ N(0,σ2Im) where Z and ε are uncorrelated Linear Dimensionality Reduction & Probabilistic PCA PPCA: Model Recall the properties of the marginal & conditional distributions associated with Linear Gaussian Models that we encountered in the Probability Lecture: Given a marginal distribution for 􏰛x and a conditional Gaussian distribution for 􏰛y given 􏰛x: 􏰛x ∼ N ( μ , Λ − 1 ) 􏰛y | 􏰛x ∼ N ( A 􏰛x + b , L − 1 ) Where: 􏰛x ∈ Rn, 􏰛y ∈ Rm, μ ∈ Rn, Λ ∈ Rn×n, A ∈ Rm×n, b ∈ Rm, L ∈ Rm×m 􏰛y ∼ N ( A μ + b , L − 1 + A Λ − 1 A T ) 􏰛x|􏰛y ∼ N(Σ[AT L(􏰛y − b) + Λμ], Σ) Where: Σ = (Λ + AT LA)−1 Linear Dimensionality Reduction & Probabilistic PCA PPCA: Model From this, we can see that x is drawn from a Gaussian distribution with: Covariance: E[X] = WE[Z] + μ + E[ε] =μ C = E[(WZ + μ + ε − E[X])(WZ + μ + ε − E[X])T ] = E[(WZ + ε)(WZ + ε)T ] = E[WZZT WT ] + E[εεT ] + E[WZεT ] + E[εZT WT ] =WE[ZZT]WT +E[εεT] =WWT +σ2Im x ∼ N(μ, C) Linear Dimensionality Reduction & Probabilistic PCA PPCA: Interpretation We can interpret pX(x) as a density defined by taking an isotropic Gaussian ‘spray can’... ...Then moving across the principal subspace, spraying Gaussian ink with density determined by σ2... ...And weighted by the prior distribution, pZ(z) This results in an ink density which has a pancake shaped distribution, which represents pX(x) Linear Dimensionality Reduction & Probabilistic PCA PPCA: Interpretation Linear Dimensionality Reduction & Probabilistic PCA PPCA: Rotational Invariance Let R be an orthogonal (rotation) matrix (for which RT R = I) Now apply this rotation to the latent space coordinate matrix, W: =⇒ W􏰛W􏰛T =WRRTWT Thus pX(x) is as well characterised by any W􏰛 as it is by W This is the analogue of the non-uniqueness which we encountered in PCA Linear Dimensionality Reduction & Probabilistic PCA PPCA: Learning Problem Now that we have defined our generative model we need to learn its parameters: μ, W, σ2 Let’s use a Maximum Likelihood approach: Linear Dimensionality Reduction & Probabilistic PCA PPCA: Log Likelihood ln pX(x(i); W, μ, σ2) 􏰑 􏰒 􏰊n {x(i)}ni=1 = =− 2 ln(2π)−2ln|C|−2 (x(i)−μ)TC−1(x(i)−μ) Linear Dimensionality Reduction & Probabilistic PCA PPCA: Optimisation Tipping & Bishop (‘99) demonstrate the following closed form solutions which flow from the optimisation of this function: WMLE = Q(Λ − σ2I)1/2R Q is the m × d matrix whose columns are given by the leading d eigenvectors of the covariance matrix S Λ is the diagonal matrix of the d leading eigenvalues associated with these eigenvectors R is an arbitrary orthogonal matrix σMLE = m − d Linear Dimensionality Reduction & Probabilistic PCA PPCA: Interpretation of Solution We can set R = I without loss of generality, in which case the columns of W are the principal component eigenvectors scaled by the square root of the variance parameters (λi − σ2)1/2 So the variance of pX(x) in the qi direction is given by: qTi Cqi =qTi WWTqi +σ2qTi qi = λi − σ2 + σ2 So this model captures the variance of the data in the direction of the principal axes While σ2MLE is the average variance associated with the discarded dimensions Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA Recall the distributional forms for z, and x|z: z ∼ N(0,Id) x|z ∼ N(Wz + μ, σ2In) Then using the conditional distribution property for Linear Gaussian Models once again: z|x ∼ N(M−1WTMLE(x − μ), σ2M−1) Where: M = WTMLEWMLE + σ2I This is the particular distribution of z given a point x in the input data space. Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA The expectation of Z|x gives a summary of the point x in latent space: E[Z|x] = M−1WTMLE(x − μ) Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA And this latent variable point projects back to a point in the input data space the expectation of which is given by: WMLEE[Z|x] + μ = WMLEM−1WTMLE(x − μ) + μ Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA And this latent variable point projects back to a point in the input data space the expectation of which is given by: WMLEE[Z|x] + μ = WMLEM−1WTMLE(x − μ) + μ = WMLE 􏰃WTMLEWMLE + σ2I􏰄−1 WTMLE(x − μ) + μ Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA And this latent variable point projects back to a point in the input data space the expectation of which is given by: WMLEE[Z|x] + μ = WMLEM−1WTMLE(x − μ) + μ = WMLE 􏰃WTMLEWMLE + σ2I􏰄−1 WTMLE(x − μ) + μ 􏰑21T 212􏰒−1T =WMLE (Λ−σ I)2Q Q(Λ−σ I)2 +σ I WMLE(x−μ)+μ Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA And this latent variable point projects back to a point in the input data space the expectation of which is given by: WMLEE[Z|x] + μ = WMLEM−1WTMLE(x − μ) + μ = WMLE 􏰃WTMLEWMLE + σ2I􏰄−1 WTMLE(x − μ) + μ 􏰑21T 212􏰒−1T =WMLE (Λ−σ I)2Q Q(Λ−σ I)2 +σ I WMLE(x−μ)+μ = WMLEΛ−1WTMLE(x − μ) + μ Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA And this latent variable point projects back to a point in the input data space the expectation of which is given by: WMLEE[Z|x] + μ = WMLEM−1WTMLE(x − μ) + μ = WMLE 􏰃WTMLEWMLE + σ2I􏰄−1 WTMLE(x − μ) + μ 􏰑21T 212􏰒−1T =WMLE (Λ−σ I)2Q Q(Λ−σ I)2 +σ I WMLE(x−μ)+μ = WMLEΛ−1WTMLE(x − μ) + μ = Q ( Λ − σ 2 I ) 12 Λ − 1 ( Λ − σ 2 I ) 12 Q T ( x − μ ) + μ Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA And this latent variable point projects back to a point in the input data space the expectation of which is given by: WMLEE[Z|x] + μ = WMLEM−1WTMLE(x − μ) + μ = WMLE 􏰃WTMLEWMLE + σ2I􏰄−1 WTMLE(x − μ) + μ 􏰑21T 212􏰒−1T =WMLE (Λ−σ I)2Q Q(Λ−σ I)2 +σ I WMLE(x−μ)+μ = WMLEΛ−1WTMLE(x − μ) + μ = Q ( Λ − σ 2 I ) 12 Λ − 1 ( Λ − σ 2 I ) 12 Q T ( x − μ ) + μ 􏰓λ1−σ2 λd−σ2􏰔 T =Qdiag λ,...,λ Q(x−μ)+μ Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA Now,asσ2 →0,then(WMLEE[Z|x]+μ)→􏰃QQT(x−μ)+μ􏰄 Thus each data point is approximated by a mapping into a linear subspace defined by the eigenvectors of S, given by Q, such that each point is orthogonally projected into this subspace ...Just as in PCA: q Ti ( 􏰛x − μ ) Linear Dimensionality Reduction & Probabilistic PCA PPCA: Connection with PCA But for σ2 > 0, each projection is scaled by λi −σ2
< 1, thus the projection is shrunk towards μ: λ i q i ( 􏰛x − μ ) Linear Dimensionality Reduction & Probabilistic PCA PPCA: Setting d Just as in the Mixture of Gaussians model for clustering, we can use our generative model to select d in a principled way: Evaluate the likelihood of data on a validation set for various settings of d and select the one which gives rise to the maximal one A Bayesian treatment of PPCA or the maximisation of the PAC bound offer different resolutions to this problem Linear Dimensionality Reduction & Probabilistic PCA Where PCA Fails One of the restrictions of PCA is that it assumes the data lies on or near a linear subspace What happens if that assumption fails? Non-Linear Dimensionality Reduction Lecture Overview 1 Lecture Overview 2 Introduction 3 Linear Dimensionality Reduction & PCA 4 Linear Dimensionality Reduction & Probabilistic PCA 5 Non-Linear Dimensionality Reduction Non-Linear Dimensionality Reduction From Subspaces to (Sub)Manifolds Non-Linear Dimensionality Reduction Manifold Learning Techniques Kernel PCA Locally Linear Embedding (LLE) Autoencoder Lecture Overview 1 Lecture Overview 2 Introduction 3 Linear Dimensionality Reduction & PCA 4 Linear Dimensionality Reduction & Probabilistic PCA 5 Non-Linear Dimensionality Reduction 1 Dimensionality Reduction falls into two broad categories depending on whether the low-dimensional space we are mapping to is linear or non-linear 2 Principal Component Analysis is a linear technique for reducing the dimensionality of the data by projecting it onto the maximum covariance subspace 3 Manifold learning techniques al 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com