CS计算机代考程序代写 algorithm Statistical Machine Learning

Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 821

Part IX
Principal Component Analysis
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
326of 821

Motivation: Pre-training Deep Neural Networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
327of 821
Empirical observations – pre 2006:
Deep architectures get stuck in local minima or plateaus
As architecture gets deeper, more difficult to obtain good generalisation
Hard to initialise random weights well
1 or 2 hidden layers seem to perform better
2006: Unsupervised pre-training of each layer; deeper models possible
Usually based on auto-encoders (tomorrow’s lecture) Similar in spirit to PCA (today’s lecture)

Motivation: Exploratory Analysis
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
328of 821
Given a dataset of numerical features:
Low dimensional data may be easy to plot High dimensional data is challenging Dimensionality reduction (e.g. PCA)
Try to explain with fewer dimensions
Enables visualisation
The new basis may yield insights
Aside: can simplify/speed up subsequent analysis e.g. regression

Supervised Learning
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
329of 821
Givenarepairsofdataxi ∈X andtargetsti ∈T inthe form (xi,ti), where i = 1…N.
Learn a mapping between the data X and the target t which generalises well to new data.

Unsupervised Learning
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
330of 821
Givenonlythedataxi ∈X.
Discover (=learn) some interesting structure inherent in the data X.

Unsupervised Learning
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
331of 821
Givenonlythedataxi ∈X.
Discover (=learn) some interesting structure inherent in the data.

Testing – Supervised versus Unsupervised Learning
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
332of 821

Recall: Fisher’s Linear Discriminant
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
333of 821
Samples from two classes in a two-dimensional input space and their histogram when projected to two different one-dimensional spaces.
44
22
00
−2 −2
−2 2 6
−2 2 6

Eigenvectors
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
334of 821
Every square matrix A ∈ Rn×n has an Eigenvector decomposition
Ax = λx
􏵬 0 1􏵭
−1 0 x=λx
λ = {−ı, ı}
􏲞􏵬ı􏵭 􏵬−ı􏵭􏲟
x=1,1
wherex∈Rn andλ∈C. Example:

Eigenvectors
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
335of 821
How many eigenvalue/eigenvector pairs?
is equivalent to
Ax = λx (A − λI)x = 0
Has only non-trivial solution for det {A − λI} = 0 polynom of nth order; at most n distinct solutions

Real Eigenvalues
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
336of 821
How can we enforce real eigenvalues?
Let’s look at matrices with complex entries A ∈ Cn×n. Transposition is replaced by Hermitian adjoint, e.g.
􏵬1+ı2 3+ı4􏵭H 􏵬1−ı2 5−ı6􏵭 5+ı6 7+ı8 = 3−ı4 7−ı8
Denote the complex conjugate of a complex number λ by λ.

Real Eigenvalues
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
337of 821
How can we enforce real eigenvalues? Let’s assume A ∈ Cn×n, Hermitian (AH = A). Calculate
xHAx = λxHx
for an eigenvector x ∈ Cn of A. Another possibility to calculate xHAx
xHAx = xHAHx = (xHAx)H = (λxHx)H
= λxHx and therefore
(A is Hermitian) (reverse order) (eigenvalue)
λ=λ (λisreal).
If A is Hermitian, then all eigenvalues are real.
Special case: If A has only real entries and is symmetric, then all eigenvalues are real.

Singular Value Decomposition
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
338of 821
Every matrix A ∈ Rn×p can be decomposed into a product of three matrices
A = UΣVT
where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices ( UTU=IandVTV=I),andΣ∈Rn×p hasnonnegative numbers on the diagonal.

Linear Curve Fitting – Least Squares
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
339of 821
N = 10
x ≡ (x , . . . , x )T
8 7 6 5 4 3 2 1 0
0 2 4 6 8 10
x
1N
t ≡ (t1,…,tN)T
xi ∈R i=1,…,N ti ∈R i=1,…,N
y(x, w) = w1x + w0 X≡[x 1]
w∗ =(XTX)−1XTt
t

Inverse of a matrix
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
340of 821
Assume a full rank symmetric real matrix A. Then A = UTΛU where
Λ is a diagonal matrix with real eigenvalues U contains the eigenvectors
A−1 = (UT ΛU)−1
= U−1Λ−1U−T inverse changes order = UTΛ−1U UTU = I
The inverse of a diagonal matrix is the inverse of its elements.

Dimensionality Reduction
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
341of 821
Main goal of Principal Component Analysis:
dimensionality reduction
Many applications in visualisation, feature extraction, signal processing, data compression . . .
Example: Use hand-written digits (binary data) and place them into a larger frame (100 × 100) varying the position and the rotation angle.
Data space size = 10 000.
But data live on a three-dimensional manifold (x, y, and the rotation angle).
FYI only: this manifold is not linear and requires bleeding edge models like capsule networks (Hinton 2017); still we can locally approximate with PCA.

Principal Component Analysis (PCA)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
342of 821
Idea: Linearly project the data points onto a lower dimensional subspace such that
the variance of the projected data is maximised, or the distortion error from the projection is minimised.
Both formulation lead to the same result.
Need to find the lower dimensional subspace, called the
principal subspace. x2
u1
xn
x􏲘n
x1

Principal Component Analysis (PCA)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Eigenvectors
Singular Value Decomposition
Principal Component Analysis
Principal Component Analysis (PCA)
Independent Component Analysis
343of 821
Given N observations xn ∈ RD, n = 1,…,N.
Project onto a space with dimensionality M < D while maximising the variance. More advanced : How to calculate M from the data. Therefore here: M is fixed. Consider a 1-dimensional subspace spanned by some unit vectoru1 ∈RD,uT1u1 =1. x2 xn x􏲘n u1 x1 PCA - Maximise Variance Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 344of 821 Each data point xn is then projected onto a scalar value u T1 x n . The mean of the projected data is uT1 ̄x where ̄x is the sample mean 1 􏱿N ̄x=N xn. n=1 The variance of the projected data is then N 1 􏱿􏵪uT1xn −uT1 ̄x􏵫2 =uT1Su1 N n=1 with the covariance matrix 1 􏱿N S= N n=1 (xn − ̄x)(xn − ̄x)T. PCA - Maximise Variance Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis Maximising uT1 Su1 under the constraint uT1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation 345of 821 PCA - Maximise Variance Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis Maximising uT1 Su1 under the constraint uT1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation L(u1, λ1) = uT1 Su1 + λ1(1 − uT1 u1) which has a stationary point 346of 821 PCA - Maximise Variance Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 347of 821 Maximising uT1 Su1 under the constraint uT1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation L(u1,λ1)=uT1Su1 +λ1(1−uT1u1) which has a stationary point if u1 is an eigenvector of S with eigenvalue λ1. The variance is then uT1 Su1 = λ1. Variance is maximised if u1 is the eigenvector of the covariance S with the largest eigenvalue. Su1 = λ1u1. PCA - Maximise Variance Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 348of 821 Continue maximising the variance amongst all possible directions orthogonal to those already considered. The optimal linear projection onto a M-dimensional space for which the variance is maximised is defined by the M eigenvectorsu1,...,uM ofthecovariancematrixS corresponding to the M largest eigenvalues λ1 , . . . , λM . Is this subspace always uniquely defined? Not if λM = λM+1. PCA - Minimise Distortion Error Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 349of 821 The distortion between data points xn and their projection 􏲘xn 1 􏱿N J = N n = 1 ∥ x n − 􏲘x n ∥ 2 is minimised if the variance is maximised. The distortion error is then D J = 􏱿 λi i=M+1 where λi, i = M + 1, . . . , D are the smallest eigenvalues of the covariance matrix S. In signal processing we speak of the signal space (principal subspace) and the noise space (orthogonal to the principal subspace). PCA - Applications Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 350of 821 The eigenvectors of the covariance matrix are elements of the original vector space ui ∈ RD. If the input data are images, the eigenvectors are also images. Mean λ1 =3.4·105 λ2 =2.8·105 λ3 =2.4·105 λ4 =1.6·105 The mean and the first four eigenvectors u1, . . . , u4 of a set of handwritten digits of ‘three’. Blue corresponds to positive values, white is zero and yellow corresponds to negative values. PCA - Applications Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 351of 821 The eigenvalues of the covariance matrix express the variance of the data set in the direction of the corresponding eigenvectors. 5 x 10 3 λi 2 1 0 0 200 400 600i (a) Plot of the eigenvalue spectrum for the digits of three data set. PCA - Applications Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 352of 821 The sum of the eigenvalues of the covariance matrix of the discarded directions express the distortion error. 1 􏱿N J = N n = 1 ∥ x n − 􏲘x n ∥ 2 6 x 10 3 J 2 1 0 0 200 400 600M (b) Plot of the distortion error versus the number of dimension of the subspace considered for projection. PCA - Compression Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 353of 821 The approximated data vector 􏲘xn can be written in the form M 􏲘x n = ̄x + 􏱿 􏲋 u Ti ( x n − ̄x ) 􏲌 u i i=1 Codebook : M + 1 vectors of dimension D ( ̄x and ui). Compressed xn : M factors uTi (xn − ̄x) Original M = 1 M = 10 M = 50 M = 250 Reconstruction of an image retaining M principal components. PCA - Data Preprocessing Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 354of 821 Standardise certain features of a data set (for instance as a preprocessing step to subsequent algorithms expecting these features). Usually, individual standardisation: each variable (dimension) has zero mean and unit variance. But variables are still correlated. PCA can do more: create decorrelated data (covariance is the identity; also called whitening or sphering of the data) Write the eigenvector equation for the covariance matrix S SU = UL where L is the diagonal matrix of (positive!) eigenvalues. Transform the original data by y n = L − 1 / 2 U T ( x n − ̄x ) The set {yn} has mean zero and covariance given by the identity. PCA - Data Preprocessing Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 355of 821 Transform the original data by y n = L − 1 / 2 U T ( x n − ̄x ) Mean of the set {yn} 1 􏱿N N n=1 1 􏱿N y n = N L − 1 / 2 U T ( x n − ̄x ) n=1 = L U N Covariance of the set {yn} 1 􏱿N N n=1 1 􏱿N n=1 n=1 − 1 / 2 T 1 􏱿N ( x n − ̄x ) = 0 ynyTn =N L−1/2UT(xn− ̄x)(xn− ̄x)TUL−1/2 = L−1/2 UT SUL−1/2 = L−1/2 UT ULL−1/2 =I PCA - The Effect of Whitening Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 356of 821 Compare standardising and whitening of a data set. (b) also shows the principal axis of the normalised data set plotted as red lines over the range ±λ1/2. i 100 90 80 70 60 50 40 22 0 0 −2 −2 −2 0 2 Standardising to zero mean and unit variance. 2 4 6 −2 0 2 Whitening to achieve unit covariance. Original data (note the different axis). Extensions of PCA Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 357of 821 Kernel PCA Use Φ(x) as features, and express in terms of kernel matrix K The covariance matrix S and the (centered) kernel matrix K has the same eigenvalues. Probabilistic PCA Explicitly model latent variable z ∼ N (z|0, I). Mean value of observed variable is given by Wz + μ Conditional distribution of observed variable x ∼ N 􏲏x|Wz + μ, σ2I􏲐 Independence versus Uncorrelatedness Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 358of 821 Independence p(x1, x2) = p(x1) p(x2) Uncorrelatedness (defined via a zero covariance) E [x1x2] − E [x1] E [x2] = 0 Independence implies Uncorrelatedness (prove it!). BUT Uncorrelatedness does NOT imply Independence. Example: Draw the pair (x1, x2) with equal probability from the set {(0, 1), (0, −1), (1, 0), (−1, 0)}. Then x1 and x2 are uncorrelated because E[x1] = E[x2] = E[x1x2] = 0 . But x1 and x2 are NOT independent p(x1 =0,x2 =−1)= 1 4 p(x1 = 0) p(x2 = −1) = 1 × 1 24 Independent Component Analysis - Overview Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 359of 821 Assume we have K signals and K recordings, each recording containing a mixture of the signals. ‘Cocktail party’ problem : K people speak at the same time in a room, and K microphones pickup a mixture of what they say. Given unknown source signals S ∈ RN×K and an unknown mixing matrix A, producing the observed data X ∈ RN×K X=SA Can we recover the original signals (Blind Source Separation)? Yes, under the assumption that at most one of the signals is Gaussian distributed. we don’t care for the amplitude (including the sign). we don’t care for the order of the recovered signals. we have at least as many observed mixtures as signals, the matrix A has full rank and can be inverted. Independent Component Analysis - Overview Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 360of 821 Uncorrelated variables are not necessarily independent. ICA maximises the statistical independence of the estimated components. Find A in such a way that the columns of S = XA−1 are maximally independent. Several definitions for statistical independence possible. Central Limit Theorem: The distribution of a sum of independent random variables tends toward a Gaussian distribution (under certain conditions). FastICA algorithm.