CS计算机代考程序代写 Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

326of 821

Part IX

Principal Component Analysis

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

327of 821

Motivation: Pre-training Deep Neural Networks

Empirical observations – pre 2006:
Deep architectures get stuck in local minima or plateaus
As architecture gets deeper, more difficult to obtain good
generalisation
Hard to initialise random weights well
1 or 2 hidden layers seem to perform better
2006: Unsupervised pre-training of each layer; deeper
models possible

Usually based on auto-encoders (tomorrow’s lecture)
Similar in spirit to PCA (today’s lecture)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

328of 821

Motivation: Exploratory Analysis

Given a dataset of numerical features:
Low dimensional data may be easy to plot
High dimensional data is challenging
Dimensionality reduction (e.g. PCA)

Try to explain with fewer dimensions
Enables visualisation
The new basis may yield insights
Aside: can simplify/speed up subsequent analysis e.g.
regression

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

329of 821

Supervised Learning

Given are pairs of data xi ∈ X and targets ti ∈ T in the
form (xi, ti), where i = 1 . . .N.
Learn a mapping between the data X and the target t
which generalises well to new data.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

330of 821

Unsupervised Learning

Given only the data xi ∈ X .
Discover (=learn) some interesting structure inherent in
the data X.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

331of 821

Unsupervised Learning

Given only the data xi ∈ X .
Discover (=learn) some interesting structure inherent in
the data.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

332of 821

Testing – Supervised versus Unsupervised
Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

333of 821

Recall: Fisher’s Linear Discriminant

Samples from two classes in a two-dimensional input space
and their histogram when projected to two different
one-dimensional spaces.

−2 2 6

−2

0

2

4

−2 2 6

−2

0

2

4

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

334of 821

Eigenvectors

Every square matrix A ∈ Rn×n has an Eigenvector
decomposition

Ax = λx

where x ∈ Rn and λ ∈ C.
Example: [

0 1
−1 0

]
x = λx

λ = {−ı, ı}

x =
{[

ı
1

]
,

[
−ı

1

]}

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

335of 821

Eigenvectors

How many eigenvalue/eigenvector pairs?

Ax = λx

is equivalent to
(A− λI)x = 0

Has only non-trivial solution for det {A− λI} = 0
polynom of nth order; at most n distinct solutions

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

336of 821

Real Eigenvalues

How can we enforce real eigenvalues?
Let’s look at matrices with complex entries A ∈ Cn×n.
Transposition is replaced by Hermitian adjoint, e.g.[

1 + ı2 3 + ı4
5 + ı6 7 + ı8

]H
=

[
1− ı2 5− ı6
3− ı4 7− ı8

]
Denote the complex conjugate of a complex number λ by
λ.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

337of 821

Real Eigenvalues

How can we enforce real eigenvalues?
Let’s assume A ∈ Cn×n, Hermitian (AH = A).
Calculate

xHAx = λxHx

for an eigenvector x ∈ Cn of A.
Another possibility to calculate xHAx

xHAx = xHAHx (A is Hermitian)

= (xHAx)H (reverse order)

= (λxHx)H (eigenvalue)

= λxHx

and therefore
λ = λ (λ is real).

If A is Hermitian, then all eigenvalues are real.
Special case: If A has only real entries and is symmetric,
then all eigenvalues are real.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

338of 821

Singular Value Decomposition

Every matrix A ∈ Rn×p can be decomposed into a product of
three matrices

A = UΣVT

where U ∈ Rn×n and V ∈ Rp×p are orthogonal matrices (
UTU = I and VTV = I ), and Σ ∈ Rn×p has nonnegative
numbers on the diagonal.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

339of 821

Linear Curve Fitting – Least Squares

N = 10

x ≡ (x1, . . . , xN)T

t ≡ (t1, . . . , tN)T
xi ∈ R i = 1,. . . , N
ti ∈ R i = 1,. . . , N

y(x,w) = w1x + w0
X ≡ [x 1]

w∗ = (XTX)−1XT t

0 2 4 6 8 10
x

0

1

2

3

4

5

6

7

8

t

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

340of 821

Inverse of a matrix

Assume a full rank symmetric real matrix A.
Then A = UTΛU where
Λ is a diagonal matrix with real eigenvalues
U contains the eigenvectors

A−1 = (UTΛU)−1

= U−1Λ−1U−T inverse changes order

= UTΛ−1U UTU = I

The inverse of a diagonal matrix is the inverse of its
elements.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

341of 821

Dimensionality Reduction

Main goal of Principal Component Analysis:
dimensionality reduction
Many applications in visualisation, feature extraction,
signal processing, data compression . . .
Example: Use hand-written digits (binary data) and place
them into a larger frame (100 × 100) varying the position
and the rotation angle.
Data space size = 10 000.
But data live on a three-dimensional manifold (x, y, and the
rotation angle).
FYI only: this manifold is not linear and requires bleeding
edge models like capsule networks (Hinton 2017); still we
can locally approximate with PCA.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

342of 821

Principal Component Analysis (PCA)

Idea: Linearly project the data points onto a lower
dimensional subspace such that

the variance of the projected data is maximised, or
the distortion error from the projection is minimised.

Both formulation lead to the same result.
Need to find the lower dimensional subspace, called the
principal subspace.

x2

x1

xn

x̃n

u1

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Eigenvectors

Singular Value
Decomposition

Principal Component
Analysis

Principal Component
Analysis (PCA)

Independent Component
Analysis

343of 821

Principal Component Analysis (PCA)

Given N observations xn ∈ RD, n = 1, . . . ,N.
Project onto a space with dimensionality M < D while maximising the variance. More advanced : How to calculate M from the data. Therefore here: M is fixed. Consider a 1-dimensional subspace spanned by some unit vector u1 ∈ RD, uT1 u1 = 1. x2 x1 xn x̃n u1 Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 344of 821 PCA - Maximise Variance Each data point xn is then projected onto a scalar value uT1 xn. The mean of the projected data is uT1 x̄ where x̄ is the sample mean x̄ = 1 N N∑ n=1 xn. The variance of the projected data is then 1 N N∑ n=1 { uT1 xn − uT1 x̄ }2 = uT1 Su1 with the covariance matrix S = 1 N N∑ n=1 (xn − x̄)(xn − x̄)T . Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 345of 821 PCA - Maximise Variance Maximising uT1 Su1 under the constraint u T 1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation L(u1, λ1) = uT1 Su1 + λ1(1− uT1 u1) which has a stationary point if u1 is an eigenvector of S with eigenvalue λ1. Su1 = λ1u1. The variance is then uT1 Su1 = λ1. Variance is maximised if u1 is the eigenvector of the covariance S with the largest eigenvalue. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 346of 821 PCA - Maximise Variance Maximising uT1 Su1 under the constraint u T 1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation L(u1, λ1) = uT1 Su1 + λ1(1− uT1 u1) which has a stationary point if u1 is an eigenvector of S with eigenvalue λ1. Su1 = λ1u1. The variance is then uT1 Su1 = λ1. Variance is maximised if u1 is the eigenvector of the covariance S with the largest eigenvalue. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 347of 821 PCA - Maximise Variance Maximising uT1 Su1 under the constraint u T 1 u1 = 1 (why do we need to bound u1 ?) leads to the Lagrange equation L(u1, λ1) = uT1 Su1 + λ1(1− uT1 u1) which has a stationary point if u1 is an eigenvector of S with eigenvalue λ1. Su1 = λ1u1. The variance is then uT1 Su1 = λ1. Variance is maximised if u1 is the eigenvector of the covariance S with the largest eigenvalue. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 348of 821 PCA - Maximise Variance Continue maximising the variance amongst all possible directions orthogonal to those already considered. The optimal linear projection onto a M-dimensional space for which the variance is maximised is defined by the M eigenvectors u1, . . . ,uM of the covariance matrix S corresponding to the M largest eigenvalues λ1, . . . , λM. Is this subspace always uniquely defined? Not if λM = λM+1. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 349of 821 PCA - Minimise Distortion Error The distortion between data points xn and their projection x̃n J = 1 N N∑ n=1 ‖xn − x̃n‖2 is minimised if the variance is maximised. The distortion error is then J = D∑ i=M+1 λi where λi, i = M + 1, . . . ,D are the smallest eigenvalues of the covariance matrix S. In signal processing we speak of the signal space (principal subspace) and the noise space (orthogonal to the principal subspace). Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 350of 821 PCA - Applications The eigenvectors of the covariance matrix are elements of the original vector space ui ∈ RD. If the input data are images, the eigenvectors are also images. Mean λ1 = 3.4 · 105 λ2 = 2.8 · 105 λ3 = 2.4 · 105 λ4 = 1.6 · 105 The mean and the first four eigenvectors u1, . . . ,u4 of a set of handwritten digits of ‘three’. Blue corresponds to positive values, white is zero and yellow corresponds to negative values. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 351of 821 PCA - Applications The eigenvalues of the covariance matrix express the variance of the data set in the direction of the corresponding eigenvectors. i λi (a) 0 200 400 600 0 1 2 3 x 10 5 Plot of the eigenvalue spectrum for the digits of three data set. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 352of 821 PCA - Applications The sum of the eigenvalues of the covariance matrix of the discarded directions express the distortion error. J = 1 N N∑ n=1 ‖xn − x̃n‖2 M J (b) 0 200 400 600 0 1 2 3 x 10 6 Plot of the distortion error versus the number of dimension of the subspace considered for projection. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 353of 821 PCA - Compression The approximated data vector x̃n can be written in the form x̃n = x̄ + M∑ i=1 ( uTi (xn − x̄) ) ui Codebook : M + 1 vectors of dimension D (x̄ and ui). Compressed xn : M factors uTi (xn − x̄) Original M = 1 M = 10 M = 50 M = 250 Reconstruction of an image retaining M principal components. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 354of 821 PCA - Data Preprocessing Standardise certain features of a data set (for instance as a preprocessing step to subsequent algorithms expecting these features). Usually, individual standardisation: each variable (dimension) has zero mean and unit variance. But variables are still correlated. PCA can do more: create decorrelated data (covariance is the identity; also called whitening or sphering of the data) Write the eigenvector equation for the covariance matrix S SU = UL where L is the diagonal matrix of (positive!) eigenvalues. Transform the original data by yn = L−1/2 UT(xn − x̄) The set {yn} has mean zero and covariance given by the identity. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 355of 821 PCA - Data Preprocessing Transform the original data by yn = L−1/2 UT(xn − x̄) Mean of the set {yn} 1 N N∑ n=1 yn = 1 N N∑ n=1 L−1/2 UT(xn − x̄) = L−1/2 UT 1 N N∑ n=1 (xn − x̄) = 0 Covariance of the set {yn} 1 N N∑ n=1 ynyTn = 1 N N∑ n=1 L−1/2 UT(xn − x̄)(xn − x̄)TUL−1/2 = L−1/2 UTSUL−1/2 = L−1/2 UTULL−1/2 = I Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 356of 821 PCA - The Effect of Whitening Compare standardising and whitening of a data set. (b) also shows the principal axis of the normalised data set plotted as red lines over the range ±λ1/2i . 2 4 6 40 50 60 70 80 90 100 Original data (note the different axis). −2 0 2 −2 0 2 Standardising to zero mean and unit variance. −2 0 2 −2 0 2 Whitening to achieve unit covariance. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 357of 821 Extensions of PCA Kernel PCA Use Φ(x) as features, and express in terms of kernel matrix K The covariance matrix S and the (centered) kernel matrix K has the same eigenvalues. Probabilistic PCA Explicitly model latent variable z ∼ N (z|0, I). Mean value of observed variable is given by Wz + µ Conditional distribution of observed variable x ∼ N ( x|Wz + µ, σ2I ) Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 358of 821 Independence versus Uncorrelatedness Independence p(x1, x2) = p(x1) p(x2) Uncorrelatedness (defined via a zero covariance) E [x1x2]− E [x1] E [x2] = 0 Independence implies Uncorrelatedness (prove it!). BUT Uncorrelatedness does NOT imply Independence. Example: Draw the pair (x1, x2) with equal probability from the set {(0, 1), (0,−1), (1, 0), (−1, 0)}. Then x1 and x2 are uncorrelated because E [x1] = E [x2] = E [x1x2] = 0 . But x1 and x2 are NOT independent p(x1 = 0, x2 = −1) = 1 4 p(x1 = 0) p(x2 = −1) = 1 2 × 1 4 Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 359of 821 Independent Component Analysis - Overview Assume we have K signals and K recordings, each recording containing a mixture of the signals. ‘Cocktail party’ problem : K people speak at the same time in a room, and K microphones pickup a mixture of what they say. Given unknown source signals S ∈ RN×K and an unknown mixing matrix A, producing the observed data X ∈ RN×K X = S A Can we recover the original signals (Blind Source Separation)? Yes, under the assumption that at most one of the signals is Gaussian distributed. we don’t care for the amplitude (including the sign). we don’t care for the order of the recovered signals. we have at least as many observed mixtures as signals, the matrix A has full rank and can be inverted. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis 360of 821 Independent Component Analysis - Overview Uncorrelated variables are not necessarily independent. ICA maximises the statistical independence of the estimated components. Find A in such a way that the columns of S = XA−1 are maximally independent. Several definitions for statistical independence possible. Central Limit Theorem: The distribution of a sum of independent random variables tends toward a Gaussian distribution (under certain conditions). FastICA algorithm. Neural Networks 2 Review Error Backpropagation Regularisation in Neural Networks Bayesian Neural Networks