CS计算机代考程序代写 deep learning Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 821

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

361of 821

Part X

Neural Network 3

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

362of 821

Number of layers

expression of a function is compact when it has few
computational elements, i.e. few degrees of freedom that
need to be tuned by learning
for a fixed number of training examples, expect that
compact representations of the target function would yield
better generalization
Example representations

affine operations, sigmoid⇒ logistic regression has depth
1, fixed number of units (a.k.a. neurons)
fixed kernel, affine operations⇒ kernel machine (e.g. SVM)
has two levels, with as many units as data points
stacked neural network of multiple “linear transformation
followed by a non-linearity”⇒ deep neural network has
arbitrary depth with arbitrary number of units per layer

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

363of 821

Easier to represent with more layers

An old result:
functions that can be compactly represented by a depth k
architecture might require an exponential number of
computational elements to be represented by a depth
k − 1 architecture
Example, the d bit parity function

parity : (b1, . . . , bd) ∈ {0, 1}d 7→

{
1 if

∑d
i=1 bi is even

0 otherwise

Theorem: d-bit parity circuits of depth 2 have exponential
size

Analogous in modern deep learning:
“Shallow networks require exponentially more parameters
for the same number of modes” — Canadian deep
learning mafia.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

364of 821

Recall: Multi-layer Neural Network Architecture

yk(x,w) = g


 M∑

j=0

w(2)kj h

(
D∑

i=0

w(1)ji xi

)
where w now contains all weight and bias parameters.

w
(1)
MD

w
(2)
KM

w
(2)
10

hidden units

inputs outputs

We could add more hidden layers

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

365of 821

Empirical observations – pre 2006

Deep architectures get stuck in local minima or plateaus
As architecture gets deeper, more difficult to obtain good
generalisation
Hard to initialise random weights well
1 or 2 hidden layers seem to perform better
2006: Unsupervised pre-training, find distributed
representation

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

366of 821

Deep representation – intuition

Bengio, “Learning Deep Architectures for AI”, 2009

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

367of 821

Deep representation – practice

AlexNet / VGG-F network visualized by mNeuron.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

368of 821

Recall: PCA

Idea: Linearly project the data points onto a lower
dimensional subspace such that

the variance of the projected data is maximised, or
the distortion error from the projection is minimised.

Both formulation lead to the same result.
Need to find the lower dimensional subspace, called the
principal subspace.

x̃n

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

369of 821

Multiple PCA layers? – linear transforms

Principle Component Analysis is a linear transformation
(because it is a projection)
The composite of two linear transformations is linear
Linear transformations M : Rm → Rn are matrices
Let S and T be matrices of appropriate dimension such
that ST is defined

ST(X + X′) = ST(X) + ST(X′)

Similarly for multiplication with a scalar

⇒ multiple PCA layers pointless

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

370of 821

Multiple PCA layers? – projection

Let XTX = UΛUT be the eigenvalue decomposition of the
covariance matrix (what is assumed about the mean?).
Define Uk to be the matrix of the first k columns of U, for
the k largest eigenvalues. Define Λk similarly
Consider the features formed by projecting onto the
principal components

Z = XUk

We perform PCA a second time, ZTZ = VΛZVT .
By the definition of Z and XTX, and the orthogonality of U

ZTZ = (XUk)
T(XUk) = U

T
k X

TXUk = U
T
k UΛU

TUk = Λk

Hence ΛZ = Λk and V is the identity, therefore the second
PCA has no effect

⇒ again, multiple PCA layers pointless

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

371of 821

Autoencoder

An autoencoder is trained to encode the input x into some
representation c(x) so that the input can be reconstructed
from that representation
the target output of the autoencoder is the autoencoder
input itself
With one linear hidden layer and the mean squared error
criterion, the k hidden units learn to project the input in the
span of the first k principal components of the data
If the hidden layer is nonlinear, the autoencoder behaves
differently from PCA, with the ability to capture multimodal
aspects of the input distribution
Let f be the decoder. We want to minimise the
reconstruction error

N∑
n=1

` (xn, f (c(xn)))

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

372of 821

Cost function

Recall: f (c(x)) is the reconstruction produced by the
network
Minimisation of the negative log likelihood of the
reconstruction, given the encoding c(x)

RE = − log P(x|c(x))

If x|c(x) is Gaussian, we recover the familiar squared error
If the inputs xi are either binary or considered to be
binomial probabilities, then the loss function would be the
cross entropy

− log P(x|c(x)) = −xi log fi(c(x)) + (1− xi) log(1− fi(c(x)))

where fi(·) is the ith component of the decoder

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

373of 821

Undercomplete Autoencoder

Consider a small number of hidden units.
c(x) is viewed as a lossy compression of x
Cannot have small loss for all x, so focus on training
examples
Hope code c(x) is a distributed representation that
captures the main factors of variation in the data

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

374of 821

Stacking autoencoders

Let cj and fj be the encoder and corresponding decoder of
the jth layer
Let zj be the representation after the encoder cj
We can define multiple layers of autoencoders recursively.
For example, let z1 = c1(x), and z2 = c2(z1),
the corresponding decoding is given by f1(f2(z2))
Because of non-linear activation functions, the latent
feature z2 can capture more complex patterns than z1.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

375of 821

Higher level image features – faces

codingplayground.blogspot.com

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

376of 821

Pre-training supervised neural networks

Latent features zj in layer j can capture high level patterns

zj = cj(cj−1(· · · c2(c1(x)) · · · ))

These features may also be useful for supervised learning
tasks.
In contrast to the feed forward network, the features zj are
constructed in an unsupervised fashion.
Discard the decoding layers, and directly use zj with a
supervised training method, such as logistic regression.
Various such pre-trained networks are available on-line,
e.g VGG-19.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

377of 821

Xavier Initialisation / ReLU

Layer-wise unsupervised pre-training helps by extracting
useful features for subsequent supervised backprop.
Pre-training also avoids saturation (large magnitude
arguments to, say, sigmoidal units).
Simpler Xavier initialization can also avoid saturation.
Let the inputs xi ∼ N (0, 1), weights wi ∼ N (0, σ2) and
activation z =

∑m
i=1 xiwi. Then:

VAR[z] = E[(z− E[z])2] = E[z2] = E[(
m∑

i=1

xiwi)
2]

m∑
i=1

E[(xiwi)2] =
m∑

i=1

E[x2i ]E[w
2
i ] = mσ

So we set σ = 1/
√

m to have “nice” activations.
Glorot initialization takes care to have nice
back-propagated signals — see the auto-encoder lab.
ReLU activations h(x) = max(x, 0) also help in practice.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

378of 821

Higher dimensional hidden layer

if there is no other constraint, then an autoencoder with
d-dimensional input and an encoding of dimension at least
d could potentially just learn the identity function
Avoid by:

Regularisation
Early stopping of stochastic gradient descent
Add noise in the encoding
Sparsity constraint on code c(x).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

379of 821

Denoising autoencoder

Add noise to input, keeping perfect example as output
Autoencoder tries to:

1 preserve information of input
2 undo stochastic corruption process

Reconstruction log likelihood

− log P(x|c(x̂))

where x noise free, x̂ corrupted

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

380of 821

Image denoising

Images with Gaussian noise added.

Denoised using Stacked Sparse Denoising Autoencoder

Images from Xie et. al. NIPS 2012

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

381of 821

Inpainting – 1

Free a bird

Image from http:
//cimg.eu/greycstoration/demonstration.shtml

http://cimg.eu/greycstoration/demonstration.shtml
http://cimg.eu/greycstoration/demonstration.shtml

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

382of 821

Inpainting – 2

Undo text over image

Image from Bach et. al. ICCV tutorial 2009

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

383of 821

Recall: Basis functions

For fixed basis functions φ(x), we use domain knowledge
for encoding features
Neural networks use data to learn a set of transformations.
φi(x) is the ith component of the feature vector, and is
learned by the network.
The transformations φi(·) for a particular dataset may no
longer be orthogonal, and furthermore may be minor
variations of each other.
We collect all the transformed features into a matrix Φ.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

384of 821

Sparse representations

Idea: Have many hidden nodes, but only a few active for a
particular code c(x).
Student t prior on codes
`1 penalty on coefficients α

Given bases in matrix Φ, look for codes by choosing α such
that input signal x is reconstructed with low `2 reconstruction
error, while w is sparse

min
α

N∑
n=1

1
2
‖xn − Φαn‖22 + λ‖α‖1

Φ is overcomplete, no longer orthogonal
Sparse⇒ small number of non-zero αi.
Exact recovery under certain conditions (coherence):
`1 → `0.
`1 regulariser ∼ Laplace prior p(αi) = λ2 exp(−λ|αi|).

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

385of 821

The image denoising problem

y︸︷︷︸
measurements

= xorig︸︷︷︸
original image

+ w︸︷︷︸
noise

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

386of 821

Sparsity assumption

Only have noisy measurements

y︸︷︷︸
measurements

= xorig︸︷︷︸
original image

+ w︸︷︷︸
noise

Given Φ ∈ Rm×p, find α such that

‖α‖0 is small for x ≈ Φα

where ‖ · ‖0 is the number of non-zero elements of α.
Φ is not necessarily features constructed from training
data.
Minimise reconstruction error

min
α

N∑
n=1

1
2
‖xn − Φαn‖22 + λ‖α‖0

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

387of 821

Convex relaxation

Want to minimise number of components

min
α

N∑
n=1

1
2
‖xn − Φαn‖22 + λ‖α‖0

but ‖ · ‖0 is hard to optimise
Relax to a convex norm

min
α

N∑
n=1

1
2
‖xn − Φαn‖22 + λ‖α‖1

where ‖α‖1 =
∑

n |αn|.
In some settings does minimisation with `1 regularisation
give the same solution as minimisation with `0
regularisation (exact recovery)?

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

388of 821

Mutual coherence

Expect to be ok when columns of Φ “not too parallel”
Assume columns of Φ are normalised to unit norm
Let K = ΦΦT be the Gram matrix, then K(i, j) is the value
of the inner product between φi and φj.
Define the mutual coherence

M = M(Φ) = max
i6=j
|K(i, j)|

If we have an orthogonal basis, then Φ is an orthogonal
matrix, hence K(i, j) = 0 when i 6= j.
However, if we have very similar columns, then M ≈ 1.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Motivation

Autoencoder

389of 821

Exact recovery conditions

If a minimiser of the true `0 problem, α∗ satisfies

‖α∗‖0 < 1 M then it is the unique sparsest solution. If α∗ satisfies the stronger condition ‖α∗‖0 < 1 2 ( 1 + 1 M ) then the minimiser of the `1 relaxation has the same sparsity pattern as α∗. Statistical Machine Learning c©2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Autoencoder 390of 821 References Yoshua Bengio, "Learning Deep Architectures for AI", Foundations and Trends in Machine Learning, 2009 http://deeplearning.net/tutorial/ http://www.deeplearningbook.org/contents/ autoencoders.html Fuchs, "On Sparse Representations in Arbitrary Redundant Bases", IEEE Trans. Info. Theory, 2004 Xavier Glorot and Yoshua Bengio, "Understanding the difficulty of training deep feedforward neural networks", 2010. http://deeplearning.net/tutorial/ http://www.deeplearningbook.org/contents/autoencoders.html http://www.deeplearningbook.org/contents/autoencoders.html Principal Component Analysis Motivation Eigenvectors Singular Value Decomposition Principal Component Analysis Principal Component Analysis (PCA) Independent Component Analysis

Related Posts