CS计算机代考程序代写 deep learning Statistical Machine Learning

Statistical Machine Learning
Christian Walder
Machine Learning Research Group CSIRO Data61
and
College of Engineering and Computer Science The Australian National University
Canberra Semester One, 2020.
(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Outlines
Overview
Introduction
Linear Algebra
Probability
Linear Regression 1
Linear Regression 2
Linear Classification 1
Linear Classification 2
Kernel Methods
Sparse Kernel Methods Mixture Models and EM 1 Mixture Models and EM 2 Neural Networks 1
Neural Networks 2
Principal Component Analysis Autoencoders
Graphical Models 1 Graphical Models 2 Graphical Models 3 Sampling Sequential Data 1 Sequential Data 2
1of 821

Part X
Neural Network 3
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation Autoencoder
361of 821

Number of layers
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
362of 821
expression of a function is compact when it has few computational elements, i.e. few degrees of freedom that need to be tuned by learning
for a fixed number of training examples, expect that compact representations of the target function would yield better generalization
Example representations
affine operations, sigmoid ⇒ logistic regression has depth 1, fixed number of units (a.k.a. neurons)
fixed kernel, affine operations ⇒ kernel machine (e.g. SVM) has two levels, with as many units as data points
stacked neural network of multiple “linear transformation followed by a non-linearity” ⇒ deep neural network has arbitrary depth with arbitrary number of units per layer

Easier to represent with more layers
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
363of 821
An old result:
functions that can be compactly represented by a depth k architecture might require an exponential number of computational elements to be represented by a depth
k − 1 architecture
Example, the d bit parity function
􏱽1 if 􏱾d b iseven parity : (b1,…,bd) ∈ {0,1}d 􏵮→ i=1 i
0 otherwise Theorem: d-bit parity circuits of depth 2 have exponential
size
Analogous in modern deep learning:
“Shallow networks require exponentially more parameters for the same number of modes” — Canadian deep learning mafia.

Recall: Multi-layer Neural Network Architecture
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
364of 821
M 􏲁D 􏲂 y(x,w)=g 􏱿w(2)h 􏱿w(1)x
k kj jii j=0 i=0
where w now contains all weight and bias parameters.
(1)
hidden units
zM
wM D w(2)
KM
yK
xD inputs x1 x0
outputs
y1
z1 w(2) 10
z0 We could add more hidden layers

Empirical observations – pre 2006
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
365of 821
Deep architectures get stuck in local minima or plateaus
As architecture gets deeper, more difficult to obtain good generalisation
Hard to initialise random weights well
1 or 2 hidden layers seem to perform better
2006: Unsupervised pre-training, find distributed representation

Deep representation – intuition
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
366of 821
Bengio, “Learning Deep Architectures for AI”, 2009

Deep representation – practice
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
367of 821
AlexNet / VGG-F network visualized by mNeuron.

Recall: PCA
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
368of 821
Idea: Linearly project the data points onto a lower dimensional subspace such that
the variance of the projected data is maximised, or the distortion error from the projection is minimised.
Both formulation lead to the same result.
Need to find the lower dimensional subspace, called the
principal subspace. x2
u1
xn
x􏲘n
x1

Multiple PCA layers? – linear transforms
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
369of 821
Principle Component Analysis is a linear transformation (because it is a projection)
The composite of two linear transformations is linear
Linear transformations M : Rm → Rn are matrices
Let S and T be matrices of appropriate dimension such
that ST is defined
ST(X + X′) = ST(X) + ST(X′)
Similarly for multiplication with a scalar ⇒ multiple PCA layers pointless

Multiple PCA layers? – projection
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
370of 821
Let XT X = UΛUT be the eigenvalue decomposition of the covariance matrix (what is assumed about the mean?).
Define Uk to be the matrix of the first k columns of U, for the k largest eigenvalues. Define Λk similarly
Consider the features formed by projecting onto the principal components
Z = XUk
We perform PCA a second time, ZTZ = VΛZVT.
By the definition of Z and XT X, and the orthogonality of U ZTZ = (XUk)T(XUk) = UkTXTXUk = UkTUΛUTUk = Λk
Hence ΛZ = Λk and V is the identity, therefore the second PCA has no effect
⇒ again, multiple PCA layers pointless

Autoencoder
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
371of 821
An autoencoder is trained to encode the input x into some representation c(x) so that the input can be reconstructed from that representation
the target output of the autoencoder is the autoencoder input itself
With one linear hidden layer and the mean squared error criterion, the k hidden units learn to project the input in the span of the first k principal components of the data
If the hidden layer is nonlinear, the autoencoder behaves differently from PCA, with the ability to capture multimodal aspects of the input distribution
Let f be the decoder. We want to minimise the reconstruction error
N
􏱿 l (xn, f (c(xn))) n=1

Cost function
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
372of 821
Recall: f (c(x)) is the reconstruction produced by the network
Minimisation of the negative log likelihood of the reconstruction, given the encoding c(x)
RE = − log P(x|c(x))
If x|c(x) is Gaussian, we recover the familiar squared error
If the inputs xi are either binary or considered to be binomial probabilities, then the loss function would be the cross entropy
− log P(x|c(x)) = −xi log fi(c(x)) + (1 − xi) log(1 − fi(c(x))) where fi(·) is the ith component of the decoder

Undercomplete Autoencoder
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
373of 821
Consider a small number of hidden units. c(x) is viewed as a lossy compression of x
Cannot have small loss for all x, so focus on training examples
Hope code c(x) is a distributed representation that captures the main factors of variation in the data

Stacking autoencoders
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
374of 821
Let cj and fj be the encoder and corresponding decoder of the jth layer
Let zj be the representation after the encoder cj
We can define multiple layers of autoencoders recursively. For example, let z1 = c1(x), and z2 = c2(z1),
the corresponding decoding is given by f1(f2(z2))
Because of non-linear activation functions, the latent feature z2 can capture more complex patterns than z1.

Higher level image features – faces
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
375of 821
codingplayground.blogspot.com

Pre-training supervised neural networks
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
376of 821
Latent features zj in layer j can capture high level patterns zj =cj(cj−1(···c2(c1(x))···))
These features may also be useful for supervised learning tasks.
In contrast to the feed forward network, the features zj are constructed in an unsupervised fashion.
Discard the decoding layers, and directly use zj with a supervised training method, such as logistic regression.
Various such pre-trained networks are available on-line, e.g VGG-19.

Xavier Initialisation / ReLU
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
377of 821
Layer-wise unsupervised pre-training helps by extracting useful features for subsequent supervised backprop.
Pre-training also avoids saturation (large magnitude arguments to, say, sigmoidal units).
Simpler Xavier initialization can also avoid saturation. Let the inputs xi ∼ N(0,1), weights wi ∼ N(0,σ2) and
activation z = 􏱾mi=1 xiwi. Then:
VAR[z] = E[(z − E[z])2] = E[z2] = E[(􏱿 xiwi)2]
i=1
mm
= 􏱿 E[(xiwi)2] = 􏱿 E[xi2]E[w2i ] = mσ2.
i=1 i=1 √
So we set σ = 1/ m to have “nice” activations. Glorot initialization takes care to have nice
back-propagated signals — see the auto-encoder lab. ReLU activations h(x) = max(x, 0) also help in practice.
m

Higher dimensional hidden layer
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
378of 821
if there is no other constraint, then an autoencoder with d-dimensional input and an encoding of dimension at least d could potentially just learn the identity function
Avoid by:
Regularisation
Early stopping of stochastic gradient descent Add noise in the encoding
Sparsity constraint on code c(x).

Denoising autoencoder
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
379of 821
Add noise to input, keeping perfect example as output Autoencoder tries to:
1 preserve information of input
2 undo stochastic corruption process
Reconstruction log likelihood
− log P(x|c(ˆx)) where x noise free, ˆx corrupted

Image denoising
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
380of 821
Images with Gaussian noise added.
Denoised using Stacked Sparse Denoising Autoencoder
Images from Xie et. al. NIPS 2012

Inpainting – 1
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
381of 821
Free a bird
Image from http: //cimg.eu/greycstoration/demonstration.shtml

Inpainting – 2
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
382of 821
Undo text over image
Image from Bach et. al. ICCV tutorial 2009

Recall: Basis functions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
383of 821
For fixed basis functions φ(x), we use domain knowledge for encoding features
Neural networks use data to learn a set of transformations. φi(x) is the ith component of the feature vector, and is learned by the network.
The transformations φi(·) for a particular dataset may no longer be orthogonal, and furthermore may be minor variations of each other.
We collect all the transformed features into a matrix Φ.

Sparse representations
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
384of 821
Idea: Have many hidden nodes, but only a few active for a particular code c(x).
Student t prior on codes
l1 penalty on coefficients α
Given bases in matrix Φ, look for codes by choosing α such that input signal x is reconstructed with low l2 reconstruction error, while w is sparse
N
min􏱿 1∥xn −Φαn∥2 +λ∥α∥1 α2
n=1
Φ is overcomplete, no longer orthogonal
Sparse ⇒ small number of non-zero αi.
Exact recovery under certain conditions (coherence): l1 →l0.
l1 regulariser ∼ Laplace prior p(αi) = λ exp(−λ|αi|). 2

The image denoising problem
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
385of 821
y = xorig +w
􏲜􏲛􏲚􏲝 􏲜􏲛􏲚􏲝
􏲜􏲛􏲚􏲝
noise
measurements original image

Sparsity assumption
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
386of 821
Only have noisy measurements
y = xorig +w
􏲜􏲛􏲚􏲝
noise
∥α∥0 is small for x ≈ Φα
where ∥ · ∥0 is the number of non-zero elements of α.
Φ is not necessarily features constructed from training data.
Minimise reconstruction error
min􏱿N 1∥xn −Φαn∥2 +λ∥α∥0
α2 n=1
􏲜􏲛􏲚􏲝 􏲜􏲛􏲚􏲝
measurements original image Given Φ ∈ Rm×p, find α such that

Convex relaxation
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
387of 821
Want to minimise number of components
min􏱿N 1∥xn −Φαn∥2 +λ∥α∥0 α2
n=1
but ∥ · ∥0 is hard to optimise
Relax to a convex norm
min􏱿N 1∥xn −Φαn∥2 +λ∥α∥1 α2
n=1 where ∥α∥1 = 􏱾n |αn|.
In some settings does minimisation with l1 regularisation give the same solution as minimisation with l0 regularisation (exact recovery)?

Mutual coherence
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
388of 821
Expect to be ok when columns of Φ “not too parallel” Assume columns of Φ are normalised to unit norm
Let K = ΦΦT be the Gram matrix, then K(i, j) is the value of the inner product between φi and φj.
Define the mutual coherence
M = M(Φ) = max |K(i, j)|
i̸=j
If we have an orthogonal basis, then Φ is an orthogonal
matrix, hence K(i, j) = 0 when i ̸= j.
However, if we have very similar columns, then M ≈ 1.

Exact recovery conditions
Statistical Machine Learning
⃝c 2020
Ong & Walder & Webers Data61 | CSIRO The Australian National University
Motivation
Autoencoder
389of 821
If a minimiser of the true l0 problem, α∗ satisfies ∥α∗∥0 < 1 M then it is the unique sparsest solution. If α∗ satisfies the stronger condition ∗1􏲇1􏲈 ∥α∥0<2 1+M then the minimiser of the l1 relaxation has the same sparsity pattern as α∗. References Statistical Machine Learning ⃝c 2020 Ong & Walder & Webers Data61 | CSIRO The Australian National University Motivation Autoencoder 390of 821 Yoshua Bengio, "Learning Deep Architectures for AI", Foundations and Trends in Machine Learning, 2009 http://deeplearning.net/tutorial/ http://www.deeplearningbook.org/contents/ autoencoders.html Fuchs, "On Sparse Representations in Arbitrary Redundant Bases", IEEE Trans. Info. Theory, 2004 Xavier Glorot and Yoshua Bengio, "Understanding the difficulty of training deep feedforward neural networks", 2010.