CS计算机代考程序代写 chain Bayesian algorithm Statistical Machine Learning

Statistical Machine Learning

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Outlines

Overview
Introduction
Linear Algebra

Probability

Linear Regression 1

Linear Regression 2

Linear Classification 1

Linear Classification 2

Kernel Methods
Sparse Kernel Methods

Mixture Models and EM 1
Mixture Models and EM 2
Neural Networks 1
Neural Networks 2
Principal Component Analysis

Autoencoders
Graphical Models 1

Graphical Models 2

Graphical Models 3

Sampling

Sequential Data 1

Sequential Data 2

1of 825

Statistical Machine Learning

Christian Walder

Machine Learning Research Group
CSIRO Data61

and

College of Engineering and Computer Science
The Australian National University

Canberra
Semester One, 2020.

(Many figures from C. M. Bishop, “Pattern Recognition and Machine Learning”)

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

306of 825

Part VIII

Neural Networks 2

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

307of 825

Neural Networks 2: Overview

Recall: we would like gradients w.r.t. parameters so that
we can optimise.
Today: gradients of neural network parameters via the
backpropagation of gradients algorithm.
Regularisation/model selection.
Incorporating invariances/domain knowledge.
Bayesian neural network (Laplace’s method).

Good News

We study back propagation for pedagogical reasons: in
practice one uses automatic differentiation which is far more
general and efficient (see e.g. the especially easy to use
PyTorch).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

308of 825

Chain Rule / Total Derivative

The composition of two functions is given by

f ◦ g(x) = f (g(x))

Let f and g be differentiable functions with derivatives f ′

and g′ respectively
Chain rule

d
dx

f (g(x)) = f ′(g(x))g′(x)

If we write u = g(x) and y = f (u),

dy
dx

=
dy
du
·

du
dx

Multivariate case we also need is the total derivative, e.g.

d
dt

f (x(t), y(t)) =
∂f
∂x

dx
dt

+
∂f
∂y

dy
dt
,

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

309of 825

Error Backpropagation

Goal: Efficiently update the weights in order to find a local
minimum of some error function E(w) utilizing the gradient
of the error function.
Core ideas :

1 Propagate the errors backwards through the network to
efficiently calculate the gradient.

2 Update the weights using the calculated gradient.

Sequential procedure : Calculate gradient and update
weights for each data/target pair.
Batch procedure : Collect gradient information for all
data/target pairs for the same weight setting. Then adjust
the weights.
Main question in both cases: How to calculate the gradient
of E(w) given one data/target pair?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

310of 825

Error Backpropagation

Assume the error is a sum over errors for each data/target
pair

E(w) =
N∑

n=1

En(w).

After applying input xn to the network, we get the output yn
and calculate the error En(w).
What is the gradient for one such term En(w)?
Note : In the following, we will drop the subscript n in order
to unclutter the equations.
Notation: Input pattern is x.
Scalar xi is the ith component of the input pattern x.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

311of 825

Backprop – One Layer – Scalar View

Simple linear model without hidden layers
One layer only, identity function as activation function!

yk =
∑

wkl xl,

and error after applying input xn

En(w) =
1
2

∑
k

(yk − tk)2.

The gradient with respect to wji is now

∂En(w)
∂wji

=
∑

(yk − tk)
∂

∂wji
yk =

∑
k

(yk − tk)
∂

∂wji

∑
l

wkl xl

=
∑

(yk − tk)
∑

xl δjkδil

= (yj − tj) xi.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

312of 825

Backprop – One Layer – Vector Calculus

Vector setup:

y = W x
W ∈ RD2×D1

x ∈ RD1

y ∈ RD2

Error after applying input training pair (x, t)

En(W) =
1
2
‖y− t‖2 .

Using the vector calculus rules gives

∇WEn(W) = ∇W
1
2
‖y− t‖2

= (y− t)∇Wy
= (y− t)x>.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

313of 825

FYI Only:
Backprop – One Layer – Directional Derivative

Do the same using the directional derivative:

y = W x W ∈ RD2×D1 ,
and error after applying input training pair (x, t)

En(W) =
1
2
(y− t)>(y− t) =

1
2
(Wx− t)>(Wx− t).

Define ∇d f (x) = ddh f (x + hd).
Relate this to the gradient by ∇d f (x) = 〈∇f (x), d〉.
The directional derivative with respect to W is now

∇ξEn(W) =
1
2
(
(ξx)>(y− t) + (y− t)>ξx

)
= x>ξ>(y− t)

With canonical inner product 〈A,B〉 = tr
{

A>B
}

the
gradient of En(W)(ξ) is

DEn(W)(ξ) = tr


x>ξ>(y− t)︸︷︷︸

scalar


 = tr


ξ> (y− t)x>︸︷︷︸

gradient




Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

314of 825

Error Backpropagation

The gradient
∇WEn(W) = (y− t)x>

or in components

∂En(w)
∂wji

= (yj − tj) xi.

looks like the product of the output error (yj − tj) with the
input xi associated with an edge for wji in the network
diagram.
Can we generalise this idea to nonlinear activation
functions?

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

315of 825

Error Backpropagation

Now consider a network with nonlinear activation functions
h(·) composed with the sum over the inputs zi in one layer
and zj in the next layer connected by edges with weights wji

aj =
∑

wji zi

zj = h(aj).

Use the chain rule to calculate the gradient

∂En(w)
∂wji

=
∂En(w)
∂aj

∂aj
∂wji

= δj zi,

where we defined the error (a slight misnomer hailing from
the derivative of the squared error) δj =

∂En(w)
∂aj

Same intuition as before: gradient is output error times the
input associated with the edge for wji.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

316of 825

Error Backpropagation

Need to calculate the errors δ in every layer.

∂En(w)
∂wji

= δj zi δj =
∂En(w)
∂aj

Start the recursion; for output units with squared error:

δk = yk − tk.

For the hidden units we use the total derivative, e.g.

d
dt

f (x(t), y(t)) =
∂f
∂x

dx
dt

+
∂f
∂y

dy
dt
,

to calculate

δj =
∂En(w)
∂aj

=
∑

∂En(w)
∂ak

∂ak
∂aj

=
∑

δk
∂ak
∂aj

using the definition of δk.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

317of 825

Error Backpropagation

Express ak as a function of the incoming aj

ak =
∑

wkjzj =
∑

wkjh(aj),

and differentiate

∂ak
∂aj

= wkj
∂h(aj)
∂aj

= wkj
∂h(s)
∂s

∣∣∣∣∣
s=aj

= wkj h
′(aj).

Finally, we get for the error in the previous layer

δj = h
′(aj)

∑
k

wkj δk.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

318of 825

Error Backpropagation

The backfpropagation formula

δj = h
′(aj)

∑
k

wkjδk.

Form of h′(·) is available, because we choose the
activation function h(·) to be a fixed function such as a
tanh etc..

δj
δk

δ1

wji wkj

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

319of 825

Error Backpropagation Algorithms

1 Apply the input vector x to the network and forward
propagate through the network to calculate all activations
and outputs of each unit.

2 Compute the gradients of the error at the output.
3 Backpropagate the gradients backwards through the

network using the backpropagation formula.
4 Calculate all components of ∇En by

∂En(w)
∂wji

= δj zi

5 Update the weights w using ∂En(w)
∂wji

δj
δk

δ1

wji wkj

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

320of 825

Error Backpropagation

For batch processing, we repeat backpropagation for each
pattern in the training set and then sum over all patterns

∂E(w)
∂wji

N∑
n=1

∂En(w)
∂wji

Backpropagation can be generalised by assuming that
each node has a different activation function h(·).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

321of 825

Easy Backprop

Let z(0) = x= input

a(l) = W(l)z(l−1)

z(l) = h(a(l))

E = L(a(L))= L(y)
e.g.
=

1
2
‖y− t‖2.

The gradients of E w.r.t. the parameters are (total derivative)

∂E
∂W(l)

=
∂E
∂a(l)

·
∂a(l)

∂W(l)
= δ(l)z(l−1)

>
,

where (neglecting transposes — assume conformant shapes)

δ(l) ≡
∂E
∂a(l)

=
∂a(l+1)

∂a(l)
∂a(l+2)

∂a(l+1)
· · ·

∂a(L)

∂a(L−1)
∂L(a(L))
∂a(L)

has the recursion δ(L) = ∂L(a
(L))

∂a(L) along with

δ(l−1) =
∂a(l)

∂a(l−1)
δ(l)

∂a(l)

∂a(l−1)
=
∂W(l)h(a(l−1))

∂a(l−1)
= diag{h′(a(l−1))}W(l)

>
.

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

322of 825

Efficieny of Backpropagation

For dense weight matrices, the complexity of calculating
the gradient ∂En(w)

∂wji
via backpropagation is of O(W) where

W is the number of weights.
Compare this to numerical differentiation using e.g.

∂En(w)
∂wji

=
En(wji + �)− En(wji − �)

2�
+ O(�2)

which needs O(W2) operations, and is less accurate.

FYI only — as in the previous lecture: In general we have the
“cheap gradient principle”. See (Griewank, A., 2000.
Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation, Section 5.1).

Statistical Machine
Learning

c©2020
Ong & Walder & Webers

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

323of 825

Regularisation in Neural Networks

Number of input and output nodes determined by the
application.
Number of hidden nodes is a free parameter.

M = 1

0 1

−1

Training a two-layer network with 1 hidden node.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

324of 825

Regularisation in Neural Networks

Number of input and output nodes determined by the
application.
Number of hidden nodes is a free parameter.

M = 3

0 1

−1

Training a two-layer network with 3 hidden nodes.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

325of 825

Regularisation in Neural Networks

Number of input and output nodes determined by the
application.
Number of hidden nodes is a free parameter.

M = 10

0 1

−1

Training a two-layer network with 10 hidden nodes.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

326of 825

Regularisation in Neural Networks

Model complexity matters again.

M = 1

0 1

−1

M = 1

M = 3

0 1

−1

M = 3

M = 10

0 1

−1

M = 10

As before, we can use the regularised error

Ẽ(w) = E(w) +
λ

2
w>w

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

327of 825

Regularisation via Early Stopping

Stop training at the minimum of the validation set error.

0 10 20 30 40 50
0.15

0.2

0.25

Training set error.

0 10 20 30 40 50
0.35

0.4

0.45

Validation set error.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

328of 825

Invariances

If input data should be invariant with respect to some
transformations, we can utilise this for training.
Use training patterns including these transformations (e.g.
handwritten digits translated in the input space).
Or create extra artifical input data by applying several
transformations to the original input data.
Alternatively, preprocess the input data to remove the
transformation.
Or use convolutional neural networks (e.g. in image
processing where close pixels are more correlated than far
away pixels; therefore extract local features first and later
feed into a network extracting higher-order features).

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

329of 825

Synthetic Warping

Create synthetic data by warping handwritten digits.

Left: Original digitised image. Right : Examples of warped
images (above) and their corresponding displacement fields

(below).

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

330of 825

Bayesian Neural Networks

Predict a single target t from a vector of inputs x
Assume conditional distribution to be Gaussian with
precision β

p(t | x,w, β) = N (t | y(x,w), β−1)

Prior distribution over weights w is also assumed to be
Gaussian

p(w |α) = N (w | 0, α−1I)

For an i.i.d training data set {xn, tn}Nn=1, the likelihood of the
targets D = {t1, . . . , tN} is

p(D |w, β) =
N∏

n=1

N (tn | y(xn,w), β−1)

Posterior distribution

p(w | D, α, β) ∝ p(w |α)p(D |w, β)

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

331of 825

Bayesian Neural Networks

But y(x,w) is nonlinear, and therefore we can no longer
calculate the posterior in closed form.
Use Laplace approximation

1 Find a (local) maximum wMAP of the posterior via numerical
optimisation.

2 Evaluate the matrix of second derivatives of the negative log
posterior distribution.

Find a (local) maximum using the log-posterior

ln p(w | D, α, β) = −
α

2
w>w−

N∑
n=1

(y(xn,w)− tn)
2
+ const

Find the matrix of second derivatives of the negative log
posterior distribution

A = −∇∇ ln p(w | D, α, β) = αI + βH

where H is the Hessian matrix of the sum-of-squares error
function with respect to the components of w.

Statistical Machine
Learning

Data61 | CSIRO
The Australian National

University

Review

Error Backpropagation

Regularisation in Neural
Networks

Bayesian Neural
Networks

332of 825

Bayesian Neural Networks

Having wMAP, and A, we can approximate the posterior by
a Gaussian

q(w | D, α, β) = N (w |wMAP,A−1)

For the predictive distribution further linearly approximate

y(x,w) ≈ y(x,wMAP) + g>(w− wMAP)
where g = ∇wy(x,w)|w=wMAP .

Then
p(t | x,D, α, β) = N (t | y(x,wMAP), σ2(x))

where
σ2(x) = β−1 + g>A−1g.

(Recall the multivariate normal conditionals.)
variance due to the intrinsic noise on the target: β−1

variance due to the model parameter w : g>A−1g

Neural Networks 2
Review
Error Backpropagation
Regularisation in Neural Networks
Bayesian Neural Networks

Related Posts