The University of Sydney Page 1
Multilayer Neural Network
Dr Chang Xu
School of Computer Science
The University of Sydney Page 2
Perceptron: the Prelude of Deep Learning
A neuron cell
Image credit to: thinglink.com
The University of Sydney Page 3
Perceptron: the Prelude of DL
S
x1
x2
xd
…
w1
w2
wd
b
o
!” =$
!”#
$
%! ⋅ ‘! + )
[Rosenblatt, et al. 1958]
The University of Sydney Page 4
Perceptron
– The neuron has a real-valued output which is a weighted
sum of its inputs.
Neuron’s estimate of
the desired output
input
vector
weight
vector
Recall the decision rule
If an example can be correctly classified, we have
! “!# + % > 0
otherwise
! “!# + % < 0
)! =+
"#$
%
""#" + % = "!# + % bias
The University of Sydney Page 5
Perceptron
If an example can be correctly classified, we have
! "!# + % > 0
otherwise
! “!# + % < 0
The objective function of Perceptron can be written as
min
&,(
/ ", % = − +
)!∈ℳ
!"("!#" + %)
where ℳ stands for the set of the misclassified examples.
The University of Sydney Page 6
Perceptron
min
&,(
/ ", % = − +
)!∈ℳ
!"("!#" + %)
Gradient descent to solve the optimization problem.
∇&/ ", % = − +
)!∈ℳ
!"#"
∇(/ ", % = − +
)!∈ℳ
!"
The University of Sydney Page 7
Perceptron
g( #) = &'() *
!"#
$
+!,! + . = &'() /
%#
sign & = 0
1, '3 & ≥ 0
−1, '3 & < 0
x1 x2
ŏ
xdŏ
x0 = 1
w0
w1 w2
wd
g(x)
Linear classifier!
b
The University of Sydney Page 8
Perceptron
Marvin Minsky & Seymour Papert (1969). Perceptrons, MIT Press, Cambridge, MA.
• Before long researchers had begun to discover the
Perceptron’s limitations.
• Unless input categories were “linearly separable”, a
perceptron could not learn to discriminate between them.
• Unfortunately, it appeared that many important
categories were not linearly separable.
• E.g., those inputs to an XOR gate that give an output of 1
(namely 10 & 01) are not linearly separable from those
that do not (00 & 11).
“AI winter”
The University of Sydney Page 9
Perceptron: Limitations
– Solving XOR (exclusive-or) with Perceptron?
– Linear classifier cannot identify non-linear patterns.
!! !" y
0 0 0
0 1 1
1 0 1
1 1 0
x1
x2
1
10
"$#$ +",#, + %
This failure caused the majority of researchers to walk away.
The University of Sydney Page 10
Breakthrough: Multi-Layer Perceptron
– In1986, Geoffrey Hinton, David Rumelhart, and Ronald
Williams published a paper “Learning representations by
back-propagating errors”, which introduced
– Backpropagation, a procedure to repeatedly adjust the
weights so as to minimize the difference between actual
output and desired output
– Hidden Layers, which are neuron nodes stacked in
between inputs and outputs, allowing neural networks to
learn more complicated features (such as XOR logic)
The University of Sydney Page 11
Multilayer Neural Network
The University of Sydney Page 12
Multilayer Neural Networks
– Function composition can be described by a directed acyclic
graph (hence feedforward networks)
– Depth is the maximum " in the function composition chain
– Final layer is called the output layer
The University of Sydney Page 13
Two-layer Neural Network
x1 x2
ŏ
xdŏ
x0 = 1
w0
w1 w2
wd
g(x)
x1
x2
1
10
Linear classifier!
b
The University of Sydney Page 14
Three-layer Neural Network (with a hidden layer)
x1 x2
x0 = 1
g(x)
1
1 1
1
-1.5
-1
0.5
0.7
-0.4
x1
x2
1
1-1
-1
The University of Sydney Page 15
Three-layer Neural Network (with a hidden layer)
– A 2D-input example
separating hyperplane
convex polygon region
composition of polygons
composition of polygons
separating hyperplane
convex polygon region
The University of Sydney Page 16
… …
#$%8 ='
9:;
<
(9)89 = *8
=+
!! !" !#
… …
… …
"! "$ "%
!- = 5(678-)
#! #& #'
Feedforward
The University of Sydney Page 17
Universal Approximation Theorem
Let .(⋅) be a nonconstant, bounded, and monotonically-increasing continuous
function. Let 2! denote the 3-dimensional unit hypercube 0,1 !. The space of
continuous functions on 2! is denoted by 6(2!). Then, given any 7 > 0 and any
function . ∈ 6(2!), there exists an integer 9, real constants :”, ;” ∈ ℝ! and real
vectors =” ∈ ℝ!, where > = 1,⋯ ,9″ such that we may define:
AB(C) =D
”
#
:”.(=”
$C + ;”)
as an approximate realization of the function B where B is independent of .; that is,
AB C − B C < 7
for all C ∈ 2!. In other words, functions of the form AB C are dense in 6(2!).
Universality theorem (Hecht-Nielsen 1989): “Neural networks with a single hidden
layer can be used to approximate any continuous function to any desired precision.”
The University of Sydney Page 18
Expressive power of a three-layer neural network
– A visual explanation on one input and one output functions
Video credit to: https://neuralnetworksanddeeplearning.com
– The weight changes the shape of the curve.
– The larger the weight, the steeper the curve.
– The bias moves the graph, but doesn’t change its shape.
,()( + /)
The University of Sydney Page 19
Expressive power of a three-layer neural network
– A visual explanation on one input and one output functions
Image credit to: https://neuralnetworksanddeeplearning.com
– Increase the weight so that the output is a very good approximation
of a step function.
– The step is at position H = −;/=.
– We simplify the problem by describing hidden neurons using just a
single parameter H.
,()( + /)
The University of Sydney Page 20
Expressive power of a three-layer neural network
– A visual explanation on one input and one output functions
Image credit to: https://neuralnetworksanddeeplearning.com
– Left: Setting the weights of output to 1.2 and −1.2, we get a
‘bump’ function, which starts at 0.4, ends at 0.6 and has height 1.2.
– Right: By adding another pair of hidden neurons, we can get more
bumps in the same network.
The University of Sydney Page 21
Expressive power of a three-layer neural network
– A visual explanation on one input and one output functions
Image credit to: https://neuralnetworksanddeeplearning.com
– By increasing the number of hidden neurons, we can make the
approximation much better.
The University of Sydney Page 22
Expressive power of a three-layer neural network
– Overall procedure
x1
x0 = 1
w0=-100
w1=100
g(x)
sigmoid
-10 -8 -6 -4 -2 0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
-10 -8 -6 -4 -2 0 2 4 6 8 10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
b=-100
"($)
The University of Sydney Page 23
Expressive power of a three-layer neural network
– How about multiple input functions? The two input case:
Video credit to: https://neuralnetworksanddeeplearning.com
– Similar to one input case, the weight changes the shape and the
bias changes the position.
– The step point: H% = −;/=&
Fix ,& = 0, change ,# and .:
The University of Sydney Page 24
Expressive power of a three-layer neural network
– How about multiple input functions? The two input case:
Video credit to: https://neuralnetworksanddeeplearning.com
The University of Sydney Page 25
Expressive power of a three-layer neural network
– How about multiple input functions? The two input case:
Image credit to: https://neuralnetworksanddeeplearning.com
The University of Sydney Page 26
Expressive power of a three-layer neural network
– How about multiple input functions? The two input case:
Image credit to: https://neuralnetworksanddeeplearning.com
– By adding bias to the output neuron and adjusting the value of ℎ
and ., we can construct a ‘tower’ function.
– The same idea can be used to compute as many towers as we like.
We can also make them as thin as we like, and whatever height we
like. As a result, we can ensure that the weighted output from the
second hidden layer approximates any desired function of two
variables.
The University of Sydney Page 27
http://playground.tensorflow.org/
A toy example
1. Linear case without an hidden layer
2. Nonlinear case with a hidden layer
The University of Sydney Page 28
Activation Functions
The University of Sydney Page 29
Activation functions
– Must be nonlinear: otherwise it is equivalent to a linear classifier
5 !
"!! Linear
function!
The University of Sydney Page 30
Activation functions
– Continuous and differentiable almost everywhere
– Monotonicity: otherwise it introduces additional local extrema
in the error surface
" #
An output value might
correspond to multiple input
values.
The University of Sydney Page 31
Activation function $(&)
– Popular choice of activation functions (single input)
– Sigmoid function
" # =
1
1 + ''(
– Tanh function: shift the center of Sigmoid to the origin
" # =
'( − ''(
'( + ''(
– Rectified linear unit (ReLU)
" # = max (0, #)
– Leaky ReLU
" # = 0
#, if # ≥ 0
4#, if # < 0
, 4 is a small constant.
The University of Sydney Page 32
Vanishing gradient problems
– Saturation.
, 1 =
1
1 + $?@ , 1 =
$@ − $?@
$@ + $?@
v At their extremes, their derivatives are close to 0, which
kills gradient and learning process
Sigmoid function Tanh function
The University of Sydney Page 33
Not zero centered activation
, 1 =
1
1 + $?@
v During training, all gradients will be all positive or all
negative that will cause problem with learning.
Sigmoid function Relu function
, 1 = max (0, 1)
Possible
update direction
Possible
update direction
Optimum point
The University of Sydney Page 34
Dead ReLUs
v For negative numbers, ReLU gives 0, which means
some part of neurons will not be activated and be dead.
ReLU function , 1 = max (0, 1)
v Avoid large learning rate and wrong weight initialization,
and try Leaky ReLU, etc.
The University of Sydney Page 35
Backpropagation
The University of Sydney Page 36
Recall the feedforward calculation
… …
#$%8 ='
9:;
<
(9)89 = *8
=+
!! !" !#
… …
… …
"! "$ "%
#= = "()*+=)
#! #& #'
The University of Sydney Page 37
Training error
– If both %A and 9A are probability distributions, we can use
cross entropy.
B C, D =
1
2
*
'"#
(
C' − D'
& =
1
2
F − G &
B C, D = −*
'"#
(
C' log D'
– Euclidean distance.
– Cross entropy is asymmetric. In some cases we may use this
symmetric form.
B C, D = −*
'"#
(
(C'log D' +D' log C')
The University of Sydney Page 38
Cross Entropy Loss
– Given two probability distributions t and z,
>?@##ABC?@DE F, G = −H
”
C” log J”
= −H
”
C” log C” +H
”
C” log
C”
J”
= ABC?@DE F + K)*(F|G)
KL (Kullback-Leibler)
divergence is a measure
of the information lost
when Z is used to
approximate T
MNtropy
is a measure of
degree of disorder
or randomness.
= −H
”
C” log C” +H
”
C” log C” −H
”
C” log J”
The University of Sydney Page 39
Softmax Function
v In multi-class classification, softmax function before the
output layer is to assign conditional probabilities (given () to
each one of the : classes.
)JC#
)JC&
)JC)
)JC’
D#
D&
D)
D’
C#
C&
C)
C’
Output Ground-truth
0.1
0.2
0.6
0.1
0
0
M
0
;< =>?11A ( = 9A =
$NOP!
∑9:Q
R $NOP”
The University of Sydney Page 40
Gradient descent
– Weights are initialized with random values, and then are
updated in a direction reducing the error.
where A is the learning rate.
Iteratively update
∆* = −A
CD
C*
* % + 1 = * % + ∆*(%)
The University of Sydney Page 41
Hidden-to-output weight ,OP
– Chain rule
-.
-/Q=
= -.-)*+Q
-)*+Q
-/Q=
= -.-)*+Q
#=
D * =
1
2
F − G S
9A = ,(#$%A)
#$%A =’
8:Q
N#
H8)A8 +)A;
x1 xi xd
… …
…
…
…
…
ynHyjy1
z1 zk zC
wkj
wji
The University of Sydney Page 42
Input-to-hidden weight ,PR
-.
-/=S
=
-.
-#=
-#=
-)*+=
-)*+=
/=S
x1 xi xd
… …
…
…
…
…
ynHyjy1
z1 zk zC
wkj
wji )*+= =0
STU
V
!=/=S +/=W
-.
-#=
= 0
QTU
X -.
-2Q
-2Q
-#=
#= = “()*+=)
The University of Sydney Page 43
Vanishing gradients problem
-.
-/=S
=
-.
-#=
-#=
-)*+=
-)*+=
/=S
x1 xi xd
… …
…
…
…
…
ynHyjy1
z1 zk zC
wkj
wji
TU*
T)JC*
= V+(*
!”#
,
+*,*! + ,*-)
The University of Sydney Page 44
Stochastic Gradient Descent
The University of Sydney Page 45
Stochastic Gradient Descent (SGD)
– Given # training examples, the target function can be
expressed as
D * = ‘
W:Q
N
DW(*)
– Batch gradient descent
* ← *−
A
#
‘
W:Q
N
JDW *
– In SGD, the gradient of D(*) is approximated on a single
example or a mini-batch of examples
* ← *− AJDW(*)
The University of Sydney Page 46
One-example based Backpropagation
– Algorithm 1 (one-example based BP)
– 1 begin initialize network topology (B+), Y, criterion Z, [, \ ← 0
– 2 do \ ← \ + 1
– 3 ^! ← randomly chosen pattern
– 4 _,” ← _,” + [ ,̀^”; _-, ← _-, + [`-E,
– 5 until ab Y < Z
– 6 return Y
– 7 end
– In one-example based training, a weight update may
reduce the error on the single pattern being presented,
yet increase the error on the full training set.
The University of Sydney Page 47
Mini-batch based SGD
– Divide the training set into mini-batches.
– In each epoch, randomly permute mini-batches and take a
mini-batch sequentially to approximate the gradient
– One epoch is usually defined to be one complete run through all of the
training data.
– The estimated gradient at each iteration is more reliable.
The University of Sydney Page 48
Mini-Batch Backpropagation
– Algorithm 2 (mini-batch based BP)
– 1 begin initialize network topology (B+), Y, criterion Z, [, r ← 0
– 2 do ? ← ? + 1 (increment epoch)
– 3 \ ← 0; ∆_,"← 0; ∆_-,← 0;
– 4 do \ ← \ + 1
– 5 ^! ← randomly chosen pattern
– 6 ∆_,"← ∆_," + [ ,̀^"; ∆_-, ← ∆_-, + [`-E,
– 7 until \ = B
– 8 _,"← _," + ∆_,"; _-, ← _-, + ∆_-,
– 9 until ab Y < Z
– 10 return Y
– 11 end
The University of Sydney Page 49
SGD Analysis
– One-example based SGD
– Estimation of the gradient is noisy, and the weights may not move
precisely down the gradient at each iteration
– Faster than batch learning, especially when training data has
redundancy
– Noise often results in better solutions
– The weights fluctuate and it may not fully converge to a local minimum
– Min-batch based SGD
– Conditions of convergence are well understood
– Some acceleration techniques only operate in batch learning
– Theoretical analysis of the weight dynamics and convergence rates are
simpler
The University of Sydney Page 50
Summarization
The University of Sydney Page 51
https://github.com/pytorch/examples/blob/master/mnist/main.py
A toy example
https://pytorch.org/docs/stable/nn.functional.html#non-
linear-activation-functions
https://pytorch.org/docs/stable/nn.html#linear
https://pytorch.org/docs/stable/nn.functional.html
https://pytorch.org/docs/stable/nn.html
The University of Sydney Page 52
https://github.com/pytorch/examples/blob/master/mnist/main.py
A toy example
The University of Sydney Page 53
Thank you!