Multilayer Neural Network
Dr Chang Xu
School of Computer Science
The University of Sydney Page 1
Perceptron: the Prelude of Deep Learning
A neuron cell
Image credit to: thinglink.com
The University of Sydney Page 2
Perceptron: the Prelude of DL
[Rosenblatt, et al. 1958]
b x1 w1
w2 wd
x2 .
x d
o
The University of Sydney
Page 3
S
$
“! = $ % ! ⋅ ‘ ! + ) !”#
Perceptron
– The neuron has a real-valued output which is a weighted
sum of its inputs.
weight vector
input vector
Neuron’s estimate of the desired output
%
!) = + ” ” # ” + % = ” ! # + % “#$
bias
Recall the decision rule
If an example can be correctly classified, we have
otherwise
!”!#+% >0 !”!#+% <0
The University of Sydney
Page 4
Perceptron
If an example can be correctly classified, we have
!"!#+% >0 !”!#+% <0
otherwise
The objective function of Perceptron can be written as
min/",% =−+!"("!#"+%) &,( )!∈M
where M stands for the set of the misclassified examples. The University of Sydney
Page 5
Perceptron
min/",% =−+!"("!#"+%) &,( )!∈M
Gradient descent to solve the optimization problem.
∇&/",% =−+!"#" )!∈M
∇(/",% =−+!" )!∈M
The University of Sydney
Page 6
Perceptron
$
g(#)=&'() *+!,!+. =&'()/%# !"#
g(x)
w0
w2
b
sign& =01, '3&≥0 −1, '3&<0
Linear classifier!
wd w1
The University of Sydney
Page 7
x0 = 1
xd
x1
x2
Perceptron “AI winter”
Marvin Minsky & Seymour Papert (1969). Perceptrons, MIT Press, Cambridge, MA.
• Before long researchers had begun to discover the Perceptron’s limitations.
• Unless input categories were “linearly separable”, a perceptron could not learn to discriminate between them.
• Unfortunately, it appeared that many important categories were not linearly separable.
• E.g., those inputs to an XOR gate that give an output of 1 (namely 10 & 01) are not linearly separable from those that do not (00 & 11).
The University of Sydney Page 8
Perceptron: Limitations
– Solving XOR (exclusive-or) with Perceptron? x2
1
!! !"
y
00
0
01
1
10
1
11
0
– Linear classifier cannot identify non-linear patterns. "$#$ +",#, +%
This failure caused the majority of researchers to walk away.
The University of Sydney
Page 9
0
1
x1
Breakthrough: Multi-Layer Perceptron
– In1986, Geoffrey Hinton, David Rumelhart, and Ronald Williams published a paper “Learning representations by back-propagating errors”, which introduced
– Backpropagation, a procedure to repeatedly adjust the weights so as to minimize the difference between actual output and desired output
– Hidden Layers, which are neuron nodes stacked in between inputs and outputs, allowing neural networks to learn more complicated features (such as XOR logic)
The University of Sydney Page 10
The University of Sydney
Page 11
Multilayer Neural Network
Multilayer Neural Networks
– Functioncompositioncanbedescribedbyadirectedacyclic graph (hence feedforward networks)
– Depthisthemaximum"inthefunctioncompositionchain
– Finallayeriscalledtheoutputlayer The University of Sydney
Page 12
Two-layer Neural Network
g(x)
Linear classifier!
1
x2
b
0
1
x1
w0
w1
wd w2
xd
x0 = 1
The University of Sydney
Page 13
x1
x2
Three-layer Neural Network (with a hidden layer)
0.5
-1
0.7
1
g(x) -0.4
-1.5
x2
1
1
1
x0 = 1
-1
1
-1
1
x1
The University of Sydney
Page 14
x1
x2
Three-layer Neural Network (with a hidden layer)
– A2D-inputexample
separating hyperplane
convex polygon region
The University of Sydney
Page 15
separating hyperplane
composition of polygons
convex polygon region
Feedforward
#! #& #'
...
...
"! "$ "%
!- =5(678-)
......
<
# $ % 8 = ' ( 9 ) 8 9 = * 8= + 9:;
......
The University of Sydney
!! !"
!# Page 16
Universal Approximation Theorem
Let .(⋅) be a nonconstant, bounded, and monotonically-increasing continuous function. Let 2! denote the 3-dimensional unit hypercube 0,1 !. The space of continuous functions on 2! is denoted by 6(2!). Then, given any 7 > 0 and any function . ∈ 6(2!), there exists an integer 9, real constants :”, ;” ∈ R! and real vectors =” ∈ R!, where > = 1, ⋯ , 9″ such that we may define:
#
BA ( C ) = D : ” . ( = ” $ C + ; ” ) ”
as an approximate realization of the function B where B is independent of .; that is, BA C − B C < 7
for all C ∈ 2!. In other words, functions of the form BA C are dense in 6(2!).
Universality theorem (Hecht-Nielsen 1989): “Neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision.”
The University of Sydney Page 17
Expressive power of a three-layer neural network
– Avisualexplanationononeinputandoneoutputfunctions ,()( + /)
Video credit to: https://neuralnetworksanddeeplearning.com
– The weight changes the shape of the curve.
– The larger the weight, the steeper the curve.
– The bias moves the graph, but doesn’t change its shape.
The University of Sydney
Page 18
Expressive power of a three-layer neural network
– Avisualexplanationononeinputandoneoutputfunctions ,()( + /)
Image credit to: https://neuralnetworksanddeeplearning.com
– Increase the weight so that the output is a very good approximation of a step function.
– The step is at position H = −;/=.
– We simplify the problem by describing hidden neurons using just a
single parameter H. The University of Sydney
Page 19
Expressive power of a three-layer neural network
– Avisualexplanationononeinputandoneoutputfunctions
Image credit to: https://neuralnetworksanddeeplearning.com
– Left: Setting the weights of output to 1.2 and −1.2, we get a
‘bump’ function, which starts at 0.4, ends at 0.6 and has height 1.2.
– Right: By adding another pair of hidden neurons, we can get more
bumps in the same network.
The University of Sydney
Page 20
Expressive power of a three-layer neural network
– Avisualexplanationononeinputandoneoutputfunctions
Image credit to: https://neuralnetworksanddeeplearning.com
– By increasing the number of hidden neurons, we can make the
approximation much better.
The University of Sydney
Page 21
Expressive power of a three-layer neural network
– Overallprocedure
b=-100
w0=-100
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
-10 -8 -6 -4 -2 0 2 4 6 8 10
x0 = 1
x1
g(x) "($)
w1=100 sigmoid
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
The University of Sydney
Page 22
0
-10 -8 -6 -4 -2 0 2 4 6 8 10
Expressive power of a three-layer neural network
– Howaboutmultipleinputfunctions?Thetwoinputcase: Fix ,& = 0, change ,# and .:
Video credit to: https://neuralnetworksanddeeplearning.com
– Similar to one input case, the weight changes the shape and the bias changes the position.
– The step point: H% = −;/=& The University of Sydney
Page 23
Expressive power of a three-layer neural network
– Howaboutmultipleinputfunctions?Thetwoinputcase:
Video credit to: https://neuralnetworksanddeeplearning.com
The University of Sydney
Page 24
Expressive power of a three-layer neural network
– Howaboutmultipleinputfunctions?Thetwoinputcase:
Image credit to: https://neuralnetworksanddeeplearning.com
The University of Sydney
Page 25
Expressive power of a three-layer neural network
– Howaboutmultipleinputfunctions?Thetwoinputcase:
Image credit to: https://neuralnetworksanddeeplearning.com
– By adding bias to the output neuron and adjusting the value of h
and ., we can construct a ‘tower’ function.
– The same idea can be used to compute as many towers as we like. We can also make them as thin as we like, and whatever height we like. As a result, we can ensure that the weighted output from the second hidden layer approximates any desired function of two variables.
The University of Sydney
Page 26
A toy example
http://playground.tensorflow.org/
1. Linear case without an hidden layer
2. Nonlinear case with a hidden layer
The University of Sydney Page 27
The University of Sydney
Page 28
Activation Functions
Activation functions
– Must be nonlinear: otherwise it is equivalent to a linear classifier
"!!
Linear function!
5
!
The University of Sydney
Page 29
Activation functions
– Continuous and differentiable almost everywhere
– Monotonicity:otherwiseitintroducesadditionallocalextrema in the error surface
"
#
The University of Sydney
Page 30
An output value might correspond to multiple input values.
Activation function $(&)
– Popular choice of activation functions (single input)
– Sigmoidfunction
"#=1
– Tanhfunction:shiftthecenterofSigmoidtotheorigin "# ='(−''(
'( + ''(
– Rectifiedlinearunit(ReLU)
"# =max(0,#)
– LeakyReLU
TheUniversityofSydney " # =0#, if#≥0,4isasmallconstant. 4#, if # < 0
Page31
1+''(
Vanishing gradient problems
– Saturation. Sigmoid function
,1=1 1+$?@
Tanh function
,1 =$@−$?@ $@ + $?@
vAt their extremes, their derivatives are close to 0, which kills gradient and learning process
The University of Sydney Page 32
Not zero centered activation
Sigmoid function
,1=1 1+$?@
Relu function
,1 =max(0,1)
vDuring training, all gradients will be all positive or all negative that will cause problem with learning.
The University of Sydney
Optimum point
Page 33
Possible update direction
Possible update direction
Dead ReLUs
ReLU function , 1 = max (0, 1)
vFor negative numbers, ReLU gives 0, which means some part of neurons will not be activated and be dead.
vAvoid large learning rate and wrong weight initialization,
and try Leaky ReLU, etc.
The University of Sydney Page 34
The University of Sydney
Page 35
Backpropagation
Recall the feedforward calculation
#! #& #'
...
...
"! "$ "%
#= ="()*+=)
......
<
# $ % 8 = ' ( 9 ) 8 9 = * 8= + 9:;
......
The University of Sydney
!! !"
!# Page 36
Training error
– Euclideandistance.
BC,D =2*C'−D' &=2 F−G&
1(1 '"#
– If both %A and cross entropy.
9A are probability distributions, we can use (
BC,D =−*C'logD' '"#
– Crossentropyisasymmetric.Insomecaseswemayusethis
The University of Sydney
Page 37
symmetric form.
(
BC,D =−*(C'logD'+D'logC') '"#
Cross Entropy Loss
– Given two probability distributions t and z, >?@##ABC?@DEF,G =−HC”logJ”
MNtropy
is a measure of degree of disorder or randomness.
=−HC”logC” +HC”logC” ” ” J”
= ABC?@DE F + K)*(F|G)
KL (Kullback-Leibler) divergence is a measure of the information lost when Z is used to approximate T
”
=−HC”logC” +HC”logC” −HC”logJ” “””
The University of Sydney
Page 38
Softmax Function
vIn multi-class classification, softmax function before the output layer is to assign conditional probabilities (given () to each one of the : classes.
<; =>?11A ( = 9A = $NOP! ∑R $NOP”
9:Q
)JC#
)JC& )JC)
)JC’
Output
D# 0.1
D& 0.2 D) 0.6
D’ 0.1
Ground-truth
0 C#
0 C& M C)
0 C’
The University of Sydney
Page 39
Gradient descent
– Weightsareinitializedwithrandomvalues,andthenare updated in a direction reducing the error.
∆*=−A CD C*
where A is the learning rate.
Iteratively update
* %+1 =* % +∆*(%)
The University of Sydney
Page 40
Hidden-to-output weight ,OP – Chainrule
-. = -. -)*+Q = -. #=
-/Q=
D * = 12 F − G
9A = ,(#$%A) N#
#$%A =’H8)A8 +)A; 8:Q
z1 …
y1 …
-)*+Q
zk … zC yj … ynH
wji
-)*+Q -/Q=
S
The University of Sydney
Page 41
……
x1 xi xd
wkj
Input-to-hidden weight ,PR
-. = -. -#= -)*+=
z1 …
y1 …
-/=S
zk …
wkj
yj …
-#= -)*+= /=S
zC
ynH
-. X -.-2Q
-#= = 0 -2Q -#= QTU
……
x1 xi xd
V
)*+= =0!=/=S +/=W STU
#= =”()*+=)
The University of Sydney
Page 42
wji
Vanishing gradients problem
-. = -. -)*+=
z1 …
y1 …
TU ,
* =V+(*+*,*!+,*-)
-/=S
zk …
wkj
yj …
-#= -)*+= /=S
zC
ynH
T)JC* !”#
……
x1 xi xd
The University of Sydney
Page 43
wji
-#=
The University of Sydney
Page 44
Stochastic Gradient Descent
Stochastic Gradient Descent (SGD)
– Given#trainingexamples,thetargetfunctioncanbe
expressed as
– Batchgradientdescent N
* ← * − #A ‘ J D W *
W:Q
– In SGD, the gradient of D(*) is approximated on a single
example or a mini-batch of examples
* ← * − AJDW(*)
The University of Sydney
Page 45
D* =’DW(*) W:Q
N
One-example based Backpropagation
– Algorithm 1 (one-example based BP)
– 1begininitializenetworktopology(B+),Y,criterionZ,[,\←0 –2 do\←\+1
– 3 ^! ← randomly chosen pattern
–4 _,”←_,”+[`,^”; _-,←_-,+[`-E,
– 5untilab Y