09_intro2dl
Qiuhong Ke
Introduction to Deep Learning
——Recap of Neural network
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
Before we start
Books & resources
• Deep Learning with Python,
by Francois Chollet, available
in unimelb library
• Deep Learning, by Ian
Goodfellow and Yoshua
Bengio and Aaron Courville,
https://
www.deeplearningbook.org/
2
Angry birds
vs
3
Source: by Papachan (CC BY)
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck
https://heroes-of-the-characters.fandom.com/wiki/Chuck
https://heroes-of-the-characters.fandom.com/wiki/Chuck
https://www.pexels.com/photo/angry-angry-bird-animal-animated-415381/
https://www.pexels.com/photo/angry-angry-bird-animal-animated-415381/
https://www.sketchport.com/user/4893810324668416/papachan
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck
https://www.sketchport.com/user/4893810324668416/papachan
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck
Bird classification
Input Feature engineering
Color or
Body shape or
…
Classification
SVM
…
Output
Bird class
4
Source: by Papachan (CC BY)
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck
https://www.pexels.com/photo/angry-angry-bird-animal-animated-415381/
https://www.pexels.com/photo/angry-angry-bird-animal-animated-415381/
https://www.sketchport.com/user/4893810324668416/papachan
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck
https://www.sketchport.com/user/4893810324668416/papachan
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck
In real life…
5
Source: Welinder, Peter, et al. “Caltech-UCSD birds 200.” (2010).
https://vision.cornell.edu/se3/wp-content/uploads/2014/09/WelinderEtal10_CUB-200.pdf
https://vision.cornell.edu/se3/wp-content/uploads/2014/09/WelinderEtal10_CUB-200.pdf
More features for real birds…
6
Source: Welinder, Peter, et al. “Caltech-UCSD birds 200.” (2010).
‘REAL Bird’ classification
Feature
engineeringInput Classification
Use deep learning
models!
Output
Bird class
7
Outline
• Introduction
• Neural network (Perceptron and Multi-layer Perceptron)
• Gradient descent algorithm
8
Feature learning + classificationInput Output
many hidden layers
Deep neural network
!!
!”
!#
“!
“#!…
…
…
#!
$!
$”
$$
##%
…
…
…
Introduction Neural network
Deep learning
Gradient descent
9
Deep learning is everywhere
Introduction Neural network Gradient descent
Near-human-level
image classification
speech recognition
handwriting transcription
autonomous driving
….
• Large dataset
• Powerful computing resources
• Better algorithms:
• Weight-initialization schemes
• Optimisation schemes
• Activation functions
• …
Network: deeper than deeper
Why now deep learning is popular?
Introduction Gradient descent Neural network
11
Perceptron
!!
Σ “!”
1
!(#)#
×$!
×$”
×$#
• !!, !” – inputs
• “!, “” – synaptic weights
• “# – bias weight
• # – activation function
Introduction Gradient descent Neural network
Predict class A if ! ≥ 0
Predict class B if ! < 0
where ! = ∑ '!(!"!#$
12
Introduction Gradient descent Neural network
! " = $1, (! " ≥ 00, (! " < 0
! " = $ 1, (! " ≥ 0−1, (! " < 0
Step function
Sign function
Perceptron is a linear binary classifier
if s ≥ 0 : f(s) is positive class
if s < 0 : f(s) is negative class
!!
Σ "!"
1
!(#)#
×$!
×$"
×$#
13
willoweit.
Typewritten Text
active function
willoweit.
Typewritten Text
bias
willoweit.
Typewritten Text
active functions
Simple example
Introduction Gradient descent Neural network
Exercise: find weights of
a perceptron capable of
perfect classification of
the following dataset
14
willoweit.
Typewritten Text
x_1+x_2
willoweit.
Typewritten Text
0
willoweit.
Typewritten Text
0
willoweit.
Typewritten Text
1
willoweit.
Typewritten Text
2
Limitations of perceptron learning
Introduction Gradient descent Neural network
• If the data is linearly separable, the perceptron training
algorithm will converge to a correct solution
* It will converge to some solution (separating boundary), one of
infinitely many possible ß bad!
• However, if the data is not linearly separable, the training will
fail completely rather than give some approximate solution
* Ugly L
AND
Some problems are linearly separable, but many are not
OR XOR15
Multi-layer perceptron: function composition
Introduction Gradient descent Neural network
!!
!"
!#
"!
#!
#"
#$
"%
…
…
!!""#!
Input
layer
Hidden
layer(s)
Output
layer
g(x): activation function
!! = #(%!)
%! = '"! +)*#'#!
$
#%&
!! =#$"%"!
#
"$%
h(x): activation function
!! = ℎ(%!)
%! = '"! +)*#'#!
$
#%&
!! =#$"%"!
#
"$%
16
Common non-linear activation functions
f(x) = max(0,x)Relu:
f(x) =
1
1 + e−x
=
ex
ex + 1
Sigmoid:
Softmax: f(xi) =
exi
∑C
j=1 e
xi
i = 1,⋯, C
(Multi-class
classification)
f(x) =
2
1 + e−2x
− 1 =
2e2x
ex2 + 1
− 1TanH
Introduction Gradient descent Neural network
17
How to train your dragon network?
Adapted from Movie Poster from
Flickr user jdxyw (CC BY-SA 2.0)
Introduction Gradient descent Neural network
18
“Training”: adjust weights to minimise loss.
Targets
(labels)
Loss
function
Introduction Gradient descent Neural network
Loss score
Add penalty(loss) for
misclassified examples
Input
Training loop: Update weights
Predictions
Neural
network
!!
!"
!#
"!
#!
#"
#$
"%
…
…
!!""#!
How?
19
Derivative
Input x, output y: f(x)
Introduction Gradient descent Neural network
Δx → 0 : f(x1 + Δx) − f(x1) = aΔx
a : rate of change (derivative) at x1
y = f(x) = x2 y
xo
derivative: a vector from origin.
Move x along OPPOSITE direction of
the vector: decrease f(x)
x1
y1
y2
x2
20
willoweit.
Rectangle
willoweit.
Oval
Gradient
x y
z
Move input (x,y) along
OPPOSITE direction of
gradient vector:
decrease output f(x,y)
Introduction Gradient descent Neural network
z = f(x, y) = x2 + y2
G = [
∂z
∂x
,
∂z
∂y
]
21
Loss = function (weights)
Introduction Gradient descent Neural network
Targets
(labels)
Loss
function
Loss score
Add penalty(loss) for
misclassified examples
Input
Update weights
Predictions
Neural
network
!!
!"
!#
"!
#!
#"
#$
"%
…
…
!!""#!
22
Introduction Gradient descent Neural network
η : leanring rate
ΔL(wi) =
∂L
∂wi
wnewi = w
old
i − η * ΔL(wi)
To reduce loss, update weights:
Loss = function (weights)
23
Figure 2.11 in Deep learning with python by Francois Chollet
Introduction Gradient descent Neural network
Large : random location η
Small : local optimal value η
η : leanring rate
Loss = function (weights)
Loss value
Parameter value
24
Figure 2.13 in Deep learning with python by Francois Chollet
Introduction Gradient descent Neural network
Gradient descent Algorithm
• Randomly shuffle/split all training examples in ! batches
• Choose initial "(")
• For # from 1 to %
• For & from 1 to !
• Do gradient descent update using data from batch &
• Advantage of such an approach: computational feasibility for
large datasets
Iterations over the
entire dataset are
called epochs
25
willoweit.
Highlight
willoweit.
Highlight
Introduction Gradient descent Neural network
Stochastic gradient descent for perceptron
Choose initial guess !("), " = 0
For % from 1 to ' (epochs)
For ( from 1 to ) (training examples)
Consider example *$ , ,$
Update*: ! %&& = ! % − /01(!(%))
26
Simple example: Perceptron Model
Introduction Gradient descent Neural network
!!
Σ "!"
1
!(#)#
×$!
×$"
×$#
Encode class: A: +1, B:-1
Predict class A if ! ≥ 0
Predict class B if ! < 0
where ! = ∑ '!(!"!#$
if s ≥ 0 : prediction f(s) = 1
if s < 0 : prediction f(s) = − 1
27
Simple example: Loss
Introduction Gradient descent Neural network
!!
Σ "!"
1
!(#)#
×$!
×$"
×$#
L(s, y) = max(0, − ys)
= max(0, − y ⋅
m
∑
i=0
xiwi)
28
Introduction Gradient descent Neural network
L(w) = max(0, − ys) = max(0, − y
m
∑
i=0
xiwi)
!(#, %)
#%
0
Simple example: gradient
∂L
∂wi
= − yxi when sy < 0
∂L
∂wi
= 0 when sy > 0
29
Introduction Gradient descent Neural network
Learn perceptron on the simple example: basic setup
!!
Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#.! ." .# !" 1 (learning rate 0 = 0.1) 30 Introduction Gradient descent Neural network Learn perceptron on the simple example: Start with random weights !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.2 !" = 0.0 !# = −0.1 !" 1 (learning rate 0 = 0.1) !! !" 0.5 31 Introduction Gradient descent Neural network Learn perceptron on the simple example: epoch 1, data point 1 !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.2 !" = 0.0 !# = −0.1 !" 1 Prediction s on (1,0): w1 × 1 + w2 × 0 + w0 × 1 = 0.1 > 0
!!
!”
0.5 (1,0)
class -1
class 1
32
Introduction Gradient descent Neural network
Learn perceptron on the simple example: epoch 1, data point 1
s>0, y=-1, sy<0: ∂L ∂wi = − yxi w1 ← w1 − η ⋅ 1 = 0.1 w2 ← w2 − η ⋅ 0 = 0 w0 ← w0 − η ⋅ 1 = − 0.2 Gradient on (1,0): !! !" 0.5 (1,0) class -1 class 1 !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.2 !" = 0.0 !# = −0.1 !" 1 (learning rate 0 = 0.1) 33 Introduction Gradient descent Neural network Learn perceptron on the simple example: update weights !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.1 !" = 0 !# = −0.2 !" 1 (learning rate 0 = 0.1) !! !" 0.5 (1,0) class -1 class 1!! !" 0.5 (1,0) class -1 class 1 34 Introduction Gradient descent Neural network Learn perceptron on the simple example: epoch 1, data point 2 Prediction s on (1,1): w1 × 1 + w2 × 1 + w0 × 1 = − 0.1 < 0 !! !" (1,0) class -1 class 1 (1,1) !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.1 !" = 0.0 !# = −0.2 !" 1 (learning rate 0 = 0.1) 35 Introduction Gradient descent Neural network Learn perceptron on the simple example: epoch 1, data point 2 !! !" (1,0) class -1 class 1 (1,1) !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.1 !" = 0.0 !# = −0.2 !" 1 (learning rate 0 = 0.1) s<0, y=1, sy<0: ∂L ∂wi = − yxi w1 ← w1 − η ⋅ (−1) = 0.2 w2 ← w2 − η ⋅ (−1) = 0.1 w0 ← w0 − η ⋅ (−1) = − 0.1 Gradient on (1,1): 36 Introduction Gradient descent Neural network Learn perceptron on the simple example: Update weights !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.2 !" = 0.1 !# = −0.1 !" 1 (learning rate 0 = 0.1) !! !" (1,0) class -1 class 1 (1,1) !! !" class -1 class 1 37 !! !" !# "! #! #" #$ "% … … !!""#! Input layer Hidden layer(s) Output layer Introduction Gradient descent Neural network Stochastic gradient descent for multi-layer perceptron 38 Introduction Gradient descent Neural network Stochastic gradient descent for multiple-layer perceptron Choose initial guess !("), ! = 0 Here ! is a set of all weights form all layers For $ from 1 to & (epochs) For ' from 1 to ( (training examples) Consider example )! , +! Update: !($%&) = !($) − $%&(!($)); kßk+1 Need to compute partial derivatives !"!#!" and !"!$" 39 Introduction Gradient descent Neural network Chain rule Given z = g(u) u = f(x) dz dx = dz du du dx Example : z = sin(x2) u = x2 z = sin(u) dz du = cos(u) dz dx = dz du du dx = 2x cos(u) 40 Multi-layer perceptron: Function composition z = f(s) x → r → u → s → z u = f(r)uj = f(rj) = sigmoid(rj) = 1 1 + e−rj s = p ∑ j=0 wjuj rj = m ∑ i=0 xivij Introduction Gradient descent Neural network z = f(s) = sigmoid(s) = 1 1 + e−s Forward prediction f : activation functions L !! !" !# "! # "$… … !! ""! $! $$ %… 41 Introduction Gradient descent Neural network Binary cross-entropy L = − [y log(z) + (1 − y)log(1 − z))] y: Labels of the data y=1 for positive class or 0 for negative class 42 Multi-layer perceptron: Chain rule x → r → u → s → z Introduction Gradient descent Neural network L z = sigmoid(s) = 1 1 + e−s L = − [y log(z) + (1 − y)log(1 − z))] s = p ∑ j=0 ujwj ∂L ∂wj = ∂L ∂z ∂L ∂z ∂s ∂wj ∂s ∂wj ∂z ∂s ∂z ∂s z = f(s) u = f(r) Forward Backward propagation: !! !" !# "! # "$… … !! ""! $! $$ %… 43 willoweit. Rectangle Multi-layer perceptron: Chain rule x → r → u → s → z Introduction Gradient descent Neural network L ∂L ∂vij = ∂L ∂z ∂L ∂z ∂z ∂s ∂z ∂s uj = sigmoid(rj) = 1 1 + e−rj s = p ∑ j=0 ujwj rj = m ∑ i=0 xivij ∂uj ∂rj ∂uj ∂rj ∂ri ∂vij∂rj ∂vij ∂s ∂uj ∂s ∂uj z = f(s) u = f(r) Forward Backward propagation: !! !" !# "! # "$… … !! ""! $! $$ %… 44 Backward propagation Introduction Gradient descent Neural network z = sigmoid(s) = 1 1 + e−s L = − [y log(z) + (1 − y)log(1 − z))] ∂L ∂z = − y z + 1 − y 1 − z ∂L ∂s = ∂L ∂z ∂z ∂s = z − y, ∂z ∂s = z(1 − z) uj = sigmoid(rj) = 1 1 + e−rj ∂L ∂rj = ∂L ∂uj ∂uj ∂rj = ∂L ∂uj (uj)(1 − uj) ∂L ∂vij = ∂L ∂rj ∂rj ∂vij = ∂L ∂rj xi Forward prediction ∂L ∂uj = ∂L ∂s ∂s ∂uj = ∂L ∂s wj ∂L ∂wj = ∂L ∂s ∂s ∂wj = ∂L ∂s ujs = p ∑ j=0 ujwj rj = m ∑ i=0 xivij 45 Summary • What is deep learning& difference with traditional machine learning? • How to train neural network using gradient descent algorithm? • How to train a perceptron using stochastic gradient descent? • What is chain rule? • How to perform forward prediction and back propagation in multilayer perceptron? 46