CS计算机代考程序代写 scheme python chain deep learning algorithm 09_intro2dl

09_intro2dl

Qiuhong Ke

Introduction to Deep Learning
——Recap of Neural network
COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

Before we start
Books & resources

• Deep Learning with Python,
by Francois Chollet, available
in unimelb library

• Deep Learning, by Ian
Goodfellow and Yoshua
Bengio and Aaron Courville,
https://
www.deeplearningbook.org/

2

Angry birds

vs

3

Source: by Papachan (CC BY)
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck

https://heroes-of-the-characters.fandom.com/wiki/Chuck
https://heroes-of-the-characters.fandom.com/wiki/Chuck
https://www.pexels.com/photo/angry-angry-bird-animal-animated-415381/
https://www.pexels.com/photo/angry-angry-bird-animal-animated-415381/
https://www.sketchport.com/user/4893810324668416/papachan
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck
https://www.sketchport.com/user/4893810324668416/papachan
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck

Bird classification

Input Feature engineering

Color or

Body shape or

Classification

SVM

Output

Bird class

4

Source: by Papachan (CC BY)
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck

https://www.pexels.com/photo/angry-angry-bird-animal-animated-415381/
https://www.pexels.com/photo/angry-angry-bird-animal-animated-415381/
https://www.sketchport.com/user/4893810324668416/papachan
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck
https://www.sketchport.com/user/4893810324668416/papachan
https://www.sketchport.com/drawing/4524248689278976/bomb-and-chuck

In real life…

5
Source: Welinder, Peter, et al. “Caltech-UCSD birds 200.” (2010).

https://vision.cornell.edu/se3/wp-content/uploads/2014/09/WelinderEtal10_CUB-200.pdf
https://vision.cornell.edu/se3/wp-content/uploads/2014/09/WelinderEtal10_CUB-200.pdf

More features for real birds…

6
Source: Welinder, Peter, et al. “Caltech-UCSD birds 200.” (2010).

‘REAL Bird’ classification

Feature
engineeringInput Classification

Use deep learning
models!

Output

Bird class

7

Outline

• Introduction

• Neural network (Perceptron and Multi-layer Perceptron)
• Gradient descent algorithm

8

Feature learning + classificationInput Output

many hidden layers

Deep neural network

!!

!”

!#

“!

“#!…

#!
$!
$”

$$
##%

Introduction Neural network

Deep learning

Gradient descent

9

Deep learning is everywhere

Introduction Neural network Gradient descent

Near-human-level

image classification

speech recognition

handwriting transcription

autonomous driving

….

• Large dataset

• Powerful computing resources

• Better algorithms:

• Weight-initialization schemes

• Optimisation schemes

• Activation functions

• …

Network: deeper than deeper

Why now deep learning is popular?

Introduction Gradient descent Neural network

11

Perceptron

!!

Σ “!”
1

!(#)#
×$!

×$”

×$#
• !!, !” – inputs
• “!, “” – synaptic weights
• “# – bias weight
• # – activation function

Introduction Gradient descent Neural network

Predict class A if ! ≥ 0
Predict class B if ! < 0 where ! = ∑ '!(!"!#$ 12 Introduction Gradient descent Neural network ! " = $1, (! " ≥ 00, (! " < 0 ! " = $ 1, (! " ≥ 0−1, (! " < 0 Step function Sign function Perceptron is a linear binary classifier if s ≥ 0 : f(s) is positive class if s < 0 : f(s) is negative class !! Σ "!" 1 !(#)# ×$! ×$" ×$# 13 willoweit. Typewritten Text active function willoweit. Typewritten Text bias willoweit. Typewritten Text active functions Simple example Introduction Gradient descent Neural network Exercise: find weights of a perceptron capable of perfect classification of the following dataset 14 willoweit. Typewritten Text x_1+x_2 willoweit. Typewritten Text 0 willoweit. Typewritten Text 0 willoweit. Typewritten Text 1 willoweit. Typewritten Text 2 Limitations of perceptron learning Introduction Gradient descent Neural network • If the data is linearly separable, the perceptron training algorithm will converge to a correct solution * It will converge to some solution (separating boundary), one of infinitely many possible ß bad! • However, if the data is not linearly separable, the training will fail completely rather than give some approximate solution * Ugly L AND Some problems are linearly separable, but many are not OR XOR15 Multi-layer perceptron: function composition Introduction Gradient descent Neural network !! !" !# "! #! #" #$ "% … … !!""#! Input layer Hidden layer(s) Output layer g(x): activation function !! = #(%!) %! = '"! +)*#'#! $ #%& !! =#$"%"! # "$% h(x): activation function !! = ℎ(%!) %! = '"! +)*#'#! $ #%& !! =#$"%"! # "$% 16 Common non-linear activation functions f(x) = max(0,x)Relu: f(x) = 1 1 + e−x = ex ex + 1 Sigmoid: Softmax: f(xi) = exi ∑C j=1 e xi i = 1,⋯, C (Multi-class classification) f(x) = 2 1 + e−2x − 1 = 2e2x ex2 + 1 − 1TanH Introduction Gradient descent Neural network 17 How to train your dragon network? Adapted from Movie Poster from Flickr user jdxyw (CC BY-SA 2.0) Introduction Gradient descent Neural network 18 “Training”: adjust weights to minimise loss. Targets (labels) Loss function Introduction Gradient descent Neural network Loss score Add penalty(loss) for misclassified examples Input Training loop: Update weights Predictions Neural network !! !" !# "! #! #" #$ "% … … !!""#! How? 19 Derivative Input x, output y: f(x) Introduction Gradient descent Neural network Δx → 0 : f(x1 + Δx) − f(x1) = aΔx a : rate of change (derivative) at x1 y = f(x) = x2 y xo derivative: a vector from origin. Move x along OPPOSITE direction of the vector: decrease f(x) x1 y1 y2 x2 20 willoweit. Rectangle willoweit. Oval Gradient x y z Move input (x,y) along OPPOSITE direction of gradient vector: decrease output f(x,y) Introduction Gradient descent Neural network z = f(x, y) = x2 + y2 G = [ ∂z ∂x , ∂z ∂y ] 21 Loss = function (weights) Introduction Gradient descent Neural network Targets (labels) Loss function Loss score Add penalty(loss) for misclassified examples Input Update weights Predictions Neural network !! !" !# "! #! #" #$ "% … … !!""#! 22 Introduction Gradient descent Neural network η : leanring rate ΔL(wi) = ∂L ∂wi wnewi = w old i − η * ΔL(wi) To reduce loss, update weights: Loss = function (weights) 23 Figure 2.11 in Deep learning with python by Francois Chollet Introduction Gradient descent Neural network Large : random location η Small : local optimal value η η : leanring rate Loss = function (weights) Loss value Parameter value 24 Figure 2.13 in Deep learning with python by Francois Chollet Introduction Gradient descent Neural network Gradient descent Algorithm • Randomly shuffle/split all training examples in ! batches • Choose initial "(") • For # from 1 to % • For & from 1 to ! • Do gradient descent update using data from batch & • Advantage of such an approach: computational feasibility for large datasets Iterations over the entire dataset are called epochs 25 willoweit. Highlight willoweit. Highlight Introduction Gradient descent Neural network Stochastic gradient descent for perceptron Choose initial guess !("), " = 0 For % from 1 to ' (epochs) For ( from 1 to ) (training examples) Consider example *$ , ,$ Update*: ! %&& = ! % − /01(!(%)) 26 Simple example: Perceptron Model Introduction Gradient descent Neural network !! Σ "!" 1 !(#)# ×$! ×$" ×$# Encode class: A: +1, B:-1 Predict class A if ! ≥ 0 Predict class B if ! < 0 where ! = ∑ '!(!"!#$ if s ≥ 0 : prediction f(s) = 1 if s < 0 : prediction f(s) = − 1 27 Simple example: Loss Introduction Gradient descent Neural network !! Σ "!" 1 !(#)# ×$! ×$" ×$# L(s, y) = max(0, − ys) = max(0, − y ⋅ m ∑ i=0 xiwi) 28 Introduction Gradient descent Neural network L(w) = max(0, − ys) = max(0, − y m ∑ i=0 xiwi) !(#, %) #% 0 Simple example: gradient ∂L ∂wi = − yxi when sy < 0 ∂L ∂wi = 0 when sy > 0

29

Introduction Gradient descent Neural network

Learn perceptron on the simple example: basic setup

!!

Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#.! ." .# !" 1 (learning rate 0 = 0.1) 30 Introduction Gradient descent Neural network Learn perceptron on the simple example: Start with random weights !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.2 !" = 0.0 !# = −0.1 !" 1 (learning rate 0 = 0.1) !! !" 0.5 31 Introduction Gradient descent Neural network Learn perceptron on the simple example: epoch 1, data point 1 !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.2 !" = 0.0 !# = −0.1 !" 1 Prediction s on (1,0): w1 × 1 + w2 × 0 + w0 × 1 = 0.1 > 0

!!

!”

0.5 (1,0)

class -1
class 1

32

Introduction Gradient descent Neural network

Learn perceptron on the simple example: epoch 1, data point 1

s>0, y=-1, sy<0: ∂L ∂wi = − yxi w1 ← w1 − η ⋅ 1 = 0.1 w2 ← w2 − η ⋅ 0 = 0 w0 ← w0 − η ⋅ 1 = − 0.2 Gradient on (1,0): !! !" 0.5 (1,0) class -1 class 1 !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.2 !" = 0.0 !# = −0.1 !" 1 (learning rate 0 = 0.1) 33 Introduction Gradient descent Neural network Learn perceptron on the simple example: update weights !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.1 !" = 0 !# = −0.2 !" 1 (learning rate 0 = 0.1) !! !" 0.5 (1,0) class -1 class 1!! !" 0.5 (1,0) class -1 class 1 34 Introduction Gradient descent Neural network Learn perceptron on the simple example: epoch 1, data point 2 Prediction s on (1,1): w1 × 1 + w2 × 1 + w0 × 1 = − 0.1 < 0 !! !" (1,0) class -1 class 1 (1,1) !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.1 !" = 0.0 !# = −0.2 !" 1 (learning rate 0 = 0.1) 35 Introduction Gradient descent Neural network Learn perceptron on the simple example: epoch 1, data point 2 !! !" (1,0) class -1 class 1 (1,1) !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.1 !" = 0.0 !# = −0.2 !" 1 (learning rate 0 = 0.1) s<0, y=1, sy<0: ∂L ∂wi = − yxi w1 ← w1 − η ⋅ (−1) = 0.2 w2 ← w2 − η ⋅ (−1) = 0.1 w0 ← w0 − η ⋅ (−1) = − 0.1 Gradient on (1,1): 36 Introduction Gradient descent Neural network Learn perceptron on the simple example: Update weights !! Σ # $−1, ) < 01, ) ≥ 0 ) = .!!! + ."!" + .#!! = 0.2 !" = 0.1 !# = −0.1 !" 1 (learning rate 0 = 0.1) !! !" (1,0) class -1 class 1 (1,1) !! !" class -1 class 1 37 !! !" !# "! #! #" #$ "% … … !!""#! Input layer Hidden layer(s) Output layer Introduction Gradient descent Neural network Stochastic gradient descent for multi-layer perceptron 38 Introduction Gradient descent Neural network Stochastic gradient descent for multiple-layer perceptron Choose initial guess !("), ! = 0 Here ! is a set of all weights form all layers For $ from 1 to & (epochs) For ' from 1 to ( (training examples) Consider example )! , +! Update: !($%&) = !($) − $%&(!($)); kßk+1 Need to compute partial derivatives !"!#!" and !"!$" 39 Introduction Gradient descent Neural network Chain rule Given z = g(u) u = f(x) dz dx = dz du du dx Example : z = sin(x2) u = x2 z = sin(u) dz du = cos(u) dz dx = dz du du dx = 2x cos(u) 40 Multi-layer perceptron: Function composition z = f(s) x → r → u → s → z u = f(r)uj = f(rj) = sigmoid(rj) = 1 1 + e−rj s = p ∑ j=0 wjuj rj = m ∑ i=0 xivij Introduction Gradient descent Neural network z = f(s) = sigmoid(s) = 1 1 + e−s Forward prediction f : activation functions L !! !" !# "! # "$… … !! ""! $! $$ %… 41 Introduction Gradient descent Neural network Binary cross-entropy L = − [y log(z) + (1 − y)log(1 − z))] y: Labels of the data y=1 for positive class or 0 for negative class 42 Multi-layer perceptron: Chain rule x → r → u → s → z Introduction Gradient descent Neural network L z = sigmoid(s) = 1 1 + e−s L = − [y log(z) + (1 − y)log(1 − z))] s = p ∑ j=0 ujwj ∂L ∂wj = ∂L ∂z ∂L ∂z ∂s ∂wj ∂s ∂wj ∂z ∂s ∂z ∂s z = f(s) u = f(r) Forward Backward propagation: !! !" !# "! # "$… … !! ""! $! $$ %… 43 willoweit. Rectangle Multi-layer perceptron: Chain rule x → r → u → s → z Introduction Gradient descent Neural network L ∂L ∂vij = ∂L ∂z ∂L ∂z ∂z ∂s ∂z ∂s uj = sigmoid(rj) = 1 1 + e−rj s = p ∑ j=0 ujwj rj = m ∑ i=0 xivij ∂uj ∂rj ∂uj ∂rj ∂ri ∂vij∂rj ∂vij ∂s ∂uj ∂s ∂uj z = f(s) u = f(r) Forward Backward propagation: !! !" !# "! # "$… … !! ""! $! $$ %… 44 Backward propagation Introduction Gradient descent Neural network z = sigmoid(s) = 1 1 + e−s L = − [y log(z) + (1 − y)log(1 − z))] ∂L ∂z = − y z + 1 − y 1 − z ∂L ∂s = ∂L ∂z ∂z ∂s = z − y, ∂z ∂s = z(1 − z) uj = sigmoid(rj) = 1 1 + e−rj ∂L ∂rj = ∂L ∂uj ∂uj ∂rj = ∂L ∂uj (uj)(1 − uj) ∂L ∂vij = ∂L ∂rj ∂rj ∂vij = ∂L ∂rj xi Forward prediction ∂L ∂uj = ∂L ∂s ∂s ∂uj = ∂L ∂s wj ∂L ∂wj = ∂L ∂s ∂s ∂wj = ∂L ∂s ujs = p ∑ j=0 ujwj rj = m ∑ i=0 xivij 45 Summary • What is deep learning& difference with traditional machine learning? • How to train neural network using gradient descent algorithm? • How to train a perceptron using stochastic gradient descent? • What is chain rule? • How to perform forward prediction and back propagation in multilayer perceptron? 46