CS计算机代考程序代写 chain deep learning c# algorithm The University of Sydney Page 1

The University of Sydney Page 1

Multilayer Neural Network

Dr Chang Xu

School of Computer Science

The University of Sydney Page 2

Perceptron: the Prelude of Deep Learning

A neuron cell

Image credit to: thinglink.com

The University of Sydney Page 3

Perceptron: the Prelude of DL

S

x1

x2

xd

w1
w2

wd

b

o

!” =$
!”#

$
%! ⋅ ‘! + )

[Rosenblatt, et al. 1958]

The University of Sydney Page 4

Perceptron
– The neuron has a real-valued output which is a weighted
sum of its inputs.

Neuron’s estimate of
the desired output

input
vector

weight
vector

Recall the decision rule

If an example can be correctly classified, we have

! “!# + % > 0
otherwise

! “!# + % < 0 )! =+ "#$ % ""#" + % = "!# + % bias The University of Sydney Page 5 Perceptron If an example can be correctly classified, we have ! "!# + % > 0
otherwise

! “!# + % < 0 The objective function of Perceptron can be written as min &,( / ", % = − + )!∈ℳ !"("!#" + %) where ℳ stands for the set of the misclassified examples. The University of Sydney Page 6 Perceptron min &,( / ", % = − + )!∈ℳ !"("!#" + %) Gradient descent to solve the optimization problem. ∇&/ ", % = − + )!∈ℳ !"#" ∇(/ ", % = − + )!∈ℳ !" The University of Sydney Page 7 Perceptron g( #) = &'() * !"# $ +!,! + . = &'() / %# sign & = 0 1, '3 & ≥ 0 −1, '3 & < 0 x1 x2 ŏ xdŏ x0 = 1 w0 w1 w2 wd g(x) Linear classifier! b The University of Sydney Page 8 Perceptron Marvin Minsky & Seymour Papert (1969). Perceptrons, MIT Press, Cambridge, MA. • Before long researchers had begun to discover the Perceptron’s limitations. • Unless input categories were “linearly separable”, a perceptron could not learn to discriminate between them. • Unfortunately, it appeared that many important categories were not linearly separable. • E.g., those inputs to an XOR gate that give an output of 1 (namely 10 & 01) are not linearly separable from those that do not (00 & 11). “AI winter” The University of Sydney Page 9 Perceptron: Limitations – Solving XOR (exclusive-or) with Perceptron? – Linear classifier cannot identify non-linear patterns. !! !" y 0 0 0 0 1 1 1 0 1 1 1 0 x1 x2 1 10 "$#$ +",#, + % This failure caused the majority of researchers to walk away. The University of Sydney Page 10 Breakthrough: Multi-Layer Perceptron – In1986, Geoffrey Hinton, David Rumelhart, and Ronald Williams published a paper “Learning representations by back-propagating errors”, which introduced – Backpropagation, a procedure to repeatedly adjust the weights so as to minimize the difference between actual output and desired output – Hidden Layers, which are neuron nodes stacked in between inputs and outputs, allowing neural networks to learn more complicated features (such as XOR logic) The University of Sydney Page 11 Multilayer Neural Network The University of Sydney Page 12 Multilayer Neural Networks – Function composition can be described by a directed acyclic graph (hence feedforward networks) – Depth is the maximum " in the function composition chain – Final layer is called the output layer The University of Sydney Page 13 Two-layer Neural Network x1 x2 ŏ xdŏ x0 = 1 w0 w1 w2 wd g(x) x1 x2 1 10 Linear classifier! b The University of Sydney Page 14 Three-layer Neural Network (with a hidden layer) x1 x2 x0 = 1 g(x) 1 1 1 1 -1.5 -1 0.5 0.7 -0.4 x1 x2 1 1-1 -1 The University of Sydney Page 15 Three-layer Neural Network (with a hidden layer) – A 2D-input example separating hyperplane convex polygon region composition of polygons composition of polygons separating hyperplane convex polygon region The University of Sydney Page 16 … … #$%8 =' 9:; < (9)89 = *8 =+ !! !" !# … … … … "! "$ "% !- = 5(678-) #! #& #' Feedforward The University of Sydney Page 17 Universal Approximation Theorem Let .(⋅) be a nonconstant, bounded, and monotonically-increasing continuous function. Let 2! denote the 3-dimensional unit hypercube 0,1 !. The space of continuous functions on 2! is denoted by 6(2!). Then, given any 7 > 0 and any
function . ∈ 6(2!), there exists an integer 9, real constants :”, ;” ∈ ℝ! and real
vectors =” ∈ ℝ!, where > = 1,⋯ ,9″ such that we may define:

AB(C) =D

#
:”.(=”

$C + ;”)

as an approximate realization of the function B where B is independent of .; that is,
AB C − B C < 7 for all C ∈ 2!. In other words, functions of the form AB C are dense in 6(2!). Universality theorem (Hecht-Nielsen 1989): “Neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision.” The University of Sydney Page 18 Expressive power of a three-layer neural network – A visual explanation on one input and one output functions Video credit to: https://neuralnetworksanddeeplearning.com – The weight changes the shape of the curve. – The larger the weight, the steeper the curve. – The bias moves the graph, but doesn’t change its shape. ,()( + /) The University of Sydney Page 19 Expressive power of a three-layer neural network – A visual explanation on one input and one output functions Image credit to: https://neuralnetworksanddeeplearning.com – Increase the weight so that the output is a very good approximation of a step function. – The step is at position H = −;/=. – We simplify the problem by describing hidden neurons using just a single parameter H. ,()( + /) The University of Sydney Page 20 Expressive power of a three-layer neural network – A visual explanation on one input and one output functions Image credit to: https://neuralnetworksanddeeplearning.com – Left: Setting the weights of output to 1.2 and −1.2, we get a ‘bump’ function, which starts at 0.4, ends at 0.6 and has height 1.2. – Right: By adding another pair of hidden neurons, we can get more bumps in the same network. The University of Sydney Page 21 Expressive power of a three-layer neural network – A visual explanation on one input and one output functions Image credit to: https://neuralnetworksanddeeplearning.com – By increasing the number of hidden neurons, we can make the approximation much better. The University of Sydney Page 22 Expressive power of a three-layer neural network – Overall procedure x1 x0 = 1 w0=-100 w1=100 g(x) sigmoid -10 -8 -6 -4 -2 0 2 4 6 8 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -10 -8 -6 -4 -2 0 2 4 6 8 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 b=-100 "($) The University of Sydney Page 23 Expressive power of a three-layer neural network – How about multiple input functions? The two input case: Video credit to: https://neuralnetworksanddeeplearning.com – Similar to one input case, the weight changes the shape and the bias changes the position. – The step point: H% = −;/=& Fix ,& = 0, change ,# and .: The University of Sydney Page 24 Expressive power of a three-layer neural network – How about multiple input functions? The two input case: Video credit to: https://neuralnetworksanddeeplearning.com The University of Sydney Page 25 Expressive power of a three-layer neural network – How about multiple input functions? The two input case: Image credit to: https://neuralnetworksanddeeplearning.com The University of Sydney Page 26 Expressive power of a three-layer neural network – How about multiple input functions? The two input case: Image credit to: https://neuralnetworksanddeeplearning.com – By adding bias to the output neuron and adjusting the value of ℎ and ., we can construct a ‘tower’ function. – The same idea can be used to compute as many towers as we like. We can also make them as thin as we like, and whatever height we like. As a result, we can ensure that the weighted output from the second hidden layer approximates any desired function of two variables. The University of Sydney Page 27 http://playground.tensorflow.org/ A toy example 1. Linear case without an hidden layer 2. Nonlinear case with a hidden layer The University of Sydney Page 28 Activation Functions The University of Sydney Page 29 Activation functions – Must be nonlinear: otherwise it is equivalent to a linear classifier 5 ! "!! Linear function! The University of Sydney Page 30 Activation functions – Continuous and differentiable almost everywhere – Monotonicity: otherwise it introduces additional local extrema in the error surface " # An output value might correspond to multiple input values. The University of Sydney Page 31 Activation function $(&) – Popular choice of activation functions (single input) – Sigmoid function " # = 1 1 + ''( – Tanh function: shift the center of Sigmoid to the origin " # = '( − ''( '( + ''( – Rectified linear unit (ReLU) " # = max (0, #) – Leaky ReLU " # = 0 #, if # ≥ 0 4#, if # < 0 , 4 is a small constant. The University of Sydney Page 32 Vanishing gradient problems – Saturation. , 1 = 1 1 + $?@ , 1 = $@ − $?@ $@ + $?@ v At their extremes, their derivatives are close to 0, which kills gradient and learning process Sigmoid function Tanh function The University of Sydney Page 33 Not zero centered activation , 1 = 1 1 + $?@ v During training, all gradients will be all positive or all negative that will cause problem with learning. Sigmoid function Relu function , 1 = max (0, 1) Possible update direction Possible update direction Optimum point The University of Sydney Page 34 Dead ReLUs v For negative numbers, ReLU gives 0, which means some part of neurons will not be activated and be dead. ReLU function , 1 = max (0, 1) v Avoid large learning rate and wrong weight initialization, and try Leaky ReLU, etc. The University of Sydney Page 35 Backpropagation The University of Sydney Page 36 Recall the feedforward calculation … … #$%8 =' 9:; < (9)89 = *8 =+ !! !" !# … … … … "! "$ "% #= = "()*+=) #! #& #' The University of Sydney Page 37 Training error – If both %A and 9A are probability distributions, we can use cross entropy. B C, D = 1 2 * '"# ( C' − D' & = 1 2 F − G & B C, D = −* '"# ( C' log D' – Euclidean distance. – Cross entropy is asymmetric. In some cases we may use this symmetric form. B C, D = −* '"# ( (C'log D' +D' log C') The University of Sydney Page 38 Cross Entropy Loss – Given two probability distributions t and z, >?@##ABC?@DE F, G = −H


C” log J”

= −H

C” log C” +H


C” log

C”
J”

= ABC?@DE F + K)*(F|G)

KL (Kullback-Leibler)
divergence is a measure
of the information lost
when Z is used to
approximate T

MNtropy
is a measure of
degree of disorder
or randomness.

= −H

C” log C” +H


C” log C” −H


C” log J”

The University of Sydney Page 39

Softmax Function

v In multi-class classification, softmax function before the
output layer is to assign conditional probabilities (given () to
each one of the : classes.

)JC#

)JC&

)JC)

)JC’

D#

D&

D)

D’

C#

C&

C)

C’

Output Ground-truth

0.1

0.2

0.6

0.1

0

0

M

0

;< =>?11A ( = 9A =
$NOP!

∑9:Q
R $NOP”

The University of Sydney Page 40

Gradient descent

– Weights are initialized with random values, and then are
updated in a direction reducing the error.

where A is the learning rate.
Iteratively update

∆* = −A
CD

C*

* % + 1 = * % + ∆*(%)

The University of Sydney Page 41

Hidden-to-output weight ,OP

– Chain rule
-.
-/Q=

= -.-)*+Q
-)*+Q
-/Q=

= -.-)*+Q
#=

D * =
1

2
F − G S

9A = ,(#$%A)

#$%A =’

8:Q

N#

H8)A8 +)A;

x1 xi xd

… …

ynHyjy1

z1 zk zC

wkj

wji

The University of Sydney Page 42

Input-to-hidden weight ,PR
-.
-/=S

=
-.
-#=

-#=
-)*+=

-)*+=
/=S

x1 xi xd

… …

ynHyjy1

z1 zk zC

wkj

wji )*+= =0
STU

V

!=/=S +/=W

-.
-#=

= 0
QTU

X -.
-2Q

-2Q
-#=

#= = “()*+=)

The University of Sydney Page 43

Vanishing gradients problem

-.
-/=S

=
-.
-#=

-#=
-)*+=

-)*+=
/=S

x1 xi xd

… …

ynHyjy1

z1 zk zC

wkj

wji

TU*
T)JC*

= V+(*
!”#

,
+*,*! + ,*-)

The University of Sydney Page 44

Stochastic Gradient Descent

The University of Sydney Page 45

Stochastic Gradient Descent (SGD)

– Given # training examples, the target function can be
expressed as

D * = ‘

W:Q

N

DW(*)

– Batch gradient descent
* ← *−

A

#

W:Q

N

JDW *

– In SGD, the gradient of D(*) is approximated on a single
example or a mini-batch of examples

* ← *− AJDW(*)

The University of Sydney Page 46

One-example based Backpropagation

– Algorithm 1 (one-example based BP)
– 1 begin initialize network topology (B+), Y, criterion Z, [, \ ← 0
– 2 do \ ← \ + 1
– 3 ^! ← randomly chosen pattern
– 4 _,” ← _,” + [ ,̀^”; _-, ← _-, + [`-E,
– 5 until ab Y < Z – 6 return Y – 7 end – In one-example based training, a weight update may reduce the error on the single pattern being presented, yet increase the error on the full training set. The University of Sydney Page 47 Mini-batch based SGD – Divide the training set into mini-batches. – In each epoch, randomly permute mini-batches and take a mini-batch sequentially to approximate the gradient – One epoch is usually defined to be one complete run through all of the training data. – The estimated gradient at each iteration is more reliable. The University of Sydney Page 48 Mini-Batch Backpropagation – Algorithm 2 (mini-batch based BP) – 1 begin initialize network topology (B+), Y, criterion Z, [, r ← 0 – 2 do ? ← ? + 1 (increment epoch) – 3 \ ← 0; ∆_,"← 0; ∆_-,← 0; – 4 do \ ← \ + 1 – 5 ^! ← randomly chosen pattern – 6 ∆_,"← ∆_," + [ ,̀^"; ∆_-, ← ∆_-, + [`-E, – 7 until \ = B – 8 _,"← _," + ∆_,"; _-, ← _-, + ∆_-, – 9 until ab Y < Z – 10 return Y – 11 end The University of Sydney Page 49 SGD Analysis – One-example based SGD – Estimation of the gradient is noisy, and the weights may not move precisely down the gradient at each iteration – Faster than batch learning, especially when training data has redundancy – Noise often results in better solutions – The weights fluctuate and it may not fully converge to a local minimum – Min-batch based SGD – Conditions of convergence are well understood – Some acceleration techniques only operate in batch learning – Theoretical analysis of the weight dynamics and convergence rates are simpler The University of Sydney Page 50 Summarization The University of Sydney Page 51 https://github.com/pytorch/examples/blob/master/mnist/main.py A toy example https://pytorch.org/docs/stable/nn.functional.html#non- linear-activation-functions https://pytorch.org/docs/stable/nn.html#linear https://pytorch.org/docs/stable/nn.functional.html https://pytorch.org/docs/stable/nn.html The University of Sydney Page 52 https://github.com/pytorch/examples/blob/master/mnist/main.py A toy example The University of Sydney Page 53 Thank you!