CS计算机代考程序代写 chain deep learning hbase c# algorithm Multilayer Neural Network

Multilayer Neural Network
Dr Chang Xu
School of Computer Science
The University of Sydney Page 1

Perceptron: the Prelude of Deep Learning
A neuron cell
Image credit to: thinglink.com
The University of Sydney Page 2

Perceptron: the Prelude of DL
[Rosenblatt, et al. 1958]
b x1 w1
w2 wd
x2 .
x d
o
The University of Sydney
Page 3
S
$
“! = $ % ! ⋅ ‘ ! + ) !”#

Perceptron
– The neuron has a real-valued output which is a weighted
sum of its inputs.
weight vector
input vector
Neuron’s estimate of the desired output
%
!) = + ” ” # ” + % = ” ! # + % “#$
bias
Recall the decision rule
If an example can be correctly classified, we have
otherwise
!”!#+% >0 !”!#+% <0 The University of Sydney Page 4 Perceptron If an example can be correctly classified, we have !"!#+% >0 !”!#+% <0 otherwise The objective function of Perceptron can be written as min/",% =−+!"("!#"+%) &,( )!∈M where M stands for the set of the misclassified examples. The University of Sydney Page 5 Perceptron min/",% =−+!"("!#"+%) &,( )!∈M Gradient descent to solve the optimization problem. ∇&/",% =−+!"#" )!∈M ∇(/",% =−+!" )!∈M The University of Sydney Page 6 Perceptron $ g(#)=&'() *+!,!+. =&'()/%# !"# g(x) w0 w2 b sign& =01, '3&≥0 −1, '3&<0 Linear classifier! wd w1 The University of Sydney Page 7 x0 = 1 􏰙 􏰙 xd x1 x2 Perceptron “AI winter” Marvin Minsky & Seymour Papert (1969). Perceptrons, MIT Press, Cambridge, MA. • Before long researchers had begun to discover the Perceptron’s limitations. • Unless input categories were “linearly separable”, a perceptron could not learn to discriminate between them. • Unfortunately, it appeared that many important categories were not linearly separable. • E.g., those inputs to an XOR gate that give an output of 1 (namely 10 & 01) are not linearly separable from those that do not (00 & 11). The University of Sydney Page 8 Perceptron: Limitations – Solving XOR (exclusive-or) with Perceptron? x2 1 !! !" y 00 0 01 1 10 1 11 0 – Linear classifier cannot identify non-linear patterns. "$#$ +",#, +% This failure caused the majority of researchers to walk away. The University of Sydney Page 9 0 1 x1 Breakthrough: Multi-Layer Perceptron – In1986, Geoffrey Hinton, David Rumelhart, and Ronald Williams published a paper “Learning representations by back-propagating errors”, which introduced – Backpropagation, a procedure to repeatedly adjust the weights so as to minimize the difference between actual output and desired output – Hidden Layers, which are neuron nodes stacked in between inputs and outputs, allowing neural networks to learn more complicated features (such as XOR logic) The University of Sydney Page 10 The University of Sydney Page 11 Multilayer Neural Network Multilayer Neural Networks – Functioncompositioncanbedescribedbyadirectedacyclic graph (hence feedforward networks) – Depthisthemaximum"inthefunctioncompositionchain – Finallayeriscalledtheoutputlayer The University of Sydney Page 12 Two-layer Neural Network g(x) Linear classifier! 1 x2 b 0 1 x1 w0 w1 wd w2 􏰙 􏰙 xd x0 = 1 The University of Sydney Page 13 x1 x2 Three-layer Neural Network (with a hidden layer) 0.5 -1 0.7 1 g(x) -0.4 -1.5 x2 1 1 1 x0 = 1 -1 1 -1 1 x1 The University of Sydney Page 14 x1 x2 Three-layer Neural Network (with a hidden layer) – A2D-inputexample separating hyperplane convex polygon region The University of Sydney Page 15 separating hyperplane composition of polygons convex polygon region Feedforward #! #& #' ... ... "! "$ "% !- =5(678-) ...... < # $ % 8 = ' ( 9 ) 8 9 = * 8= + 9:; ...... The University of Sydney !! !" !# Page 16 Universal Approximation Theorem Let .(⋅) be a nonconstant, bounded, and monotonically-increasing continuous function. Let 2! denote the 3-dimensional unit hypercube 0,1 !. The space of continuous functions on 2! is denoted by 6(2!). Then, given any 7 > 0 and any function . ∈ 6(2!), there exists an integer 9, real constants :”, ;” ∈ R! and real vectors =” ∈ R!, where > = 1, ⋯ , 9″ such that we may define:
#
BA ( C ) = D : ” . ( = ” $ C + ; ” ) ”
as an approximate realization of the function B where B is independent of .; that is, BA C − B C < 7 for all C ∈ 2!. In other words, functions of the form BA C are dense in 6(2!). Universality theorem (Hecht-Nielsen 1989): “Neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision.” The University of Sydney Page 17 Expressive power of a three-layer neural network – Avisualexplanationononeinputandoneoutputfunctions ,()( + /) Video credit to: https://neuralnetworksanddeeplearning.com – The weight changes the shape of the curve. – The larger the weight, the steeper the curve. – The bias moves the graph, but doesn’t change its shape. The University of Sydney Page 18 Expressive power of a three-layer neural network – Avisualexplanationononeinputandoneoutputfunctions ,()( + /) Image credit to: https://neuralnetworksanddeeplearning.com – Increase the weight so that the output is a very good approximation of a step function. – The step is at position H = −;/=. – We simplify the problem by describing hidden neurons using just a single parameter H. The University of Sydney Page 19 Expressive power of a three-layer neural network – Avisualexplanationononeinputandoneoutputfunctions Image credit to: https://neuralnetworksanddeeplearning.com – Left: Setting the weights of output to 1.2 and −1.2, we get a ‘bump’ function, which starts at 0.4, ends at 0.6 and has height 1.2. – Right: By adding another pair of hidden neurons, we can get more bumps in the same network. The University of Sydney Page 20 Expressive power of a three-layer neural network – Avisualexplanationononeinputandoneoutputfunctions Image credit to: https://neuralnetworksanddeeplearning.com – By increasing the number of hidden neurons, we can make the approximation much better. The University of Sydney Page 21 Expressive power of a three-layer neural network – Overallprocedure b=-100 w0=-100 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 x0 = 1 x1 g(x) "($) w1=100 sigmoid 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 The University of Sydney Page 22 0 -10 -8 -6 -4 -2 0 2 4 6 8 10 Expressive power of a three-layer neural network – Howaboutmultipleinputfunctions?Thetwoinputcase: Fix ,& = 0, change ,# and .: Video credit to: https://neuralnetworksanddeeplearning.com – Similar to one input case, the weight changes the shape and the bias changes the position. – The step point: H% = −;/=& The University of Sydney Page 23 Expressive power of a three-layer neural network – Howaboutmultipleinputfunctions?Thetwoinputcase: Video credit to: https://neuralnetworksanddeeplearning.com The University of Sydney Page 24 Expressive power of a three-layer neural network – Howaboutmultipleinputfunctions?Thetwoinputcase: Image credit to: https://neuralnetworksanddeeplearning.com The University of Sydney Page 25 Expressive power of a three-layer neural network – Howaboutmultipleinputfunctions?Thetwoinputcase: Image credit to: https://neuralnetworksanddeeplearning.com – By adding bias to the output neuron and adjusting the value of h and ., we can construct a ‘tower’ function. – The same idea can be used to compute as many towers as we like. We can also make them as thin as we like, and whatever height we like. As a result, we can ensure that the weighted output from the second hidden layer approximates any desired function of two variables. The University of Sydney Page 26 A toy example http://playground.tensorflow.org/ 1. Linear case without an hidden layer 2. Nonlinear case with a hidden layer The University of Sydney Page 27 The University of Sydney Page 28 Activation Functions Activation functions – Must be nonlinear: otherwise it is equivalent to a linear classifier "!! Linear function! 5 ! The University of Sydney Page 29 Activation functions – Continuous and differentiable almost everywhere – Monotonicity:otherwiseitintroducesadditionallocalextrema in the error surface " # The University of Sydney Page 30 An output value might correspond to multiple input values. Activation function $(&) – Popular choice of activation functions (single input) – Sigmoidfunction "#=1 – Tanhfunction:shiftthecenterofSigmoidtotheorigin "# ='(−''( '( + ''( – Rectifiedlinearunit(ReLU) "# =max(0,#) – LeakyReLU TheUniversityofSydney " # =0#, if#≥0,4isasmallconstant. 4#, if # < 0 Page31 1+''( Vanishing gradient problems – Saturation. Sigmoid function ,1=1 1+$?@ Tanh function ,1 =$@−$?@ $@ + $?@ vAt their extremes, their derivatives are close to 0, which kills gradient and learning process The University of Sydney Page 32 Not zero centered activation Sigmoid function ,1=1 1+$?@ Relu function ,1 =max(0,1) vDuring training, all gradients will be all positive or all negative that will cause problem with learning. The University of Sydney Optimum point Page 33 Possible update direction Possible update direction Dead ReLUs ReLU function , 1 = max (0, 1) vFor negative numbers, ReLU gives 0, which means some part of neurons will not be activated and be dead. vAvoid large learning rate and wrong weight initialization, and try Leaky ReLU, etc. The University of Sydney Page 34 The University of Sydney Page 35 Backpropagation Recall the feedforward calculation #! #& #' ... ... "! "$ "% #= ="()*+=) ...... < # $ % 8 = ' ( 9 ) 8 9 = * 8= + 9:; ...... The University of Sydney !! !" !# Page 36 Training error – Euclideandistance. BC,D =2*C'−D' &=2 F−G& 1(1 '"# – If both %A and cross entropy. 9A are probability distributions, we can use ( BC,D =−*C'logD' '"# – Crossentropyisasymmetric.Insomecaseswemayusethis The University of Sydney Page 37 symmetric form. ( BC,D =−*(C'logD'+D'logC') '"# Cross Entropy Loss – Given two probability distributions t and z, >?@##ABC?@DEF,G =−HC”logJ”
MNtropy
is a measure of degree of disorder or randomness.
=−HC”logC” +HC”logC” ” ” J”
= ABC?@DE F + K)*(F|G)
KL (Kullback-Leibler) divergence is a measure of the information lost when Z is used to approximate T

=−HC”logC” +HC”logC” −HC”logJ” “””
The University of Sydney
Page 38

Softmax Function
vIn multi-class classification, softmax function before the output layer is to assign conditional probabilities (given () to each one of the : classes.
<; =>?11A ( = 9A = $NOP! ∑R $NOP”
9:Q
)JC#
)JC& )JC)
)JC’
Output
D# 0.1
D& 0.2 D) 0.6
D’ 0.1
Ground-truth
0 C#
0 C& M C)
0 C’
The University of Sydney
Page 39

Gradient descent
– Weightsareinitializedwithrandomvalues,andthenare updated in a direction reducing the error.
∆*=−A CD C*
where A is the learning rate.
Iteratively update
* %+1 =* % +∆*(%)
The University of Sydney
Page 40

Hidden-to-output weight ,OP – Chainrule
-. = -. -)*+Q = -. #=
-/Q=
D * = 12 F − G
9A = ,(#$%A) N#
#$%A =’H8)A8 +)A; 8:Q
z1 …
y1 …
-)*+Q
zk … zC yj … ynH
wji
-)*+Q -/Q=
S
The University of Sydney
Page 41
……
x1 xi xd
wkj

Input-to-hidden weight ,PR
-. = -. -#= -)*+=
z1 …
y1 …
-/=S
zk …
wkj
yj …
-#= -)*+= /=S
zC
ynH
-. X -.-2Q
-#= = 0 -2Q -#= QTU
……
x1 xi xd
V
)*+= =0!=/=S +/=W STU
#= =”()*+=)
The University of Sydney
Page 42
wji

Vanishing gradients problem
-. = -. -)*+=
z1 …
y1 …
TU ,
* =V+(*+*,*!+,*-)
-/=S
zk …
wkj
yj …
-#= -)*+= /=S
zC
ynH
T)JC* !”#
……
x1 xi xd
The University of Sydney
Page 43
wji
-#=

The University of Sydney
Page 44
Stochastic Gradient Descent

Stochastic Gradient Descent (SGD)
– Given#trainingexamples,thetargetfunctioncanbe
expressed as
– Batchgradientdescent N
* ← * − #A ‘ J D W *
W:Q
– In SGD, the gradient of D(*) is approximated on a single
example or a mini-batch of examples
* ← * − AJDW(*)
The University of Sydney
Page 45
D* =’DW(*) W:Q
N

One-example based Backpropagation
– Algorithm 1 (one-example based BP)
– 1begininitializenetworktopology(B+),Y,criterionZ,[,\←0 –2 do\←\+1
– 3 ^! ← randomly chosen pattern
–4 _,”←_,”+[`,^”; _-,←_-,+[`-E,
– 5untilab Y