程序代写 Introduction to Machine Learning Training Neural Networks

Introduction to Machine Learning Training Neural Networks
Prof. Kutty

Neural Networks

Copyright By PowCoder代写 加微信 powcoder

Neural Networks
Input layer
architecture
Hidden layers
Output layer I parameter
h ( x ̄ , W ) = f ( z
Je e Ird eshiz W
Fully connected (FC): each node is connected to all nodes from previous layer

Neural Networks
Input layer
architecture
Hidden layers
Output layer
h ( x ̄ , W ) = f ( z
28 x 28 = 784
examples of activation functions:
• threshold
• logistic
• σ(z) = !
at each node:
• range0to1 • hyperbolic tangent
tanh(z) = 2σ(2z) – 1 range -1 to 1
f(z) = max(0,z)
weighted sum of inputs
non-linear transformation

Neural Networks
activation
Input layer
z(2) =W(1)x1 +W(1)x2 +W(1),h(2) =g(z(2)) 1 11 12 101 1
z(2) =W(1)x1 +W(1)x2 +W(1),h(2) =g(z(2)) 2 21 22 202 2
z(2) = W (1)x1 + W (1)x2 + W (1), h(2) = g(z(2)) x h(2) 3 31 32 30 3 3
1 1 Hidden layer essentially
omneuron no
Hidden layer
Output layer transforms the input…
h(x ̄,W) = f(z(3)) (x ̄) = [h(2),h(2),h(2),1]
z(3) = W(2)h(2) + W(2)h(2) + W(2)h(2) + W(2) 11 1 12 2 13 3 10
h(x ̄,W) = f(z(3)) …so that we can learn a linear classifier in a
different feature space.

Given the following data you learn a 2 layer NN with two hidden units, described by the following equations:
z(2) = W (1)x1 + W (1)x2 + W (1), h(2) = max{0, z(2)} 11112101 1
z(2) = W (1)x1 + W (1)x2 + W (1), h(2) = max{0, z(2)} 22122202 2
Show that the points in the new feature space are linearly separable.
z(2) = 0 1
z(2) = 0 2

Training Neural Networks
Idea: use back-propagation
back-propagation ≈ stochastic gradient descent + chain rule

Single layer NN
Input layer bias/offset +1 !!
x1 !” x2 !#
!#̅;%̅ ='(!#!+(%=* !”#
Want to learn the weights “̅ = $!,$”,…,$# $
-1 -1 -1 -1

Learning Neural Networks
Sn = {x ̄(i), y(i)}ni=1 x ̄ 2 Rd y 2 {1, 1}
Goal: learn model parameters ✓ ̄ so as to minimize loss over the training
examples label 1 * ytrue
” $̅ = ‘ ( Loss , ‘ – .̅ ‘ ; $̅
‘() i prediction

Overview of Optimization Procedure: SGD
Idea: sample a point at random, nudge parameters toward values that would improve classification on that particular example
(0) Initialize parameters to small random values
(1) Select a point at random
(2) Update the parameters based on that point and the gradient:
$ ̅ ( , – ) ) = $ ̅ ( , ) − 1 , ∇ 0/ L o s s , ‘ – . ̅ ‘ ; $ ̅
Let’s compute this for a single layer NN

Single layer NN
Input layer bias/offset +1 !!
x1 !” x2 !#
. !$ • Loss of thisTclassifier wrt to example (#̅, 1) is given
“̅= $!,$”,…,$# $
Suppose the output is
h#̅;% =,(*)
where*=∑$ (# +( 33g x !”#!! %
by 2344(1*) ∇’& 2344(1*) = ∇’& max{1 − 1*, 0}
“”#$$(&’) = “”#$$(&’) “‘
assuming hinge loss
chain rule
1−34>0 otherwise

SGD for a Single Layer NN
Idea: sample a point at random, nudge parameters toward values that would improve classification on that particular example
(0) Initialize parameters to small random values
(1) Select a point at random
(2) Update the parameters based on that point and the gradient:
#”($%&) = #”($) ++$ ( !”
#”($%&) =#”($) ++$ (!” 1−&’>0
$ ̅ ( , – ) ) = $ ̅ ( , ) − 1 , ∇ 0/ L o s s , ‘ – . ̅ ‘ ; $ ̅

Learning Neural Networks
Idea: sample a point at random, nudge parameters toward values that would improve classification on that particular example
(0) Initialize parameters to small random values
(1) Select a point at random
(2) Update the parameters based on that point and the gradient:
$ ̅ ( , – ) ) = $ ̅ ( , ) − 1 , ∇ 0/ L o s s , ‘ – . ̅ ‘ ; $ ̅
Let’s compute this for 2 layer NNs

SGD – two layer NN
Goal: work out update for each parameter
Since there are only two layers, for notational convenience,
let’s drop the superscripts:
Wewillconsiderhingeloss:!”##((%&) = max{0,1−%&}

SGD – two layer NN

SGD – two layer NN
Goal: work out update for each parameter Idea: use back-propagation
back-propagation = gradient descent + chain rule
go through each layer in reverse to measure the error contribution
of each connection (backward propagate)

SGD – two layer NN
Goal: work out update for each parameter Start with the last layer and work backwards
Notice that the last layer is a simple linear classifier
v(k+1) = v(k) + ⌘kyhj[[(1 yz) > 0]] jj

SGD – two layer NN
Goal: work out update for each parameter
Update for parameters in hidden layer is more complicated
computing the derivative of the composition of two or more functions.
Idea: use back-propagation
back-propagation = gradient descent + chain rule
@Loss(yz) = @Loss(yz) @z @hj @zj @wji @z @hj @zj @wji

SGD – two layer NN
Goal: work out update for each parameter !5’ for ” ∈ {1,…,(} and * ∈ {1,…,+}
@Loss(yz) = @Loss(yz) @z @hj @zj @wji @z @hj @zj @wji
= y[[(1 yz) > 0]]vj [[zj > 0]]xi
w(k+1) = w(k) + ⌘ky[[(1 yz) > 0]]vj[[zj > 0]]xi ji ji

Estimating parameters of NNs: SGD
Idea: sample a point at random, nudge parameters toward values that would improve classification on that particular example
(0) Initialize parameters to small random values
(1) Select a point at random
(2) Update the parameters based on that point and the gradient:
(for a two-layer NN):
v(k+1) = v(k) + ⌘kyhj[[(1 yz) > 0]] jj
w(k+1) = w(k) + ⌘ky[[(1 yz) > 0]]vj[[zj > 0]]xi ji ji
$ ̅ ( , – ) ) = $ ̅ ( , ) − 1 , ∇ 0/ L o s s , ‘ – . ̅ ‘ ; $ ̅

Use backprop
to compute the gradient in SGD
• Foreachtraininginstance(“̅(“),#(“))
– make a prediction !(#̅(“),$̅)
• called forward propagation (“) ̅
– measure the loss of the prediction !(#̅ ,$) with respect to the true label %(“)
• Loss(!(0)”($̅(0),%̅))
– go through each layer in reverse to measure the error contribution of each connection
• called backward propagation
“Loss (( 4 !̅( ;7̅ “Loss (( 4 !̅( ;7̅
– tweak weight to reduce error (SGD update step)

Input layer
zj = wji hi +wj0 i=1
SGD – three layer NN
X2 h()) = :(%())) i=1
w(1)xi +w(1)
Hidden layers
(3) (2) (2) (2)
Output layer
h “( * ) = : ( % “( * ) )
w ( 3 ) h ( 3 ) + w ( 3 ) 1jj 10

Input layer
#(&) x1 &&
zj = wji hi +wj0 i=1
SGD – three layer NN
X2 h()) = :(%())) i=1
w(1)xi +w(1)
Hidden layers
(3) (2) (2) (2)
Output layer
h “( * ) = : ( % “( * ) )
w(3)h(3) + w(3)
Chain rule:
@ L X3 @ y ˆ @ h ( 3 ) @ z ( 3 ) @ h ( 2 ) @ z ( 2 ) = jj11
@yˆ @h(3) @z(3) @h(2) @z(2) @w(1) j=1 j j 1 1 11

Input layer
#(&) x1 &&
zj = wji hi +wj0 i=1
SGD – three layer NN
X2 h()) = :(%())) i=1
w(1)xi +w(1)
Hidden layers
(3) (2) (2) (2)
Output layer
h “( * ) = : ( % “( * ) )
Chain rule: (3)
@L @LX3 @zj
w(3)h(3) + w(3) 1jj 10
@w(1) @yˆ @h(3) @z(3) @h(2) @z(2) @w(1) 12 j=1 j j 1 1 12
local derivatives are shared!

NNs: details
• How to choose an architecture
• Avoiding Overfitting
• How to pick learning rate
• Issues with gradient values

Choice of architecture
Number of hidden layers
• NN with one hidden layer can model complex functions provided it has enough neurons (i.e., is wide)
• in fact, 2 layer NNs can approximate any* function
• Deeper nets can converge faster to a good solution
• Take-away: Start with 1-2 hidden layers and ramp up
until you start to overfit
Number of Neurons per Hidden Layer
• # nodes in input & output determined by task at hand
• Common practice: same size for all hidden layers (one tuning
parameter)
• Take-away: increase number of neurons until you start to overfit

How to set learning rate?
• Simplest recipe: keep it fixed and use the same for all parameters.
– Note: better results can generally be obtained by allowing learning rates to decrease (learning schedule)

Reducing Overfitting
1. Early stopping: interrupt training when performance on the validation set starts to decrease
2. L1 & L2 regularization: applied as before to all weights
3. Dropout

Reducing Overfitting: Dropout
• dropout during training with probability p, each unit might be turned off randomly at each iteration
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
• when unit is turned off associated weights are not updated
(a) Standard Neural Net (b) After applying dropout.
Figure 1•: aDtrotepostutiNmeeuruasleNaetllMunoditesl.bLuetf(t1:-pA)*stwanedigahrdtsneural net with 2 hidden layers. Right:
An example of a thinned net produced by applying dropout to the network on the left.
[Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” 2014]
Crossed units have been dropped.

Vanishing/Exploding Gradients
• It turns out that the gradient in deep neural networks
is unstable, tending to either explode or vanish in earlier layers.
• e.g., sometimes gradients get smaller and smaller as the algorithm progresses down to the lower layers, this can leave the lower layer connection weights virtually unchanged and training never converges to a good solution

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com