Introduction to Machine Learning Training Neural Networks
Prof. Kutty
Neural Networks
Copyright By PowCoder代写 加微信 powcoder
Neural Networks
Input layer
architecture
Hidden layers
Output layer I parameter
h ( x ̄ , W ) = f ( z
Je e Ird eshiz W
Fully connected (FC): each node is connected to all nodes from previous layer
Neural Networks
Input layer
architecture
Hidden layers
Output layer
h ( x ̄ , W ) = f ( z
28 x 28 = 784
examples of activation functions:
• threshold
• logistic
• σ(z) = !
at each node:
• range0to1 • hyperbolic tangent
tanh(z) = 2σ(2z) – 1 range -1 to 1
f(z) = max(0,z)
weighted sum of inputs
non-linear transformation
Neural Networks
activation
Input layer
z(2) =W(1)x1 +W(1)x2 +W(1),h(2) =g(z(2)) 1 11 12 101 1
z(2) =W(1)x1 +W(1)x2 +W(1),h(2) =g(z(2)) 2 21 22 202 2
z(2) = W (1)x1 + W (1)x2 + W (1), h(2) = g(z(2)) x h(2) 3 31 32 30 3 3
1 1 Hidden layer essentially
omneuron no
Hidden layer
Output layer transforms the input…
h(x ̄,W) = f(z(3)) (x ̄) = [h(2),h(2),h(2),1]
z(3) = W(2)h(2) + W(2)h(2) + W(2)h(2) + W(2) 11 1 12 2 13 3 10
h(x ̄,W) = f(z(3)) …so that we can learn a linear classifier in a
different feature space.
Given the following data you learn a 2 layer NN with two hidden units, described by the following equations:
z(2) = W (1)x1 + W (1)x2 + W (1), h(2) = max{0, z(2)} 11112101 1
z(2) = W (1)x1 + W (1)x2 + W (1), h(2) = max{0, z(2)} 22122202 2
Show that the points in the new feature space are linearly separable.
z(2) = 0 1
z(2) = 0 2
Training Neural Networks
Idea: use back-propagation
back-propagation ≈ stochastic gradient descent + chain rule
Single layer NN
Input layer bias/offset +1 !!
x1 !” x2 !#
!#̅;%̅ ='(!#!+(%=* !”#
Want to learn the weights “̅ = $!,$”,…,$# $
-1 -1 -1 -1
Learning Neural Networks
Sn = {x ̄(i), y(i)}ni=1 x ̄ 2 Rd y 2 { 1, 1}
Goal: learn model parameters ✓ ̄ so as to minimize loss over the training
examples label 1 * ytrue
” $̅ = ‘ ( Loss , ‘ – .̅ ‘ ; $̅
‘() i prediction
Overview of Optimization Procedure: SGD
Idea: sample a point at random, nudge parameters toward values that would improve classification on that particular example
(0) Initialize parameters to small random values
(1) Select a point at random
(2) Update the parameters based on that point and the gradient:
$ ̅ ( , – ) ) = $ ̅ ( , ) − 1 , ∇ 0/ L o s s , ‘ – . ̅ ‘ ; $ ̅
Let’s compute this for a single layer NN
Single layer NN
Input layer bias/offset +1 !!
x1 !” x2 !#
. !$ • Loss of thisTclassifier wrt to example (#̅, 1) is given
“̅= $!,$”,…,$# $
Suppose the output is
h#̅;% =,(*)
where*=∑$ (# +( 33g x !”#!! %
by 2344(1*) ∇’& 2344(1*) = ∇’& max{1 − 1*, 0}
“”#$$(&’) = “”#$$(&’) “‘
assuming hinge loss
chain rule
1−34>0 otherwise
SGD for a Single Layer NN
Idea: sample a point at random, nudge parameters toward values that would improve classification on that particular example
(0) Initialize parameters to small random values
(1) Select a point at random
(2) Update the parameters based on that point and the gradient:
#”($%&) = #”($) ++$ ( !”
#”($%&) =#”($) ++$ (!” 1−&’>0
$ ̅ ( , – ) ) = $ ̅ ( , ) − 1 , ∇ 0/ L o s s , ‘ – . ̅ ‘ ; $ ̅
Learning Neural Networks
Idea: sample a point at random, nudge parameters toward values that would improve classification on that particular example
(0) Initialize parameters to small random values
(1) Select a point at random
(2) Update the parameters based on that point and the gradient:
$ ̅ ( , – ) ) = $ ̅ ( , ) − 1 , ∇ 0/ L o s s , ‘ – . ̅ ‘ ; $ ̅
Let’s compute this for 2 layer NNs
SGD – two layer NN
Goal: work out update for each parameter
Since there are only two layers, for notational convenience,
let’s drop the superscripts:
Wewillconsiderhingeloss:!”##((%&) = max{0,1−%&}
SGD – two layer NN
SGD – two layer NN
Goal: work out update for each parameter Idea: use back-propagation
back-propagation = gradient descent + chain rule
go through each layer in reverse to measure the error contribution
of each connection (backward propagate)
SGD – two layer NN
Goal: work out update for each parameter Start with the last layer and work backwards
Notice that the last layer is a simple linear classifier
v(k+1) = v(k) + ⌘kyhj[[(1 yz) > 0]] jj
SGD – two layer NN
Goal: work out update for each parameter
Update for parameters in hidden layer is more complicated
computing the derivative of the composition of two or more functions.
Idea: use back-propagation
back-propagation = gradient descent + chain rule
@Loss(yz) = @Loss(yz) @z @hj @zj @wji @z @hj @zj @wji
SGD – two layer NN
Goal: work out update for each parameter !5’ for ” ∈ {1,…,(} and * ∈ {1,…,+}
@Loss(yz) = @Loss(yz) @z @hj @zj @wji @z @hj @zj @wji
= y[[(1 yz) > 0]]vj [[zj > 0]]xi
w(k+1) = w(k) + ⌘ky[[(1 yz) > 0]]vj[[zj > 0]]xi ji ji
Estimating parameters of NNs: SGD
Idea: sample a point at random, nudge parameters toward values that would improve classification on that particular example
(0) Initialize parameters to small random values
(1) Select a point at random
(2) Update the parameters based on that point and the gradient:
(for a two-layer NN):
v(k+1) = v(k) + ⌘kyhj[[(1 yz) > 0]] jj
w(k+1) = w(k) + ⌘ky[[(1 yz) > 0]]vj[[zj > 0]]xi ji ji
$ ̅ ( , – ) ) = $ ̅ ( , ) − 1 , ∇ 0/ L o s s , ‘ – . ̅ ‘ ; $ ̅
Use backprop
to compute the gradient in SGD
• Foreachtraininginstance(“̅(“),#(“))
– make a prediction !(#̅(“),$̅)
• called forward propagation (“) ̅
– measure the loss of the prediction !(#̅ ,$) with respect to the true label %(“)
• Loss(!(0)”($̅(0),%̅))
– go through each layer in reverse to measure the error contribution of each connection
• called backward propagation
“Loss (( 4 !̅( ;7̅ “Loss (( 4 !̅( ;7̅
– tweak weight to reduce error (SGD update step)
Input layer
zj = wji hi +wj0 i=1
SGD – three layer NN
X2 h()) = :(%())) i=1
w(1)xi +w(1)
Hidden layers
(3) (2) (2) (2)
Output layer
h “( * ) = : ( % “( * ) )
w ( 3 ) h ( 3 ) + w ( 3 ) 1jj 10
Input layer
#(&) x1 &&
zj = wji hi +wj0 i=1
SGD – three layer NN
X2 h()) = :(%())) i=1
w(1)xi +w(1)
Hidden layers
(3) (2) (2) (2)
Output layer
h “( * ) = : ( % “( * ) )
w(3)h(3) + w(3)
Chain rule:
@ L X3 @ y ˆ @ h ( 3 ) @ z ( 3 ) @ h ( 2 ) @ z ( 2 ) = jj11
@yˆ @h(3) @z(3) @h(2) @z(2) @w(1) j=1 j j 1 1 11
Input layer
#(&) x1 &&
zj = wji hi +wj0 i=1
SGD – three layer NN
X2 h()) = :(%())) i=1
w(1)xi +w(1)
Hidden layers
(3) (2) (2) (2)
Output layer
h “( * ) = : ( % “( * ) )
Chain rule: (3)
@L @LX3 @zj
w(3)h(3) + w(3) 1jj 10
@w(1) @yˆ @h(3) @z(3) @h(2) @z(2) @w(1) 12 j=1 j j 1 1 12
local derivatives are shared!
NNs: details
• How to choose an architecture
• Avoiding Overfitting
• How to pick learning rate
• Issues with gradient values
Choice of architecture
Number of hidden layers
• NN with one hidden layer can model complex functions provided it has enough neurons (i.e., is wide)
• in fact, 2 layer NNs can approximate any* function
• Deeper nets can converge faster to a good solution
• Take-away: Start with 1-2 hidden layers and ramp up
until you start to overfit
Number of Neurons per Hidden Layer
• # nodes in input & output determined by task at hand
• Common practice: same size for all hidden layers (one tuning
parameter)
• Take-away: increase number of neurons until you start to overfit
How to set learning rate?
• Simplest recipe: keep it fixed and use the same for all parameters.
– Note: better results can generally be obtained by allowing learning rates to decrease (learning schedule)
Reducing Overfitting
1. Early stopping: interrupt training when performance on the validation set starts to decrease
2. L1 & L2 regularization: applied as before to all weights
3. Dropout
Reducing Overfitting: Dropout
• dropout during training with probability p, each unit might be turned off randomly at each iteration
Srivastava, Hinton, Krizhevsky, Sutskever and Salakhutdinov
• when unit is turned off associated weights are not updated
(a) Standard Neural Net (b) After applying dropout.
Figure 1•: aDtrotepostutiNmeeuruasleNaetllMunoditesl.bLuetf(t1:-pA)*stwanedigahrdtsneural net with 2 hidden layers. Right:
An example of a thinned net produced by applying dropout to the network on the left.
[Srivastava et al., “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” 2014]
Crossed units have been dropped.
Vanishing/Exploding Gradients
• It turns out that the gradient in deep neural networks
is unstable, tending to either explode or vanish in earlier layers.
• e.g., sometimes gradients get smaller and smaller as the algorithm progresses down to the lower layers, this can leave the lower layer connection weights virtually unchanged and training never converges to a good solution
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com