Foundations of Machine Learning Neural Networks
Kate Farrahi
ECS Southampton
December 8, 2020
1/17
Gradient Descent
repeat until convergence:
w ← w − η ∂J (1) ∂w
where w is a multidimensional vector representing all of the weights in the model and η is the learning rate.
In order to get Gradient Descent working in practice, we need to compute ∂J . For neural networks, there are 2 stages to this
∂w
computation, (1) the forward pass and (2) the backwards pass.
2/17
Batch Gradient Descent
begin initialize w, Th, η, m=0, r =0 do r ← r + 1 (increment epoch)
m ← 0; ∆w ← 0 do m ← m + 1
xm ← selected pattern
∆w ← ∆w − η ∂J ∂w
until m = M
w ← w + ∆w until J(w) < Th
return w end
3/17
Stochastic Gradient Descent
begin initialize w, Th, η, m=0, r =0 do r ← r + 1 (increment epoch)
m←0
do m←m+1
xm ← selected pattern (randomly)
w ← w − η ∂J ∂w
until m = M until J(w) < Th
return w end
4/17
Backpropagation: The Forward Pass
We need to compute the multilayer perceptron outputs yk in order to compute the cost function J(w). For a 3 layer network:
a(1) = f (w(1)x + b(1)) (2) yˆ = a(2) = f (w(2)a(1) + b(2)) (3)
5/17
Backpropagation: The Backwards Pass
In order to update the weights, we need to compute the gradient of the cost function with respect to each of the weights. Let us consider the quadratic cost function as follows:
1 M
2
k=1
Note, we are considering the case of yˆ = a(2).
To compute the weight updates, we compute the derivative of the cost function with respect to each weight. The derivative of J with respect to the weights at the output layer can be computed as follows:
∂w(2) ∂a(2) ∂z(2) ∂w(2) kj k k kj
J(w)=
(yˆ −y )2 (4) kk
k
k
∂J ∂J ∂a(2) ∂z(2)
=kk (5)
6/17
Backpropagation: The Backwards Pass
Let us assume the quadratic cost function and the sigmoid activation function.
∂J =(a(2)−yk) ∂a(2) k
k
If we assume a sigmoid activation function then a(2) = k
(6)
1
−z(2) 1+e k
∂a(2) k
∂z(2) k
= a(2)(1 − a(2))
(7)
(8)
∂zk(2) = a(1) ∂w(2) k
kj
Since z(2) = w(2)a(1) therefore, k jkjk
k
k
∂J = (a(2) − yk )a(2)(1 − a(2))a(1) ∂w(2) k k k k
(9)
kj 7/17
Backpropagation: The Backwards Pass
To compute the weight updates with respect to the input layer:
∂w(1) ∂a(1) ∂z(1) ∂w(1) ji k k ji
∂J ∂J ∂a(1) ∂z(1)
=kk (10)
8/17
Backpropagation: The Backwards Pass
Again, this is the derivative of the output of the neuron w.r.t. the
sigmoid activation function. Since a(1) = 1
k −z(1)
, therefore
(11)
∂a(1) k
∂z(1) k
1+e k
= a(1)(1 − a(1))
Since z(1) = xiw(1) k i ji
∂z(1) k ∂w(1)
= xi
(12)
k
k
ji
9/17
Backpropagation: The Backwards Pass
Again, we can work out these three partial derivatives as follows:
∂J M ∂J M ∂J ∂a(2) ∂z(2)
= yk =( yk k k ) (13)
∂a(1) ∂a(1) ∂a(2) z(2) ∂a(1) k k=1 k k=1 k k k
∂z(2) (2) k = w
(2) (2) (1)
where
take the derivative with respect to a(1) instead of w(2). k kj
∂a(1) kj k
a , however, this time we
since z = w
k j kj k
10/17
Backpropagation in General
We define the error of the neuron j in layer l by
δjl = ∂J (14) ∂ z jl
11/17
Backpropagation in General
We define the error of the neuron j in layer l by
δjl = ∂J (14)
Then
∂ z jl
δL = ∆aJ ⊙ σ′(zL) (15)
11/17
Backpropagation in General
We define the error of the neuron j in layer l by
δjl = ∂J (14)
∂ z jl
Then
δL = ∆aJ ⊙ σ′(zL) (15) For the case of a MSE cost function:
11/17
Backpropagation in General
We define the error of the neuron j in layer l by
δjl = ∂J (14)
Then
For the case of a MSE cost function: δL = (aL − y) ⊙ σ′(zL)
∂ z jl
δL = ∆aJ ⊙ σ′(zL) (15)
11/17
Backpropagation in General
We define the error of the neuron j in layer l by
δjl = ∂J (14)
Then
For the case of a MSE cost function: δL = (aL − y) ⊙ σ′(zL)
∂ z jl
δL = ∆aJ ⊙ σ′(zL) (15)
11/17
Backpropagation in General
δL = ∆aJ ⊙ σ′(zL) (16)
12/17
Backpropagation in General
δL = ∆aJ ⊙ σ′(zL) (16) δl =((wl+1)Tδl+1)⊙σ′(zl) (17)
12/17
Backpropagation in General
δL = ∆aJ ⊙ σ′(zL) (16) δl =((wl+1)Tδl+1)⊙σ′(zl) (17)
∂J = al−1δl (18) ∂wl k j
jk
12/17
Backpropagation in General
δL = ∆aJ ⊙ σ′(zL) (16) δl =((wl+1)Tδl+1)⊙σ′(zl) (17)
∂J = al−1δl (18) ∂wl k j
jk
∂J =δjl (19) ∂ b jl
12/17
Gradient Descent with Momentum
Allows the network to learn more quickly than standard gradient descent
Learning with momentum reduces the variation in overall gradient directions to speed learning
Learning with momentum is given by
w(m + 1) = w(m) + (1 − α)∆wbp(m) + α∆w(m − 1)
∆w(m) = w(m) − w(m − 1)
where ∆wbp(m) is the change in weight given by the backpropagation algorithm.
13/17
Practical Considerations – Initializing Weights
All of the weights should be randomly initialized to a small random number, close to zero but not identically zero.
If they’re all set to zero, they will all undergo the exact same parameter updates during backprop
There will be no source of asymmetry if the weights are all initialized to be the same
1
It is common to initialize all of the biases to zero or a small number such as 0.01.
Calibrating the variances to √n ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence.
14/17
Practical Considerations – Learning Rates
Plot the cost function J as a function of iterations (epochs)
J should decrease after every iteration on your training data!
If J is increasing then something is wrong, it is likely that the learning rate is too high
Standard test for convergence, ∆J < Th where Th = 10−3
Note, it is difficult to choose a threshold. Looking at the overall plot of the cost function vs. iterations on data is always most informative.
15/17
L1 Regularization
In general
Cost function = Loss + Regularization Term
In L1 regularization, the absolute value of the weights are penalized which can push the weights to reduce to zero. This is useful if we are trying to compress the model.
For L1:
n
γ ||wi || (20)
i=1
16/17
L2 Regularization
L2 Regularization is also known as weight decay as it forces the weights to decay towards zero.
For L2:
n
γ ||wi ||2 (21)
i=1
17/17