程序代写代做代考 algorithm Foundations of Machine Learning Neural Networks

Foundations of Machine Learning Neural Networks
Kate Farrahi
ECS Southampton
December 8, 2020
1/17

Gradient Descent
repeat until convergence:
w ← w − η ∂J (1) ∂w
where w is a multidimensional vector representing all of the weights in the model and η is the learning rate.
In order to get Gradient Descent working in practice, we need to compute ∂J . For neural networks, there are 2 stages to this
∂w
computation, (1) the forward pass and (2) the backwards pass.
2/17

Batch Gradient Descent
begin initialize w, Th, η, m=0, r =0 do r ← r + 1 (increment epoch)
m ← 0; ∆w ← 0 do m ← m + 1
xm ← selected pattern
∆w ← ∆w − η ∂J ∂w
until m = M
w ← w + ∆w until J(w) < Th return w end 3/17 Stochastic Gradient Descent begin initialize w, Th, η, m=0, r =0 do r ← r + 1 (increment epoch) m←0 do m←m+1 xm ← selected pattern (randomly) w ← w − η ∂J ∂w until m = M until J(w) < Th return w end 4/17 Backpropagation: The Forward Pass We need to compute the multilayer perceptron outputs yk in order to compute the cost function J(w). For a 3 layer network: a(1) = f (w(1)x + b(1)) (2) yˆ = a(2) = f (w(2)a(1) + b(2)) (3) 5/17 Backpropagation: The Backwards Pass In order to update the weights, we need to compute the gradient of the cost function with respect to each of the weights. Let us consider the quadratic cost function as follows: 1 􏰁M 2 k=1 Note, we are considering the case of yˆ = a(2). To compute the weight updates, we compute the derivative of the cost function with respect to each weight. The derivative of J with respect to the weights at the output layer can be computed as follows: ∂w(2) ∂a(2) ∂z(2) ∂w(2) kj k k kj J(w)= (yˆ −y )2 (4) kk k k ∂J ∂J ∂a(2) ∂z(2) =kk (5) 6/17 Backpropagation: The Backwards Pass Let us assume the quadratic cost function and the sigmoid activation function. ∂J =(a(2)−yk) ∂a(2) k k If we assume a sigmoid activation function then a(2) = k (6) 1 −z(2) 1+e k ∂a(2) k ∂z(2) k = a(2)(1 − a(2)) (7) (8) ∂zk(2) = a(1) ∂w(2) k kj Since z(2) = 􏰀 w(2)a(1) therefore, k jkjk k k ∂J = (a(2) − yk )a(2)(1 − a(2))a(1) ∂w(2) k k k k (9) kj 7/17 Backpropagation: The Backwards Pass To compute the weight updates with respect to the input layer: ∂w(1) ∂a(1) ∂z(1) ∂w(1) ji k k ji ∂J ∂J ∂a(1) ∂z(1) =kk (10) 8/17 Backpropagation: The Backwards Pass Again, this is the derivative of the output of the neuron w.r.t. the sigmoid activation function. Since a(1) = 1 k −z(1) , therefore (11) ∂a(1) k ∂z(1) k 1+e k = a(1)(1 − a(1)) Since z(1) = 􏰀 xiw(1) k i ji ∂z(1) k ∂w(1) = xi (12) k k ji 9/17 Backpropagation: The Backwards Pass Again, we can work out these three partial derivatives as follows: ∂J M ∂J M ∂J ∂a(2) ∂z(2) =􏰁 yk =(􏰁 yk k k ) (13) ∂a(1) ∂a(1) ∂a(2) z(2) ∂a(1) k k=1 k k=1 k k k ∂z(2) (2) k = w (2) 􏰀 (2) (1) where take the derivative with respect to a(1) instead of w(2). k kj ∂a(1) kj k a , however, this time we since z = w k j kj k 10/17 Backpropagation in General 􏰂 We define the error of the neuron j in layer l by δjl = ∂J (14) ∂ z jl 11/17 Backpropagation in General 􏰂 We define the error of the neuron j in layer l by δjl = ∂J (14) 􏰂 Then ∂ z jl δL = ∆aJ ⊙ σ′(zL) (15) 11/17 Backpropagation in General 􏰂 We define the error of the neuron j in layer l by δjl = ∂J (14) ∂ z jl 􏰂 Then δL = ∆aJ ⊙ σ′(zL) (15) 􏰂 For the case of a MSE cost function: 11/17 Backpropagation in General 􏰂 We define the error of the neuron j in layer l by δjl = ∂J (14) 􏰂 Then 􏰂 For the case of a MSE cost function: 􏰂 δL = (aL − y) ⊙ σ′(zL) ∂ z jl δL = ∆aJ ⊙ σ′(zL) (15) 11/17 Backpropagation in General 􏰂 We define the error of the neuron j in layer l by δjl = ∂J (14) 􏰂 Then 􏰂 For the case of a MSE cost function: 􏰂 δL = (aL − y) ⊙ σ′(zL) ∂ z jl δL = ∆aJ ⊙ σ′(zL) (15) 11/17 Backpropagation in General δL = ∆aJ ⊙ σ′(zL) (16) 12/17 Backpropagation in General δL = ∆aJ ⊙ σ′(zL) (16) δl =((wl+1)Tδl+1)⊙σ′(zl) (17) 12/17 Backpropagation in General δL = ∆aJ ⊙ σ′(zL) (16) δl =((wl+1)Tδl+1)⊙σ′(zl) (17) ∂J = al−1δl (18) ∂wl k j jk 12/17 Backpropagation in General δL = ∆aJ ⊙ σ′(zL) (16) δl =((wl+1)Tδl+1)⊙σ′(zl) (17) ∂J = al−1δl (18) ∂wl k j jk ∂J =δjl (19) ∂ b jl 12/17 Gradient Descent with Momentum 􏰂 Allows the network to learn more quickly than standard gradient descent 􏰂 Learning with momentum reduces the variation in overall gradient directions to speed learning 􏰂 Learning with momentum is given by w(m + 1) = w(m) + (1 − α)∆wbp(m) + α∆w(m − 1) ∆w(m) = w(m) − w(m − 1) where ∆wbp(m) is the change in weight given by the backpropagation algorithm. 13/17 Practical Considerations – Initializing Weights 􏰂 All of the weights should be randomly initialized to a small random number, close to zero but not identically zero. 􏰂 If they’re all set to zero, they will all undergo the exact same parameter updates during backprop 􏰂 There will be no source of asymmetry if the weights are all initialized to be the same 1 􏰂 It is common to initialize all of the biases to zero or a small number such as 0.01. 􏰂 Calibrating the variances to √n ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence. 14/17 Practical Considerations – Learning Rates 􏰂 Plot the cost function J as a function of iterations (epochs) 􏰂 J should decrease after every iteration on your training data! 􏰂 If J is increasing then something is wrong, it is likely that the learning rate is too high 􏰂 Standard test for convergence, ∆J < Th where Th = 10−3 􏰂 Note, it is difficult to choose a threshold. Looking at the overall plot of the cost function vs. iterations on data is always most informative. 15/17 L1 Regularization In general Cost function = Loss + Regularization Term In L1 regularization, the absolute value of the weights are penalized which can push the weights to reduce to zero. This is useful if we are trying to compress the model. For L1: n γ 􏰁 ||wi || (20) i=1 16/17 L2 Regularization L2 Regularization is also known as weight decay as it forces the weights to decay towards zero. For L2: n γ 􏰁 ||wi ||2 (21) i=1 17/17