Tutorial Questions | Week 4
COSC2779 – Deep Learning
This tutorial is aimed at reviewing optimization techniques and regularization in deep learning.
Please try the questions before you join the session.
1. Calculate the gradient update for the neural network below with given initialized weights, if the input data
point is x = [1, 2] and the expected output is 3. The loss is mean squared error, hidden units have ReLU
activation, linier activation in output unit and the learning rate is 0.1. Initial biases are all zero.
1×1
2×2
ŷ
w11 =
2
w1
2
=
1
w
21 = 1
w22 =
−2
w
31 =
1w
32 = 2
w1 = −1
w2 = 3
w3 = 2
Solution:
Can compute the derivatives using the back prop rule.
Output neuron: ∂L
∂w1
= 24; ∂L
∂w2
= 0; ∂L
∂w3
= 30; ∂L
∂bo
= 6;
Hidden neuron 1: ∂L
∂w11
= −6; ∂L
∂w12
= −12; ∂L
∂b1
= −6;
Hidden neuron 2: ∂L
∂w21
= 0; ∂L
∂w22
= 0; ∂L
∂b2
= 0;
Hidden neuron 3: ∂L
∂w31
= 12; ∂L
∂w32
= 24; ∂L
∂b3
= 12;
Update with: wj ← wj + 0.1 ∂L∂wj
2. What problem(s) will result from using a learning rate that’s too
(a) high?
Solution: Cost function does not converge to an optimal solution and can even diverge. To detect,
look at the costs after each iteration (plot the cost function v.s. the number of iterations). If the
cost oscillates wildly, the learning rate is too high. For batch gradient descent, if the cost increases,
the learning rate is too high.
(b) low?
Solution: Cost function may not converge to an optimal solution, or will converge after a very
long time. To detect, look at the costs after each iteration (plot the cost function v.s. the number
of iterations). The cost function decreases very slowly (almost linearly). You could also try higher
learning rates to see if the performance improves.
3. What is a saddle point? What is the advantage/disadvantage of Stochastic Gradient Descent in dealing
with saddle points?
Solution: Saddle point – The gradient is zero, but it is neither a local minima nor a local maxima.
Also accepted – the gradient is zero and the function has a local maximum in one direction, but a local
minimum in another direction. SGD has noisier updates and can help escape from a saddle point
4. One of the difficulties with the sigmoid activation function is that of saturated units. Briefly explain the
problem, and whether switching to tanh fixes the problem
Solution: The derivative of σ(z) is small for large negative or positive z.
No, switching to tanh does not fix the problem. The same problem persists in tanh(z). Both function
has a sigmoidal shape. We can see tanh is effectively a scaled and translated sigmoid function: tanh(z) =
2σ(2z)− 1
tanh activations are centered around zero, whereas sigmoid are centered around 0.5. Centering the
data/hidden activation can help optimization due to similar effect to the batch normalization without
the variance division.
5. What happens if we use batch-norm after an affine layer with bias? what effect does increasing the bias
has?
Solution:
z(i) = W(l)x(i) + b(l) .Affine transform
µj =
1
m
∑
i
z
(i)
J ; σ
2
j =
1
m
∑
i
(
z
(i)
j − µj
)2
ẑ
(i)
j =
z
(i)
j − µj√
σ2j + �
.mean zero, variance 1
z̃
(i)
j = γẑ
(i)
j + β . γ, β learnable parameters
mean(z) = mean(W(l)x) + b(l)
ẑ(i) =
W(l)x(i) + b(l) −mean(W(l)x)− b(l)√
σ2j + �
.mean zero, variance 1
Bias will be removed by the normalization step of batch norm. Therefore it has no affect.
6. Explain why dropout in a neural network acts as a regularizer.
Solution: There are several explanations:
• Dropout is a form of model averaging. In particular, for a layer of H nodes, we sampling from 2H
architectures, where we choose an arbitrary subset of the nodes to remain active. The weights
learned are shared across all these models means that the various models are regularizing the
other models.
• Dropout helps prevent feature co-adaptation, which has a regularizing effect.
• Dropout adds noise to the learning process, and training with noise in general has a regularizing
effect.
• Dropout leads to more sparsity in the hidden units, which has a regularizing effect (“shrinking
the weights” or “spreading out the weights”).
7. Why do we often refer to L2-regularization as “weight decay”? Derive a mathematical expression to explain
your point.
Page 2
Solution: In the case of L2 regularization, we can derive the following update rule for the weights:
w ← (1− αλ)w −
∂L
∂w
where α is the learning rate and λ is the regularization hyperparameter (αλ << 1). This shows that at every iteration W’s value is pushed closer to zero. 8. Explain what effect will the following operations generally have on the bias and variance of your model. (a) Regularizing the weights Solution: bias: Increases, variance: Decreases (b) Increasing the width of the layers Solution: bias: Decreases, variance: Increases (c) Using dropout to train a deep neural network Solution: bias: Increases, variance: Decreases (d) Getting more training data Solution: bias: No change, variance: Decreases Page 3