The University of Sydney Page 1
Regularizations
for Deep Models
Dr Chang Xu
School of Computer Science
The University of Sydney Page 2
What is regularization?
In general: any method to prevent overfitting or help the optimization.
Regression using polynomials,
! = #! + #”% + ##%# + #$%$ +⋯+ #%%% + ‘
The University of Sydney Page 3
Overfitting
! = #! + #”% + ##%# + #$%$ +⋯+ #%%% + ‘
The University of Sydney Page 4
Prevent overfitting
Ø Larger data set helps.
Ø Throwing away useless hypotheses also helps.
ü Classical regularization: some principal ways to constrain
hypotheses.
ü Other types of regularization: data augmentation, early stopping, etc.
The University of Sydney Page 5
Regularization as hard constraint
min
!
$% & =
1
)
*
“#$
%
+(&, .”, /”)
1. 3. 4 & ≤ 6
Training objective:
Example: ℓ& regularization
min
!
$% & =
1
)
*
“#$
%
+(&, .”, /”)
1. 3. & &
&
≤ 6
The University of Sydney Page 6
Regularization as soft constraint
min
!
$% & =
1
)
*
“#$
%
+(&, .”, /”) + 9
∗
4 &
The hard-constraint optimization is equivalent to soft-constraint
Example: ℓ& regularization
min
!
$% & =
1
)
*
“#$
%
+(&, .”, /”) + 9
∗
& &
&
for some hyper parameter 9∗ > 0.
The University of Sydney Page 7
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: <(&)
• Posterior over the hypotheses: < & .", /"
• Likelihood: < .", /" &
• Bayesian rule:
< & .", /" =
<(&)< .", /" &
<({.", /"})
The University of Sydney Page 8
Regularization as Bayesian prior
• Bayesian rule:
< & .", /" =
<(&)< .", /" &
<({.", /"})
• Maximum A Posteriori (MAP):
max
!
log < & .", /" = max!
log < & + log < .", /" &
Regularization MLE loss
The University of Sydney Page 9
Some examples
The University of Sydney Page 10
Weight Decay (Researchers’ view)
• Limiting the growth of the weights in the network.
• A term is added to the original loss function, penalizing large
weights:
min
!
$%((&) = $%(&) +
D
2
& &
&
• Gradient of regularized objective
F$%((&)= F$%(&) + D&
• Gradient descent update
& ← & − IF$%((&)= & − IF$% & − ID& = 1 − ID & − IF$% &
The University of Sydney Page 11
Weight Decay (Engineers’ view)
• The L2 regularization technique for neural networks was worked
out by researchers in the 1990s.
• At the same time, engineers, working independently from
researchers, noticed that if you simply decrease the value of each
weight on each training iteration, you get an improved trained
model that isn’t as likely to be overfitted.
& ← & − IF$%((&) // update
& ← & ∗ 0.98 // or & = & − (0.02 ∗ &)
• The L2 approach has a solid underlying theory but is complicated to
implement. The weight decay approach “just works” but is simple
to implement.
The University of Sydney Page 12
https://pytorch.org/docs/stable/_modules/torch/optim/adam.html#Adam
The University of Sydney Page 13
Other types of regularizations
• Robustness to noise
• Noise to the input
• Noise to the weights
• Data augmentation
• Early stopping
• Dropout
The University of Sydney Page 14
Multiple optimal solutions.
The University of Sydney Page 15
Add noise to the input
Too much noise leads to data
points cross the boundary.
The University of Sydney Page 16
Equivalence to weight decay
• Suppose the hypothesis is M . = N)., noise is O~Q(0, 9R)
• After adding noise, the loss is
% M = S*,,,- M . + O − /
&
= S*,,,- M . + N
)
O − /
&
= S*,, M . − /
&
+ 2S*,,,- N
)
O(M(.) − /) + S- N
)
O
&
% M = S*,, M . − /
&
+ 9 N
&
The University of Sydney Page 17
Add noise to the weights
• For the loss on each data point, add a noise term to the weights
before calculating the prediction
% M = S*,,,- M./- . − /
&
O~Q 0, 9R , w
0
= w+ O
• Prediction: M.! . instead of M.(.)
• Loss becomes
The University of Sydney Page 18
Add noise to the weights
% M = S*,,,- M./- . − /
&
• Loss becomes
• To simplify, use Taylor expansion
• Plug in
M./- . ≈ M. . + O
)
FM . +
O
)
F
&
M . O
2
% M ≈ S*,, M. . − /
&
+ 9S*,, FM . M .
&
+ V(9
&
)
The University of Sydney Page 19
Data augmentation
Adding noise to the input: a special kind of augmentation.
The University of Sydney Page 20
The first very successful CNN on the ImageNet dataset.
The University of Sydney Page 21
The University of Sydney Page 22
Early stopping
• Idea: don’t train the network to too small training error.
• When training, also output validation error
• Every time validation error improved, store a copy of the weights
• When validation error not improved for some time, stop
• Return the copy of the weights stored
The University of Sydney Page 23
Early stopping
• hyperparameter selection: training step is the hyperparameter
Ø Advantage
• Efficient: along with training; only store an extra copy of
weights
• Simple: no change to the model/algorithm
Ø Disadvantage: need validation data
The University of Sydney Page 24
Early stopping
Ø Strategy to get rid of the disadvantage
• After early stopping of the first run, train a second run and reuse
validation data
Ø How to reuse validation data
1. Start fresh, train with both training data and validation data up to
the previous number of epochs
2. Start from the weights in the first run, train with both training data
and validation data until the validation loss < the training loss at the
early stopping point
The University of Sydney Page 25
Early stopping as a regularizer
The University of Sydney Page 26
Dropout
The University of Sydney Page 27
https://github.com/pytorch/pytorch/blob/10c4b98ade8349d841518d22f19a653
a939e260c/torch/nn/modules/dropout.py#L36
Dropout in PyTorch
https://github.com/pytorch/pytorch/blob/10c4b98ade8349d841518d22f19a653a939e260c/torch/nn/modules/dropout.py
The University of Sydney Page 28
Dropout
It is a simple but very effective technique that could alleviate
overfitting in training phase.
The inspiration for Dropout (Hinton et al., 2013), came from the
role of sex in evolution.
pinterest.com
• Genes work well with another small random set of genes.
• Similarly, dropout suggests that each unit should work with a
random sample of other units.
The University of Sydney Page 29
Dropout
• At training (each iteration):
Each unit is retained with a probability <.
• At test:
The network is used as a whole.
The weights are scaled-down by a factor of < (e.g. 0.5).
Expectation ?
The University of Sydney Page 30
Inverted Dropout
• At training (each iteration):
Each unit is retained with a probability <.
ü Weights are scaled-up by a factor of 1/<.
• At test:
The network is used as a whole.
ü No scaling is applied.
The University of Sydney Page 31
Dropout
In practice, dropout trains 2% networks () – number of units).
Dropout is a technique which deals with overfitting by combining
the predictions of many different large neural nets at test time.
!&'" =(
%
)(+)- + ∗/ !& ≈ - (
%
)(+) + ∗/ !& = - )/!&
The University of Sydney Page 32
Dropout
The feed-forward operation of a
standard neural network is
W"
1/$
= X"
1/$
Y
(1)
+ Z"
1/$
/"
1/$
= M W"
1/$
The University of Sydney Page 33
Dropout
With dropout: 6"
1
~ Bernoulli(<)
_Y
(1)
= `
1
∗ Y
(1)
W"
1/$
= X"
1/$
_Y
(1)
+ Z"
1/$
/"
(1/$)
= M W"
1/$
The University of Sydney Page 34
Weight Decay
• Limiting the growth of the weights in the network.
• A term is added to the original loss function, penalizing large
weights:
ℒ%4. X = ℒ516 X +
1
2
9 X &
&
Dropout has more advantages over weight decay (Helmblod et al., 2016):
• Dropout is scale-free: dropout does not penalize the use of large weights when
needed.
• Dropout is invariant to parameter scaling: dropout is unaffected if weights in a
certain layer are scaled up by a constant 1, and the weights in another layer are
scaled down by a constant 1.
The University of Sydney Page 35
DropConnect
• DropConnect (Yann LeCun et al., 2013) generalizes dropout.
• Randomly drop connections in network with probability 1- <.
• As in Dropout, < = 0.5 usually gives the best results.
The University of Sydney Page 36
DropConnect
W = c.∗ d e
f7 W = <(1 − <)(d.∗ d)(e.∗ e)
A single neuron before activation
function:
g7 W =