Regularizations for Deep Models
Dr Chang Xu
School of Computer Science
The University of Sydney Page 1
What is regularization?
In general: any method to prevent overfitting or help the optimization. Regression using polynomials, # $ %
!=#!+#”%+##% +#$% +⋯+#%% +’
The University of Sydney Page 2
Overfitting
The University of Sydney
Page 3
! = #! + #”% + ##%# + #$%$ + ⋯ + #%%% + ‘
Prevent overfitting
Ø Larger data set helps.
Ø Throwing away useless hypotheses also helps.
ü Classical regularization: some principal ways to constrain hypotheses.
ü Other types of regularization: data augmentation, early stopping, etc.
The University of Sydney Page 4
Regularization as hard constraint
Training objective:
Example: l& regularization $1%
The University of Sydney
Page 5
$1%
min% & = )*+(&,.”,/”)
! “#$ 1.3. 4&≤6
min% & = )*+(&,.”,/”) ! “#$
1.3. & &≤6
Regularization as soft constraint
The hard-constraint optimization is equivalent to soft-constraint
$1%
min% & =)*+(&,.”,/”)+9∗4 &
! “#$ for some hyper parameter 9∗ > 0.
Example: l& regularization $1%
The University of Sydney
Page 6
min%& =)*+(&,.”,/”)+9∗ && ! “#$
Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: <(&)
• Posterior over the hypotheses: < & ." , /"
•Likelihood:< .",/" &
• Bayesian rule:
<&.",/" =<(&)<.",/"& <({." , /" })
The University of Sydney
Page 7
Regularization as Bayesian prior
• Bayesian rule:
<&.",/" =<(&)<.",/"&
<({." , /" })
• Maximum A Posteriori (MAP):
maxlog< & .",/" =maxlog< & +log< .",/" & !!
Regularization MLE loss
The University of Sydney
Page 8
The University of Sydney
Page 9
Some examples
Weight Decay (Researchers’ view)
• Limiting the growth of the weights in the network.
• A term is added to the original loss function, penalizing large
weights:
min%$((&)=%$(&)+D2 & & !
• Gradient of regularized objective F%$((&)= F%$(&) + D&
• Gradient descent update
&←&−IF%$((&)=&−IF%$ & −ID&= 1−ID &−IF%$ &
The University of Sydney Page 10
•
• •
Weight Decay (Engineers’ view)
The L2 regularization technique for neural networks was worked out by researchers in the 1990s.
At the same time, engineers, working independently from researchers, noticed that if you simply decrease the value of each weight on each training iteration, you get an improved trained model that isn’t as likely to be overfitted.
& ← & − IF%$((&) // update
& ← & ∗ 0.98 // or & = & − (0.02 ∗ &)
The L2 approach has a solid underlying theory but is complicated to
implement. The weight decay approach “just works” but is simple
to implement.
The University of Sydney Page 11
https://pytorch.org/docs/stable/_modules/torch/optim/adam.html#Adam
The University of Sydney Page 12
Other types of regularizations
• Robustness to noise
• Noise to the input
• Noise to the weights
• Data augmentation
• Early stopping
• Dropout
The University of Sydney
Page 13
Multiple optimal solutions.
The University of Sydney Page 14
Add noise to the input
Too much noise leads to data points cross the boundary.
The University of Sydney Page 15
Equivalence to weight decay
• Suppose the hypothesis is M . = N)., noise is O~Q(0, 9R)
• After adding noise, the loss is
%M =S*,,,- M.+O −/&=S*,,,- M. +N)O−/& =S*,,M. −/&+2S*,,,-N)O(M(.)−/) +S-N)O&
% M = S*,, M . − / & + 9 N &
The University of Sydney
Page 16
Add noise to the weights
• For the loss on each data point, add a noise term to the weights before calculating the prediction
O~Q 0,9R , w0 =w+O
• Prediction: M.! . instead of M.(.)
• Loss becomes
% M = S*,,,- M./- . − / &
The University of Sydney
Page 17
Add noise to the weights
• Loss becomes
• To simplify, use Taylor expansion
M./-.≈M. .+O)FM.+O)F&M.O 2
• Plug in
%M≈S*,,M..−/&+9S*,,FM.M. &+V(9&)
% M = S*,,,- M./- . − / &
The University of Sydney Page 18
Data augmentation
Adding noise to the input: a special kind of augmentation.
The University of Sydney Page 19
The first very successful CNN on the ImageNet dataset.
The University of Sydney Page 20
The University of Sydney Page 21
Early stopping
• Idea: don’t train the network to too small training error.
• When training, also output validation error
• Every time validation error improved, store a copy of the weights • When validation error not improved for some time, stop
• Return the copy of the weights stored
The University of Sydney Page 22
Early stopping
• hyperparameter selection: training step is the hyperparameter Ø Advantage
• Efficient: along with training; only store an extra copy of weights
• Simple: no change to the model/algorithm
Ø Disadvantage: need validation data
The University of Sydney
Page 23
Early stopping
Ø Strategy to get rid of the disadvantage
• After early stopping of the first run, train a second run and reuse validation data
Ø How to reuse validation data
1. Start fresh, train with both training data and validation data up to the previous number of epochs
2. Start from the weights in the first run, train with both training data and validation data until the validation loss < the training loss at the early stopping point
The University of Sydney Page 24
Early stopping as a regularizer
The University of Sydney Page 25
The University of Sydney
Page 26
Dropout
Dropout in PyTorch
https://github.com/pytorch/pytorch/blob/10c4b98ade8349d841518d22f19a653 a939e260c/torch/nn/modules/dropout.py#L36
The University of Sydney Page 27
Dropout
It is a simple but very effective technique that could alleviate overfitting in training phase.
The inspiration for Dropout (Hinton et al., 2013), came from the role of sex in evolution.
• Genes work well with another small random set of genes.
• Similarly, dropout suggests that each unit should work with a
random sample of other units.
The University of Sydney pinterest.com Page 28
Dropout
• At training (each iteration):
Each unit is retained with a probability <.
Expectation ?
• Attest:
The network is used as a whole.
The weights are scaled-down by a factor of < (e.g. 0.5).
The University of Sydney Page 29
Inverted Dropout
• At training (each iteration):
Each unit is retained with a probability <.
ü Weights are scaled-up by a factor of 1/<.
• Attest:
The network is used as a whole.
ü No scaling is applied. The University of Sydney
Page 30
Dropout
In practice, dropout trains 2% networks () – number of units).
Dropout is a technique which deals with overfitting by combining the predictions of many different large neural nets at test time.
!&'" =()(+)- +∗/ !& ≈- ()(+) +∗/ !& =- )/!& %%
The University of Sydney
Page 31
Dropout
The feed-forward operation of a standard neural network is
W"1/$ =X"1/$Y(1)+Z"1/$ /"1/$ =M W"1/$
The University of Sydney
Page 32
Dropout
With dropout:
6" 1 ~ Bernoulli(<)
W"1/$ =X"1/$Y_(1)+Z"1/$ /"(1/$)=M W"1/$
The University of Sydney
Page 33
Y_ ( 1 ) = ` 1 ∗ Y ( 1 )
Weight Decay
• Limiting the growth of the weights in the network.
• A term is added to the original loss function, penalizing large
weights:
L % 4 . X = L 5 1 6 X + 12 9 X & &
Dropout has more advantages over weight decay (Helmblod et al., 2016):
• Dropout is scale-free: dropout does not penalize the use of large weights when
needed.
• Dropout is invariant to parameter scaling: dropout is unaffected if weights in a
certain layer are scaled up by a constant 1, and the weights in another layer are scaled down by a constant 1.
The University of Sydney Page 34
• • •
DropConnect
DropConnect (Yann LeCun et al., 2013) generalizes dropout. Randomly drop connections in network with probability 1- <. As in Dropout, < = 0.5 usually gives the best results.
The University of Sydney Page 35
DropConnect
A single neuron before activation function:
W= c.∗de
Approximate W by a Gaussian distribution:
g7 W =