CS计算机代考程序代写 Bayesian algorithm Regularizations for Deep Models

Regularizations for Deep Models
Dr Chang Xu
School of Computer Science
The University of Sydney Page 1

What is regularization?
In general: any method to prevent overfitting or help the optimization. Regression using polynomials, # $ %
!=#!+#”%+##% +#$% +⋯+#%% +’
The University of Sydney Page 2

Overfitting
The University of Sydney
Page 3
! = #! + #”% + ##%# + #$%$ + ⋯ + #%%% + ‘

Prevent overfitting
Ø Larger data set helps.
Ø Throwing away useless hypotheses also helps.
ü Classical regularization: some principal ways to constrain hypotheses.
ü Other types of regularization: data augmentation, early stopping, etc.
The University of Sydney Page 4

Regularization as hard constraint
Training objective:
Example: l& regularization $1%
The University of Sydney
Page 5
$1%
min% & = )*+(&,.”,/”)
! “#$ 1.3. 4&≤6
min% & = )*+(&,.”,/”) ! “#$
1.3. & &≤6

Regularization as soft constraint
The hard-constraint optimization is equivalent to soft-constraint
$1%
min% & =)*+(&,.”,/”)+9∗4 &
! “#$ for some hyper parameter 9∗ > 0.
Example: l& regularization $1%
The University of Sydney
Page 6
min%& =)*+(&,.”,/”)+9∗ && ! “#$

Regularization as Bayesian prior
• Bayesian view: everything is a distribution
• Prior over the hypotheses: <(&) • Posterior over the hypotheses: < & ." , /" •Likelihood:< .",/" & • Bayesian rule: <&.",/" =<(&)<.",/"& <({." , /" }) The University of Sydney Page 7 Regularization as Bayesian prior • Bayesian rule: <&.",/" =<(&)<.",/"& <({." , /" }) • Maximum A Posteriori (MAP): maxlog< & .",/" =maxlog< & +log< .",/" & !! Regularization MLE loss The University of Sydney Page 8 The University of Sydney Page 9 Some examples Weight Decay (Researchers’ view) • Limiting the growth of the weights in the network. • A term is added to the original loss function, penalizing large weights: min%$((&)=%$(&)+D2 & & ! • Gradient of regularized objective F%$((&)= F%$(&) + D& • Gradient descent update &←&−IF%$((&)=&−IF%$ & −ID&= 1−ID &−IF%$ & The University of Sydney Page 10 • • • Weight Decay (Engineers’ view) The L2 regularization technique for neural networks was worked out by researchers in the 1990s. At the same time, engineers, working independently from researchers, noticed that if you simply decrease the value of each weight on each training iteration, you get an improved trained model that isn’t as likely to be overfitted. & ← & − IF%$((&) // update & ← & ∗ 0.98 // or & = & − (0.02 ∗ &) The L2 approach has a solid underlying theory but is complicated to implement. The weight decay approach “just works” but is simple to implement. The University of Sydney Page 11 https://pytorch.org/docs/stable/_modules/torch/optim/adam.html#Adam The University of Sydney Page 12 Other types of regularizations • Robustness to noise • Noise to the input • Noise to the weights • Data augmentation • Early stopping • Dropout The University of Sydney Page 13 Multiple optimal solutions. The University of Sydney Page 14 Add noise to the input Too much noise leads to data points cross the boundary. The University of Sydney Page 15 Equivalence to weight decay • Suppose the hypothesis is M . = N)., noise is O~Q(0, 9R) • After adding noise, the loss is %M =S*,,,- M.+O −/&=S*,,,- M. +N)O−/& =S*,,M. −/&+2S*,,,-N)O(M(.)−/) +S-N)O& % M = S*,, M . − / & + 9 N & The University of Sydney Page 16 Add noise to the weights • For the loss on each data point, add a noise term to the weights before calculating the prediction O~Q 0,9R , w0 =w+O • Prediction: M.! . instead of M.(.) • Loss becomes % M = S*,,,- M./- . − / & The University of Sydney Page 17 Add noise to the weights • Loss becomes • To simplify, use Taylor expansion M./-.≈M. .+O)FM.+O)F&M.O 2 • Plug in %M≈S*,,M..−/&+9S*,,FM.M. &+V(9&) % M = S*,,,- M./- . − / & The University of Sydney Page 18 Data augmentation Adding noise to the input: a special kind of augmentation. The University of Sydney Page 19 The first very successful CNN on the ImageNet dataset. The University of Sydney Page 20 The University of Sydney Page 21 Early stopping • Idea: don’t train the network to too small training error. • When training, also output validation error • Every time validation error improved, store a copy of the weights • When validation error not improved for some time, stop • Return the copy of the weights stored The University of Sydney Page 22 Early stopping • hyperparameter selection: training step is the hyperparameter Ø Advantage • Efficient: along with training; only store an extra copy of weights • Simple: no change to the model/algorithm Ø Disadvantage: need validation data The University of Sydney Page 23 Early stopping Ø Strategy to get rid of the disadvantage • After early stopping of the first run, train a second run and reuse validation data Ø How to reuse validation data 1. Start fresh, train with both training data and validation data up to the previous number of epochs 2. Start from the weights in the first run, train with both training data and validation data until the validation loss < the training loss at the early stopping point The University of Sydney Page 24 Early stopping as a regularizer The University of Sydney Page 25 The University of Sydney Page 26 Dropout Dropout in PyTorch https://github.com/pytorch/pytorch/blob/10c4b98ade8349d841518d22f19a653 a939e260c/torch/nn/modules/dropout.py#L36 The University of Sydney Page 27 Dropout It is a simple but very effective technique that could alleviate overfitting in training phase. The inspiration for Dropout (Hinton et al., 2013), came from the role of sex in evolution. • Genes work well with another small random set of genes. • Similarly, dropout suggests that each unit should work with a random sample of other units. The University of Sydney pinterest.com Page 28 Dropout • At training (each iteration): Each unit is retained with a probability <. Expectation ? • Attest: The network is used as a whole. The weights are scaled-down by a factor of < (e.g. 0.5). The University of Sydney Page 29 Inverted Dropout • At training (each iteration): Each unit is retained with a probability <. ü Weights are scaled-up by a factor of 1/<. • Attest: The network is used as a whole. ü No scaling is applied. The University of Sydney Page 30 Dropout In practice, dropout trains 2% networks () – number of units). Dropout is a technique which deals with overfitting by combining the predictions of many different large neural nets at test time. !&'" =()(+)- +∗/ !& ≈- ()(+) +∗/ !& =- )/!& %% The University of Sydney Page 31 Dropout The feed-forward operation of a standard neural network is W"1/$ =X"1/$Y(1)+Z"1/$ /"1/$ =M W"1/$ The University of Sydney Page 32 Dropout With dropout: 6" 1 ~ Bernoulli(<) W"1/$ =X"1/$Y_(1)+Z"1/$ /"(1/$)=M W"1/$ The University of Sydney Page 33 Y_ ( 1 ) = ` 1 ∗ Y ( 1 ) Weight Decay • Limiting the growth of the weights in the network. • A term is added to the original loss function, penalizing large weights: L % 4 . X = L 5 1 6 X + 12 9 X & & Dropout has more advantages over weight decay (Helmblod et al., 2016): • Dropout is scale-free: dropout does not penalize the use of large weights when needed. • Dropout is invariant to parameter scaling: dropout is unaffected if weights in a certain layer are scaled up by a constant 1, and the weights in another layer are scaled down by a constant 1. The University of Sydney Page 34 • • • DropConnect DropConnect (Yann LeCun et al., 2013) generalizes dropout. Randomly drop connections in network with probability 1- <. As in Dropout, < = 0.5 usually gives the best results. The University of Sydney Page 35 DropConnect A single neuron before activation function: W= c.∗de Approximate W by a Gaussian distribution: g7 W =