CS计算机代考程序代写 Bayesian algorithm The University of Sydney Page 1

The University of Sydney Page 1

Regularizations
for Deep Models

Dr Chang Xu

School of Computer Science

The University of Sydney Page 2

What is regularization?

In general: any method to prevent overfitting or help the optimization.
Regression using polynomials,

! = #! + #”% + ##%# + #$%$ +⋯+ #%%% + ‘

The University of Sydney Page 3

Overfitting

! = #! + #”% + ##%# + #$%$ +⋯+ #%%% + ‘

The University of Sydney Page 4

Prevent overfitting

Ø Larger data set helps.
Ø Throwing away useless hypotheses also helps.

ü Classical regularization: some principal ways to constrain
hypotheses.

ü Other types of regularization: data augmentation, early stopping, etc.

The University of Sydney Page 5

Regularization as hard constraint

min
!

$% & =
1

)
*

“#$

%
+(&, .”, /”)

1. 3. 4 & ≤ 6

Training objective:

Example: ℓ& regularization

min
!

$% & =
1

)
*

“#$

%
+(&, .”, /”)

1. 3. & &
&
≤ 6

The University of Sydney Page 6

Regularization as soft constraint

min
!

$% & =
1

)
*

“#$

%
+(&, .”, /”) + 9


4 &

The hard-constraint optimization is equivalent to soft-constraint

Example: ℓ& regularization

min
!

$% & =
1

)
*

“#$

%
+(&, .”, /”) + 9


& &

&

for some hyper parameter 9∗ > 0.

The University of Sydney Page 7

Regularization as Bayesian prior

• Bayesian view: everything is a distribution

• Prior over the hypotheses: <(&) • Posterior over the hypotheses: < & .", /" • Likelihood: < .", /" & • Bayesian rule: < & .", /" = <(&)< .", /" & <({.", /"}) The University of Sydney Page 8 Regularization as Bayesian prior • Bayesian rule: < & .", /" = <(&)< .", /" & <({.", /"}) • Maximum A Posteriori (MAP): max ! log < & .", /" = max! log < & + log < .", /" & Regularization MLE loss The University of Sydney Page 9 Some examples The University of Sydney Page 10 Weight Decay (Researchers’ view) • Limiting the growth of the weights in the network. • A term is added to the original loss function, penalizing large weights: min ! $%((&) = $%(&) + D 2 & & & • Gradient of regularized objective F$%((&)= F$%(&) + D& • Gradient descent update & ← & − IF$%((&)= & − IF$% & − ID& = 1 − ID & − IF$% & The University of Sydney Page 11 Weight Decay (Engineers’ view) • The L2 regularization technique for neural networks was worked out by researchers in the 1990s. • At the same time, engineers, working independently from researchers, noticed that if you simply decrease the value of each weight on each training iteration, you get an improved trained model that isn’t as likely to be overfitted. & ← & − IF$%((&) // update & ← & ∗ 0.98 // or & = & − (0.02 ∗ &) • The L2 approach has a solid underlying theory but is complicated to implement. The weight decay approach “just works” but is simple to implement. The University of Sydney Page 12 https://pytorch.org/docs/stable/_modules/torch/optim/adam.html#Adam The University of Sydney Page 13 Other types of regularizations • Robustness to noise • Noise to the input • Noise to the weights • Data augmentation • Early stopping • Dropout The University of Sydney Page 14 Multiple optimal solutions. The University of Sydney Page 15 Add noise to the input Too much noise leads to data points cross the boundary. The University of Sydney Page 16 Equivalence to weight decay • Suppose the hypothesis is M . = N)., noise is O~Q(0, 9R) • After adding noise, the loss is % M = S*,,,- M . + O − / & = S*,,,- M . + N ) O − / & = S*,, M . − / & + 2S*,,,- N ) O(M(.) − /) + S- N ) O & % M = S*,, M . − / & + 9 N & The University of Sydney Page 17 Add noise to the weights • For the loss on each data point, add a noise term to the weights before calculating the prediction % M = S*,,,- M./- . − / & O~Q 0, 9R , w 0 = w+ O • Prediction: M.! . instead of M.(.) • Loss becomes The University of Sydney Page 18 Add noise to the weights % M = S*,,,- M./- . − / & • Loss becomes • To simplify, use Taylor expansion • Plug in M./- . ≈ M. . + O ) FM . + O ) F & M . O 2 % M ≈ S*,, M. . − / & + 9S*,, FM . M . & + V(9 & ) The University of Sydney Page 19 Data augmentation Adding noise to the input: a special kind of augmentation. The University of Sydney Page 20 The first very successful CNN on the ImageNet dataset. The University of Sydney Page 21 The University of Sydney Page 22 Early stopping • Idea: don’t train the network to too small training error. • When training, also output validation error • Every time validation error improved, store a copy of the weights • When validation error not improved for some time, stop • Return the copy of the weights stored The University of Sydney Page 23 Early stopping • hyperparameter selection: training step is the hyperparameter Ø Advantage • Efficient: along with training; only store an extra copy of weights • Simple: no change to the model/algorithm Ø Disadvantage: need validation data The University of Sydney Page 24 Early stopping Ø Strategy to get rid of the disadvantage • After early stopping of the first run, train a second run and reuse validation data Ø How to reuse validation data 1. Start fresh, train with both training data and validation data up to the previous number of epochs 2. Start from the weights in the first run, train with both training data and validation data until the validation loss < the training loss at the early stopping point The University of Sydney Page 25 Early stopping as a regularizer The University of Sydney Page 26 Dropout The University of Sydney Page 27 https://github.com/pytorch/pytorch/blob/10c4b98ade8349d841518d22f19a653 a939e260c/torch/nn/modules/dropout.py#L36 Dropout in PyTorch https://github.com/pytorch/pytorch/blob/10c4b98ade8349d841518d22f19a653a939e260c/torch/nn/modules/dropout.py The University of Sydney Page 28 Dropout It is a simple but very effective technique that could alleviate overfitting in training phase. The inspiration for Dropout (Hinton et al., 2013), came from the role of sex in evolution. pinterest.com • Genes work well with another small random set of genes. • Similarly, dropout suggests that each unit should work with a random sample of other units. The University of Sydney Page 29 Dropout • At training (each iteration): Each unit is retained with a probability <. • At test: The network is used as a whole. The weights are scaled-down by a factor of < (e.g. 0.5). Expectation ? The University of Sydney Page 30 Inverted Dropout • At training (each iteration): Each unit is retained with a probability <. ü Weights are scaled-up by a factor of 1/<. • At test: The network is used as a whole. ü No scaling is applied. The University of Sydney Page 31 Dropout In practice, dropout trains 2% networks () – number of units). Dropout is a technique which deals with overfitting by combining the predictions of many different large neural nets at test time. !&'" =( % )(+)- + ∗/ !& ≈ - ( % )(+) + ∗/ !& = - )/!& The University of Sydney Page 32 Dropout The feed-forward operation of a standard neural network is W" 1/$ = X" 1/$ Y (1) + Z" 1/$ /" 1/$ = M W" 1/$ The University of Sydney Page 33 Dropout With dropout: 6" 1 ~ Bernoulli(<) _Y (1) = ` 1 ∗ Y (1) W" 1/$ = X" 1/$ _Y (1) + Z" 1/$ /" (1/$) = M W" 1/$ The University of Sydney Page 34 Weight Decay • Limiting the growth of the weights in the network. • A term is added to the original loss function, penalizing large weights: ℒ%4. X = ℒ516 X + 1 2 9 X & & Dropout has more advantages over weight decay (Helmblod et al., 2016): • Dropout is scale-free: dropout does not penalize the use of large weights when needed. • Dropout is invariant to parameter scaling: dropout is unaffected if weights in a certain layer are scaled up by a constant 1, and the weights in another layer are scaled down by a constant 1. The University of Sydney Page 35 DropConnect • DropConnect (Yann LeCun et al., 2013) generalizes dropout. • Randomly drop connections in network with probability 1- <. • As in Dropout, < = 0.5 usually gives the best results. The University of Sydney Page 36 DropConnect W = c.∗ d e f7 W = <(1 − <)(d.∗ d)(e.∗ e) A single neuron before activation function: g7 W =