Bells and whistles in neural net training
Tricks in training neural networks
There are various tricks that people use when training neural networks:
I Regularization: Adjusting the gradient
I Dropout: Adjusting the hidden units
I Optimization methods: Adjusting the learning rate I Initialization: Using particular forms of initialization
Regularization
Neural networks can be regularized in a similar way as linear models. Neural networks can also with Frobenius norm, which is a trivial extension to L2 norm for matrices. In fact, in many cases it is just referred to as L2 regularization.
XN i=1
`(i) + z!yk⇥(z!y)k2F + x!zk⇥(x!z)k2F i,j i,j
L=
where k⇥k2 = P ✓2 is the squred Frobenius norm, which
F
generalizes the L2 norm to matrices. The bias parameters b are not regularized, as they do not contribute to the classifier to the inputs.
L2 regularization
I Compute the gradient of a loss with L2 regularization
I Update the weights
@ L @✓ =
XN @ ` ( i )
@✓ + ✓
XN@`(i) !
@✓ + ✓ pulls a weight back when it has become too big
I Question: Does it matter which layer ✓ is from when computing the regularization term?
i=1
✓=✓ ⌘
I “Weigh decay factor”: is a tunable hyper parameter that
i=1
L1 regularization
I L1 regularization loss
L=
I Compute the gradient
XN i=1
`(i) + z!yk⇥(z!y)k1 + x!zk⇥(x!z)k1
@ L @✓=
XN @ ` ( i )
@✓ + sign(✓)
I update the weights
XN @ ` ( i ) !
@✓ + sign(✓)
✓=✓ ⌘
i=1
i=1
Comparison of L1 and L2
I In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w.
I When a particular weight has a large absolute value, |✓|, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when |✓| is small, L1 regularization shrinks the weight much more than L2 regularization.
I The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero. So L1 regularization e↵ectively does feature selection.
Dropout
I Randomly drops a certain percentage of the nodes to prevent over-reliance on a few features or hidden units, or feature co-adaptation, where some features are only useful when working together with a few other features. The ultimate goal is to avoid overfitting.
Dropout
I Dropout can be achieved using a mask: z(1) =g1(⇥(1)x+b1)
m1 ⇠ Bernoulli(r1) (1) 1 (1)
z ̃ = m z
(2) 2 (2)(1) 2
z =g(⇥ z ̃ +b) m2 ⇠ Bernoulli(r2)
(2) 2 (2)
z ̃ = m z (3) (2)
y = ⇥ z ̃
where m1 and m2 are mask vectors. The values of the elements in these vectors are either 1 or 0, drawn from a Bernoulli distribution with parameter r (usually r = 0.5)
Optimization methods
I SGD with Momentum
I AdaGrad
I Root Mean Square Prop (RMSProp) I Adam
SGD with Momentum
I At each timestep t, compute r✓L, and then compute the momentum as follows:
V0 =0, ⇡0.9
Vt = Vt 1+(1 )r✓L ✓j = ✓j ⌘Vt
I The momentum term increases for dimensions whose gradient point in the same directions and reduces updates for dimensions whose gradient change directions.
AdaGrad
I Keep a running sum of the squared gradient Vr✓ . When updating the weight of this theta, divide the gradient by the square root of this term
V0 = 0
Vt =Vt 1+r✓L2
r✓L ✓j =✓j ⌘pVt +✏
e.g., ✏ = 10 8
I The net e↵ect is to slow down the update for weights with large gradient and accelerate the update for weights with small gradient
Root Mean Square Prop (RMSProp)
I A minor adjustment of AdaGrad. Instead of letting the sum of squared gradient continuously grow, we let the sum decay:
V0 = 0
Vt = Vt 1+(1 )r✓L2
r✓L ✓j =✓j ⌘pVt +✏
e.g. ⇡ 0.9,⌘ = 0.001,✏ = 10 8
Adaptive Moment Estimation (Adam)
I Weight update at time step t for Adam: V0 =0, S0 =0,
Vt = 1Vt 1 + (1 1)r✓L
St = 2St 1 + (1 2)r✓L2
Vcorrected = Vt t 1
Scorrected = St
t 2t
V corrected
✓j=✓j ⌘p t
Scorrected + ✏
t
I Adam combines Momentum and RMSProp
Momentum RMSProp
Initialization
Xavier Initialization:
⇥⇠U”