Bells and whistles in neural net training

Tricks in training neural networks
There are various tricks that people use when training neural networks:
I Regularization: Adjusting the gradient
I Dropout: Adjusting the hidden units
I Optimization methods: Adjusting the learning rate I Initialization: Using particular forms of initialization

Neural networks can be regularized in a similar way as linear models. Neural networks can also with Frobenius norm, which is a trivial extension to L2 norm for matrices. In fact, in many cases it is just referred to as L2 regularization.
XN i=1
`(i) +z!yk⇥(z!y)k2F +x!zk⇥(x!z)k2F i,j i,j
where k⇥k2 = P ✓2 is the squred Frobenius norm, which
generalizes the L2 norm to matrices. The bias parameters b are not regularized, as they do not contribute to the classifier to the inputs.

L2 regularization
I Compute the gradient of a loss with L2 regularization
I Update the weights
@ L @✓ =
XN @ ` ( i )
@✓ +✓
XN@`(i) !
@✓ +✓ pulls a weight back when it has become too big
I Question: Does it matter which layer ✓ is from when computing the regularization term?
I “Weigh decay factor”: is a tunable hyper parameter that

L1 regularization
I L1 regularization loss
I Compute the gradient
XN i=1
`(i) +z!yk⇥(z!y)k1 +x!zk⇥(x!z)k1
@ L @✓=
XN @ ` ( i )
@✓ +sign(✓)
I update the weights
XN @ ` ( i ) !
@✓ +sign(✓)

Comparison of L1 and L2
I In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w.
I When a particular weight has a large absolute value, |✓|, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when |✓| is small, L1 regularization shrinks the weight much more than L2 regularization.
I The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero. So L1 regularization e↵ectively does feature selection.

I Randomly drops a certain percentage of the nodes to prevent over-reliance on a few features or hidden units, or feature co-adaptation, where some features are only useful when working together with a few other features. The ultimate goal is to avoid overfitting.

I Dropout can be achieved using a mask: z(1) =g1(⇥(1)x+b1)
m1 ⇠ Bernoulli(r1) (1) 1 (1)
z ̃ = m z
(2) 2 (2)(1) 2
z =g(⇥ z ̃ +b) m2 ⇠ Bernoulli(r2)
(2) 2 (2)
z ̃ = m z (3) (2)
y = ⇥ z ̃
where m1 and m2 are mask vectors. The values of the elements in these vectors are either 1 or 0, drawn from a Bernoulli distribution with parameter r (usually r = 0.5)

Optimization methods
I SGD with Momentum
I AdaGrad
I Root Mean Square Prop (RMSProp) I Adam

SGD with Momentum
I At each timestep t, compute r✓L, and then compute the momentum as follows:
V0 =0, ⇡0.9
Vt =Vt1+(1)r✓L ✓j = ✓j ⌘Vt
I The momentum term increases for dimensions whose gradient point in the same directions and reduces updates for dimensions whose gradient change directions.

I Keep a running sum of the squared gradient Vr✓ . When updating the weight of this theta, divide the gradient by the square root of this term
V0 = 0
Vt =Vt1+r✓L2
r✓L ✓j =✓j ⌘pVt +✏
e.g., ✏ = 108
I The net e↵ect is to slow down the update for weights with large gradient and accelerate the update for weights with small gradient

Root Mean Square Prop (RMSProp)
I A minor adjustment of AdaGrad. Instead of letting the sum of squared gradient continuously grow, we let the sum decay:
V0 = 0
Vt =Vt1+(1)r✓L2
r✓L ✓j =✓j ⌘pVt +✏
e.g. ⇡ 0.9,⌘ = 0.001,✏ = 108

Adaptive Moment Estimation (Adam)
I Weight update at time step t for Adam: V0 =0, S0 =0,
Vt = 1Vt1 + (1 1)r✓L
St = 2St1 + (1 2)r✓L2
Vcorrected = Vt t 1
Scorrected = St
t 2t
V corrected
✓j=✓j⌘p t
Scorrected + ✏
I Adam combines Momentum and RMSProp
Momentum RMSProp

Xavier Initialization:
⇥⇠U”r 6 ,r 6 #
n(l) + n(l+1) n(l) + n(l+1)
where n(l) is the number of input units to ⇥ (fan-in), n(l+1) is the number of output units from ⇥

Neural net in PyTorch
from torch import nn class Net(nn.Module):
”””subclass from nn.Module is
important to inspecting the parameters”””
def init (self , in dim=25, out dim=3, batch size=1): super(Net, self ). init ()
self .in dim = in dim
self.out dim = out dim
self.linear = nn.Linear(self.in dim, self.out dim) self.softmax = nn.Softmax(dim=1)
def forward(self , input matrix ):
logit = self.linear(input matrix)
#return raw score , not normalized score
return logit
def xtropy loss(self , input matrix , target label vec ): loss = nn. CrossEntropyLoss ()
logits = self .forward(input matrix)
return loss(logits ,target label vec)

Use optimizers in Pytorch
import torch . optim as optim
net = Net(input dim , output dim)
optimizer = optim.Adam(net.parameters(), lr=lrate) for epoch in range(epochs ):
total nll = 0
for batch in batchize(train data , batch size ):
optimizer . zero grad () #zero out the gradient . vectorized = vectorize batch(batch ,\
feat index , label index)
feat vec = map( itemgetter (0) , vectorized )
label vec = map(itemgetter(1), vectorized) feat list = list(feat vec)
label list = list(label vec)
x = torch.Tensor(feat list)
y = torch.LongTensor(label list)
loss = net.xtropy loss(x,y) total nll += loss
loss .backward()
optimizer . step ()
torch.save(net.state dict(), net path)