程序代写代做代考 Bells and whistles in neural net training

Bells and whistles in neural net training

Tricks in training neural networks
There are various tricks that people use when training neural networks:
I Regularization: Adjusting the gradient
I Dropout: Adjusting the hidden units
I Optimization methods: Adjusting the learning rate I Initialization: Using particular forms of initialization

Regularization
Neural networks can be regularized in a similar way as linear models. Neural networks can also with Frobenius norm, which is a trivial extension to L2 norm for matrices. In fact, in many cases it is just referred to as L2 regularization.
XN i=1
`(i) +z!yk⇥(z!y)k2F +x!zk⇥(x!z)k2F i,j i,j
L=
where k⇥k2 = P ✓2 is the squred Frobenius norm, which
F
generalizes the L2 norm to matrices. The bias parameters b are not regularized, as they do not contribute to the classifier to the inputs.

L2 regularization
I Compute the gradient of a loss with L2 regularization
I Update the weights
@ L @✓ =
XN @ ` ( i )
@✓ +✓
XN@`(i) !
@✓ +✓ pulls a weight back when it has become too big
I Question: Does it matter which layer ✓ is from when computing the regularization term?
i=1
✓=✓⌘
I “Weigh decay factor”: is a tunable hyper parameter that
i=1

L1 regularization
I L1 regularization loss
L=
I Compute the gradient
XN i=1
`(i) +z!yk⇥(z!y)k1 +x!zk⇥(x!z)k1
@ L @✓=
XN @ ` ( i )
@✓ +sign(✓)
I update the weights
XN @ ` ( i ) !
@✓ +sign(✓)
✓=✓⌘
i=1
i=1

Comparison of L1 and L2
I In L1 regularization, the weights shrink by a constant amount toward 0. In L2 regularization, the weights shrink by an amount which is proportional to w.
I When a particular weight has a large absolute value, |✓|, L1 regularization shrinks the weight much less than L2 regularization does. By contrast, when |✓| is small, L1 regularization shrinks the weight much more than L2 regularization.
I The net result is that L1 regularization tends to concentrate the weight of the network in a relatively small number of high-importance connections, while the other weights are driven toward zero. So L1 regularization e↵ectively does feature selection.

Dropout
I Randomly drops a certain percentage of the nodes to prevent over-reliance on a few features or hidden units, or feature co-adaptation, where some features are only useful when working together with a few other features. The ultimate goal is to avoid overfitting.

Dropout
I Dropout can be achieved using a mask: z(1) =g1(⇥(1)x+b1)
m1 ⇠ Bernoulli(r1) (1) 1 (1)
z ̃ = m z
(2) 2 (2)(1) 2
z =g(⇥ z ̃ +b) m2 ⇠ Bernoulli(r2)
(2) 2 (2)
z ̃ = m z (3) (2)
y = ⇥ z ̃
where m1 and m2 are mask vectors. The values of the elements in these vectors are either 1 or 0, drawn from a Bernoulli distribution with parameter r (usually r = 0.5)

Optimization methods
I SGD with Momentum
I AdaGrad
I Root Mean Square Prop (RMSProp) I Adam

SGD with Momentum
I At each timestep t, compute r✓L, and then compute the momentum as follows:
V0 =0, ⇡0.9
Vt =Vt1+(1)r✓L ✓j = ✓j ⌘Vt
I The momentum term increases for dimensions whose gradient point in the same directions and reduces updates for dimensions whose gradient change directions.

AdaGrad
I Keep a running sum of the squared gradient Vr✓ . When updating the weight of this theta, divide the gradient by the square root of this term
V0 = 0
Vt =Vt1+r✓L2
r✓L ✓j =✓j ⌘pVt +✏
e.g., ✏ = 108
I The net e↵ect is to slow down the update for weights with large gradient and accelerate the update for weights with small gradient

Root Mean Square Prop (RMSProp)
I A minor adjustment of AdaGrad. Instead of letting the sum of squared gradient continuously grow, we let the sum decay:
V0 = 0
Vt =Vt1+(1)r✓L2
r✓L ✓j =✓j ⌘pVt +✏
e.g. ⇡ 0.9,⌘ = 0.001,✏ = 108

Adaptive Moment Estimation (Adam)
I Weight update at time step t for Adam: V0 =0, S0 =0,
Vt = 1Vt1 + (1 1)r✓L
St = 2St1 + (1 2)r✓L2
Vcorrected = Vt t 1
Scorrected = St
t 2t
V corrected
✓j=✓j⌘p t
Scorrected + ✏
t
I Adam combines Momentum and RMSProp
Momentum RMSProp

Initialization
Xavier Initialization:
⇥⇠U”r 6 ,r 6 #
n(l) + n(l+1) n(l) + n(l+1)
where n(l) is the number of input units to ⇥ (fan-in), n(l+1) is the number of output units from ⇥

Neural net in PyTorch
from torch import nn class Net(nn.Module):
”””subclass from nn.Module is
important to inspecting the parameters”””
def init (self , in dim=25, out dim=3, batch size=1): super(Net, self ). init ()
self .in dim = in dim
self.out dim = out dim
self.linear = nn.Linear(self.in dim, self.out dim) self.softmax = nn.Softmax(dim=1)
def forward(self , input matrix ):
logit = self.linear(input matrix)
#return raw score , not normalized score
return logit
def xtropy loss(self , input matrix , target label vec ): loss = nn. CrossEntropyLoss ()
logits = self .forward(input matrix)
return loss(logits ,target label vec)

Use optimizers in Pytorch
import torch . optim as optim
net = Net(input dim , output dim)
optimizer = optim.Adam(net.parameters(), lr=lrate) for epoch in range(epochs ):
total nll = 0
for batch in batchize(train data , batch size ):
optimizer . zero grad () #zero out the gradient . vectorized = vectorize batch(batch ,\
feat index , label index)
feat vec = map( itemgetter (0) , vectorized )
label vec = map(itemgetter(1), vectorized) feat list = list(feat vec)
label list = list(label vec)
x = torch.Tensor(feat list)
y = torch.LongTensor(label list)
loss = net.xtropy loss(x,y) total nll += loss
loss .backward()
optimizer . step ()
torch.save(net.state dict(), net path)