CS计算机代考程序代写 deep learning Bayesian 2b: Cross Entropy, Softmax, Weight Decay and

2b: Cross Entropy, Softmax, Weight Decay and
Momentum

Cross Entropy and Softmax

Loss Functions

In Week 1 we introduced the sum squared error (SSE) loss function, which is suitable for function
approximation tasks.

E = (t −
2
1

i

∑ i z )i 2

However, for binary classi�cation tasks, where the target output is either zero or one, it may be
more logical to use the cross entropy error:

E = (−t log(z ) −
i

∑ i i (1 − t ) log(1 −i z ))i

In order to explain the motivation for both of these loss functions, we need to introduce the
mathematical concept of Maximum Likelihood.

Maximum Likelihood

Let be a class of hypotheses for predicting certain data .H D

Let be the probability of data being generated under hypothesis . Prob(D∣h) D h ∈ H

The logarithm of this probability, is called the likelihood. log Prob(D∣h)

The Maximum Likelihood Principle states that we should choose which maximizes this
likelihood, i.e. maximizes or, equivalently, maximizes .

h ∈ H
Prob(D∣h) log Prob(D∣h)

In our case, the data consist of a target value for each set of input features in a Supervised
Learning task, and we can think of each hypothesis as a function determined by a neural
network with speci�ed weights or, to give a simpler example, could be a straight line with a
speci�ed slope and -intercept.

D t i x i
h f ()

f ()
y

Derivation of Least Squares

As previously mentioned, noise in the data is often caused by an accumulation of small errors due to
factors which are not captured by the model. The Central Limit Theorem tells us that when a large
number of independent random variables are added together, the combined error is well
approximated by a Gaussian distribution.

In order to accommodate this kind of noise, let’s suppose that our data are generated by a
linear function plus noise generated from a Gaussian distribution with mean zero and standard
deviation . Then

D = {t }i
f ()

σ

Prob( D ∣ h ) = Prob({t } ∣ f )i

log Prob({t } ∣ f )i

f M L

= e
i


σ2π

1 − t −f x
2σ2

1 ( i ( i))
2

= − t − f x − log(σ) − log(2π)
i


2σ2

1
( i ( i))

2

2
1

= argmax log Prob(t ∣ f )f∈H

= argmin (t − f (x ))f∈H 2
1

i

∑ i i 2

It is interesting to note that we do not need to know the value of .σ

Derivation of Cross Entropy

For binary classi�cation tasks, the target value is either or . It makes sense to interpret the
output of the neural network as the probability of the true value being , i.e.

t i 0 1
f (x )i 1

Prob( 1 ∣ f (x ))i
Prob( 0 ∣ f (x ))i
Prob( t ∣ f (x ))i i

= f (x )i
= 1 − f (x )i
= f (x ) (1 − f (x )i

t i
i

1−t i

Then

log Prob({t } ∣ f ) =i ( t log f (x ) +
i

∑ i i (1 − t ) log(1 −i f (x )))i

So, according to the Maximum Likelihood principle, we need to maximize the expression on the right
hand side (or minimize its negative).

Cross Entropy loss is often used in combination with sigmoid activation at the output node, which
guarantees that the output is strictly between and , and also makes the backprop computations a
bit simpler, as follows:

0 1

E

∂z i

∂E

If z = , i
1 + e−s i

1
∂s i

∂E

= (−t log(z ) − (1 − t ) log(1 − z ))
i

∑ i i i i

= − + =
z i

t i

1 − z i

1 − t i
z (1 − z )i i

z − t i i

= = z − t
∂z i

∂E
∂s i

∂z i
i i

SSE and cross entropy behave a bit di�erently when it comes to outliers. SSE is more likely to allow
outliers in the training set to be misclassi�ed, because the contribution to the loss function for each
item is bounded between and . Cross Entropy is more likely to keep outliers correctly classi�ed,
because the loss function grows logarithmically as the distance between the target and actual value
approaches . For this reason, Cross Entropy works particularly well for classi�cation tasks that are
unbalanced in the sense of negative items vastly outnumbering positive ones (or vice versa).

0 1

1

Log Softmax

Some Supervised Learning tasks require data to be classi�ed into more than two classes. If the
number of classes is and we have a neural network with outputs , we can make the
assumption that the network’s estimate for the probability of class is proportional to .

N N z , … z i N
j exp(z )j

Because the probabilities must add up to , we need to normalize by dividing by their sum: 1

Prob(i)

log Prob(i)

=
exp z ∑j=1

N ( j)

exp z ( i)

= z − log exp z i
j=1


N

( j)

If the correct class is , we can treat as our loss function, and the gradient is

where is the Kronecker delta. The �rst term pushes up the correct class in proportion to the
distance of its assigned probability from 1, while the second term pushes down the incorrect classes

in proportion to the probabilities assigned to them.

k − log Prob(k)

log Prob(k) =
dz i

d
δ −ik =

exp z ∑j=1
N

( j)

exp(z )i
δ −ik Prob(i)

δ ik i = k

i = k

Softmax compared to Boltzmann Distribution and Sigmoid

If you have studied mathematics or physics, you may be interested to know that Softmax is related to
the Boltzmann Distribution, with the negative of output playing the role of the “energy” for “state”
. The familiar Sigmoid function can also be seen as a special case of Softmax, with two classes and
one output, as follows: Consider a simpli�ed case where there is a choice between two classes, Class

and Class . We consider the output of the network to be associated with Class , and we imagine
a �xed “output” for Class which is always equal to zero. In this case, the softmax becomes:

z i i

0 1 z 1
0

Prob(1) = =
e + ez 0

ez

1 + e−z
1

Further Reading

Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016):

Maximum Likelihood (5.5)

Cross Entropy (3.13)

Softmax (6.2.2)

Exercise: Softmax and Backpropagation

Question 1

No response

Question 2

No response

Question 3

Recall that the formula for Softmax is:

Prob(i)

log Prob(i)

=
exp z ∑j=1

N ( j)

exp z ( i)

= z − log exp z i
j=1


N

( j)

Consider a classi�cation task with three classes , , . Suppose a particular input is presented,
producing outputs:

1 2 3

z =1 1, z =2 2, z =3 3

and that the correct class is .2

Compute each of the following, to two decimal places:

Prob(1)

Prob(2)

Prob(3)

Compute each of the following, to two decimal places:

d(log Prob(2))/dz 1
d(log Prob(2))/dz 2
d(log Prob(2))/dz 3

Consider a degenerate case of supervised learning where the training set consists of just a single
input, repeated times. In of the cases, the target output value is ; in the other , it is .
What will a back-propagation neural network predict for this example, assuming that it has been

100 80 100 1 20 0

No response

trained and reaches a global minimum? Does it make a di�erence whether the loss function is sum
squared error or cross entropy? (Hint: to �nd the global minimum, di�erentiate the loss function and
set the derivative to zero.)

Weight Decay and Momentum

Weight Decay

Sometimes a penalty term is added to the loss function which encourages the neural network weights
to remain small:w j

E = (z −
2
1

i

∑ i t ) +i 2 w
2
λ

j

∑ j2

(Note: the sum squared error term may be replaced with cross entropy or softmax).

This additional loss term prevents the weights from “saturating” to very high values. It is sometimes
referred to as “elastic weights” because it simulates a force on each weight as if there were a spring
pulling it back towards the origin according to Hooke’s Law. The scaling factor needs to be
determined from experience, or empirically.

λ

In order to explain the theoretical justi�cation for Weight Decay, we need to introduce Baysian
Inference and Maximum A Posteriori (MAP) estimation.

Bayesian Inference and MAP estimation

Recall the Maximum Likelihood principle which selects hypothesis for which is
maximal.

h ∈ H Prob(D ∣ h)

Bayesian Inference instead seeks to maximize i.e. the probability that hypothesis is
correct, given that data have been observed.

Prob(h ∣ D) h
D

According to Bayes’ Theorem:

P (h ∣ D)P (D)

P (h ∣ D)

= P (D ∣ h)P (h)

=
P (D)

P (D ∣ h)P (h)

We do not need to know in order to maximize this expression over . However, we do
need to be able to estimate which is called the prior probability of (i.e. our estimate of the
probability before the data have been observed). is called the posterior probability
because it is our estimate after the data have been observed. For this reason, maximizing

in this context is sometimes called Maximum A Posteriori or MAP estimation.

Prob(D) h
Prob(h) h

Prob(h ∣ D)

Prob(h ∣ D)

Note also that must be a well-de�ned probability in the sense that its integral over must
be �nite and not in�nite.

Prob(h) H

Weight Decay as MAP Estimation

We assume a Gaussian prior distribution for the weights, i.e.

P (w) = e
j


σ 2π 0

1 −w /2σ
j
2

0
2

P (w ∣ t)

log P (w ∣ t)

w MAP

= = e e
P (t)

P (t ∣ w)P (w)
P (t)

1

i


σ2π

1 − z −t
2σ2

1 ( i i)
2

j


σ 2π 0

1 −w /2σ
j
2

0
2

= − z − t − w + constant
2σ2

1

i

∑ ( i i)2
2σ 0

2
1

j

∑ j2

= argmax log P (w ∣ t)w∈H

= argmin z − t + w w∈H (2
1

i

∑ ( i i)2
2
λ

j

∑ j2)

where λ = σ /σ 2 0
2

Momentum

It is often helpful to maintain a running average of the gradient for each weight, and use this running
average to update that weight:

δw

w

← αδw − η
∂w
∂E

← w + δw

The parameter is called the Momentum.α ∈ [0, 1)

This is helpful in two situations. Firstly, if the weights travel through a �at region in the landscape,
momentum will help to speed up the learning in this region. Secondly, if the landscape is shaped like
a “rain gutter”, weights will tend to oscillate without much improvement. Momentum will
theoretically dampen the sideways oscillations but amplify the downhill motion by a factor of . 1−α

1

When momentum is introduced, we generally reduce the learning rate at the same time, in order to
compensate for this implicit factor of . 1−α

1

Adaptive Moment Estimation (Adam)

Adaptive Moment Estimation or Adam maintains a running average of the gradients ( ) and the
squared gradients ( ) for each weight in the network (Kingma & Ba, 2015).

m t
g t

m t

v t

= β m + (1 − β )g 1 t−1 1 t
= β v + (1 − β )g 2 t−1 2 t

2

To speed up the training in the early stages, compensating for the fact that are initialized to
zero, we rescale as follows:

m , g t t

m̂t

v̂t

=
1 − β 1

t

m t

=
1 − β 2

t

v t

Finally, each parameter is adjusted according to:

w =t w −t−1
+ ε v̂t

η
m̂t

Adam has been found to work well across a wide variety of task domains.

As with momentum, when switching from SGD to Adam we often need to reduce the learning rate
in order to compensate for the factor of .

η

+ε v̂t
1

Second Order Methods

For smaller networks, optimisation methods such as Conjugate Gradients or Natural Gradients are
sometimes used, which compute the second derivative of the loss function with respect to every pair

of weights in the network. However, these methods are generally not practical in the context of deep
learning, because the computation time is proportional to the square of the number of weights,
which may be in the millions or even billions.

Adam is seen as a very cost e�ective method which, in practice, provides a similar bene�t to second
order methods, but with far less computation.

References

Kingma, D.P., & Ba, J., 2015. Adam: A method for stochastic optimization, International Conference on
Learning Representations (poster).

Further Reading

https://ruder.io/optimizing-gradient-descent/index.html

Textbook Deep Learning (Goodfellow, Bengio, Courville, 2016):

Weight Decay (5.2.2) and MAP Estimation (5.6.1)

Momentum (8.3) and Adam (8.5)

Quiz 3: Backprop Variations

Question 1

No response

Question 2

No response

Question 3

No response

Question 4

No response

Question 5

No response

This is a Quiz to test your understanding of the material from Week 2 on Backprop Variations.

You must attempt to answer each question yourself, before looking at the sample answer.

Explain the di�erence between the following paradigms, in terms of what is presented to the system,
and what it aims to achieve:

Supervised Learning

Reinforcement Learning

Unsupervised Learning

Explain what is meant by Over�tting in neural networks, and list four di�erent methods for avoiding
it.

Explain how Dropout is used for neural networks, in both the training and testing phase.

Write the formulas for these Loss functions: Squared Error, Cross Entropy, Softmax, Weight Decay
(remember to de�ne any variables you use)

In the context of Supervised Learning, explain the di�erence between Maximum Likelihood
estimation and Bayesian Inference.

Question 6

No response

Brie�y explain the concept of Momentum, as an enhancement for Gradient Descent.

Coding Exercise – Gradient Descent with NumPy

Objective

In this exercise, you will learn how to use gradient descent to solve a linear regression problem.

Instructions

Complete the parts of the code marked as “TODO”.

To run the cell you can press Ctrl-Enter or hit the “Play” button at the top.

Week 2 Thursday video