CS计算机代考程序代写 deep learning Bayesian 2b: Cross Entropy, Softmax, Weight Decay and

2b: Cross Entropy, Softmax, Weight Decay and
Momentum

Cross Entropy and Softmax

Loss Functions

In Week 1 we introduced the sum squared error (SSE) loss function, which is suitable for function
approximation tasks.

E = (t −
2
1

∑ i z )i 2

However, for binary classi�cation tasks, where the target output is either zero or one, it may be
more logical to use the cross entropy error:

E = (−t log(z ) −
i

∑ i i (1 − t ) log(1 −i z ))i

In order to explain the motivation for both of these loss functions, we need to introduce the
mathematical concept of Maximum Likelihood.

Maximum Likelihood

Let be a class of hypotheses for predicting certain data .H D

Let be the probability of data being generated under hypothesis . Prob(D∣h) D h ∈ H

The logarithm of this probability, is called the likelihood. log Prob(D∣h)

The Maximum Likelihood Principle states that we should choose which maximizes this
likelihood, i.e. maximizes or, equivalently, maximizes .

h ∈ H
Prob(D∣h) log Prob(D∣h)

In our case, the data consist of a target value for each set of input features in a Supervised
Learning task, and we can think of each hypothesis as a function determined by a neural
network with speci�ed weights or, to give a simpler example, could be a straight line with a
speci�ed slope and -intercept.

D t i x i
h f ()

f ()
y

Derivation of Least Squares

As previously mentioned, noise in the data is often caused by an accumulation of small errors due to
factors which are not captured by the model. The Central Limit Theorem tells us that when a large
number of independent random variables are added together, the combined error is well
approximated by a Gaussian distribution.

In order to accommodate this kind of noise, let’s suppose that our data are generated by a
linear function plus noise generated from a Gaussian distribution with mean zero and standard
deviation . Then

D = {t }i
f ()

Prob( D ∣ h ) = Prob({t } ∣ f )i

log Prob({t } ∣ f )i

f M L

= e
i

∏
σ2π

1 − t −f x
2σ2

1 ( i ( i))
2

= − t − f x − log(σ) − log(2π)
i

∑
2σ2

1
( i ( i))

2
1

= argmax log Prob(t ∣ f )f∈H

= argmin (t − f (x ))f∈H 2
1

∑ i i 2

It is interesting to note that we do not need to know the value of .σ

Derivation of Cross Entropy

For binary classi�cation tasks, the target value is either or . It makes sense to interpret the
output of the neural network as the probability of the true value being , i.e.

t i 0 1
f (x )i 1

Prob( 1 ∣ f (x ))i
Prob( 0 ∣ f (x ))i
Prob( t ∣ f (x ))i i

= f (x )i
= 1 − f (x )i
= f (x ) (1 − f (x )i

t i
i

1−t i

Then

log Prob({t } ∣ f ) =i ( t log f (x ) +
i

∑ i i (1 − t ) log(1 −i f (x )))i

So, according to the Maximum Likelihood principle, we need to maximize the expression on the right
hand side (or minimize its negative).

Cross Entropy loss is often used in combination with sigmoid activation at the output node, which
guarantees that the output is strictly between and , and also makes the backprop computations a
bit simpler, as follows:

0 1

∂z i

∂E

If z = , i
1 + e−s i

1
∂s i

∂E

= (−t log(z ) − (1 − t ) log(1 − z ))
i

∑ i i i i

= − + =
z i

t i

1 − z i

1 − t i
z (1 − z )i i

z − t i i

= = z − t
∂z i

∂E
∂s i

∂z i
i i

SSE and cross entropy behave a bit di�erently when it comes to outliers. SSE is more likely to allow
outliers in the training set to be misclassi�ed, because the contribution to the loss function for each
item is bounded between and . Cross Entropy is more likely to keep outliers correctly classi�ed,
because the loss function grows logarithmically as the distance between the target and actual value
approaches . For this reason, Cross Entropy works particularly well for classi�cation tasks that are
unbalanced in the sense of negative items vastly outnumbering positive ones (or vice versa).

0 1

Log Softmax

Some Supervised Learning tasks require data to be classi�ed into more than two classes. If the
number of classes is and we have a neural network with outputs , we can make the
assumption that the network’s estimate for the probability of class is proportional to .

N N z , … z i N
j exp(z )j

Because the probabilities must add up to , we need to normalize by dividing by their sum: 1

Prob(i)

log Prob(i)

=
exp z ∑j=1

N ( j)

exp z ( i)

= z − log exp z i
j=1

∑
N

( j)

If the correct class is , we can treat as our loss function, and the gradient is

where is the Kronecker delta. The �rst term pushes up the correct class in proportion to the
distance of its assigned probability from 1, while the second term pushes down the incorrect classes

in proportion to the probabilities assigned to them.

k − log Prob(k)

log Prob(k) =
dz i

d
δ −ik =

exp z ∑j=1
N

( j)

exp(z )i
δ −ik Prob(i)

δ ik i = k

i = k

Softmax compared to Boltzmann Distribution and Sigmoid

If you have studied mathematics or physics, you may be interested to know that Softmax is related to
the Boltzmann Distribution, with the negative of output playing the role of the “energy” for “state”
. The familiar Sigmoid function can also be seen as a special case of Softmax, with two classes and
one output, as follows: Consider a simpli�ed case where there is a choice between two classes, Class

and Class . We consider the output of the network to be associated with Class , and we imagine
a �xed “output” for Class which is always equal to zero. In this case, the softmax becomes:

z i i

0 1 z 1
0

Prob(1) = =
e + ez 0

1 + e−z
1

Related Posts