Lecture 3, Part 2: Training a Classifier
Roger Grosse
1 Introduction
Now that we’ve defined what binary classification is, let’s actually train a classifier. We’ll approach this problem in much the same way as we did linear regression: define a model and a cost function, and minimize the cost using gradient descent. The one thing that makes the classification case harder is that it’s not obvious what loss function to use. We can’t just use the classification error itself, because the gradient is zero almost everywhere! Instead, we’ll define a surrogate loss function, i.e. an alternative loss function which is easier to optimize.
Copyright By PowCoder代写 加微信 powcoder
Learning Goals
Understand why classification error and squared error are problematic cost functions for classification.
Know what cross-entropy is and understand why it can be easier to optimize than squared error (assuming a logistic activation function).
Be able to derive the gradient descent updates for all of the models and cost functions mentioned in this lecture and to implement the learning algorithms in Python.
Know what hinge loss is, and how it relates to cross-entropy loss.
Understand how binary logistic regression can be generalized to mul- tiple variables.
Know what it means for a function to be convex, how to check con- vexity visually, and why it’s important for optimization.
– Algebraically proving functions to be convex is beyond the scope of this course.
Know how to check the correctness of gradients using finite differences.
Choosing a cost function
Recall the setup from the previous lecture. We have a set of training ex- amples {(x(i),t(i))}Ni=1, where the x(i) are vector-valued inputs, and t(i) are binary-valued targets. We would like to learn a binary linear classifier, where we compute a linear function of x(i) and threshold it at zero:
1 ifw⊤x>0
y = 0 otherwise. (1)
In the last lecture, our goal was to correctly classify every training exam- ple. But this might be impossible if the dataset isn’t linearly separable. Even if it’s possible to correctly classify every training example, it may be undesirable since then we might just overfit!
How can we define a sensible learning criterion when the dataset isn’t linearly separable? One natural criterion is to minimize the number of mis- classified training examples. We can formalize this with the classification error loss, or the 0-1 loss:
L0−1(y, t) = 1 otherwise. (2)
As always, the cost function is just the loss averaged over the training examples; in this case, that corresponds to the error rate, or fraction of misclassified examples. How do we make this small?
2.1 Attempt 1: 0-1 loss
As a first attempt, let’s try to minimize 0-1 loss directly. In order to compute the gradient descent updates, we need to compute the partial derivatives ∂L0−1/∂wj. Rather than mechanically deriving this derivative, let’s think about what it means. It means, how much does L0−1 change if you make a very small change to wj ? As long as we’re not on the classification boundary, making a small enough change to wj will have no effect on L0−1, because the prediction won’t change. This implies that ∂L0−1/∂wj = 0, as long as we’re not on the boundary. Gradient descent will go nowhere. (If we are on the boundary, the cost is discontinuous, which certainly isn’t any better!) OK, we certainly can’t optimize 0-1 loss with gradient descent.
2.2 Attempt 2: linear regression
Since that didn’t work, let’s try using something we already know: linear regression. Recall that this assumes a linear model and the squared error loss function:
y = w⊤x + b (3) LSE(y,t)= 12(y−t)2 (4)
We’ve already seen two ways of optimizing this: gradient descent, and a closed-form solution. But does it make sense for classification? One obvious problem is that the predictions are real-valued rather than binary. But that’s OK, since we can just pick some scheme for binarizing them, such as thresholding at y = 1/2. When we replace a loss function we trust with another one we trust less but which is easier to optimize, the replacement one is called a surrogate loss function.
But there’s still a problem. Suppose we have a positive example, i.e. t = 1. Ifwepredicty=1,wegetacostof0,whereasifwemakethewrong prediction y = 0, we get a cost of 1/2; so far, so good. But suppose we’re really confident that this is a positive example, and predict y = 9. Then we pay a cost of 21(9−1)2 = 32. This is far higher than the cost for y = 0, so the learning algorithm will try very hard to prevent this from happening.
Near the end of the course, when we discuss reinforcement learning, we’ll see an algorithm which can minimize 0-1 loss directly. It’s nowhere near as efficient as gradient descent, though, so we still need the techniques of this lecture!
That’s not bad in itself, but it means that something else might need to be sacrificed, if it’s impossible to match all of the targets exactly. Perhaps the sacrifice will be that it incorrectly classifies some other training examples.
2.3 Attempt 3: logistic nonlinearity
The problem with linear regression is that the predictions were allowed to take arbitrary real values. But it makes no sense to predict anything smaller than 0 or larger than 1. Let’s fix this problem by applying a nonlinearity, or activation function, which squashes the predictions to be between 0 and 1. In particular, we’ll use something called the logistic function:
σ(z) = 1 . (5) 1+e−z
This is a kind of sigmoidal, or S-shaped, function:
What’s important about this function is that it increases monotonically, with asymptotes at 0 and 1. (Plus, it’s smooth, so we can compute deriva- tives.)
We refine the model as follows:
z = w⊤x + b (6)
y = σ(z) (7) LSE(y,t)= 12(y−t)2. (8)
Notice that this model solves the problem we observed with linear regression. As the predictions get more and more confident on the correct answer, the loss continues to decrease.
To derive the gradient descent updates, we’ll need the partial derivatives of the cost function. We’ll do this by applying the Chain Rule twice: first to compute dLSE/dz, and then again to compute ∂LSE/∂wj. But first, let’s note the convenient fact that
∂z (1+e−z)2
= y(1 − y). (9)
If you predict y > 1, then regardless of the target, you can decrease the loss by setting y = 1. Similarly for y < 0.
Another advantage of the logistic function is that calculations tend to work out very nicely.
This is equivalent to the elegant identity σ′(z) = σ(z)(1 − σ(z)).
Figure 1: Visualization of derivatives of squared error loss with logistic nonlinearity, for a training example with t = 1. The derivative dJ/dz corresponds to the slope of the tangent line.
Now for the Chain Rule:
dLSE = dLSE dy dz dydz
= (y − t)y(1 − y) (10) ∂LSE = dLSE ∂z
∂wj dz ∂wj
= dLSE ·xj. (11)
Done! Why don’t we go one step further and plug Eqn. 10 into Eqn. 11? The reason is that our goal isn’t to compute a formula for ∂LSE/∂wj; as computer scientists, our goal is to come up with a procedure for computing it. The two formulas above give us a procedure which we can implement directly in Python. One advantage of doing it this way is that we can reuse some of the work we’ve done in computing the derivative with respect to the bias:
At this point, you should stop and sanity check the equations we just derived, e.g. checking that they have the sign that they ought to. Get in the habit of doing this.
Reusing computation of derivatives is one of the main insights behind backpropagation, one of the central algorithms in this course.
dLSE dLSE ∂z db = dz ∂b
If we had expanded out the entire formula, it might not be obvious to us that we can reuse computation like this.
So far, so good. But there’s one hitch. Let’s suppose you classify one of the training examples extremely wrong, e.g. you confidently predict a negative label with z = −5, which gives y ≈ 0.0067, for a positive example (i.e. t = 1). Plugging these values into Eqn 10, we find that ∂LSE/∂z ≈ −0.0066. This is a pretty small value, considering how big the mistake was. As shown in Figure 1, the more confident the wrong prediction, the smaller the gradient is! The most badly misclassified examples will have hardly any effect on the training. This doesn’t seem very good. We say the learning algorithm does not have a strong gradient signal for those training examples.
Figure 2: Plot of cross-entropy loss as a function of the input z to the logistic activation function.
The problem with squared error loss in the classification setting is that it doesn’t distinguish bad predictions from extremely bad predictions. If t = 1, then a prediction of y = 0.01 has roughly the same squared-error loss as a prediction of y = 0.00001, even though in some sense the latter is more wrong. This isn’t necessarily a problem in terms of the cost function itself: whether 0.00001 is inherently much worse than 0.01 depends on the situation. (If all we care about is classification error, they’re essentially equivalent.) But from the perspective of optimization, the fact that the losses are nearly equivalent is a big problem. If we can increase y from 0.00001 to 0.0001, that means we’re “getting warmer,” but this doesn’t show up in the squared-error loss. We’d like a loss function which reflects our intuitive notion of getting warmer.
2.4 Final touch: cross-entropy loss
The problem with squared-error loss is that it treats y = 0.01 and y = 0.00001 as nearly equivalent (for a positive example). We’d like a loss function which makes these very different. One such loss function is cross- entropy (CE). This is defined as follows:
− log y if t = 1
LCE(y,t)= −log1−y ift=0 (13)
In our earlier example, we see that LCE(0.01, 1) = 4.6, whereas LCE(0.00001, 1) = 11.5, so cross-entropy treats the latter as much worse than the former.
When we do calculations, it’s cumbersome to use the case notation, so we instead rewrite Eqn. 13 in the following form. You should check that they are equivalent:
LCE(y, t) = −t log y − (1 − t) log 1 − y. (14)
Remember, the logistic function squashes y to be between 0 and 1, but cross-entropy draws big distinctions between probabilities close to 0 or 1. Interestingly, these effects cancel out: Figure 2 plots the loss as a function of z. You get a sizable gradient signal even when the predictions are very wrong.
Think about how the argument in this paragraph relates to the one in the previous paragraph.
Actually, the effect discussed here can also be beneficial, because it makes the algorithm robust, in that it can learn to ignore mislabeled examples. Cost functions like this are sometimes used for this reason. However, when you do use it, you should be aware of the optimization difficulties it creates!
You’ll sometimes see cross-entropy abbreviated XE.
See if you can derive the equations for the asymptote lines.
When we combine the logistic activation function with cross-entropy loss, you get logistic regression:
z = w⊤x + b
y = σ(z) (15) LCE =−tlogy−(1−t)log1−y.
Now let’s compute the derivatives. We’ll do it two different ways: the mechanical way, and the clever way. Let’s do the mechanical way first, as an example of the chain rule for derivatives. Remember, our job here isn’t to produce a formula for the derivatives, the way we would in calculus class. Our job is to give a procedure for computing the derivatives which we could translate into NumPy code. The following does that:
dLCE = − t + 1 − t dy y1−y
dLCE = dLCE dy dz dydz
= dLCE ·y(1−y) (16) dy
∂LCE = dLCE ∂z ∂wj dz ∂wj
= dLCE · xj dz
This can be translated directly into NumPy (exercise: how do you vec- torize this?). If we were good little computer scientists, we would stop here. But today we’re going to be naughty computer scientists and break the abstraction barrier between the activation function (logistic) and the cost function (cross-entropy).
2.5 Logistic-cross-entropy function
There’s a big problem with Eqns. 15 and 16: what happens if we have a positive example (t = 1), but we confidently classify it as a negative example (z ≪ 0, implying y ≈ 0)? This is likely to happen at the very beginning of training, so we should be able to handle it. But if y is small enough, it could be smaller than the smallest floating point value, i.e. numerically zero. Then when we compute the cross-entropy, we take the log of 0 and get −∞. Or if this doesn’t happen, think about Eqn. 16. Since y appears in the denominator, dLCE/dy will be extremely large in magnitude, which again can cause numerical difficulties. These bugs are very subtle, and can be hard to track down if you don’t expect them.
What we do instead is combine the logistic function and cross-entropy loss into a single function, which we term logistic-cross-entropy:
LLCE(z, t) = LCE(σ(z), t) = t log(1 + e−z) + (1 − t) log(1 + ez) (17) This still isn’t numerically stable if we implement it directly, since ez could
blow up. But most scientific computing environments provide a numerically
The second step of this derivation uses Eqn. 9.
Figure 3: Comparison of the loss functions considered so far.
stable log-sum-exp routine.1 In numpy, this is np.logaddexp. So the following code would compute the logistic-cross-entropy:
E = t * np.logaddexp(0, -z) + (1-t) * np.logaddexp(0, z)
Now to compute the loss derivatives:
dLLCE = d tlog(1+e−z)+(1−t)log(1+ez)
=−t·1+e−z +(1−t)1+ez
e−z ez = −t(1 − y) + (1 − t)y
This is like magic! We took a somewhat complicated formula for the logistic activation function, combined it with a somewhat complicated formula for the cross-entropy loss, and wound up with a stunningly simple formula for the loss derivative! Observe that this is exactly the same formula as for dLSE/dy in the case of linear regression. And it has the same intuitive interpretation: if y > t, you made too positive a prediction, so you want to shift your prediction in the negative direction. Conversely, if y < t, you want to shift your prediction in the positive direction.
2.6 Another alternative: hinge loss
Another loss function you might encounter is hinge loss. Here, y is a real value, and t ∈ {−1, 1}.
LH(y, t) = max(0, 1 − ty)
Hinge loss is plotted in Figure 3 for a positive example. One useful property
of hinge loss is that it’s an upper bound on 0–1 loss; this is a useful property
1The log-sum-exp trick is pretty neat. https://hips.seas.harvard.edu/blog/2013/ 01/09/computing-log-sum-exp/
This isn’t a coincidence. The reason it happens is beyond the scope of this course, but if you’re curious, look up “generalized linear models.”
for a surrogate loss function, since it means that if you make the hinge loss small, you’ve also made 0–1 loss small. A linear model with hinge loss is known as a support vector machine (SVM):
y = w⊤x + b (19) LH = max(0, 1 − ty) (20)
If you take CSC411, you’ll learn a lot about SVMs, including their statis- tical motivation, how to optimize them efficiently and how to make them nonlinear (using something called the “kernel trick”). But you already know one optimization method: you already know enough to derive the gradient descent updates for an SVM.
Interestingly, even though SVMs came from a different community and had a different sort of motivation from logistic regression, the algorithms behave very similarly in practice. The reason has to do with the loss func- tions. Figure 3 compares hinge loss to cross-entropy loss; even though cross- entropy is smoother, the asymptotic behavior is the same, suggesting the loss functions are basically pretty similar.
All of the loss functions covered so far is shown in Figure 3. Take the time to review them, to understand their strengths and weaknesses.
3 Multiclass classification
So far we’ve talked about binary classification, but most classification prob- lems involve more than two categories. Fortunately, this doesn’t require any new ideas: everything pretty much works by analogy with the binary case. The first question is how to represent the targets. We could represent them as integers, but it’s more convenient to use a one-hot vector, also called a one-of-K encoding:
t = (0,...,0,1,0,...,0) (21) | {z }
entry k is 1
Now let’s design the whole model by analogy with the binary case.
First of all, consider the linear part of the model. We have K outputs and D inputs. To represent a linear function, we’ll need a K × D weight matrix, as well as a K-dimensional bias vector. We first compute the
intermediate quantities as follows:
z = Wx + b. (22)
This is the general form of a linear function from RD to RK.
Next, the activation function. We saw that the logistic function was a good thing to use in the binary case. There’s a multivariate generalization
of the logistic function called the softmax function:
ezk logistic function.
yk = softmax(z1,...,zK)k = Pk′ ezk′ (23) Importantly, the outputs of the softmax function are nonnegative and sum
to 1, so they can be interpreted as a probability distribution over the K 8
Try plugging in K = 2 to figure
out how the softmax relates to the
classes (just like the output of the logistic could be interpreted as a prob- ability). The inputs to the softmax are called the logits. Note that when one of the zk’s is much larger than the others, the output of the softmax will be approximately the argmax, in the one-hot encoding. Hence, a more accurate name might be “soft-argmax.”
Finally, the loss function. Cross-entropy can be generalized to the multiple-output case:
K LCE(y,t) = −Xtk logyk
= −t⊤(log y).
Here, logy represents the elementwise log. Note that only one of the tk’s is 1 and the rest are 0, so the summation has the effect of picking the relevant entry of the vector logy. (See how convenient the one-hot notation is?) Note that this loss function only makes sense for predictions which sum to 1; if you eliminate that constraint, you could trivially minimize the loss by making all the yk’s large.
When we put these things together, we get multiclass logistic regres- sion, or softmax regression:
z = Wx + b
y = softmax(z) LCE = −t⊤(log y)
We won’t go through the derivatives in detail, but it basically works out exactly like logistic regression. The softmax and cross-entropy functions interact nicely with each other, so we always combine them into a single softmax-cross-entropy function LSCE for purposes of numerical stability. The derivatives of LSCE have the same elegant formula we’ve been seeing repeatedly, except this time remember that t and y are both vectors:
∂LSCE =y−t (24) ∂z
Softmax regression is an elegant learning algorithm which can work very well in practice.
4 Convex Functions
An important criterion we often use to compare different loss functions is convexity. Recall that a set S is convex if the line segment connecting any two points in S lies entirely within S. Mathematically, this means that for x0,x1 ∈S,
(1−λ)x0+λx1 ∈S for0≤λ≤1.
The definition of a convex function is closely related. A function f is
convex if for any x0,x1 in the domain of f,
f((1 − λ)x0 + λx1) ≤ (1 − λ)f(x0) + λf(x1) (25)
Think about the logits as the “log-odds”, because when you exponentiate them you get the odds ratios of the probabilities.
You’ll sometimes see σ(z) used to denote the softmax function, by analogy with the logistic. But in this course, it will always denote the logistic function.
Try plugging in K = 2 to see how this relates to binary cross-entropy.
Figure 4: Left: Definition of convexity. Right: Proof-by-picture that if the model is linear and L is a convex function of z = w⊤x+b, then it’s also convex as a function of w and b.
This is shown graphically in Figure 4. Another way to phrase this require- ment is that the line segment connecting any two points on the graph of f must lie above the graph of f. Equivalently, the set of points lying above the graph of f must be a convex set. Intuitively, convex functions are bowl- shaped.
Convexity is a really important property from the standpoint of o
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com