Lecture 3, Part 1: Linear Classification
1 Introduction
Last time, we saw an example of a learning task called regression. There, the goal was to predict a scalar-valued target from a set of features. This time, we’ll focus on a slightly different task: binary classification, where the goal is to predict a binary-valued target. Here are some examples of binary classification problems:
Copyright By PowCoder代写 加微信 powcoder
• You want to train a medical diagnosis system to predict whether a patient has a given disease. You have a training set consisting of a set of patients, a set of features for those individuals (e.g. presence or absence of various symptoms), and a label saying whether or not the patient had the disease.
• You are running an e-mail service, and want to determine whether a given e-mail is spam. You have a large collection of e-mails which have been hand-labeled as spam or non-spam.
• You are running an online payment service, and want to determine whether or not a given transaction is fraudulent. You have a labeled training dataset of fraudulent and non-fraudulent transactions; fea- tures might include the type of transaction, the amount of money, or the time of day.
Like regression, binary classification is a very restricted kind of task. Most learning problems you’ll encounter won’t fit nicely into one of these two categories. Our motivation for focusing on binary classification is to introduce several fundamental ideas that we’ll use throughout the course. In this lecture, we discuss how to view both data points and linear classifiers as vectors. Next lecture, we discuss the perceptron, a particular classifica- tion algorithm, and use it as an example of how to efficiently implement a learning algorithm in Python. Starting next week, we’ll look at supervised learning in full generality, and see that regression and binary classification are just special cases of a more general learning framework.
This lecture focuses on the geometry of classification. We’ll look in particular at two spaces:
• The input space, where each data case corresponds to a vector. A classifier corresponds to a decision boundary, or a hyperplane such that the positive examples lie on one side, and negative examples lie on the other side.
• Weight space, where each set of classification weights corresponds to a vector. Each training case corresponds to a constraint in this space, where some regions of weight space are “good” (classify it correctly) and some regions are “bad” (classify it incorrectly).
The idea of weight space may seem pretty abstract, but it is very important that you become comfortable with it, since it underlies nearly everything we do in the course.
Using our understanding of input space and weight space, the limita- tions of linear classifiers will become immediately apparent. We’ll see some examples of datasets which are not linearly separable (i.e. no linear classi- fier can correctly classify all the training cases), but which become linearly separable if we use a basis function representation.
Learning goals
Know what is meant by binary linear classification.
Understand why an explicit threshold for a classifier is redundant. Understand how we can get rid of the bias term by adding a “dummy” feature.
Be able to specify weights and biases by hand to represent simple functions (e.g. AND, OR, NOT).
Be familiar with input space and weight space.
– Be able to plot training cases and classification weights in both input space and weight space.
Be aware of the limitations of linear classifiers.
– Know what is meant by convexity, and be able to use convexity to show that a given set of training cases is not linearly separable.
– Understand how we can sometimes still separate the classes using a basis function representation.
Binary linear classifiers
We’ll be looking at classifiers which are both binary (they distinguish be-
tween two categories) and linear (the classification is done using a linear
function of the inputs). As in our discussion of linear regression, we assume
each input is given in terms of D scalar values, called input dimensions
or features, which we think summarize the important information for clas-
sification. (Some of the features, e.g. presence or absence of a symptom,
may in fact be binary valued, but we’re going to treat these as real-valued
anyway.) The jth feature for the ith training example is denoted x(i). All j
of the features for a given training case are concatenated together to form a vector, which we’ll denote x(i). (Recall that vectors and matrices are shown in boldface.)
Associated with each data case is a binary-valued target, the thing we’re trying to predict. By definition, a binary target takes two possible values,
which we’ll call classes, and which are typically referred to as positive and negative. (E.g., the positive class might be “has disease” and the negative class might be “does not have disease.”) Data cases belonging to these classes are called positive examples and negative examples, respectively. The training set consists of a set of N pairs (x(i), t(i)), where x(i) is the input and t(i) is the binary-valued target, or label. Since the training cases come with labels, they’re referred to as labeled examples. Confusingly, even though we talk about positive and negative examples, the t(i) typically take values in {0,1}, where 0 corresponds to the “negative” class. Sorry, you’ll just have to live with this terminology.
Our goal is to correctly classify all the training cases (and, hopefully, examples not in the training set). In order to do the classification, we need to specify a model, which determines how the predictions are computed from the inputs. As we said before, our model for this week is binary linear classifiers.
The way binary linear classifiers work is simple: they compute a linear function of the inputs, and determine whether or not the value is larger than some threshold r. Recall from Lecture 2 that a linear function of the input can be written as
w1x1 +···+wDxD +b=wTx+b,
where w is a weight vector and b is a scalar-valued bias. Therefore, the
prediction y can be computed as follows: z = wT x + b
1 ifz≥r y= 0 ifz
From these inequalities, we immediately see that b < 0 and w1,w2 > 0. The simplest way forward at this point is proba- bly trial and error. Since the problem is symmetric with respect to w1 and w2, we might as well decide that w1 = w2. So let’s tryb=−1,w1 =w2 =1andseeifitworks. Thefirstand fourth inequalities are clearly satisfied, but the second and third are not, since w1+b = w2+b = 0. So let’s try making the bias a bit more negative. When we try b = −1.5,w1 = w2 = 1, we see that all four inequalities are satisfied, so we have our solution.
Following these examples, you should attempt the OR function on your own.
3 The geometric picture
Now let’s move on to the main concepts of this lecture: data space and weight space. These are the spaces that the inputs and the weight vectors live in, respectively. It’s very important to become comfortable thinking about these spaces, since we’re going to treat the inputs and weights as vectors for the rest of the term.
In this lecture, we’re going to focus on two-dimensional input and weight spaces. But keep in mind that this is a vast oversimplification: in practi- cal settings, these spaces are typically many thousands, or even millions, of dimensions. It’s pretty much impossible to visualize spaces this high- dimensional.
3.1 Data space
The first space to be familiar with is data space, or input space. Each point in this space corresponds to a possible input vector. (We’re going to abuse mathematical terminology a bit by using “point” and “vector” in- terchangeably.) It’s customary to represent positive and negative examples with the symbols “+” and “−”, respectively.
Once we’ve chosen the weights w and bias b, we can divide the data space into a region where the points are classified as positive (the posi- tive region), and a region where the points are classified as negative (the negative region). The boundary between these regions, i.e. the set where wT x + b = 0, is called the decision boundary. Think back to your lin- ear algebra class, and recall that the set determined by this equation is a hyperplane. The set of points on one side of the hyperplane is called a half-space. Examples are shown in Figure 1
When we plot examples in two dimensions, the hyperplanes are actually lines. But you shouldn’t think of them as lines — you should think of them as hyperplanes.
If it’s possible to choose a linear decision boundary that correctly clas- sifies all of the training cases, the training set is said to be linearly sepa- rable. As we’ll see later, not all training sets are linearly separable.
(a) (b) (c) (d)
Figure 1: (a) Training examples and for NOT function, in data space. (b) NOT, in weight space. (c) Slice of data space for AND function correspond- ing to x0 = 1. (d) Slice of weight space for AND function corresponding to w0 = −1.
3.2 Weight space
As you’d expect from the name, weight vectors are also vectors, and the space they live in is called weight space. In this section, we’ll assume there is no bias parameter unless stated otherwise. (See Section 2.1.) Each point in weight space is a possible weight vector.
Consider a positive training case (x, 1). The set of weight vectors which correctly classify this training case is given by the linear inequality wT x ≥ 0. (In fact, it’s exactly the sort of inequality we derived in Examples 1 and 2.) Geometrically, the set of points satisfying this inequality is a half-space. For lack of a better term, we’ll refer to the side which satisfies the constraint as the good region, and the other side as the bad region. Similarly, the set of weight vectors which correctly classify a negative training case (x, 0) is given by wT x < 0; this is also a half-space. Examples are shown in Figure 1.
The set of weight vectors which correctly classify all of the training examples is the intersection of all the half-spaces corresponding to the in- dividual examples. This set is called the feasible region. If the feasible region is nonempty, the problem is said to be feasible; otherwise it’s said to be infeasible.
When we draw the constraints in two dimensions, we typically draw the line corresponding to the boundary of the constraint set, and then indicate the good region with an arrow. As with our data space visualizations, you should think of the boundary as a hyperplane, not as a line.
We can visualize three-dimensional examples by looking at slices. As shown in Figure 2, these slices will resemble our previous visualizations, except that the decision boundaries and constraints need not pass through the origin.
4 The perceptron learning rule
The perceptron is a kind of binary linear classifier. Recall from last lecture that this means it makes predictions by computing wT x+b and seeing if the result is positive or negative. Here, x is the input vector, w is the weight vector, and b is a scalar-valued bias. Recall as well that we can eliminate the bias by adding a dummy dimension to x. For the perceptron algorithm, it will be convenient to represent the positive and negative classes with 1 and -1, instead of 1 and 0 as we use in the rest of the course. Therefore,
We’re going to completely ignore the fact that one of these inequalities is strict and the other is not. The question of what happens on the decision boundaries isn’t very interesting.
There’s one constraint per training example. What happened to the fourth constraint in Figure 1(d)?
Figure 2: Visualizing a slice of a 3-dimensional weight space. the classification model is as follows:
z = wT x (1) 1 if z ≥ 0
y= −1 ifz<0 (2)
Here’s a rough sketch of the perceptron algorithm. We examine each of the training cases one at a time. For each input x(i), we compute the prediction y(i) and see if it matches the target t(i). If the prediction is correct, we do nothing. If it is wrong, we adjust the weights in a direction that makes it more correct.
Now for the details. First of all, how do we determine if the prediction is correct? We could simply check if y(i) = t(i), but this has a slight problem: if x(i) lies exactly on the classification boundary, it is technically classified as positive according to the above definition. But we don’t want our training cases to lie on the decision boundary, since this means the classification may change if the input is perturbed even slightly. We’d like our classifiers to be more robust than this. Instead, we’ll use the stricter criterion
z(i)t(i) > 0. (3)
You should now check that this criterion correctly handles the various cases that may occur.
The other question is, how do we adjust the weight vector? If the train- ing case is positive and we classify it as negative, we’d like to increase the value of z. In other words, we’d like
z′ =w′Tx>wTx=z, (4) where w′ and w are the new and old weight vectors, respectively. The
perceptron algorithm achieves this using the update
w′ =w+αx, (5)
where α > 0. We now check that (4) is satisfied:
w′T x = (w + αx)T x (6) = wT x + αxT x (7) =wTx+α∥x∥2. (8)
Here, ∥x∥ represents the Euclidean norm of x. Since the squared norm is always positive, we have z′ > z.
Conversely, if it’s a negative example which we mistakenly classified as positive, we want to decrease z, so we use a negative value of α. Since it’s possible to show that the absolute value of α doesn’t matter, we generally use α = 1 for positive cases and α = −1 for negative cases. We can denote this compactly with
w ← w + tx. (9) This rule is known as the perceptron learning rule.
Now we write out the perceptron algorithm in full: For each training case (x(i), t(i)),
z(i) ← wT x(i) If z(i)t(i) ≤ 0,
w ← w + t(i)x(i)
In thinking about this algorithm, remember that we’re denoting the classes
with -1 and 1 (rather than 0 and 1, as we do in the rest of the course).
5 The limits of linear classifiers
Linear classifiers can represent a lot of things, but they can’t represent everything. The classic example of what they can’t represent is the XOR function. It should be pretty obvious from inspection that you can’t draw a line separating the two classes. But how do we actually prove this?
5.1 Convex sets
An important geometric concept which helps us out here is convexity. A set S is convex if the line segment connecting any two points in S must lie within S. It’s not too hard to show that if S is convex, then any weighted average of points in S must also lie within S. A weighted average of points x(1), . . . , x(N) is a point given by the linear combination
x(avg) =λ1x(1) +···+λNx(N),
where 0 ≤ λi ≤ 1 and λ1 +···+λN = 1. You can think of the weighted average as the center of mass, where the mass of each point is given by λi. In the context of binary classification, there are two important sets that
are always convex:
1. In data space, the positive and negative regions are both convex. Both regions are half-spaces, and it should be visually obvious that half- spaces are convex. This implies that if inputs x(1) , . . . , x(N ) are all in the positive region, then any weighted average must also be in the positive region. Similarly for the negative region.
2. In weight space, the feasible region is convex. The rough mathematical argument is as follows. Each good region (the set of weights which correctly classify one data point) is convex because it’s a half-space. The feasible region is the intersection of all the good regions, so it must be convex because the intersection of convex sets is convex.
Discriminating simple patterns under translation with wrap-around
• Suppose we just use pixels as the features.
iscriminating simple patterns
• Can a binary threshold unit
er translation widtihscFrwiimgriuanrapete-3ab:eroTtwuheenendXdOiffRerefnutnction is not linearly separable.
just use pixels as
threshold unit between different
have the same pixels?
patterns that have the same
pattern B pattern B pattern B
number of on pixels?
translate with wrap-around!
patterns can Figure 4: No linear hypothesis can separate these two patterns in all possible pattern B
withwrap-aroutnradn!slations(withwrap-around). pattern B
5.2 Showing that functions aren’t linearly separable
Now let’s see how convexity can be used to show functions aren’t linearly separable.
Example 3. Let’s return to the XOR example. Since the posi- tive region is convex, if we draw the line segment connecting the two positive examples (0,1) and (1,0), this entire line segment must be classified as positive. Similarly, if we draw the line seg- ment connecting the two negative examples (0, 0) and (1, 1), the entire line segment must be classified as negative. But these two line segments intersect at (0.5, 0.5), which means this point must be classified as both positive and negative, which is impossible. (See Figure 3.) Therefore, XOR isn’t linearly separable.
Example 4. Our last example was somewhat artificial. Let’s now turn to a somewhat more troubling, and practically relevant, limitation of linear classifiers. Let’s say we want to give a robot a vision system which can recognize objects in the world. Since the robot could be looking any given direction, it needs to be able to recognize objects regardless of their location in its visual field. I.e., it should be able to recognize a pattern in any possible translation.
As a simplification of this situation, let’s say our inputs are 16-dimensional binary vectors and we want to distinguish two patterns, A, and B (shown in Figure 4), which can be placed in any possible translation,
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com