CS代考 CSC 311: Introduction to Machine Learning

CSC 311: Introduction to Machine Learning
Lecture 3 – Linear Classification
Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews
Intro ML (UofT) CSC311-Lec3 1 / 39

Last class, we discussed linear regression, and used a modular approach to machine learning algorithm design:
chooseamodel: y=f(x)=w⊤x+b
choose a loss: L(y, t) = 12 (y − t)2
formulate an optimization problem: minimize 􏰊Ni=1 L(y(i), t(i)) (w,b)
solve the minimization problem using one of two strategies
􏰀 direct solution (set derivatives to zero)
􏰀 gradient descent
vectorize the algorithm, i.e. represent in terms of linear algebra make a linear model more powerful using feature expansion improve the generalization by adding a regularizer
Intro ML (UofT) CSC311-Lec3

Classification Setup
Classification: predicting a discrete-valued target
􏰀 Binary classification: predicting a binary-valued target
Recall the notation:
􏰀 Training data: (x(1) , t(1) ), (x(2) , t(2) ), … (x(N ) , t(N ) )
􏰀 x(i) are the inputs
􏰀 t(i) are the (discrete value) targets
􏰀 predict whether a patient has a disease, given the presence or
absence of various symptoms
􏰀 classify e-mails as spam or non-spam
􏰀 predict whether a financial transaction is fraudulent
Intro ML (UofT) CSC311-Lec3

The Binary Linear Classification Model
classification: predict a discrete-valued target binary: predict a binary target t ∈ {0, 1}
􏰀 Training examples with t = 1 are called positive examples, and training examples with t = 0 are called negative examples.
􏰀 t ∈ {0, 1} or t ∈ {−1, +1} is for computational convenience.
Intro ML (UofT) CSC311-Lec3

The Binary Linear Classification Model
classification: predict a discrete-valued target binary: predict a binary target t ∈ {0, 1}
􏰀 Training examples with t = 1 are called positive examples, and training examples with t = 0 are called negative examples.
􏰀 t ∈ {0, 1} or t ∈ {−1, +1} is for computational convenience. linear: model is a linear function of x, followed by a threshold r:
z = wT x + b
y= 0 ifz0⇐⇒w0>0
􏰀 Whenx1=1,need:z=w0x0+w1x1<0⇐⇒w0+w1<0 This is our “training set” Intro ML (UofT) CSC311-Lec3 x0 x1 t 101 110 What conditions are needed on w0,w1 to classify all examples? 􏰀 Whenx1=0,need:z=w0x0+w1x1>0⇐⇒w0>0
􏰀 Whenx1=1,need:z=w0x0+w1x1<0⇐⇒w0+w1<0 Example solution: w0 = 1, w1 = −2 Is this the only solution? This is our “training set” Intro ML (UofT) CSC311-Lec3 x0 x1 x2 t 1000 1010 1100 1111 z = w0x0 + w1x1 + w2x2 CSC311-Lec3 x0 x1 x2 t 1000 1010 1100 1111 z = w0x0 + w1x1 + w2x2 need: w0 < 0 CSC311-Lec3 x0 x1 x2 t 1000 1010 1100 1111 z = w0x0 + w1x1 + w2x2 need: w0 < 0 need: w0 +w2 <0 CSC311-Lec3 x0 x1 x2 t 1000 1010 1100 1111 z = w0x0 + w1x1 + w2x2 need: w0 < 0 need: w0 +w2 <0 need: w0 +w1 <0 CSC311-Lec3 x0 x1 x2 t 1000 1010 1100 1111 z = w0x0 + w1x1 + w2x2 need: w0 need: w0 +w2 need: w0 +w1 need: w0 +w1 +w2 < 0 <0 <0 >0
CSC311-Lec3

x0 x1 x2 t 1000 1010 1100 1111
z = w0x0 + w1x1 + w2x2 need: w0 < 0 need: w0 +w2 <0 need: w0 +w1 <0 need: w0 +w1 +w2 >0
Example solution: w0 = −1.5, w1 = 1, w2 = 1
(UofT) CSC311-Lec3

The Geometric Picture
Input Space, or Data Space for NOT example
x0 x1 t 101 110
Training examples are points
Weights (hypotheses) w can be represented by half-spaces H+ ={x:wTx≥0},H− ={x:wTx<0} 􏰀 The boundaries of these half-spaces pass through the origin (why?) Intro ML (UofT) CSC311-Lec3 9 / 39 The Geometric Picture Input Space, or Data Space for NOT example x0 x1 t 101 110 Training examples are points Weights (hypotheses) w can be represented by half-spaces H+ ={x:wTx≥0},H− ={x:wTx<0} 􏰀 The boundaries of these half-spaces pass through the origin (why?) The boundary is the decision boundary: {x : wT x = 0} 􏰀 In 2-D, it is a line, but think of it as a hyperplane If the training examples can be perfectly separated by a linear decision rule, we say data is linearly separable. Intro ML (UofT) CSC311-Lec3 9 / 39 The Geometric Picture Weight Space Weights (hypotheses) w are points Each training example x specifies a half-space w must lie in to be correctly classified: wT x > 0 if t = 1.
w0 > 0 w0 + w1 < 0 Intro ML (UofT) CSC311-Lec3 10 / 39 The Geometric Picture Weight Space Weights (hypotheses) w are points w0 > 0 w0 + w1 < 0 Each training example x specifies a half-space w must lie in to be correctly classified: wT x > 0 if t = 1.
For NOT example:
􏰀 x0 =1,x1 =0,t=1 =⇒ (w0,w1)∈{w:w0 >0}
􏰀 x0 =1,x1 =1,t=0 =⇒ (w0,w1)∈{w:w0 +w1 <0} Intro ML (UofT) CSC311-Lec3 10 / 39 The Geometric Picture Weight Space Weights (hypotheses) w are points w0 > 0 w0 + w1 < 0 Each training example x specifies a half-space w must lie in to be correctly classified: wT x > 0 if t = 1.
For NOT example:
􏰀 x0 =1,x1 =0,t=1 =⇒ (w0,w1)∈{w:w0 >0}
􏰀 x0 =1,x1 =1,t=0 =⇒ (w0,w1)∈{w:w0 +w1 <0} The region satisfying all the constraints is the feasible region; if this region is nonempty, the problem is feasible, otw it is infeasible. Intro ML (UofT) CSC311-Lec3 10 / 39 The Geometric Picture The AND example requires three dimensions, including the dummy one. To visualize data space and weight space for a 3-D example, we can look at a 2-D slice. The visualizations are similar. 􏰀 Feasible set will always have a corner at the origin. Intro ML (UofT) CSC311-Lec3 11 / 39 The Geometric Picture Visualizations of the AND example Data Space -Sliceforx0 =1and - example sol: w0 =−1.5, w1 =1, w2 =1 - decision boundary: w0x0+w1x1+w2x2=0 =⇒ −1.5+x1+x2=0 Intro ML (UofT) CSC311-Lec3 12 / 39 The Geometric Picture Visualizations of the AND example Data Space -Sliceforx0 =1and - example sol: w0 = −1.5, w1 = 1, w2 = 1 - decision boundary: w0x0+w1x1+w2x2=0 =⇒ − 1 . 5 + x 1 + x 2 = 0 Weight Space -Sliceforw0 =−1.5forthe constraints - w 0 + w 1 < 0 - w0 + w1 + w2 > 0
Intro ML (UofT)
CSC311-Lec3

The Geometric Picture
Some datasets are not linearly separable, e.g. XOR
Intro ML (UofT) CSC311-Lec3 13 / 39

Summary: binary linear classifiers
Binary Linear Classifiers. Targets t ∈ {0, 1} z = wT x + b
􏰘1 ifz≥0 y= 0 ifz<0 If training set is separable, we can solve for w, b using linear programming Intro ML (UofT) CSC311-Lec3 14 / 39 Summary: binary linear classifiers Binary Linear Classifiers. Targets t ∈ {0, 1} z = wT x + b 􏰘1 ifz≥0 y= 0 ifz<0 If training set is separable, we can solve for w, b using linear programming If it’s not separable, the problem is harder 􏰀 data is almost never separable in real life. Intro ML (UofT) CSC311-Lec3 What can we do instead? (Spoiler Answer: Logistic Regression) Let’s derive Logistic Regression by considering a few alternatives first. Intro ML (UofT) CSC311-Lec3 15 / 39 Loss Functions Instead: define loss function then try to minimize the resulting cost function 􏰀 Recall: cost is loss averaged (or summed) over the training set Seemingly obvious loss function: 0-1 loss 􏰘0 ify=t L0−1(y,t) = 1 if y ̸= t = I[y ̸= t] Intro ML (UofT) CSC311-Lec3 16 / 39 Attempt 1: 0-1 Loss Usually, the cost J is the averaged loss over training examples; for 0-1 loss, this is the misclassification rate/error: What is the issue with using this loss? I[y(i) ̸= t(i)] Intro ML (UofT) CSC311-Lec3 17 / 39 Attempt 1: 0-1 Loss Usually, the cost J is the averaged loss over training examples; for 0-1 loss, this is the misclassification rate/error: I[y(i) ̸= t(i)] The step function (0-1 loss) has zero derivative almost everywhere What is the issue with using this loss? Intro ML (UofT) CSC311-Lec3 17 / 39 Attempt 2: Linear Regression Sometimes we can replace the loss function we care about with one that is easier to optimize. This is known as relaxation with a smooth surrogate loss function. One problem with L0−1 is that it is defined in terms of final prediction, which inherently involves a discontinuity Instead, define loss in terms of wT x + b directly 􏰀 Redo notation for convenience: z = wT x + b Intro ML (UofT) CSC311-Lec3 Attempt 2: Linear Regression We already know how to fit a linear regression model using the squared error loss. Can we use the same squared error loss instead? z = w⊤x + b LSE(z,t)= 21(z−t)2 Intro ML (UofT) CSC311-Lec3 19 / 39 Attempt 2: Linear Regression We already know how to fit a linear regression model using the squared error loss. Can we use the same squared error loss instead? z = w⊤x + b LSE(z,t)= 21(z−t)2 Doesn’t matter that the targets are actually binary. Treat them as continuous values. Intro ML (UofT) CSC311-Lec3 19 / 39 Attempt 2: Linear Regression We already know how to fit a linear regression model using the squared error loss. Can we use the same squared error loss instead? z = w⊤x + b LSE(z,t)= 21(z−t)2 Doesn’t matter that the targets are actually binary. Treat them as continuous values. For this loss function, it makes sense to make final predictions by thresholding z at 12 (why?) Intro ML (UofT) CSC311-Lec3 19 / 39 Attempt 2: Linear Regression The problem: The loss function penalizes you when you make correct predictions with high confidence! If t = 1, the loss is larger when z = 10 than when z = 0. Intro ML (UofT) CSC311-Lec3 20 / 39 Attempt 3: Logistic Activation Function There’s obviously no reason to predict values outside [0, 1]. Let’s squash y into this interval. The logistic function is a kind of sigmoid, or S-shaped function: σ(z) = 1 1+e−z σ−1(y) = log(y/(1 − y)) is called the logit. Intro ML (UofT) CSC311-Lec3 21 / 39 Attempt 3: Logistic Activation Function There’s obviously no reason to predict values outside [0, 1]. Let’s squash y into this interval. The logistic function is a kind of sigmoid, or S-shaped function: σ(z) = 1 1+e−z σ−1(y) = log(y/(1 − y)) is called the logit. A linear model with a logistic nonlinearity is known as log-linear: z = w⊤x + b y = σ(z) LSE(y,t)= 12(y−t)2. Used in this way, σ is called an activation function. Intro ML (UofT) CSC311-Lec3 21 / 39 Attempt 3: Logistic Activation Function The problem: (plot of LSE as a function of z, assuming t = 1) ∂L = ∂L ∂z ∂wj ∂z ∂wj Intro ML (UofT) CSC311-Lec3 Attempt 3: Logistic Activation Function The problem: (plot of LSE as a function of z, assuming t = 1) ∂L = ∂L ∂z ∂wj ∂z ∂wj For z ≪ 0, we have σ(z) ≈ 0. ∂L ≈ 0 (check!) =⇒ ∂L ≈ 0 =⇒ derivative w.r.t. wj is small =⇒ wj is like a critical point If the prediction is really wrong, you should be far from a critical point (which is your candidate solution). Intro ML (UofT) CSC311-Lec3 22 / 39 Logistic Regression Because y ∈ [0, 1], we can interpret it as the estimated probability that t = 1. The pundits who were 99% confident Clinton would win were much more wrong than the ones who were only 90% confident. Intro ML (UofT) CSC311-Lec3 23 / 39 Logistic Regression Because y ∈ [0, 1], we can interpret it as the estimated probability that t = 1. The pundits who were 99% confident Clinton would win were much more wrong than the ones who were only 90% confident. Cross-entropy loss (aka log loss) captures this intuition: 􏰘 −logy ift=1 LCE(y,t)= −log(1−y) ift=0 = −t log y − (1 − t) log(1 − y) Intro ML (UofT) CSC311-Lec3 Logistic Regression Logistic Regression: z = w⊤x + b y = σ(z) LCE =−tlogy−(1−t)log(1−y) Plot is for target t = 1. Intro ML (UofT) CSC311-Lec3 24 / 39 Logistic Regression Problem: what if t = 1 but you’re really confident it’s a negative example (z ≪ 0)? If y is small enough, it may be numerically zero. This can cause very subtle and hard-to-find bugs. LCE =−tlogy−(1−t)log(1−y) ⇒ y ≈ 0 ⇒ computes log0 Intro ML (UofT) CSC311-Lec3 Logistic Regression Problem: what if t = 1 but you’re really confident it’s a negative example (z ≪ 0)? If y is small enough, it may be numerically zero. This can cause very subtle and hard-to-find bugs. y = σ(z) ⇒ y ≈ 0 LCE =−tlogy−(1−t)log(1−y) ⇒ computes log0 Instead, we combine the activation function and the loss into a single logistic-cross-entropy function. LLCE(z, t) = LCE(σ(z), t) = t log(1 + e−z) + (1 − t) log(1 + ez) Numerically stable computation: E = t * np.logaddexp(0, -z) + (1-t) * np.logaddexp(0, z) Intro ML (UofT) CSC311-Lec3 25 / 39 Logistic Regression Comparison of loss functions: (for t = 1) Intro ML (UofT) CSC311-Lec3 26 / 39 Gradient Descent How do we minimize the cost J in this case? No direct solution. 􏰀 Taking derivatives of J w.r.t. w and setting them to 0 doesn’t have an explicit solution. Intro ML (UofT) CSC311-Lec3 27 / 39 Gradient Descent How do we minimize the cost J in this case? No direct solution. 􏰀 Taking derivatives of J w.r.t. w and setting them to 0 doesn’t have an explicit solution. Now let’s see a second way to minimize the cost function which is more broadly applicable: gradient descent. Intro ML (UofT) CSC311-Lec3 27 / 39 Gradient Descent How do we minimize the cost J in this case? No direct solution. 􏰀 Taking derivatives of J w.r.t. w and setting them to 0 doesn’t have an explicit solution. Now let’s see a second way to minimize the cost function which is more broadly applicable: gradient descent. Gradient descent is an iterative algorithm, which means we apply an update repeatedly until some criterion is met. Intro ML (UofT) CSC311-Lec3 27 / 39 Gradient Descent How do we minimize the cost J in this case? No direct solution. 􏰀 Taking derivatives of J w.r.t. w and setting them to 0 doesn’t have an explicit solution. Now let’s see a second way to minimize the cost function which is more broadly applicable: gradient descent. Gradient descent is an iterative algorithm, which means we apply an update repeatedly until some criterion is met. We initialize the weights to something reasonable (e.g. all zeros) and repeatedly adjust them in the direction of steepest descent. Intro ML (UofT) CSC311-Lec3 27 / 39 Gradient for Logistic Regression Back to logistic regression: LCE(y, t) = − t log(y) − (1 − t) log(1 − y) y=1/(1+e−z) and z=wTx+b ∂LCE ∂LCE ∂y ∂z 􏰃 t 1−t􏰄 ∂w = ∂y ·∂z·∂w = −y+1−y ·y(1−y)·xj =(y − t)xj Exercise: Verify this! Gradient descent (coordinatewise) update to find the weights of logistic regression: wj ← wj − α ∂J ∂wj 􏰋(y(i) − t(i)) x(i) Intro ML (UofT) CSC311-Lec3 28 / 39 Logistic Regression Comparison of gradient descent updates: Linear regression (verify!): w ← w − N Logistic regression: (y(i) − t(i)) x(i) (y(i) − t(i)) x(i) Intro ML (UofT) CSC311-Lec3 29 / 39 Logistic Regression Comparison of gradient descent updates: Linear regression (verify!): w ← w − N Logistic regression: (y(i) − t(i)) x(i) models. But we won’t go in further detail. (y(i) − t(i)) x(i) Not a coincidence! These are both examples of generalized linear Notice N1 in front of sums due to averaged losses. This is why you need smaller learning rate when we optimize the sum of losses (α′ = α/N). Intro ML (UofT) CSC311-Lec3 29 / 39 Stochastic Gradient Descent So far, the cost function J has been the average loss over the training examples: L(y(x(i), θ), t(i)). Intro ML (UofT) CSC311-Lec3 Stochastic Gradient Descent So far, the cost function J has been the average loss over the training examples: J (θ) = N By linearity, L(i) = N L(y(x(i), θ), t(i)). ∂ J 1 􏰋N ∂ L ( i ) ∂θ=Ni=1 ∂θ. Intro ML (UofT) CSC311-Lec3 Stochastic Gradient Descent So far, the cost function J has been the average loss over the training examples: By linearity, L(i) = N L(y(x(i), θ), t(i)). ∂ J 1 􏰋N ∂ L ( i ) ∂θ=Ni=1 ∂θ. Computing the gradient requires summing over all of the training examples. This is known as batch training. Batch training is impractical if you have a large dataset N ≫ 1 (e.g. millions of training examples)! Intro ML (UofT) CSC311-Lec3 30 / 39 Stochastic Gradient Descent Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example, 1. Choose i uniformly at random 2. θ←θ−α∂L(i) ∂θ Intro ML (UofT) CSC311-Lec3 31 / 39 Stochastic Gradient Descent Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example, 1. Choose i uniformly at random 2. θ←θ−α∂L(i) Cost of each SGD update is independent of N. SGD can make significant progress before even seeing all the data! Intro ML (UofT) CSC311-Lec3 Stochastic Gradient Descent Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example, 1. Choose i uniformly at random 2. θ←θ−α∂L(i) Cost of each SGD update is independent of N. SGD can make significant progress before even seeing all the data! Mathematical justification: if you sample a training example uniformly at random, the stochastic gradient is an unbiased estimate of the batch gradient: 􏰅 ∂ L ( i ) 􏰆 1 􏰋N ∂ L ( i ) ∂ J E ∂θ =Ni=1 ∂θ =∂θ. Intro ML (UofT) CSC311-Lec3 31 / 39 Stochastic Gradient Descent Stochastic gradient descent (SGD): update the parameters based on the gradient for a single training example, 1. Choose i uniformly at random 2. θ←θ−α∂L(i) Cost of each SGD update is independent of N. SGD can make significant progress before even seeing all the data! Mathematical justification: if you sample a training example uniformly at random, the stochastic gradient is an unbiased estimate of the batch gradient: 􏰅 ∂ L ( i ) 􏰆 1 􏰋N ∂ L ( i ) ∂ J E ∂θ =Ni=1 ∂θ =∂θ. 􏰀 Variance in this estimate may be high 􏰀 If we only look at one training example at a time, we can’t exploit efficient vectorized operations. Intro ML (UofT) CSC311-Lec3 31 / 39 Stochastic Gradient Descent Compromise approach: compute the gradients on a randomly chosen medium-sized set of training examples M ⊂ {1, . . . , N }, called a mini-batch. Stochastic gradients computed on larger mini-batches have smaller variance. This is similar to bagging. The mini-batch size |M| is a hyper