CS计算机代考程序代写 algorithm CS 4610/5335 Logistic Regression

CS 4610/5335 Logistic Regression
Robert Platt Northeastern University
Material adapted from:
1. Lawson Wong, CS 5100

Use features (x) to predict targets (y)
Classification
Classification
Targets y are now either: – Binary: {0, 1}
– Multi-class: {1, 2, …, K}
We will focus on binary case (Ex5 Q6 covers multi-class)
2

Classification
Focus: Supervised learning (e.g., regression, classification) Use features (x) to predict targets (y)
Input: Dataset of n samples: {x(i), y(i)}, i = 1, …, n
Each x(i) is a p-dimensional vector of feature values
Output: Hypothesis hθ(x) in some hypothesis class H
H is parameterized by d-dim. parameter vector θ
Goal: Find the best hypothesis θ* within H
What does “best” mean? Optimizes objective function:
J(θ): Error fn. L(pred, y): Loss fn.
A learning algorithm is the procedure for optimizing J(θ)
3

Recall: Linear Regression
Hypothesis class for linear regression:
4

Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
5

Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
Semicolon distinguishes between
random variables that are being conditioned on (x) and parameters (θ) (not considered random variables)
6

Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
In the binary case, P(y=0 | x; θ) = 1 – hθ(x)
d = dimension of parameter vector θ
p = dimension of feature space x
For logistic regression, d = p + 1 (same as before)
7

Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
σ = sigmoid function (also known as logistic function)
8

Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
The hypothesis class is the set of linear classifiers
x2 x2 x1
x1
9

Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
The hypothesis class is the set of linear classifiers
x2 x2 x1
x1
10

Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
11

Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
12

Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
Log-linear model
13

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
14

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
15

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
For linear regression, why did we use this particular J(θ)? See Ex5 Q4 for a different error function to derive ridge regression, a variant of linear regression
16

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
For linear regression, why did we use this particular J(θ)? See Ex5 Q4 for a different error function to derive ridge regression, a variant of linear regression
Why squared-error loss L(h, y)?
See Ex5 Q5 for a derivation of the squared-error loss and J(θ) using the maximum-likelihood principle, assuming that labels have Gaussian noise
17

Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
18

Apply Maximum-Likelihood to Logistic Regression A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression? No. We will apply the maximum-likelihood principle.
19

Apply Maximum-Likelihood to Logistic Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression? No. We will apply the maximum-likelihood principle.
Will show: Max likelihood is equivalent to minimizing J(θ):
with the cross-entropy loss (logistic loss, log loss):
L(h (x(i)), y(i)) = θ
20

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
21

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = ?
22

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
23

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
24

Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
25

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
26

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ
27

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ For the biased coin (θ: P(H) = q):
28

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
29

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
Calculus review: Product rule of differentiation
30

Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
To find maximizing value of q, set derivative to 0 and solve:
31

Apply Maximum-Likelihood to Logistic Regression
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ For the biased coin (P(H) = q):
Maximizing likelihood leads us to infer q =
32

Logistic regression: Example: MNIST
33

Logistic regression: Example: MNIST (0 vs. 1)
34

Features?
Logistic regression: Example
35

Features?
Logistic regression: Example
Bias-term only (θ0)
36

Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
37

Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
Why is line
not at 50%?
38

Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
Why is line
not at 50%?
In test set:
0: 980 samples 1: 1135 samples
39

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Sum of pixel values
40

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Sum of pixel values
Even worse!
41

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28) (digit images are
28*28, grayscale)
42

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28) (digit images are
28*28, grayscale)
Theoretically, should not make a difference!
43

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28)
Theoretically, should not make a difference!
In practice, it does. Useful to normalize data
(0 ≤ mean ≤ 1)
44

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
45

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
46

Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
47

Features?
Logistic regression: Example
θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
48

Features?
Logistic regression: Example
θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
Model predicts P(y=1 | x; θ) > 0.5 whenθ0 +θ1x>0
→ when x < 0.08538 / 0.7640 ≈ 0.1118 49 Features? Logistic regression: Example θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640 Achieves 93% accuracy. Surprising? Always visualize (where possible). Always check. Model predicts P(y=1 | x; θ) > 0.5 whenθ0 +θ1x>0
→ when x < 0.08538 / 0.7640 ≈ 0.1118 Conclusion: Classifying 0 and 1 is quite easy... 50 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means 51 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means 52 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means 53 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means 97.26% accuracy Which features are useful? 54 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means 97.26% accuracy Which features are useful? Do ablation analysis: Remove features and compare performance 55 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means 97.26% accuracy Which features are useful? Do ablation analysis: Remove features and compare performance 56 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means 93.85% accuracy Which features are useful? Do ablation analysis: Remove features and compare performance 57 Features? θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means 93.85% accuracy When row/col sums are present: pixel mean not useful bias term is useful Logistic regression: Example 58 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values 59 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values 60 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values 99.91% test accuracy! (2 false positives: true 0, pred 1) 61 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values 99.91% test accuracy! (2 false positives: true 0, pred 1) 62 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values 99.91% test accuracy! (2 false positives: true 0, pred 1) 63 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values This is just memorizing pixel locations... If we perturb dataset with row/col shifts: 64 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values This is just memorizing pixel locations... If we perturb dataset with row/col shifts: 65 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values This is just memorizing pixel locations... If we perturb dataset with row/col shifts: 66 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values This is just memorizing pixel locations... If we perturb dataset with row/col shifts: ... continued in next lecture 67 Logistic Regression Another look at the full algorithm... 68 Logistic Regression Hypothesis class: Predict the probability that x belongs to some class: This is a probabilistic model of the data! Use maximum-likelihood principle to estimate θ 69 Logistic Regression Hypothesis class: Predict the probability that x belongs to some class: This is a probabilistic model of the data! Use maximum-likelihood principle to estimate θ Likelihood function for supervised learning: 70 Logistic Regression Use maximum-likelihood principle to estimate θ Likelihood function for supervised learning: 71 Logistic Regression Use maximum-likelihood principle to estimate θ Likelihood function for supervised learning: Common assumption: Data samples are independent and identically distributed (IID), given θ 72 Logistic Regression Use maximum-likelihood principle to estimate θ Likelihood function for supervised learning: Common assumption: Data samples are independent and identically distributed (IID), given θ Easier to handle in log space: log-likelihood function l(θ): Since log is monotonic, l(θ) and L(θ) have same maximizer 73 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max 74 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max Probabilistic model in logistic regression: 75 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max Probabilistic model in logistic regression: 76 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max Probabilistic model in logistic regression: Simplify with a trick: 77 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max 78 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max Recall that in supervised learning, we minimize J(θ): 79 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max Recall that in supervised learning, we minimize J(θ): Highlighted terms is (negative) loss function! 80 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max Recall that in supervised learning, we minimize J(θ): Highlighted terms is (negative) loss function! L(h (x(i)), y(i)) = θ Names: Cross-entropy loss, logistic loss, log loss 81 Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max Recall that in supervised learning, we minimize J(θ): Highlighted terms is (negative) loss function! L(h (x(i)), y(i)) = θ Names: Cross-entropy loss, logistic loss, log loss Instead of defining the loss / error function arbitrarily, 82 we derived it using the maximum-likelihood principle Logistic Regression Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function: max To solve this optimization problem, find the gradient, and do (stochastic) gradient ascent (max vs. min) 83 Logistic Regression See next slide 84 Logistic Regression 85 Logistic Regression Recall: Hence: 86 Logistic Regression See prev slide 87 Logistic Regression Bias term; equivalent if x (i) = 1 0 See prev slide 88 Logistic Regression Similar form as linear regression gradient! Both are generalized linear models (GLMs) 89 Logistic Regression 90 Logistic Regression Can similarly extend to stochastic / minibatch versions With linear algebra: Iteratively reweighted least squares 91 Logistic Regression The remainder of this lecture is not covered on the exam. 92 Logistic regression: Example: MNIST 93 Logistic regression: Example Features? θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values 99.91% test accuracy! (2 false positives: true 0, pred 1) 94 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values 99.91% test accuracy! (2 false positives: true 0, pred 1) 95 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values 99.91% test accuracy! (2 false positives: true 0, pred 1) 96 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values This is just memorizing pixel locations... If we perturb dataset with row/col shifts: 97 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values This is just memorizing pixel locations... If we perturb dataset with row/col shifts: 98 Features? Logistic regression: Example θ0: Bias-term θ1: Mean of pixels θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual pixel values This is just memorizing pixel locations... If we perturb dataset with row/col shifts: 99 Logistic regression: Example: MNIST (0 vs. 1) We could try to match image patches... (e.g., lines, curves) 100 Logistic regression: Example Hack: Try to correct the perturbation with some processing Compute the center of mass of the perturbed image, and shift it back to the center 101 Logistic regression: Example Hack: Try to correct the perturbation with some processing Compute the center of mass of the perturbed image, and shift it back to the center 102 Logistic regression: Example Hack: Try to correct the perturbation with some processing Compute the center of mass of the perturbed image, and shift it back to the center 3 false positives. (99.86% accuracy) Previous two, and a new one: 103 Logistic regression: Example Hack: Try to correct the perturbation with some processing Compute the center of mass of the perturbed image, and shift it back to the center 104 Logistic regression: Example Hack: Try to correct the perturbation with some processing Compute the center of mass of the perturbed image, and shift it back to the center 105 Logistic regression: Example: MNIST Multi-class: 10-class classification 106 Logistic regression: Example Binary logistic regression: 107 Logistic regression: Example Binary logistic regression: Multi-class logistic regression: There are now K parameter vectors – one vector per class – Number of parameters d = K * (p+1) 108 Logistic regression: Example Features: θ0: Bias-term θ1 to θ784: Individual pixel values 109 Logistic regression: Example Features: θ0: Bias-term θ1 to θ784: Individual pixel values 110 Logistic regression: Example Features: θ0: Bias-term θ1 to θ784: Individual pixel values 87.99% accuracy What accuracy does predicting at random achieve? 111 Logistic regression: Example Features: θ0: Bias-term θ1 to θ784: Individual pixel values 112 Logistic regression: Example Features: θ0: Bias-term θ1 to θ784: Individual pixel values Visualize error: Confusion matrix Row = true class Col = predicted class (on diagonal = correct) 113 Logistic regression: Example Features: θ0: Bias-term θ1 to θ784: Individual pixel values Visualize error: Confusion matrix Row = true class Col = predicted class (on diagonal = correct) – if most correct, only looking at off-diagonal is useful 114 Logistic regression: Example Features: θ0: Bias-term θ1 to θ784: Individual pixel values 87.99% accuracy Maybe not enough data? Graph only uses 3200 of 60000 samples available in training set 115 Logistic regression: Example N = 180000 samples – using each sample 3 times 116 Logistic regression: Example N = 180000 samples – using each sample 3 times 117 Logistic regression: Example N = 180000 samples – using each sample 3 times 118 Logistic regression: Example N = 180000 samples – using each sample 3 times Sample ordering: Going through training set in order, round-robin through the classes 119 Logistic regression: Example N = 180000 samples – using each sample 3 times Sample ordering: Going through training set in order, round-robin through the classes ... but there are 6742 samples in class 1 (out of 60000) – all classes except 1 and 7 have < 6000 samples 120 Logistic regression: Example Smaller example: Class 0: 1760 samples Classes 1-9: 160 samples Train round-robin classes 0-9 0123456789 0123456789 ... 0000000000 (row161) ... 121 Logistic regression: Example Smaller example: Class 0: 1760 samples Classes 1-9: 160 samples Train round-robin classes 0-9 Modified order: 0000000001 0000000002 ... 122 Logistic regression: Example Smaller example: Class 0: 1760 samples Classes 1-9: 160 samples Train round-robin classes 0-9 Modified order: 0000000001 0000000002 ... 123 Logistic regression: Example Even in balanced case, “data splicing” (training round-robin through classes) is important Order: 320 class 0 samples 320 class 1 samples ... 320 class 9 samples 124 Logistic regression: Example Even in balanced case, “data splicing” (training round-robin through classes) is important Order: 320 class 0 samples 320 class 1 samples ... 320 class 9 samples 125 Bag of tricks Normalize features (whiten – mean 0, variance 1) Numerical issues (exp, log) – log-sum-exp trick Balanced data (do not have too much of one class) Splicing data (alternate between classes during training) Data augmentation (increase variation in training data) “One in ten rule” – at least 10:1 sample:parameter ratio Use the same random seed (for debugging) If results are too good to be true, be very skeptical – Did you use the test set for training? Visualize! (Learning curves, feature weights, errors, etc.) 126 Role of hyperparameters Choose with cross-validation 127 Role of hyperparameters Choose with cross-validation 128 Logistic regression: Example N = 320000 samples – using each sample 5-6 times With appropriate data balancing and splicing: 91.93% test accuracy (vs. 87.99% for N = 3200) 129