CS 4610/5335 Logistic Regression
Robert Platt Northeastern University
Material adapted from:
1. Lawson Wong, CS 5100
Use features (x) to predict targets (y)
Classification
Classification
Targets y are now either: – Binary: {0, 1}
– Multi-class: {1, 2, …, K}
We will focus on binary case (Ex5 Q6 covers multi-class)
2
Classification
Focus: Supervised learning (e.g., regression, classification) Use features (x) to predict targets (y)
Input: Dataset of n samples: {x(i), y(i)}, i = 1, …, n
Each x(i) is a p-dimensional vector of feature values
Output: Hypothesis hθ(x) in some hypothesis class H
H is parameterized by d-dim. parameter vector θ
Goal: Find the best hypothesis θ* within H
What does “best” mean? Optimizes objective function:
J(θ): Error fn. L(pred, y): Loss fn.
A learning algorithm is the procedure for optimizing J(θ)
3
Recall: Linear Regression
Hypothesis class for linear regression:
4
Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
5
Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
Semicolon distinguishes between
random variables that are being conditioned on (x) and parameters (θ) (not considered random variables)
6
Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
In the binary case, P(y=0 | x; θ) = 1 – hθ(x)
d = dimension of parameter vector θ
p = dimension of feature space x
For logistic regression, d = p + 1 (same as before)
7
Logistic Regression
Hypothesis class for linear regression:
Instead of predicting a real value,
predict the probability that x belongs to some class:
σ = sigmoid function (also known as logistic function)
8
Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
The hypothesis class is the set of linear classifiers
x2 x2 x1
x1
9
Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
The hypothesis class is the set of linear classifiers
x2 x2 x1
x1
10
Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
11
Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
12
Logistic Regression
Alternative interpretation of hypothesis class: Log-odds ratio is a linear function of the features
Log-linear model
13
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
14
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
15
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
For linear regression, why did we use this particular J(θ)? See Ex5 Q4 for a different error function to derive ridge regression, a variant of linear regression
16
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
For linear regression, why did we use this particular J(θ)? See Ex5 Q4 for a different error function to derive ridge regression, a variant of linear regression
Why squared-error loss L(h, y)?
See Ex5 Q5 for a derivation of the squared-error loss and J(θ) using the maximum-likelihood principle, assuming that labels have Gaussian noise
17
Recall: Linear Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression?
18
Apply Maximum-Likelihood to Logistic Regression A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression? No. We will apply the maximum-likelihood principle.
19
Apply Maximum-Likelihood to Logistic Regression
A learning algorithm is the procedure for optimizing J(θ):
Should we do the same J(θ) for logistic regression? No. We will apply the maximum-likelihood principle.
Will show: Max likelihood is equivalent to minimizing J(θ):
with the cross-entropy loss (logistic loss, log loss):
L(h (x(i)), y(i)) = θ
20
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
21
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = ?
22
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
23
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
24
Maximum-Likelihood Estimation
Imagine observing tosses from a biased coin
Probabilistic model:
Coin results in H with probability q (and T with prob. 1-q) q is an unknown parameter
Estimation: Want to infer the value of q
Suppose we have observed H heads and T tails. q = Why? Can we derive this formally?
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
25
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
26
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ
27
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ For the biased coin (θ: P(H) = q):
28
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
29
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
Calculus review: Product rule of differentiation
30
Maximum-Likelihood Estimation
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
To find maximizing value of q, set derivative to 0 and solve:
31
Apply Maximum-Likelihood to Logistic Regression
Maximum-likelihood principle: Estimate parameter(s) such that the observations are most likely
under the defined model class
(What value of q leads to the highest probability of observing H heads and T tails?)
Likelihood function L(θ) – conventionally a function of θ For the biased coin (P(H) = q):
Maximizing likelihood leads us to infer q =
32
Logistic regression: Example: MNIST
33
Logistic regression: Example: MNIST (0 vs. 1)
34
Features?
Logistic regression: Example
35
Features?
Logistic regression: Example
Bias-term only (θ0)
36
Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
37
Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
Why is line
not at 50%?
38
Features?
Logistic regression: Example
Bias-term only (θ0)
Learning curve:
– Shows progress
of learning
– Steep is good!
Why is line
not at 50%?
In test set:
0: 980 samples 1: 1135 samples
39
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Sum of pixel values
40
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Sum of pixel values
Even worse!
41
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28) (digit images are
28*28, grayscale)
42
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28) (digit images are
28*28, grayscale)
Theoretically, should not make a difference!
43
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Mean = Sum / (28*28)
Theoretically, should not make a difference!
In practice, it does. Useful to normalize data
(0 ≤ mean ≤ 1)
44
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
45
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
46
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixel values
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
47
Features?
Logistic regression: Example
θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
48
Features?
Logistic regression: Example
θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
Model predicts P(y=1 | x; θ) > 0.5 whenθ0 +θ1x>0
→ when x < 0.08538 / 0.7640 ≈ 0.1118
49
Features?
Logistic regression: Example
θ0: Bias-term Learned weights: θ0 = 0.08538 θ1: Mean of pixel values θ1 = -0.7640
Achieves 93% accuracy. Surprising?
Always visualize (where possible). Always check.
Model predicts P(y=1 | x; θ) > 0.5 whenθ0 +θ1x>0
→ when x < 0.08538 / 0.7640 ≈ 0.1118
Conclusion: Classifying
0 and 1 is quite easy...
50
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means
51
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means
52
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means
53
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means
97.26% accuracy Which features
are useful?
54
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means
97.26% accuracy Which features
are useful?
Do ablation analysis:
Remove features
and compare performance
55
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means
97.26% accuracy Which features
are useful?
Do ablation analysis:
Remove features
and compare performance
56
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means
93.85% accuracy Which features
are useful?
Do ablation analysis:
Remove features
and compare performance
57
Features?
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means
93.85% accuracy When row/col sums
are present:
pixel mean not useful bias term is useful
Logistic regression: Example
58
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
59
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
60
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
99.91% test accuracy! (2 false positives:
true 0, pred 1)
61
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
99.91% test accuracy! (2 false positives:
true 0, pred 1)
62
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
99.91% test accuracy! (2 false positives:
true 0, pred 1)
63
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
This is just memorizing pixel locations...
If we perturb dataset with row/col shifts:
64
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
This is just memorizing pixel locations...
If we perturb dataset with row/col shifts:
65
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
This is just memorizing pixel locations...
If we perturb dataset with row/col shifts:
66
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
This is just memorizing pixel locations...
If we perturb dataset with row/col shifts:
... continued in next lecture
67
Logistic Regression Another look at the full algorithm...
68
Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
This is a probabilistic model of the data!
Use maximum-likelihood principle to estimate θ
69
Logistic Regression
Hypothesis class:
Predict the probability that x belongs to some class:
This is a probabilistic model of the data!
Use maximum-likelihood principle to estimate θ
Likelihood function for supervised learning:
70
Logistic Regression
Use maximum-likelihood principle to estimate θ Likelihood function for supervised learning:
71
Logistic Regression
Use maximum-likelihood principle to estimate θ Likelihood function for supervised learning:
Common assumption: Data samples are
independent and identically distributed (IID), given θ
72
Logistic Regression
Use maximum-likelihood principle to estimate θ Likelihood function for supervised learning:
Common assumption: Data samples are
independent and identically distributed (IID), given θ
Easier to handle in log space: log-likelihood function l(θ): Since log is monotonic, l(θ) and L(θ) have same maximizer
73
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
74
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
Probabilistic model in logistic regression:
75
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
Probabilistic model in logistic regression:
76
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
Probabilistic model in logistic regression:
Simplify with a trick:
77
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
78
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
Recall that in supervised learning, we minimize J(θ):
79
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
Recall that in supervised learning, we minimize J(θ):
Highlighted terms is (negative) loss function!
80
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
Recall that in supervised learning, we minimize J(θ):
Highlighted terms is (negative) loss function!
L(h (x(i)), y(i)) = θ
Names: Cross-entropy loss, logistic loss, log loss
81
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
Recall that in supervised learning, we minimize J(θ):
Highlighted terms is (negative) loss function!
L(h (x(i)), y(i)) = θ
Names: Cross-entropy loss, logistic loss, log loss Instead of defining the loss / error function arbitrarily,
82
we derived it using the maximum-likelihood principle
Logistic Regression
Use maximum-likelihood principle to estimate θ Find θ that maximizes the log-likelihood function:
max
To solve this optimization problem, find the gradient, and do (stochastic) gradient ascent (max vs. min)
83
Logistic Regression
See next slide
84
Logistic Regression
85
Logistic Regression
Recall: Hence:
86
Logistic Regression
See prev slide
87
Logistic Regression
Bias term; equivalent if x (i) = 1 0
See prev slide
88
Logistic Regression
Similar form as linear regression gradient! Both are generalized linear models (GLMs)
89
Logistic Regression
90
Logistic Regression
Can similarly extend to stochastic / minibatch versions With linear algebra: Iteratively reweighted least squares
91
Logistic Regression
The remainder of this lecture is not covered on the exam.
92
Logistic regression: Example: MNIST
93
Logistic regression: Example Features?
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
99.91% test accuracy! (2 false positives:
true 0, pred 1)
94
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
99.91% test accuracy! (2 false positives:
true 0, pred 1)
95
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
99.91% test accuracy! (2 false positives:
true 0, pred 1)
96
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
This is just memorizing pixel locations...
If we perturb dataset with row/col shifts:
97
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
This is just memorizing pixel locations...
If we perturb dataset with row/col shifts:
98
Features?
Logistic regression: Example
θ0: Bias-term
θ1: Mean of pixels
θ2 to θ29: Row means θ30 to θ57: Col means θ58 to θ841: Individual
pixel values
This is just memorizing pixel locations...
If we perturb dataset with row/col shifts:
99
Logistic regression: Example: MNIST (0 vs. 1)
We could try to match image patches... (e.g., lines, curves)
100
Logistic regression: Example
Hack: Try to correct the perturbation with some processing
Compute the center of mass of the perturbed image, and shift it back to the center
101
Logistic regression: Example
Hack: Try to correct the perturbation with some processing
Compute the center of mass of the perturbed image, and shift it back to the center
102
Logistic regression: Example
Hack: Try to correct the perturbation with some processing
Compute the center of mass of the perturbed image, and shift it back to the center
3 false positives. (99.86% accuracy) Previous two, and a new one:
103
Logistic regression: Example
Hack: Try to correct the perturbation with some processing
Compute the center of mass of the perturbed image, and shift it back to the center
104
Logistic regression: Example
Hack: Try to correct the perturbation with some processing
Compute the center of mass of the perturbed image, and shift it back to the center
105
Logistic regression: Example: MNIST
Multi-class: 10-class classification
106
Logistic regression: Example
Binary logistic regression:
107
Logistic regression: Example
Binary logistic regression:
Multi-class logistic regression:
There are now K parameter vectors – one vector per class – Number of parameters d = K * (p+1)
108
Logistic regression: Example
Features:
θ0: Bias-term
θ1 to θ784: Individual pixel values
109
Logistic regression: Example
Features:
θ0: Bias-term
θ1 to θ784: Individual pixel values
110
Logistic regression: Example
Features:
θ0: Bias-term
θ1 to θ784: Individual pixel values
87.99% accuracy
What accuracy does predicting at random achieve?
111
Logistic regression: Example
Features:
θ0: Bias-term
θ1 to θ784: Individual pixel values
112
Logistic regression: Example
Features:
θ0: Bias-term
θ1 to θ784: Individual pixel values
Visualize error: Confusion matrix
Row = true class
Col = predicted class (on diagonal = correct)
113
Logistic regression: Example
Features:
θ0: Bias-term
θ1 to θ784: Individual pixel values
Visualize error: Confusion matrix
Row = true class
Col = predicted class
(on diagonal = correct)
– if most correct, only looking at off-diagonal is useful
114
Logistic regression: Example
Features:
θ0: Bias-term
θ1 to θ784: Individual pixel values
87.99% accuracy
Maybe not enough data? Graph only uses
3200 of 60000 samples available in training set
115
Logistic regression: Example
N = 180000 samples – using each sample 3 times
116
Logistic regression: Example
N = 180000 samples – using each sample 3 times
117
Logistic regression: Example
N = 180000 samples – using each sample 3 times
118
Logistic regression: Example
N = 180000 samples – using each sample 3 times
Sample ordering: Going through training set in order, round-robin through the classes
119
Logistic regression: Example
N = 180000 samples – using each sample 3 times
Sample ordering: Going through training set in order, round-robin through the classes
... but there are 6742 samples in class 1 (out of 60000) – all classes except 1 and 7 have < 6000 samples
120
Logistic regression: Example
Smaller example: Class 0:
1760 samples Classes 1-9:
160 samples
Train round-robin classes 0-9
0123456789 0123456789
...
0000000000 (row161) ...
121
Logistic regression: Example
Smaller example: Class 0:
1760 samples Classes 1-9:
160 samples
Train round-robin classes 0-9
Modified order: 0000000001 0000000002 ...
122
Logistic regression: Example
Smaller example: Class 0:
1760 samples Classes 1-9:
160 samples
Train round-robin classes 0-9
Modified order: 0000000001 0000000002 ...
123
Logistic regression: Example
Even in balanced case, “data splicing”
(training round-robin through classes)
is important
Order:
320 class 0 samples 320 class 1 samples ...
320 class 9 samples
124
Logistic regression: Example
Even in balanced case, “data splicing”
(training round-robin through classes)
is important
Order:
320 class 0 samples 320 class 1 samples ...
320 class 9 samples
125
Bag of tricks
Normalize features (whiten – mean 0, variance 1) Numerical issues (exp, log) – log-sum-exp trick Balanced data (do not have too much of one class) Splicing data (alternate between classes during training) Data augmentation (increase variation in training data) “One in ten rule” – at least 10:1 sample:parameter ratio Use the same random seed (for debugging)
If results are too good to be true, be very skeptical – Did you use the test set for training?
Visualize! (Learning curves, feature weights, errors, etc.)
126
Role of hyperparameters
Choose with cross-validation
127
Role of hyperparameters
Choose with cross-validation
128
Logistic regression: Example
N = 320000 samples – using each sample 5-6 times
With appropriate data balancing and splicing: 91.93% test accuracy
(vs. 87.99% for N = 3200)
129