Machine learning lecture slides
COMS 4771 Fall 2020
0/24
Classification I: Linear classification
Outline
Logistic regression and linear classifiers
Example: text classification
Maximum likelihood estimation and empirical risk minimization Linear separators
Surrogate loss functions
1/24
Logistic regression model
Suppose x is given by d real-valued features, so x ∈ Rd, while y ∈ {−1, +1}.
Logistic regression model for (X, Y ):
Y | X = x is Bernoulli (but taking values in {−1, +1} rather
than {0,1}) with parameter σ(xTw) := 1 . 1+exp(−xT w)
Sigmoid function σ(t) := 1/(1 + e−t)
w ∈ Rd is parameter vector of interest
w not involved in marginal distribution of X (which we don’t
care much about)
1 0.8 0.6 0.4 0.2 0
-6 -4 -2 0 2 4 6
Figure 1: Logistic (sigmoid) function
2/24
Log-odds in logistic regression model
Sigmoid function σ(t) := 1/(1 + e−t) Useful property: 1 − σ(t) = σ(−t)
Pr(Y =+1|X =x)=σ(xTw)
Pr(Y =−1|X =x)=1−σ(xTw)=σ(−xTw) Convenient formula: for each y ∈ {−1, +1},
Pr(Y =y|X=x)=σ(yxTw).
Log-odds in the model is given by a linear function:
lnPr(Y =+1|X =x) =xTw. Pr(Y =−1|X =x)
Just like in linear regression, common to use feature expansion! E.g., affine feature expansion φ(x) = (1, x) ∈ Rd+1
3/24
Optimal classifier in logistic regression model
Recall that Bayes classifier is
f⋆(x)=+1 ifPr(Y =+1|X=x)>1/2 −1 otherwise.
If distribution of (X, Y ) comes from logistic regression model with parameter w, then Bayes classifier is
f⋆(x)=+1 ifxTw>0
−1 otherwise. = sign(xTw).
This is a linear classifier
Compute linear combination of features, then check if above
threshold (zero)
With affine feature expansion, threshold can be non-zero
Many other statistical models for classification data lead to a linear (or affine) classifier, e.g., Naive Bayes
4/24
Geometry of linear classifiers
Hyperplane specified by normal vector w ∈ Rd: H = {x ∈ Rd : xTw = 0}
This is the decision boundary of a linear classifier Angle θ between x and w has
cos(θ) = xTw ∥x∥2∥w∥2
Distance to hyperplane given by ∥x∥2 · cos(θ) xisonsamesideofHaswiffxTw>0
x
θw
H
Figure 3: Decision boundary of linear classifier
5/24
Geometry of linear classifiers (2)
With feature expansion, can obtain other types of decision boundaries
6/24
Figure 4: Decision boundary of linear classifier with quadratic feature expansion
7/24
Figure 5: Decision boundary of linear classifier with quadratic feature expansion (another one)
8/24
MLE for logistic regression
Treat training examples as iid, same distribution as test example
Log-likelihood of w given data (x1,y1),…,(xn,yn)∈Rd ×{−1,+1}:
n
− ln(1 + exp(−yixTi w)) + {terms not involving w}
i=1
No “closed form” expression for maximizer
(Later, we’ll discuss algorithms for finding approximate
maximizers using iterative methods like gradient descent.)
9/24
Example: Text classification (1)
Data: articles posted to various internet message boards
Label: −1 for articles on “religion”, +1 for articles on “politics” Features:
Vocabulary of d = 61188 words
Each document is a binary vector x = (x1,…,xd) ∈ {0,1}d,
where xj = 1{document contains j-th vocabulary word} Logistic regression model
ln Prw(Y = politics | X = x) = wTx Prw(Y = religion | X = x)
Each weight in weight vector w = (w1, . . . , w61188) corresponds to a vocabulary word
10/24
Example: Text classification (2)
Found wˆ that approximately maximizes likelihood given 3028 training examples
Test error rate on 2017 examples is about 8.5%.
Vocabulary words with 10 highest (most positive) coefficients:
israel, gun, government, american, news, clinton, rights, guns, israeli, politics
Vocabulary words with 10 lowest (most negative) coefficients:
god, christian, bible, jesus, keith, christians, religion, church, christ, athos
11/24
Example: Text classification (3)
Figure 6: Histogram of Prwˆ(Y = politics | X = x) values on test data
12/24
Example: Text classification (4)
Article with Prwˆ(Y = politics | X = x) ≈ 0.0:
Rick, I think we can safely say, 1) Robert is not the only person who understands the Bible, and 2), the leadership of the LDS church historicly never has. Let’s consider some “personal interpretations” and see how much trust we should put in “Orthodox Mormonism”, which could never be confused with Orthodox Christianity. [. . . ]
13/24
Example: Text classification (5)
Article with Prwˆ(Y = politics | X = x) ≈ 0.5:
Does anyone know where I can access an online copy of the proposed “jobs” or “stimulus” legislation? Please E-mail me directly and if anyone else is interested, I can post this information.
Article with Prwˆ(Y = politics | X = x) ≈ 1.0:
[A reprint of a Village Voice article by Robert I. Friedman titled “The Enemy Within” about the Anti-Defamation League.]
14/24
Zero-one loss and ERM for linear classifiers
Recall: error rate of classifier f can also be written as risk: R(f) = E[1{f(X)̸=Y }] = Pr(f(X) ̸= Y ),
where loss function is zero-one loss.
For classification, we are ultimately interested in classifiers with
small error rate
I.e., small (zero-one loss) risk
Just like for linear regression, can apply plug-in principle to derive ERM, but now for linear classifiers.
Find w ∈ Rd to minimize
1 n
R(w) := n
1{sign(xTi w)̸=yi}.
i=1
15/24
Performance of ERM for linear classifiers
Theorem: In IID model, ERM solution wˆ satisfies
E[R(wˆ)] ≤ min R(w) + O d w∈Rd n
Inductive bias assumption: there is a linear classifier with low error rate, so minw∈Rd R(w) is small.
Unfortunately, solving this optimization problem, even for linear classifiers, is computationally intractable.
(Sharp contrast to ERM optimization problem for linear regression!)
16/24
Linearly separable data
Training data is linearly separable if there exists a linear classifier with training error rate zero.
(Special case where ERM optimization problem is tractable.) There exists w ∈ Rd such that sign(xTi w) = yi for all
i = 1,…,n. Equivalent:
yixTiw>0 foralli=1,…,n w
H
Figure 7: Linearly separable data
17/24
Figure 8: Data that is not linearly separable
18/24
Finding a linear separator I
Suppose training data (x1, y1), . . . , (xn, yn) ∈ Rd × {−1, +1} is linearly separable.
How to find a linear separator (assuming one exists)? Method 1: solve linear feasibility problem
19/24
Finding a linear separator II
Method 2: approximately solve logistic regression MLE
20/24
Surrogate loss functions I
Often, a linear separator will not exist.
Regard each term in negative log-likelihood as logistic loss
llogistic(s) := ln(1 + exp(−s)) C.f. Zero-one loss: l0/1(s) := 1{s≤0}
Scaling of llogistic is upper-bound on l0/1: a surrogate loss: l0/1 (s) ≤ 1 llogistic (s) = llogistic (s).
ln2 2
Small (empirical) llogistic-risk implies small (empirical) l0/1-risk
2.5 2 1.5 1 0.5 0
zero-one loss
(scaled) logistic loss
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 9: Comparing zero-one loss and (scaled) logistic loss
21/24
Surrogate loss functions II
Another example: squared loss
lsq(s)=(1−s)2
Note: (1−yixTiw)2 =(yi −xTiw)2 sinceyi ∈{−1,+1}
Weird: lsq(s)→∞ass→∞.
Minimizing Rlsq does not necessarily give a linear separator,
even if one exists.
A fix: lmsq(s) := max{0, 1 − s}2 (modified squared loss) 2.5
2 1.5 1 0.5 0
zero-one loss
squared loss
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 10: Comparing zero-one loss and squared loss
22/24
2.5 2 1.5 1 0.5 0
zero-one loss
modified squared loss
-1.5 -1 -0.5 0 0.5 1 1.5
Figure 11: Comparing zero-one loss and modified squared loss
23/24
(Regularized) empirical risk minimization for classification with surrogate losses
We can combine these surrogate losses with regularizers, just as when we discussed linear regression
This leads to regularized ERM objectives:
where
1 n
n
l(yixTi w) + Φ(w)
i=1
l is a (surrogate) loss function
Φ is a regularizer (e.g., Φ(w) = λ∥w∥2, Φ(w) = λ∥w∥1)
24/24