Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 24
Classification I: Linear classification
Outline
I Logistic regression and linear classifiers
I Example: text classification
I Maximum likelihood estimation and empirical risk minimization
I Linear separators
I Surrogate loss functions
1 / 24
Logistic regression model
I Suppose x is given by d real-valued features, so x ∈ Rd, while
y ∈ {−1,+1}.
I Logistic regression model for (X,Y ):
I Y | X = x is Bernoulli (but taking values in {−1,+1} rather
than {0, 1}) with parameter σ(xTw) := 11+exp(−xTw) .
I Sigmoid function σ(t) := 1/(1 + e−t)
I w ∈ Rd is parameter vector of interest
I w not involved in marginal distribution of X (which we don’t
care much about)
-6 -4 -2 0 2 4 6
0
0.2
0.4
0.6
0.8
1
Figure 1: Logistic (sigmoid) function
2 / 24
Log-odds in logistic regression model
I Sigmoid function σ(t) := 1/(1 + e−t)
I Useful property: 1− σ(t) = σ(−t)
I Pr(Y = +1 | X = x) = σ(xTw)
I Pr(Y = −1 | X = x) = 1− σ(xTw) = σ(−xTw)
I Convenient formula: for each y ∈ {−1,+1},
Pr(Y = y | X = x) = σ(yxTw).
I Log-odds in the model is given by a linear function:
ln
Pr(Y = +1 | X = x)
Pr(Y = −1 | X = x)
= xTw.
I Just like in linear regression, common to use feature expansion!
I E.g., affine feature expansion ϕ(x) = (1, x) ∈ Rd+1
3 / 24
Optimal classifier in logistic regression model
I Recall that Bayes classifier is
f?(x) =
+1 if Pr(Y = +1 | X = x) > 1/2−1 otherwise.
I If distribution of (X,Y ) comes from logistic regression model
with parameter w, then Bayes classifier is
f?(x) =
+1 if x
Tw > 0
−1 otherwise.
= sign(xTw).
I This is a linear classifier
I Compute linear combination of features, then check if above
threshold (zero)
I With affine feature expansion, threshold can be non-zero
I Many other statistical models for classification data lead to a
linear (or affine) classifier, e.g., Naive Bayes
Figure 2: Implementation of a linear classifier
4 / 24
Geometry of linear classifiers
I Hyperplane specified by normal vector w ∈ Rd:
I H = {x ∈ Rd : xTw = 0}
I This is the decision boundary of a linear classifier
I Angle θ between x and w has
cos(θ) =
xTw
‖x‖2‖w‖2
I Distance to hyperplane given by ‖x‖2 · cos(θ)
I x is on same side of H as w iff xTw > 0
w
x
θ
H
Figure 3: Decision boundary of linear classifier
5 / 24
Geometry of linear classifiers (2)
I With feature expansion, can obtain other types of decision
boundaries
6 / 24
Figure 4: Decision boundary of linear classifier with quadratic feature
expansion
7 / 24
Figure 5: Decision boundary of linear classifier with quadratic feature
expansion (another one)
8 / 24
MLE for logistic regression
I Treat training examples as iid, same distribution as test
example
I Log-likelihood of w given data
(x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}:
−
n∑
i=1
ln(1 + exp(−yixTiw)) + {terms not involving w}
I No “closed form” expression for maximizer
I (Later, we’ll discuss algorithms for finding approximate
maximizers using iterative methods like gradient descent.)
9 / 24
Example: Text classification (1)
I Data: articles posted to various internet message boards
I Label: −1 for articles on “religion”, +1 for articles on “politics”
I Features:
I Vocabulary of d = 61188 words
I Each document is a binary vector x = (x1, . . . , xd) ∈ {0, 1}d,
where xj = 1{document contains j-th vocabulary word}
I Logistic regression model
ln
Prw(Y = politics | X = x)
Prw(Y = religion | X = x)
= wTx
I Each weight in weight vector w = (w1, . . . , w61188) corresponds
to a vocabulary word
10 / 24
Example: Text classification (2)
I Found ŵ that approximately maximizes likelihood given 3028
training examples
I Test error rate on 2017 examples is about 8.5%.
I Vocabulary words with 10 highest (most positive) coefficients:
I israel, gun, government, american, news, clinton, rights, guns,
israeli, politics
I Vocabulary words with 10 lowest (most negative) coefficients:
I god, christian, bible, jesus, keith, christians, religion, church,
christ, athos
11 / 24
Example: Text classification (3)
Figure 6: Histogram of Prŵ(Y = politics | X = x) values on test data
12 / 24
Example: Text classification (4)
I Article with Prŵ(Y = politics | X = x) ≈ 0.0:
Rick, I think we can safely say, 1) Robert is not the only
person who understands the Bible, and 2), the leadership
of the LDS church historicly never has. Let’s consider
some “personal interpretations” and see how much trust
we should put in “Orthodox Mormonism”, which could
never be confused with Orthodox Christianity. [. . . ]
13 / 24
Example: Text classification (5)
I Article with Prŵ(Y = politics | X = x) ≈ 0.5:
Does anyone know where I can access an online copy of the
proposed “jobs” or “stimulus” legislation? Please E-mail
me directly and if anyone else is interested, I can post this
information.
I Article with Prŵ(Y = politics | X = x) ≈ 1.0:
[A reprint of a Village Voice article by Robert I. Friedman
titled “The Enemy Within” about the Anti-Defamation
League.]
14 / 24
Zero-one loss and ERM for linear classifiers
I Recall: error rate of classifier f can also be written as risk:
R(f) = E[1{f(X)6=Y }] = Pr(f(X) 6= Y ),
where loss function is zero-one loss.
I For classification, we are ultimately interested in classifiers with
small error rate
I I.e., small (zero-one loss) risk
I Just like for linear regression, can apply plug-in principle to
derive ERM, but now for linear classifiers.
I Find w ∈ Rd to minimize
R̂(w) :=
1
n
n∑
i=1
1{sign(xT
i
w) 6=yi}.
15 / 24
Performance of ERM for linear classifiers
I Theorem: In IID model, ERM solution ŵ satisfies
E[R(ŵ)] ≤ min
w∈Rd
R(w) +O
√
d
n
I Inductive bias assumption: there is a linear classifier with low
error rate, so minw∈Rd R(w) is small.
I Unfortunately, solving this optimization problem, even for linear
classifiers, is computationally intractable.
I (Sharp contrast to ERM optimization problem for linear
regression!)
16 / 24
Linearly separable data
I Training data is linearly separable if there exists a linear
classifier with training error rate zero.
I (Special case where ERM optimization problem is tractable.)
I There exists w ∈ Rd such that sign(xTiw) = yi for all
i = 1, . . . , n.
I Equivalent:
yix
T
iw > 0 for all i = 1, . . . , n
w
H
Figure 7: Linearly separable data
17 / 24
Figure 8: Data that is not linearly separable
18 / 24
Finding a linear separator I
I Suppose training data (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}
is linearly separable.
I How to find a linear separator (assuming one exists)?
I Method 1: solve linear feasibility problem
19 / 24
Finding a linear separator II
I Method 2: approximately solve logistic regression MLE
20 / 24
Surrogate loss functions I
I Often, a linear separator will not exist.
I Regard each term in negative log-likelihood as logistic loss
`logistic(s) := ln(1 + exp(−s))
I C.f. Zero-one loss: `0/1(s) := 1{s≤0}
I Scaling of `logistic is upper-bound on `0/1: a surrogate loss:
`0/1(s) ≤ 1ln 2`logistic(s) = `logistic2(s).
I Small (empirical) `logistic-risk implies small (empirical) `0/1-risk
-1.5 -1 -0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
zero-one loss
(scaled) logistic loss
Figure 9: Comparing zero-one loss and (scaled) logistic loss
21 / 24
Surrogate loss functions II
I Another example: squared loss
I `sq(s) = (1− s)2
I Note: (1− yixTi w)2 = (yi − xTi w)2 since yi ∈ {−1, +1}
I Weird: `sq(s)→∞ as s→∞.
I Minimizing R̂`sq does not necessarily give a linear separator,
even if one exists.
I A fix: `msq(s) := max{0, 1− s}2 (modified squared loss)
-1.5 -1 -0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
zero-one loss
squared loss
Figure 10: Comparing zero-one loss and squared loss
22 / 24
-1.5 -1 -0.5 0 0.5 1 1.5
0
0.5
1
1.5
2
2.5
zero-one loss
modified squared loss
Figure 11: Comparing zero-one loss and modified squared loss
23 / 24
(Regularized) empirical risk minimization for classification
with surrogate losses
I We can combine these surrogate losses with regularizers, just
as when we discussed linear regression
I This leads to regularized ERM objectives:
1
n
n∑
i=1
`(yixTiw) + Φ(w)
where
I ` is a (surrogate) loss function
I Φ is a regularizer (e.g., Φ(w) = λ‖w‖22, Φ(w) = λ‖w‖1)
24 / 24
Classification I: Linear classification