CS计算机代考程序代写 algorithm Machine learning lecture slides

Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 24

Classification I: Linear classification

Outline

I Logistic regression and linear classifiers
I Example: text classification
I Maximum likelihood estimation and empirical risk minimization
I Linear separators
I Surrogate loss functions

1 / 24

Logistic regression model

I Suppose x is given by d real-valued features, so x ∈ Rd, while
y ∈ {−1,+1}.

I Logistic regression model for (X,Y ):
I Y | X = x is Bernoulli (but taking values in {−1,+1} rather

than {0, 1}) with parameter σ(xTw) := 11+exp(−xTw) .
I Sigmoid function σ(t) := 1/(1 + e−t)
I w ∈ Rd is parameter vector of interest
I w not involved in marginal distribution of X (which we don’t

care much about)

-6 -4 -2 0 2 4 6
0

0.2

0.4

0.6

0.8

1

Figure 1: Logistic (sigmoid) function
2 / 24

Log-odds in logistic regression model

I Sigmoid function σ(t) := 1/(1 + e−t)
I Useful property: 1− σ(t) = σ(−t)
I Pr(Y = +1 | X = x) = σ(xTw)
I Pr(Y = −1 | X = x) = 1− σ(xTw) = σ(−xTw)
I Convenient formula: for each y ∈ {−1,+1},

Pr(Y = y | X = x) = σ(yxTw).

I Log-odds in the model is given by a linear function:

ln
Pr(Y = +1 | X = x)
Pr(Y = −1 | X = x)

= xTw.

I Just like in linear regression, common to use feature expansion!
I E.g., affine feature expansion ϕ(x) = (1, x) ∈ Rd+1

3 / 24

Optimal classifier in logistic regression model

I Recall that Bayes classifier is

f?(x) =


+1 if Pr(Y = +1 | X = x) > 1/2−1 otherwise.

I If distribution of (X,Y ) comes from logistic regression model
with parameter w, then Bayes classifier is

f?(x) =


+1 if x

Tw > 0
−1 otherwise.

= sign(xTw).
I This is a linear classifier

I Compute linear combination of features, then check if above
threshold (zero)

I With affine feature expansion, threshold can be non-zero
I Many other statistical models for classification data lead to a

linear (or affine) classifier, e.g., Naive Bayes

Figure 2: Implementation of a linear classifier

4 / 24

Geometry of linear classifiers

I Hyperplane specified by normal vector w ∈ Rd:
I H = {x ∈ Rd : xTw = 0}
I This is the decision boundary of a linear classifier
I Angle θ between x and w has

cos(θ) =
xTw

‖x‖2‖w‖2
I Distance to hyperplane given by ‖x‖2 · cos(θ)
I x is on same side of H as w iff xTw > 0

w

x

θ

H

Figure 3: Decision boundary of linear classifier
5 / 24

Geometry of linear classifiers (2)

I With feature expansion, can obtain other types of decision
boundaries

6 / 24

Figure 4: Decision boundary of linear classifier with quadratic feature
expansion

7 / 24

Figure 5: Decision boundary of linear classifier with quadratic feature
expansion (another one)

8 / 24

MLE for logistic regression

I Treat training examples as iid, same distribution as test
example

I Log-likelihood of w given data
(x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}:


n∑

i=1
ln(1 + exp(−yixTiw)) + {terms not involving w}

I No “closed form” expression for maximizer
I (Later, we’ll discuss algorithms for finding approximate

maximizers using iterative methods like gradient descent.)

9 / 24

Example: Text classification (1)

I Data: articles posted to various internet message boards
I Label: −1 for articles on “religion”, +1 for articles on “politics”
I Features:

I Vocabulary of d = 61188 words
I Each document is a binary vector x = (x1, . . . , xd) ∈ {0, 1}d,

where xj = 1{document contains j-th vocabulary word}
I Logistic regression model

ln
Prw(Y = politics | X = x)
Prw(Y = religion | X = x)

= wTx

I Each weight in weight vector w = (w1, . . . , w61188) corresponds
to a vocabulary word

10 / 24

Example: Text classification (2)

I Found ŵ that approximately maximizes likelihood given 3028
training examples

I Test error rate on 2017 examples is about 8.5%.
I Vocabulary words with 10 highest (most positive) coefficients:

I israel, gun, government, american, news, clinton, rights, guns,
israeli, politics

I Vocabulary words with 10 lowest (most negative) coefficients:
I god, christian, bible, jesus, keith, christians, religion, church,

christ, athos

11 / 24

Example: Text classification (3)

Figure 6: Histogram of Prŵ(Y = politics | X = x) values on test data

12 / 24

Example: Text classification (4)

I Article with Prŵ(Y = politics | X = x) ≈ 0.0:
Rick, I think we can safely say, 1) Robert is not the only
person who understands the Bible, and 2), the leadership
of the LDS church historicly never has. Let’s consider
some “personal interpretations” and see how much trust
we should put in “Orthodox Mormonism”, which could
never be confused with Orthodox Christianity. [. . . ]

13 / 24

Example: Text classification (5)

I Article with Prŵ(Y = politics | X = x) ≈ 0.5:
Does anyone know where I can access an online copy of the
proposed “jobs” or “stimulus” legislation? Please E-mail
me directly and if anyone else is interested, I can post this
information.

I Article with Prŵ(Y = politics | X = x) ≈ 1.0:
[A reprint of a Village Voice article by Robert I. Friedman
titled “The Enemy Within” about the Anti-Defamation
League.]

14 / 24

Zero-one loss and ERM for linear classifiers

I Recall: error rate of classifier f can also be written as risk:

R(f) = E[1{f(X)6=Y }] = Pr(f(X) 6= Y ),

where loss function is zero-one loss.
I For classification, we are ultimately interested in classifiers with

small error rate
I I.e., small (zero-one loss) risk

I Just like for linear regression, can apply plug-in principle to
derive ERM, but now for linear classifiers.
I Find w ∈ Rd to minimize

R̂(w) :=
1
n

n∑
i=1

1{sign(xT
i

w) 6=yi}.

15 / 24

Performance of ERM for linear classifiers

I Theorem: In IID model, ERM solution ŵ satisfies

E[R(ŵ)] ≤ min
w∈Rd

R(w) +O



d

n


I Inductive bias assumption: there is a linear classifier with low
error rate, so minw∈Rd R(w) is small.

I Unfortunately, solving this optimization problem, even for linear
classifiers, is computationally intractable.
I (Sharp contrast to ERM optimization problem for linear

regression!)

16 / 24

Linearly separable data

I Training data is linearly separable if there exists a linear
classifier with training error rate zero.

I (Special case where ERM optimization problem is tractable.)
I There exists w ∈ Rd such that sign(xTiw) = yi for all
i = 1, . . . , n.

I Equivalent:

yix
T
iw > 0 for all i = 1, . . . , n

w

H

Figure 7: Linearly separable data

17 / 24

Figure 8: Data that is not linearly separable

18 / 24

Finding a linear separator I

I Suppose training data (x1, y1), . . . , (xn, yn) ∈ Rd × {−1,+1}
is linearly separable.

I How to find a linear separator (assuming one exists)?
I Method 1: solve linear feasibility problem

19 / 24

Finding a linear separator II

I Method 2: approximately solve logistic regression MLE

20 / 24

Surrogate loss functions I

I Often, a linear separator will not exist.
I Regard each term in negative log-likelihood as logistic loss

`logistic(s) := ln(1 + exp(−s))
I C.f. Zero-one loss: `0/1(s) := 1{s≤0}
I Scaling of `logistic is upper-bound on `0/1: a surrogate loss:

`0/1(s) ≤ 1ln 2`logistic(s) = `logistic2(s).
I Small (empirical) `logistic-risk implies small (empirical) `0/1-risk

-1.5 -1 -0.5 0 0.5 1 1.5
0

0.5

1

1.5

2

2.5

zero-one loss

(scaled) logistic loss

Figure 9: Comparing zero-one loss and (scaled) logistic loss
21 / 24

Surrogate loss functions II

I Another example: squared loss
I `sq(s) = (1− s)2

I Note: (1− yixTi w)2 = (yi − xTi w)2 since yi ∈ {−1, +1}
I Weird: `sq(s)→∞ as s→∞.

I Minimizing R̂`sq does not necessarily give a linear separator,
even if one exists.

I A fix: `msq(s) := max{0, 1− s}2 (modified squared loss)

-1.5 -1 -0.5 0 0.5 1 1.5
0

0.5

1

1.5

2

2.5

zero-one loss

squared loss

Figure 10: Comparing zero-one loss and squared loss

22 / 24

-1.5 -1 -0.5 0 0.5 1 1.5
0

0.5

1

1.5

2

2.5

zero-one loss

modified squared loss

Figure 11: Comparing zero-one loss and modified squared loss

23 / 24

(Regularized) empirical risk minimization for classification
with surrogate losses

I We can combine these surrogate losses with regularizers, just
as when we discussed linear regression

I This leads to regularized ERM objectives:

1
n

n∑
i=1

`(yixTiw) + Φ(w)

where
I ` is a (surrogate) loss function
I Φ is a regularizer (e.g., Φ(w) = λ‖w‖22, Φ(w) = λ‖w‖1)

24 / 24

Classification I: Linear classification