CS计算机代考程序代写 Machine learning lecture slides

Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 32

Prediction theory

Outline

I Statistical model for binary outcomes
I Plug-in principle and IID model
I Maximum likelihood estimation
I Statistical model for binary classification
I Analysis of nearest neighbor classifier
I Estimating the error rate of a classifier
I Beyond binary classificaiton and the IID model

1 / 32

Statistical model for binary outcomes

I Example: coin toss
I Physical model: hard
I Statistical model: outcome is random

I Bernoulli distribution with heads probability θ ∈ [0, 1]
I Encode heads as 1 and tails as 0
I Written as Bernoulli(θ)
I Notation: Y ∼ Bernoulli(θ) means Y is a random variable with

distribution Bernoulli(θ).
I Goal: correctly predict outcome

2 / 32

Optimal prediction

I Suppose Y ∼ Bernoulli(θ).
I Suppose θ known.
I Optimal prediction:

1{θ>1/2}
I Indicator function notation:

1{Q} :=

{
1 if Q is true
0 if Q is false

I The optimal prediction is incorrect with probability

min{θ, 1− θ}

3 / 32

Learning to make predictions

I If θ unknown:
I Assume we have data: outcomes of previous coin tosses
I Data should be related to what we want to predict: same coin

is being tossed

4 / 32

Plug-in principle and IID model

I Plug-in principle:
I Estimate unknown(s) based on data (e.g., θ)
I Plug estimates into formula for optimal prediction

I When can we estimate the unknowns?
I Observed data should be related to the outcome we want to

predict
I IID model: Observations & (unseen) outcome are iid random

variables
I iid: independent and identically distributed
I Crucial modeling assumption that makes learning possible

I When is the IID assumption not reasonable? . . .

5 / 32

Statistical models

I Parametric statistical model {Pθ : θ ∈ Θ}
I collection of parameterized probability distributions for data
I Θ is the parameter space
I One distribution per parameter value θ ∈ Θ

I E.g., distributions on n binary outcomes treated as iid Bernoulli
random variables
I Θ = [0, 1]
I Overload notation: Pθ is the probability mass function (pmf)

for the distribution.
I What is formula for Pθ(y1, . . . , yn) for (y1, . . . , yn) ∈ {0, 1}n?

6 / 32

Maximum likelihood estimation (1)

I Likelihood of parameter θ (given observed data)
I L(θ) = Pθ(y1, . . . , yn)

I Maximum likelihood estimation:
I Choose θ with highest likelihood

I Log-likelihood
I Sometimes more convenient
I ln is increasing, so lnL(θ) orders the parameters in the same

way as L(θ)

7 / 32

Maximum likelihood estimation (2)

I Coin toss example
I Log-likelihood

lnL(θ) =
n∑
i=1

yi ln θ + (1− yi) ln(1− θ)

I Use calculus to determine formula for maximizer
I This is a little annoying, but someone else has already done it

for you:

θ̂MLE :=
1
n

n∑
i=1

yi.

8 / 32

Back to plug-in principle

I We are given data y1, . . . , yn ∈ {0, 1}n, which we model using
the IID model from before

I Obtain estimate θ̂MLE of known θ based on y1, . . . , yn
I Plug-in θ̂MLE for θ in formula for optimal prediction:

Ŷ := 1{θ̂MLE>1/2}.

9 / 32

Analysis of the plug-in prediction (1)

I How good is the plug-in prediction?
I Study behavior under the IID model, where

Y1, . . . , YN , Y ∼iid Bernoulli(θ).
I Y1, . . . , Yn are the data we collected
I Y is the outcome to predict
I θ is the unknown parameter

I Recall: optimal prediction is incorrect with probability
min{θ, 1− θ}.

I We cannot hope Ŷ to beat this, but we can hope it is not much
worse.

10 / 32

Analysis of the plug-in prediction (2)

I Theorem:
Pr(Ŷ 6= Y ) ≤ min{θ, 1− θ}+ 12 · |θ − 0.5| · e

−2n(θ−0.5)2 .
I The first term is the optimal error probability.
I The second term comes from the probability that the θ̂MLE is

on the opposite side of 1/2 as θ.
I This probability is very small when n is large!
I If S is number of heads in n independent tosses of coin with

bias θ, then S ∼ Binomial(n, θ) (Binomial distribution)

11 / 32

n=20

0 1 2 3 4 5 6 7 8 9 10

k

0

0.05

0.1

0.15

P
r[

S
=

k
]

Figure 1: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 20

12 / 32

n=40

0 5 10 15 20 25 30 35 40

k

0

0.05

0.1

P
r[

S
=

k
]

Figure 2: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 40

13 / 32

n=60

0 10 20 30 40 50 60

k

0

0.02

0.04

0.06

0.08

0.1

P
r[

S
=

k
]

Figure 3: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 60

14 / 32

n=80

0 10 20 30 40 50 60 70 80

k

0

0.02

0.04

0.06

0.08

0.1

P
r[

S
=

k
]

Figure 4: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 80

15 / 32

Statistical model for labeled data in binary classification

I Example: spam filtering
I Labeled example: (x, y) ∈ X × {0, 1}
I X is input (feature) space; {0, 1} is the output (label) space

I X is not necessarily the space of inputs itself (e.g., space of all
emails), but rather the space of what we measure about inputs

I We only see x (email), and then must make prediction of y
(spam or not-spam)

I Statistical model: (X,Y ) is random
I X has some marginal probability distribution
I Conditional probability distribution of Y given X = x is

Bernoulli with heads probability η(x)
I η : X → [0, 1] is a function, sometimes called the

regression function or conditional mean function (since
E[Y | X = x] = η(x)).

16 / 32

Error rate of a classifier

I For a classifier f : X → {0, 1}, the error rate of f (with
respect to the distribution of (X,Y )) is

err(f) := Pr(f(X) 6= Y ).

Recall that we had previously used the notation

err(f, S) =
1
|S|


(x,y)∈S

1{f(x)6=y},

which is the same as Pr(f(X) 6= Y ) when the distribution of
(X,Y ) is uniform over the labeled examples in S.

I Caution: This notation err(f) does not make explicit the
dependence on (the distribution of) the random example
(X,Y ). You will need to determine this from context.

17 / 32

Conditional expectations (1)

I Consider any random variables A and B.
I Conditional expectation of A given B:

I Written E[A | B]
I A random variable! What is its expectation?
I Law of iterated expectations (a.k.a. tower property):

E[E[A | B]] = E[A]

18 / 32

Conditional expectations (2)

I Example: roll a fair 6-sided die
I A = number shown facing up
I B = parity of number shown facing up
I C := E[A | B] is random variable with

Pr
(
C = E[A | B = odd] =

1
3

(1 + 3 + 5) = 3
)

=
1
2

Pr
(
C = E[A | B = even] =

1
3

(2 + 4 + 6) = 4
)

=
1
2

19 / 32

Bayes classifier

I Optimal classifier (Bayes classifier):

f?(x) = 1{η(x)>1/2},

where η is the conditional mean function
I Classifier with smallest probability of mistake
I Depends on the function η, which is typically unknown!

I Optimal error rate (Bayes error rate):
I Write error rate as err(f?) = Pr(f?(X) 6= Y ) = E[1{f?(X)6=Y }]
I Conditional on X, probability of mistake is

min{η(X), 1− η(X)}.
I So, optimal error rate is

err(f?) = E[1{f?(X)6=Y }]
= E[E[1{f?(X)6=Y } | X]]
= E[min{η(X), 1− η(X)}].

20 / 32

Example: spam filtering

I Suppose input x is a single (binary) feature, “is email all-caps?”
I How to interpret “the probability that email is spam given
x = 1?”

I What does it mean for the Bayes classifier f? to be optimal?

21 / 32

Learning prediction functions

I What to do if η is unknown?
I Training data: (x1, y1), . . . , (xn, yn)
I Assume data are related to what we want to predict
I Let Z := (X,Y ), and Zi := (Xi, Yi) for i = 1, . . . , n.
I IID model: Z1, . . . , Zn, Z are iid random variables
I Z = (X,Y ) is the (unseen) “test” example
I (Technically, each labeled example is a (X × {0, 1})-valued

random variable. If X = Rd, can regard as vector of d+ 1
random variables.)

22 / 32

Performance of nearest neighbor classifier

I Study in context of IID model
I Assume η(x) ≈ η(x′) whenever x and x′ are close.

I This is where the modeling assumption comes in (via choice of
distance function)!

I Let (X,Y ) be the “test” example, and suppose (Xî, Yî) is the
nearest neighbor among training data
S = ((X1, Y1), . . . , (Xn, Yn)).

I For large n, X and Xî likely to be close enough so that
η(X) ≈ η(Xî).

I Prediction is Yî, true label is Y .
I Conditional on X and Xî, what is probability that Yî 6= Y ?

I η(X)(1− η(Xî)) + (1− η(X))η(Xî) ≈ 2η(X)(1− η(X))
I Conclusion: expected error rate is

E[err(NNS)] ≈ 2 · E[η(X)(1− η(X))] for large n
I Recall that optimal is E[min{η(X), 1− η(X)}].
I So E[err(NNS)] is at most twice optimal.
I Never exactly optimal unless η(x) ∈ {0, 1} for all x.

23 / 32

Test error rate (1)

I How to estimate error rate?
I IID model:

(X1, Y1), . . . , (Xn, Yn), (X ′1, Y ′1), . . . , (X ′m, Y ′m), (X,Y ) are iid
I Training examples (that you have): (X1, Y1), . . . , (Xn, Yn)
I Test examples (that you have): (X ′1, Y ′1), . . . , (X ′m, Y ′m)
I Test example (that you don’t have) used to define error rate:

(X,Y )
I Classifier f̂ is based only on training examples
I Hence, test examples are independent of f̂ (very

important!)
I We would like to estimate err(f̂)

I Caution: since f̂ depends on training data, it is random!
I Convention: When we write err(f̂) where f̂ is random, we

really mean Pr(f̂(X) 6= Y | f̂).
I Therefore err(f̂) is a random variable!

24 / 32

Test error rate (2)

I Conditional distribution of S :=
∑m
i=1 1{f̂(X′

i
)6=Y ′

i
} given

training data:
I S | training data ∼ Binomial(m, ε) where ε := err(f̂)
I By law of large numbers,

1
m
S → ε

as m→∞
I Therefore, test error rate

1
m

m∑
i=1

1{f̂(X′
i
) 6=Y ′

i
}

is close to ε when m is large
I How accurate is the estimate? Depends on the (conditional)

variance!
I var( 1

m
S | training data) = ε(1−ε)

m

I Standard deviation is

ε(1−ε)
m

25 / 32

Confusion tables

I True positive rate (recall): Pr(f(X) = 1 | Y = 1)
I False positive rate: Pr(f(X) = 1 | Y = 0)
I Precision: Pr(Y = 1 | f(X) = 1)
I . . .
I Confusion table

f(x) = 0 f(x) = 1
y = 0 # true negatives # false positives
y = 1 # false negatives # true positives

26 / 32

ROC curves

I Receiver operating characteristic (ROC) curve
I What points are achievable on the TPR-FPR plane?
I Use randomization to combine classifiers

27 / 32

FPR

TPR

(FPR1,TPR1)

(FPR2,TPR2)

0
1

1

Figure 5: TPR vs FPR plot with two points

28 / 32

FPR

TPR

0
1

1

Figure 6: TPR vs FPR plot with many points

29 / 32

More than two outcomes

I What if there are K > 2 possible outcomes?
I Replace coin with K-sided die
I Say Y has a categorical distribution over [K] := {1, . . . ,K},

determined probability vector θ = (θ1, . . . , θK)
I θk ≥ 0 for all k ∈ [K], and

∑K
k=1 θk = 1

I Pr(Y = k) = θk
I Optimal prediction of Y if θ is known

ŷ := arg max
k∈[K]

θk

30 / 32

Statistical model for multi-class classification

I Statistical model for labeled examples (X,Y ), where Y takes
values in [K]
I Now, Y | X = x has a categorical distribution with parameter

vector η(x) = (η(x)1, . . . , η(x)K)
I Conditional probability function: η(x)k := Pr(Y = k | X = x)
I Optimal classifier: f?(x) = arg maxk∈[K] η(x)k
I Optimal error rate: Pr(f?(X) 6= Y ) = 1− E[maxk η(X)k]

31 / 32

Potential downsides of the IID model

I Example: Train OCR digit classifier using data from Alice’s
handwriting, but eventually use on digits written by Bob.

I What is a better evaluation?

I What if we want to eventually use on digits written by both
Alice and Bob?

32 / 32

Prediction theory