程序代写代做代考 C Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/32

EDT
about which session to join Work on problems in small groups
Friday 10 10am noon
Recitation
We’ll have 2 parallel
Watch your email for instructions
session
running

Prediction theory

Outline
I Statistical model for binary outcomes
I Plug-in principle and IID model
I Maximum likelihood estimation
I Statistical model for binary classification I Analysis of nearest neighbor classifier
I Estimating the error rate of a classifier
I Beyond binary classificaiton and the IID model
1/32

Statistical model for binary outcomes
I Example: coin toss
I Physical model: hard
I Statistical model: outcome is random
I Bernoulli distribution with heads probability ◊ œ [0, 1]
I Encode heads as 1 and tails as 0
I Written as Bernoulli(◊)
I Notation: Y ≥ Bernoulli(◊) means Y is a random variable with
distribution Bernoulli(◊).
I Goal: correctly predict outcome
2/32

Optimal prediction
I Suppose Y ≥ Bernoulli(◊). I Suppose ◊ known.
I Optimal prediction:
1{◊>1/2}
I Indicator function notation:
1 := I1 if Q is true
{Q} 0 if Q is false
I The optimal prediction is incorrect with probability
min{◊, 1 ≠ ◊}
3/32

Learning to make predictions
I If ◊ unknown:
I Assume we have data: outcomes of previous coin tosses
I Data should be related to what we want to predict: same coin
is being tossed
4/32

Plug-in principle and IID model
I Plug-in principle:
I Estimate unknown(s) based on data (e.g., ◊)
I Plug estimates into formula for optimal prediction I When can we estimate the unknowns?
I Observed data should be related to the outcome we want to predict
I IID model: Observations & (unseen) outcome are iid random variables
I iid: independent and identically distributed
I Crucial modeling assumption that makes learning possible
I When is the IID assumption not reasonable? . . .
5/32

Statistical models
I Parametric statistical model {P◊ : ◊ œ }
I collection of parameterized probability distributions for data I is the parameter space
I One distribution per parameter value ◊ œ
I E.g., distributions on n binary outcomes treated as iid Bernoulli random variables
I =[0,1]
I Overload notation: P◊ is the probability mass function (pmf)
for the distribution. d
I What is formula for P◊(y1,…,yn) for (y1,…,yn) œ {0,1}n?
loAA
Polo I Paly
On to a 01202
yn
IIμo5
6/32

Maximum likelihood estimation (1)
I Likelihood of parameter ◊ (given observed data) I L(◊) = P◊(y1,…,yn)
2
I Sometimes more convenient
I ln is increasing, so ln L(◊) orders the parameters in the same
way as L(◊)
I Maximum likelihood estimation:
I Choose ◊ with highest likelihood
I Log-likelihood
7/32

Maximum likelihood estimation (2)
I Coin toss example I Log-likelihood
ln L(◊) =
f toy Siggi
ÿn i=1
EI
yi ln ◊ + (1 ≠ yi) ln(1 ≠ ◊)
I Use calculus to determine formula for maximizer
I This is a little annoying, but someone else has already done it
for you: 1 ÿn
◊ˆ := y.
MLE n i i=1
8/32

Back to plug-in principle
I We are given data y1,…,yn œ {0,1}n, which we model using the IID model from before
I Obtain estimate ◊ˆ of known ◊ based on y ,…,y ˆMLE 1n
I Plug-in ◊MLE for ◊ in formula for optimal prediction: Yˆ : = 1 ˆ .
{◊>1/2} MCE
9/32

Analysis of the plug-in prediction (1)
I How good is the plug-in prediction?
I Study behavior under the IID model, where Y1,…,YN,Y ≥iid Bernoulli(◊).unknown
I Recall: optimal prediction is incorrect with probability I min{◊,1≠◊}. ˆ
I Y1,…,Yn are the data we collected
I YET
I Y is the outcome to predict ◊ is the unknown parameter
Hoo’s
We cannot hope Y to beat this, but we can hope it is not much
worse.
we can hope
Pf Ft YJ Sm
infoI83t something small
d very hopethisis is
smallwhen n large
10/32

Analysis of the plug-in prediction (2)
I Theorem:
Pr(Yˆ ”=Y)Æmin{◊,1≠◊}+1 ·|◊≠0.5|·e≠2n(◊≠0.5)2.
I The first term is the optimal error probability. ˆ
I The second term comes from the probability that the ◊MLE is
on the opposite side of 1/2 as ◊.
I This probability is very small when n is large!
I If S is number of heads in n independent tosses of coin with
bias ◊, then S ≥ Binomial(n, ◊) (Binomial distribution)
F
2
11/32

n=20
0.15 0.1 0.05 0
Figure 1: Pr(S > n/2) for S ≥ Binomial(n, ◊), n = 20
0 1 2 3 4 5 6 7 8 910111217 k
12/32
Pr[S=k]

n=40
0.1
0.05
0
0 5 10 15 20 25 30 35 40
k
Figure 2: Pr(S > n/2) for S ≥ Binomial(n, ◊), n = 40
13/32
Pr[S=k]

n=60
0.1 0.08 0.06 0.04 0.02 0
Figure 3: Pr(S > n/2) for S ≥ Binomial(n, ◊), n = 60
0 10 20 30 40 50 60 k
14/32
Pr[S=k]

0.1 0.08 0.06 0.04 0.02 0
Figure 4: Pr(S > n/2) for S ≥ Binomial(n, ◊), n = 80
n=80
0 10 20 30 40 50 60 70 80 k
15/32
Pr[S=k]

Statistical model for labeled data in binary classification
I Example: spam filtering
I Labeled example: (x, y) œ X ◊ {0, 1}
I X is input (feature) space; {0, 1} is the output (label) space
I X is not necessarily the space of inputs itself (e.g., space of all emails), but rather the space of what we measure about inputs
I We only see x (email), and then must make prediction of y (spam or not-spam)
I Statistical model: (X, Y ) is random
I X has some marginal probability distribution
I Conditional probability distribution of Y given X = x is
Bernoulli with heads probability ÷(x) eta
I ÷ : X æ [0, 1] is a function, sometimes called the
regression function or conditional mean function (since E[Y |X =x]=÷(x)).
16/32

Error rate of a classifier
I Foraclassifierf:X æ{0,1},theerrorrateoff (with respect to the distribution of (X, Y )) is
err(f) := Pr(f(X) ”= Y ). Recall that we had previously used the notation
err(f,S) = 1 ÿ 1{f(x)”=y}, |S| (x,y)œS
which is the same as Pr(f(X) ”= Y ) when the distribution of (X, Y ) is uniform over the labeled examples in S.
I Caution: This notation err(f) does not make explicit the dependence on (the distribution of) the random example
(X, Y ). You will need to determine this from context.
17/32

Conditional expectations (1)
I Consider any random variables A and B. I Conditional expectation of A given B:
I Written E[A | B] to
I A random variable! What is its expectation?
I Law of iterated expectations (a.k.a. tower property):
E[E[A | B]] = E[A] 11
EA IB
18/32

Conditional expectations (2)
I Example: roll a fair 6-sided die
I A = number shown facing up
I B = parity of number shown facing up I C := E[A | B] is random variable with
Pr 3C = E[A | B = odd] = 1 (1 + 3 + 5) = 34 = 1 3e342
Pr C=E[A|B=even]=1(2+4+6)=4 =1 32
EKFECAECAIB L 3ttz
4
3.5
19/32

Bayes classifier
I Optimal classifier (Bayes classifier): fı(x) = 1{÷(x)>1/2},
where ÷ is the conditional mean function
I Classifier with smallest probability of mistake
I Depends on the function ÷, which is typically unknown! I Optimal error rate (Bayes error rate):
I Write error rate as err(fı) = Pr(fı(X) ”= Y ) = E[1{fı(X)”=Y }] I Conditional on X, probability of mistake is
min{÷(X), 1 ≠ ÷(X)}. I So, optimal error rate is
err(fı) = E[1{fı(X)”=Y }]
= E[E[1{fı(X)”=Y } | X]]
= E[min{÷(X), 1 ≠ ÷(X)}].
20/32

Example: spam filtering
I Suppose input xI is a single (binary) feature, “is email all-caps?” I How to interpret “the probability that email is spam given
x = 1?” 1
I What does it mean for the Bayes classifier fı to be optimal? 1
I xz 1
all
engaritein
x
caps
What is the function of 94,74
rate XnY X
of smallest
error
fir eerrff.tl errffitz
f asbefore same
andlikelysmalter
21/32

Learning prediction functions
I What to do if ÷ is unknown?
I Training data: (x1, y1), . . . , (xn, yn)
I Assume data are related to what we want to predict
I LetZ:=(X,Y),andZi :=(Xi,Yi)fori=1,…,n.
I IID model: Z1, . . . , Zn, Z are iid random variables
I Z = (X, Y ) is the (unseen) “test” example
I (Technically, each labeled example is a (X ◊ {0, 1})-valued
random variable. If X = Rd, can regard as vector of d + 1 random variables.)
22/32

Performance of nearest neighbor classifier
ECTIX D
I Assume ÷(x) ¥ ÷(xÕ) whenever x and xÕ are close.
I This is where the modeling assumption comes in (via choice of
I Study in context of IID model yea
distance function)!
I Let (X, Y ) be the “test” example, and suppose (Xˆ, Yˆ) is the
ii S = ((X1,Y1),…,(Xn,Yn)). X Yin
nearest neighbor among training data
AN
I For large n, X and Xˆi likely to be close enough so that ÷(X) ¥ ÷(Xˆi).
I Prediction is Yˆi, true label is Y .
I Conditional on X and Xˆi, what is probability that Yˆi ”= Y ?
I ÷(X)(1≠÷(Xˆi))+(1≠÷(X))÷(Xˆi)¥2÷(X)(1≠÷(X)) I Conclusion: expected error rate is
E[err(NNS)] ¥ 2 · E[÷(X)(1 ≠ ÷(X))] for large n
I Recall that optimal is E[min{÷(X), 1 ≠ ÷(X)}].
I So E[err(NNS)] is at most twice optimal.
I Never exactly optimal unless ÷(x) œ {0, 1} for all x.
23/32

Test error rate (1)
I How to estimate error rate? I IID model:
(X1,Y1),…,(Xn,Yn),(X1Õ,Y1Õ),…,(XmÕ ,YmÕ ),(X,Y) are iid
I Training examples (that you have): (X1, Y1), . . . , (Xn, Yn)
I Test examples (that you have): (X1Õ , Y1Õ), . . . , (XmÕ , YmÕ )3 T I Test example (that you don’t have) used to define error rate:
(X,Y)
I Classifier fˆ is based only on training examples
I Hence, test examples are independent of fˆ (very I important!) ˆ Elif 3 IF
We would like to estimate err(f)
I Caution: since fˆ depends on training data, it is random!
ˆˆ
I Convention: When we write err(f) where f is random, we
ˆˆ really mean Pr(f(X) ”= Y | f).
ˆ
I Therefore err(f) is a random variable!
24/32
I

Test error rate (2) errCIT mt cstexImpler
training data: {f(Xi)”=Yi }
I S | training data ≥ Binomial(m, Á) where Á := err(f )
I Conditional distribution of S := qmi=1 1 ˆ Õ Õ
yiig
ˆ
II
I By law of large numbers,
v
Ta s mx H
1SæÁ m
EFIgfcxig yi.es
lPffTherefore, test error rateÿ
æ
Œ
IIvarCSII
fcxiH Yl
art
I 31Tm1ˆÕ Õ
varchstf
EAg
im varl1gIcxiHYi3Hm i=1
mECtE
timed
E
I
Yi
1
{f(Xi)”=Yi }
varCSIF
is close to Á when m is large
I How accurate is the estimate? Depends on the (conditional)
variance!
m
Á(1≠Á) var( m S | training data) = m
I
I Standard deviation is
1Òe Á(1≠Á)
25/32
IEFCx.es
given

Confusion tables
I True positive rate (recall): Pr(f(X) = 1 | Y = 1) I False positive rate: Pr(f(X) = 1 | Y = 0)
I Precision: Pr(Y = 1 | f(X) = 1)
I …
I Confusion table
f(x) = 0
f(x) = 1
y = 0 # false positives y = 1 # false negatives # true positives
# true negatives
26/32

ROC curves
I Receiver operating characteristic (ROC) curve
I What points are achievable on the TPR-FPR plane? I Use randomization to combine classifiers
27/32

TPR
1
(FPR1 ,iTPR1 )
(FPR2 , TPR2 )
0
1
FPR
Figure 5: TPR vs FPR plot with two points
28/32

TPR
1T 0t
1
Figure 6: TPR vs FPR plot with many points
FPR
29/32

More than two outcomes
I What if there are K > 2 possible outcomes?
I Replace coin with K-sided die
I Say Y has a categorical distribution over [K] := {1, . . . , K},
determined probability vector ◊q= (◊1,…,◊K) I ◊k Ø0forallkœ[K],and Kk=1◊k =1
I Pr(Y =k)=◊k
I Optimal prediction of Y if ◊ is known
yˆ := arg max ◊k kœ[K]
30/32

Statistical model for multi-class classification
I Statistical model for labeled examples (X, Y ), where Y takes values in [K] I ik
I Now, Y | X = x has a categorical distribution with parameter vector ÷(x) = (÷(x)1, . . . , ÷(x)K )
I Conditional probability function: ÷(x)k := Pr(Y = k | X = x) I Optimal classifier: fı(x) = arg maxkœ[K] ÷(x)k
I Optimal error rate: Pr(fı(X) ”= Y ) = 1 ≠ E[maxk ÷(X)k]
31/32

Potential downsides of the IID model
I Example: Train OCR digit classifier using data from Alice’s handwriting, but eventually use on digits written by Bob.
I What is a better evaluation?
I What if we want to eventually use on digits written by both Alice and Bob?
32/32