Machine learning lecture slides
COMS 4771 Fall 2020
0/32
Prediction theory
Outline
Statistical model for binary outcomes
Plug-in principle and IID model
Maximum likelihood estimation
Statistical model for binary classification Analysis of nearest neighbor classifier
Estimating the error rate of a classifier
Beyond binary classificaiton and the IID model
1/32
Statistical model for binary outcomes
Example: coin toss
Physical model: hard
Statistical model: outcome is random
Bernoulli distribution with heads probability θ ∈ [0, 1]
Encode heads as 1 and tails as 0
Written as Bernoulli(θ)
Notation: Y ∼ Bernoulli(θ) means Y is a random variable with
distribution Bernoulli(θ).
Goal: correctly predict outcome
2/32
Optimal prediction
Suppose Y ∼ Bernoulli(θ). Suppose θ known.
Optimal prediction:
Indicator function notation:
1 ifQistrue 0 if Q is false
The optimal prediction is incorrect with probability min{θ, 1 − θ}
1{Q} :=
1{θ>1/2}
3/32
Learning to make predictions
If θ unknown:
Assume we have data: outcomes of previous coin tosses
Data should be related to what we want to predict: same coin
is being tossed
4/32
Plug-in principle and IID model
Plug-in principle:
Estimate unknown(s) based on data (e.g., θ)
Plug estimates into formula for optimal prediction
When can we estimate the unknowns?
Observed data should be related to the outcome we want to
predict
IID model: Observations & (unseen) outcome are iid random
variables
iid: independent and identically distributed
Crucial modeling assumption that makes learning possible
When is the IID assumption not reasonable? . . .
5/32
Statistical models
Parametric statistical model {Pθ : θ ∈ Θ}
collection of parameterized probability distributions for data
Θ is the parameter space
One distribution per parameter value θ ∈ Θ
E.g., distributions on n binary outcomes treated as iid Bernoulli random variables
Θ=[0,1]
Overload notation: Pθ is the probability mass function (pmf)
for the distribution.
What is formula for Pθ(y1,…,yn) for (y1,…,yn) ∈ {0,1}n?
6/32
Maximum likelihood estimation (1)
Likelihood of parameter θ (given observed data) L(θ) = Pθ(y1,…,yn)
Maximum likelihood estimation:
Choose θ with highest likelihood
Log-likelihood
Sometimes more convenient
ln is increasing, so ln L(θ) orders the parameters in the same
way as L(θ)
7/32
Maximum likelihood estimation (2)
Coin toss example Log-likelihood
n
ln L(θ) = yi ln θ + (1 − yi) ln(1 − θ)
i=1
Use calculus to determine formula for maximizer
This is a little annoying, but someone else has already done it
for you:
ˆ 1n
θMLE := n
yi.
i=1
8/32
Back to plug-in principle
We are given data y1,…,yn ∈ {0,1}n, which we model using the IID model from before
Obtain estimate θˆ of known θ based on y ,…,y MLE 1n
Plug-in θˆ for θ in formula for optimal prediction: MLE
Yˆ := 1{θˆMLE>1/2}.
9/32
Analysis of the plug-in prediction (1)
How good is the plug-in prediction?
Study behavior under the IID model, where
Y1,…,YN,Y ∼iid Bernoulli(θ).
Y1,…,Yn are the data we collected Y is the outcome to predict
θ is the unknown parameter
Recall: optimal prediction is incorrect with probability min{θ, 1 − θ}.
We cannot hope Yˆ to beat this, but we can hope it is not much worse.
10/32
Analysis of the plug-in prediction (2)
Theorem:
Pr(Yˆ ̸=Y)≤min{θ,1−θ}+1 ·|θ−0.5|·e−2n(θ−0.5)2.
2
The first term is the optimal error probability.
The second term comes from the probability that the θˆ is
on the opposite side of 1/2 as θ.
This probability is very small when n is large!
If S is number of heads in n independent tosses of coin with
bias θ, then S ∼ Binomial(n, θ) (Binomial distribution)
MLE
11/32
n=20
0.15 0.1 0.05 0
Figure 1: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 20
0 1 2 3 4 5 6 7 8 9 10 k
12/32
Pr[S=k]
n=40
0.1
0.05
0
0 5 10 15 20 25 30 35 40
k
Figure 2: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 40
13/32
Pr[S=k]
n=60
0.1 0.08 0.06 0.04 0.02 0
Figure 3: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 60
0 10 20 30 40 50 60 k
14/32
Pr[S=k]
0.1 0.08 0.06 0.04 0.02 0
Figure 4: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 80
n=80
0 10 20 30 40 50 60 70 80 k
15/32
Pr[S=k]
Statistical model for labeled data in binary classification
Example: spam filtering
Labeled example: (x, y) ∈ X × {0, 1}
X is input (feature) space; {0, 1} is the output (label) space
X is not necessarily the space of inputs itself (e.g., space of all emails), but rather the space of what we measure about inputs
We only see x (email), and then must make prediction of y (spam or not-spam)
Statistical model: (X, Y ) is random
X has some marginal probability distribution
Conditional probability distribution of Y given X = x is
Bernoulli with heads probability η(x)
η : X → [0, 1] is a function, sometimes called the
regression function or conditional mean function (since E[Y |X =x]=η(x)).
16/32
Error rate of a classifier
Foraclassifierf:X →{0,1},theerrorrateoff (with respect to the distribution of (X, Y )) is
err(f) := Pr(f(X) ̸= Y ). Recall that we had previously used the notation
err(f,S) = 1 1{f(x)̸=y}, |S| (x,y)∈S
which is the same as Pr(f(X) ̸= Y ) when the distribution of (X, Y ) is uniform over the labeled examples in S.
Caution: This notation err(f) does not make explicit the dependence on (the distribution of) the random example
(X, Y ). You will need to determine this from context.
17/32
Conditional expectations (1)
Consider any random variables A and B. Conditional expectation of A given B:
Written E[A | B]
A random variable! What is its expectation?
Law of iterated expectations (a.k.a. tower property):
E[E[A | B]] = E[A]
18/32
Conditional expectations (2)
Example: roll a fair 6-sided die
A = number shown facing up
B = parity of number shown facing up C := E[A | B] is random variable with
11 Pr C=E[A|B=odd]=3(1+3+5)=3 =2
11 Pr C=E[A|B=even]=3(2+4+6)=4 =2
19/32
Bayes classifier
Optimal classifier (Bayes classifier): f⋆(x) = 1{η(x)>1/2},
where η is the conditional mean function
Classifier with smallest probability of mistake
Depends on the function η, which is typically unknown!
Optimal error rate (Bayes error rate):
Write error rate as err(f⋆) = Pr(f⋆(X) ̸= Y ) = E[1{f⋆(X)̸=Y }]
Conditional on X, probability of mistake is
min{η(X), 1 − η(X)}.
So, optimal error rate is
err(f⋆) = E[1{f⋆(X)̸=Y }]
= E[E[1{f⋆(X)̸=Y } | X]]
= E[min{η(X), 1 − η(X)}].
20/32
Example: spam filtering
Suppose input x is a single (binary) feature, “is email all-caps?” How to interpret “the probability that email is spam given
x = 1?”
What does it mean for the Bayes classifier f⋆ to be optimal?
21/32
Learning prediction functions
What to do if η is unknown?
Training data: (x1, y1), . . . , (xn, yn)
Assume data are related to what we want to predict
LetZ:=(X,Y),andZi :=(Xi,Yi)fori=1,…,n.
IID model: Z1, . . . , Zn, Z are iid random variables
Z = (X, Y ) is the (unseen) “test” example
(Technically, each labeled example is a (X × {0, 1})-valued
random variable. If X = Rd, can regard as vector of d + 1 random variables.)
22/32
Performance of nearest neighbor classifier
Study in context of IID model
Assume η(x) ≈ η(x′) whenever x and x′ are close.
This is where the modeling assumption comes in (via choice of distance function)!
Let (X, Y ) be the “test” example, and suppose (Xˆi, Yˆi) is the nearest neighbor among training data
S = ((X1,Y1),…,(Xn,Yn)).
For large n, X and Xˆi likely to be close enough so that η(X) ≈ η(Xˆi).
Prediction is Yˆi, true label is Y .
Conditional on X and Xˆi, what is probability that Yˆi ̸= Y ?
η(X)(1−η(Xˆi))+(1−η(X))η(Xˆi)≈2η(X)(1−η(X))
Conclusion: expected error rate is
E[err(NNS)] ≈ 2 · E[η(X)(1 − η(X))] for large n
Recall that optimal is E[min{η(X), 1 − η(X)}].
So E[err(NNS)] is at most twice optimal.
Never exactly optimal unless η(x) ∈ {0, 1} for all x.
23/32
Test error rate (1)
How to estimate error rate? IID model:
(X1,Y1),…,(Xn,Yn),(X1′,Y1′),…,(Xm′ ,Ym′ ),(X,Y) are iid Training examples (that you have): (X1, Y1), . . . , (Xn, Yn)
Test examples (that you have): (X1′ , Y1′), . . . , (Xm′ , Ym′ )
Test example (that you don’t have) used to define error rate:
(X,Y)
Classifier fˆ is based only on training examples
Hence, test examples are independent of fˆ (very
important!)
ˆ We would like to estimate err(f)
Caution: since fˆ depends on training data, it is random! ˆˆ
Convention: When we write err(f) where f is random, we ˆˆ
really mean Pr(f(X) ̸= Y | f).
ˆ
Therefore err(f) is a random variable!
24/32
Test error rate (2)
Conditional distribution of S := mi=1 1{fˆ(Xi′)̸=Yi′} given training data:
:ˆ S | training data ∼ Binomial(m, ε) where ε = err(f )
By law of large numbers,
1S→ε
m
1 m
m
is close to ε when m is large
How accurate is the estimate? Depends on the (conditional)
variance!
var( 1 S | training data) = ε(1−ε) mm
Standard deviation is ε(1−ε) m
as m → ∞
Therefore, test error rate
i=1
1 { fˆ( X i′ ) ̸ = Y i′ }
25/32
Confusion tables
True positive rate (recall): Pr(f(X) = 1 | Y = 1) False positive rate: Pr(f(X) = 1 | Y = 0)
Precision: Pr(Y = 1 | f(X) = 1)
…
Confusion table
f(x) = 0
f(x) = 1
y = 0 # false positives y = 1 # false negatives # true positives
# true negatives
26/32
ROC curves
Receiver operating characteristic (ROC) curve
What points are achievable on the TPR-FPR plane? Use randomization to combine classifiers
27/32
TPR
1
(FPR1 , TPR1 )
(FPR2 , TPR2 )
0 1 FPR Figure 5: TPR vs FPR plot with two points
28/32
TPR
1
0 1 FPR Figure 6: TPR vs FPR plot with many points
29/32
More than two outcomes
What if there are K > 2 possible outcomes?
Replace coin with K-sided die
Say Y has a categorical distribution over [K] := {1, . . . , K},
determined probability vector θ = (θ1 , . . . , θK ) θk ≥0forallk∈[K],andKk=1θk =1
Pr(Y =k)=θk
Optimal prediction of Y if θ is known
yˆ := arg max θk k∈[K]
30/32
Statistical model for multi-class classification
Statistical model for labeled examples (X, Y ), where Y takes values in [K]
Now, Y | X = x has a categorical distribution with parameter vector η(x) = (η(x)1, . . . , η(x)K )
Conditional probability function: η(x)k := Pr(Y = k | X = x)
Optimal classifier: f⋆(x) = arg maxk∈[K] η(x)k
Optimal error rate: Pr(f⋆(X) ̸= Y ) = 1 − E[maxk η(X)k]
31/32
Potential downsides of the IID model
Example: Train OCR digit classifier using data from Alice’s handwriting, but eventually use on digits written by Bob.
What is a better evaluation?
What if we want to eventually use on digits written by both Alice and Bob?
32/32