Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 32
Prediction theory
Outline
I Statistical model for binary outcomes
I Plug-in principle and IID model
I Maximum likelihood estimation
I Statistical model for binary classification
I Analysis of nearest neighbor classifier
I Estimating the error rate of a classifier
I Beyond binary classificaiton and the IID model
1 / 32
Statistical model for binary outcomes
I Example: coin toss
I Physical model: hard
I Statistical model: outcome is random
I Bernoulli distribution with heads probability θ ∈ [0, 1]
I Encode heads as 1 and tails as 0
I Written as Bernoulli(θ)
I Notation: Y ∼ Bernoulli(θ) means Y is a random variable with
distribution Bernoulli(θ).
I Goal: correctly predict outcome
2 / 32
Optimal prediction
I Suppose Y ∼ Bernoulli(θ).
I Suppose θ known.
I Optimal prediction:
1{θ>1/2}
I Indicator function notation:
1{Q} :=
{
1 if Q is true
0 if Q is false
I The optimal prediction is incorrect with probability
min{θ, 1− θ}
3 / 32
Learning to make predictions
I If θ unknown:
I Assume we have data: outcomes of previous coin tosses
I Data should be related to what we want to predict: same coin
is being tossed
4 / 32
Plug-in principle and IID model
I Plug-in principle:
I Estimate unknown(s) based on data (e.g., θ)
I Plug estimates into formula for optimal prediction
I When can we estimate the unknowns?
I Observed data should be related to the outcome we want to
predict
I IID model: Observations & (unseen) outcome are iid random
variables
I iid: independent and identically distributed
I Crucial modeling assumption that makes learning possible
I When is the IID assumption not reasonable? . . .
5 / 32
Statistical models
I Parametric statistical model {Pθ : θ ∈ Θ}
I collection of parameterized probability distributions for data
I Θ is the parameter space
I One distribution per parameter value θ ∈ Θ
I E.g., distributions on n binary outcomes treated as iid Bernoulli
random variables
I Θ = [0, 1]
I Overload notation: Pθ is the probability mass function (pmf)
for the distribution.
I What is formula for Pθ(y1, . . . , yn) for (y1, . . . , yn) ∈ {0, 1}n?
6 / 32
Maximum likelihood estimation (1)
I Likelihood of parameter θ (given observed data)
I L(θ) = Pθ(y1, . . . , yn)
I Maximum likelihood estimation:
I Choose θ with highest likelihood
I Log-likelihood
I Sometimes more convenient
I ln is increasing, so lnL(θ) orders the parameters in the same
way as L(θ)
7 / 32
Maximum likelihood estimation (2)
I Coin toss example
I Log-likelihood
lnL(θ) =
n∑
i=1
yi ln θ + (1− yi) ln(1− θ)
I Use calculus to determine formula for maximizer
I This is a little annoying, but someone else has already done it
for you:
θ̂MLE :=
1
n
n∑
i=1
yi.
8 / 32
Back to plug-in principle
I We are given data y1, . . . , yn ∈ {0, 1}n, which we model using
the IID model from before
I Obtain estimate θ̂MLE of known θ based on y1, . . . , yn
I Plug-in θ̂MLE for θ in formula for optimal prediction:
Ŷ := 1{θ̂MLE>1/2}.
9 / 32
Analysis of the plug-in prediction (1)
I How good is the plug-in prediction?
I Study behavior under the IID model, where
Y1, . . . , YN , Y ∼iid Bernoulli(θ).
I Y1, . . . , Yn are the data we collected
I Y is the outcome to predict
I θ is the unknown parameter
I Recall: optimal prediction is incorrect with probability
min{θ, 1− θ}.
I We cannot hope Ŷ to beat this, but we can hope it is not much
worse.
10 / 32
Analysis of the plug-in prediction (2)
I Theorem:
Pr(Ŷ 6= Y ) ≤ min{θ, 1− θ}+ 12 · |θ − 0.5| · e
−2n(θ−0.5)2 .
I The first term is the optimal error probability.
I The second term comes from the probability that the θ̂MLE is
on the opposite side of 1/2 as θ.
I This probability is very small when n is large!
I If S is number of heads in n independent tosses of coin with
bias θ, then S ∼ Binomial(n, θ) (Binomial distribution)
11 / 32
n=20
0 1 2 3 4 5 6 7 8 9 10
k
0
0.05
0.1
0.15
P
r[
S
=
k
]
Figure 1: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 20
12 / 32
n=40
0 5 10 15 20 25 30 35 40
k
0
0.05
0.1
P
r[
S
=
k
]
Figure 2: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 40
13 / 32
n=60
0 10 20 30 40 50 60
k
0
0.02
0.04
0.06
0.08
0.1
P
r[
S
=
k
]
Figure 3: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 60
14 / 32
n=80
0 10 20 30 40 50 60 70 80
k
0
0.02
0.04
0.06
0.08
0.1
P
r[
S
=
k
]
Figure 4: Pr(S > n/2) for S ∼ Binomial(n, θ), n = 80
15 / 32
Statistical model for labeled data in binary classification
I Example: spam filtering
I Labeled example: (x, y) ∈ X × {0, 1}
I X is input (feature) space; {0, 1} is the output (label) space
I X is not necessarily the space of inputs itself (e.g., space of all
emails), but rather the space of what we measure about inputs
I We only see x (email), and then must make prediction of y
(spam or not-spam)
I Statistical model: (X,Y ) is random
I X has some marginal probability distribution
I Conditional probability distribution of Y given X = x is
Bernoulli with heads probability η(x)
I η : X → [0, 1] is a function, sometimes called the
regression function or conditional mean function (since
E[Y | X = x] = η(x)).
16 / 32
Error rate of a classifier
I For a classifier f : X → {0, 1}, the error rate of f (with
respect to the distribution of (X,Y )) is
err(f) := Pr(f(X) 6= Y ).
Recall that we had previously used the notation
err(f, S) =
1
|S|
∑
(x,y)∈S
1{f(x)6=y},
which is the same as Pr(f(X) 6= Y ) when the distribution of
(X,Y ) is uniform over the labeled examples in S.
I Caution: This notation err(f) does not make explicit the
dependence on (the distribution of) the random example
(X,Y ). You will need to determine this from context.
17 / 32
Conditional expectations (1)
I Consider any random variables A and B.
I Conditional expectation of A given B:
I Written E[A | B]
I A random variable! What is its expectation?
I Law of iterated expectations (a.k.a. tower property):
E[E[A | B]] = E[A]
18 / 32
Conditional expectations (2)
I Example: roll a fair 6-sided die
I A = number shown facing up
I B = parity of number shown facing up
I C := E[A | B] is random variable with
Pr
(
C = E[A | B = odd] =
1
3
(1 + 3 + 5) = 3
)
=
1
2
Pr
(
C = E[A | B = even] =
1
3
(2 + 4 + 6) = 4
)
=
1
2
19 / 32
Bayes classifier
I Optimal classifier (Bayes classifier):
f?(x) = 1{η(x)>1/2},
where η is the conditional mean function
I Classifier with smallest probability of mistake
I Depends on the function η, which is typically unknown!
I Optimal error rate (Bayes error rate):
I Write error rate as err(f?) = Pr(f?(X) 6= Y ) = E[1{f?(X)6=Y }]
I Conditional on X, probability of mistake is
min{η(X), 1− η(X)}.
I So, optimal error rate is
err(f?) = E[1{f?(X)6=Y }]
= E[E[1{f?(X)6=Y } | X]]
= E[min{η(X), 1− η(X)}].
20 / 32
Example: spam filtering
I Suppose input x is a single (binary) feature, “is email all-caps?”
I How to interpret “the probability that email is spam given
x = 1?”
I What does it mean for the Bayes classifier f? to be optimal?
21 / 32
Learning prediction functions
I What to do if η is unknown?
I Training data: (x1, y1), . . . , (xn, yn)
I Assume data are related to what we want to predict
I Let Z := (X,Y ), and Zi := (Xi, Yi) for i = 1, . . . , n.
I IID model: Z1, . . . , Zn, Z are iid random variables
I Z = (X,Y ) is the (unseen) “test” example
I (Technically, each labeled example is a (X × {0, 1})-valued
random variable. If X = Rd, can regard as vector of d+ 1
random variables.)
22 / 32
Performance of nearest neighbor classifier
I Study in context of IID model
I Assume η(x) ≈ η(x′) whenever x and x′ are close.
I This is where the modeling assumption comes in (via choice of
distance function)!
I Let (X,Y ) be the “test” example, and suppose (Xî, Yî) is the
nearest neighbor among training data
S = ((X1, Y1), . . . , (Xn, Yn)).
I For large n, X and Xî likely to be close enough so that
η(X) ≈ η(Xî).
I Prediction is Yî, true label is Y .
I Conditional on X and Xî, what is probability that Yî 6= Y ?
I η(X)(1− η(Xî)) + (1− η(X))η(Xî) ≈ 2η(X)(1− η(X))
I Conclusion: expected error rate is
E[err(NNS)] ≈ 2 · E[η(X)(1− η(X))] for large n
I Recall that optimal is E[min{η(X), 1− η(X)}].
I So E[err(NNS)] is at most twice optimal.
I Never exactly optimal unless η(x) ∈ {0, 1} for all x.
23 / 32
Test error rate (1)
I How to estimate error rate?
I IID model:
(X1, Y1), . . . , (Xn, Yn), (X ′1, Y ′1), . . . , (X ′m, Y ′m), (X,Y ) are iid
I Training examples (that you have): (X1, Y1), . . . , (Xn, Yn)
I Test examples (that you have): (X ′1, Y ′1), . . . , (X ′m, Y ′m)
I Test example (that you don’t have) used to define error rate:
(X,Y )
I Classifier f̂ is based only on training examples
I Hence, test examples are independent of f̂ (very
important!)
I We would like to estimate err(f̂)
I Caution: since f̂ depends on training data, it is random!
I Convention: When we write err(f̂) where f̂ is random, we
really mean Pr(f̂(X) 6= Y | f̂).
I Therefore err(f̂) is a random variable!
24 / 32
Test error rate (2)
I Conditional distribution of S :=
∑m
i=1 1{f̂(X′
i
)6=Y ′
i
} given
training data:
I S | training data ∼ Binomial(m, ε) where ε := err(f̂)
I By law of large numbers,
1
m
S → ε
as m→∞
I Therefore, test error rate
1
m
m∑
i=1
1{f̂(X′
i
) 6=Y ′
i
}
is close to ε when m is large
I How accurate is the estimate? Depends on the (conditional)
variance!
I var( 1
m
S | training data) = ε(1−ε)
m
I Standard deviation is
√
ε(1−ε)
m
25 / 32
Confusion tables
I True positive rate (recall): Pr(f(X) = 1 | Y = 1)
I False positive rate: Pr(f(X) = 1 | Y = 0)
I Precision: Pr(Y = 1 | f(X) = 1)
I . . .
I Confusion table
f(x) = 0 f(x) = 1
y = 0 # true negatives # false positives
y = 1 # false negatives # true positives
26 / 32
ROC curves
I Receiver operating characteristic (ROC) curve
I What points are achievable on the TPR-FPR plane?
I Use randomization to combine classifiers
27 / 32
FPR
TPR
(FPR1,TPR1)
(FPR2,TPR2)
0
1
1
Figure 5: TPR vs FPR plot with two points
28 / 32
FPR
TPR
0
1
1
Figure 6: TPR vs FPR plot with many points
29 / 32
More than two outcomes
I What if there are K > 2 possible outcomes?
I Replace coin with K-sided die
I Say Y has a categorical distribution over [K] := {1, . . . ,K},
determined probability vector θ = (θ1, . . . , θK)
I θk ≥ 0 for all k ∈ [K], and
∑K
k=1 θk = 1
I Pr(Y = k) = θk
I Optimal prediction of Y if θ is known
ŷ := arg max
k∈[K]
θk
30 / 32
Statistical model for multi-class classification
I Statistical model for labeled examples (X,Y ), where Y takes
values in [K]
I Now, Y | X = x has a categorical distribution with parameter
vector η(x) = (η(x)1, . . . , η(x)K)
I Conditional probability function: η(x)k := Pr(Y = k | X = x)
I Optimal classifier: f?(x) = arg maxk∈[K] η(x)k
I Optimal error rate: Pr(f?(X) 6= Y ) = 1− E[maxk η(X)k]
31 / 32
Potential downsides of the IID model
I Example: Train OCR digit classifier using data from Alice’s
handwriting, but eventually use on digits written by Bob.
I What is a better evaluation?
I What if we want to eventually use on digits written by both
Alice and Bob?
32 / 32
Prediction theory