PowerPoint Presentation
Lecturer: Ben Rubinstein
Lecture 6. PAC Learning Theory
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Excess risk
∗ Decomposition: Estimation vs approximation
∗ Bayes risk – irreducible error
• Probably approximation correct learning
• Bounding generalisation error with high probability
∗ Single model: Hoeffding’s inequality
∗ Finite model class: Also use the union bound
• Importance & limitations of uniform deviation bounds
2
Turing
Award
Inside
COMP90051 Statistical Machine Learning
Generalisation and Model Complexity
• Theory we’ve seen so far (mostly statistics)
∗ Asymptotic notions (consistency, efficiency)
∗ Convergence could be really slow
∗ Model complexity undefined
• Want: finite sample theory; convergence rates, trade-offs
• Want: define model complexity and relate it to test error
∗ Test error can’t be measured in real life, but it can be provably
bounded!
∗ Growth function, VC dimension
• Want: distribution-independent, learner-independent theory
∗ A fundamental theory applicable throughout ML
∗ Unlike bias-variance: distribution dependent, no model complexity,
3
theory maths
COMP90051 Statistical Machine Learning
Probably Approximately
Correct Learning
The bedrock of machine learning theory in
computer science.
4
COMP90051 Statistical Machine Learning
Standard setup
• Supervised binary classification of
∗ data in 𝒳𝒳 into label set 𝒴𝒴 = {−1,1}
• iid data {(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)}𝑖𝑖=1
𝑚𝑚 ~𝐷𝐷 some fixed unknown distribution
• Single test example independent from same 𝐷𝐷 when
representing generalisation performance (risk)
• Learning from a class of function ℱ mapping (classifying)
𝒳𝒳 into 𝒴𝒴
• What parts depend on the sample of data
∗ Empirical risk �𝑅𝑅[𝑓𝑓] that averages loss over the sample
∗ 𝑓𝑓𝑚𝑚 ∈ ℱ the learned model (it could be same sample or
different; theory is actually fully general here)
5
COMP90051 Statistical Machine Learning
Risk: The good, bad and ugly
𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅∗ = 𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓∗ − 𝑅𝑅∗
• Good: what we’d aim for in our class, with infinite data
∗ 𝑅𝑅 𝑓𝑓∗ true risk of best in class 𝑓𝑓∗ ∈ argmin𝑓𝑓∈ℱ 𝑅𝑅 𝑓𝑓
• Bad: we get what we get and don’t get upset
∗ 𝑅𝑅 𝑓𝑓𝑚𝑚 true risk of learned 𝑓𝑓𝑚𝑚 ∈ arg min𝑓𝑓∈ℱ �𝑅𝑅 𝑓𝑓 + 𝐶𝐶 𝑓𝑓 2 (e.g.)
• Ugly: we usually cannot even hope for perfection!
∗ 𝑅𝑅∗ ∈ inf𝑓𝑓 𝑅𝑅 𝑓𝑓 called the Bayes risk; noisy labels makes large
6
Excess risk Approximation errorEstimation error
COMP90051 Statistical Machine Learning
A familiar trade-off: More intuition
• simple family may underfit due to approximation error
• complex family may overfit due to estimation error
7
complex model classsimple model class
Approximation
error
Estimation
error
Excess Risk
COMP90051 Statistical Machine Learning
About Bayes risk
• Bayes risk 𝑅𝑅∗ ∈ inf𝑓𝑓 𝑅𝑅 𝑓𝑓
∗ Best risk possible, ever; but can be large
∗ Depends on distribution and loss function
• Bayes classifier achieves Bayes risk
∗ 𝑓𝑓𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑥𝑥 = sgn𝔼𝔼 𝑌𝑌|𝑋𝑋 = 𝑥𝑥
8
𝒳𝒳
𝒴𝒴
1
-1
𝔼𝔼 𝑌𝑌|𝑋𝑋 = 𝑥𝑥
sgn𝔼𝔼 𝑌𝑌|𝑋𝑋 = 𝑥𝑥
Once again, named after
Bayes. Not Bayesian ML.
COMP90051 Statistical Machine Learning
Let’s focus on 𝑅𝑅 𝑓𝑓𝑚𝑚
• Since we don’t know data distribution, we
need to bound generalisation to be small
∗ Bound by test error �𝑅𝑅 𝑓𝑓𝑚𝑚 =
1
𝑚𝑚
∑𝑖𝑖=1
𝑚𝑚 𝑓𝑓(𝑋𝑋𝑖𝑖 ,𝑌𝑌𝑖𝑖)
∗ Abusing notation: 𝑓𝑓 𝑋𝑋𝑖𝑖 ,𝑌𝑌𝑖𝑖 = 𝑙𝑙(𝑌𝑌𝑖𝑖 , 𝑓𝑓 𝑋𝑋𝑖𝑖 )
𝑅𝑅 𝑓𝑓𝑚𝑚 ≤ �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝜀𝜀(𝑚𝑚,ℱ)
• Unlucky training sets, no always-guarantees possible!
• With probability ≥ 1 − 𝛿𝛿: 𝑅𝑅 𝑓𝑓𝑚𝑚 ≤ �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝜀𝜀(𝑚𝑚,ℱ, 𝛿𝛿)
• Called Probably Approximately Correct (PAC) learning
∗ ℱ called PAC learnable if 𝑚𝑚 = 𝑂𝑂(poly(1/𝜀𝜀, 1/𝛿𝛿)) to learn 𝑓𝑓𝑚𝑚 for any 𝜀𝜀, 𝛿𝛿
∗ Won Leslie Valiant (Harvard) the 2010 Turing Award
• Later: Why this bounds estimation error.
9
Leslie Valiant
CCA2.0 Renate Schmid
Don’t require
exponential growth
in training size 𝑚𝑚
COMP90051 Statistical Machine Learning
Mini Summary
• Excess risk as the goal of ML
• Decomposition into approximation, estimation errors
• Probably Approximately Correct (PAC) learning
∗ Like asymptotic theory in stats, but for finite sample size
∗ Worst-case on distributions: We don’t want to assume
something unrealistic about where the data comes from
∗ Worst-case on models: We don’t want a theory that applies to
narrow set of learners, but to ML in general
∗ We want it to produce a useful measure of model complexity
Next: First step to PAC theory – bounding single model risk
10
COMP90051 Statistical Machine Learning
Bounding true risk
of one function
11
One step at a time
COMP90051 Statistical Machine Learning
We need a concentration inequality
• �𝑅𝑅[𝑓𝑓] is an unbiased estimate of 𝑅𝑅[𝑓𝑓] for any fixed 𝑓𝑓 (why?)
• That means on average �𝑅𝑅[𝑓𝑓] lands on 𝑅𝑅[𝑓𝑓]
• What’s the likelihood 1 − δ that �𝑅𝑅[𝑓𝑓] lands within 𝜀𝜀 of 𝑅𝑅[𝑓𝑓]? Or
more precisely, what 1 − δ(𝑚𝑚, 𝜀𝜀) achieves a given 𝜀𝜀 > 0?
• Intuition: Just bounding CDF of �𝑅𝑅[𝑓𝑓], independent of distribution!!
12
Public domain, Wikipedia𝑅𝑅[𝑓𝑓]�𝑅𝑅1�𝑅𝑅5 �𝑅𝑅4 �𝑅𝑅3�𝑅𝑅2
𝟐𝟐𝟐𝟐
1 − 𝛿𝛿(𝑚𝑚, 𝜀𝜀)
1 − CDF(𝑅𝑅 𝑓𝑓 + 𝜀𝜀)CDF(𝑅𝑅 𝑓𝑓 − 𝜀𝜀)
�𝑅𝑅[𝑓𝑓]’s PDF
COMP90051 Statistical Machine Learning
Hoeffding’s inequality
• Many such concentration inequalities; a simplest one…
• Theorem: Let 𝑍𝑍1, … ,𝑍𝑍𝑚𝑚,𝑍𝑍 be iid random variables and
ℎ 𝑧𝑧 ∈ [𝑎𝑎, 𝑏𝑏] be a bounded function. For all 𝜀𝜀 > 0
Pr 𝔼𝔼 ℎ(𝑍𝑍) −
1
𝑚𝑚
�
𝑖𝑖=1
𝑚𝑚
ℎ(𝑍𝑍𝑖𝑖) ≥ 𝜀𝜀 ≤ 2 exp −
2𝑚𝑚𝜀𝜀2
(𝑏𝑏 − 𝑎𝑎)2
Pr 𝔼𝔼 ℎ(𝑍𝑍) −
1
𝑚𝑚
�
𝑖𝑖=1
𝑚𝑚
ℎ(𝑍𝑍𝑖𝑖) ≥ 𝜀𝜀 ≤ exp −
2𝑚𝑚𝜀𝜀2
(𝑏𝑏 − 𝑎𝑎)2
• Two-sided case in words: The probability
that the empirical average is far
from the expectation is small.
13
Public domain, Wikipedia
COMP90051 Statistical Machine Learning
Common probability ‘tricks’
• Inversion:
∗ For any event 𝐴𝐴, Pr �̅�𝐴 = 1 − Pr 𝐴𝐴
∗ Application: Pr 𝑋𝑋 > 𝜀𝜀 ≤ 𝛿𝛿
implies Pr 𝑋𝑋 ≤ 𝜀𝜀 ≥ 1 − 𝛿𝛿
• Solving for, in high-probability bounds:
∗ For given 𝜀𝜀 with 𝛿𝛿(𝜀𝜀) function 𝜀𝜀: Pr 𝑋𝑋 > 𝜀𝜀 ≤ 𝛿𝛿(𝜀𝜀)
∗ Given 𝛿𝛿′ can write 𝜀𝜀 = 𝛿𝛿−1(𝛿𝛿′): Pr 𝑋𝑋 > 𝛿𝛿−1(𝛿𝛿′) ≤ 𝛿𝛿′
∗ Let’s you specify either parameter
∗ Sometimes sample size 𝑚𝑚 a variable we can solve for too
14
A
Ω
COMP90051 Statistical Machine Learning
Et voila: A bound on true risk!
Result! 𝑅𝑅 𝑓𝑓 ≤ �𝑅𝑅 𝑓𝑓 + log 1/𝛿𝛿
2𝑚𝑚
with high probability (w.h.p.) ≥ 1 − 𝛿𝛿
Proof
• Take the 𝑍𝑍𝑖𝑖 as labelled examples (𝑋𝑋𝑖𝑖 ,𝑌𝑌𝑖𝑖)
• Take ℎ 𝑋𝑋,𝑌𝑌 = 𝑙𝑙 𝑌𝑌, 𝑓𝑓(𝑋𝑋) zero-one loss for some fixed 𝑓𝑓 ∈ ℱ
then ℎ 𝑋𝑋,𝑌𝑌 ∈ 0,1
• Apply one-sided Hoeffding: Pr 𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 ≥ 𝜀𝜀 ≤ exp −2𝑚𝑚𝜀𝜀2
15
COMP90051 Statistical Machine Learning
Mini Summary
• Goal: Bound true risk of a classifier based on its
empirical risk plus “stuff”
• Caveat: Bound is “with high probability” since we
could be unlucky with the data
• Approach: Hoeffding’s inequality which bounds how
far a mean is likely to be from an expectation
Next: PAC learning as uniform deviation bounds
16
COMP90051 Statistical Machine Learning
Uniform deviation bounds
17
Why we need our bound to simultaneously
(or uniformly) hold over a family of functions.
COMP90051 Statistical Machine Learning
Our bound doesn’t hold for 𝑓𝑓 = 𝑓𝑓𝑚𝑚
• Result says there’s set 𝑆𝑆 of good samples for which
𝑅𝑅 𝑓𝑓 ≤ �𝑅𝑅 𝑓𝑓 + log 1/𝛿𝛿
2𝑚𝑚
and Pr 𝐙𝐙 ∈ 𝑆𝑆 ≥ 1 − 𝛿𝛿
• But for different functions 𝑓𝑓1, 𝑓𝑓2, … we might get very different sets 𝑆𝑆1, 𝑆𝑆2, …
• 𝑆𝑆 observed may be bad for 𝑓𝑓𝑚𝑚. Learning minimises �𝑅𝑅 𝑓𝑓𝑚𝑚 , exacerbating this
18
ℱ
ri
sk
𝑅𝑅 𝑓𝑓
𝑓𝑓𝑚𝑚
𝟐𝟐𝟐𝟐
�𝑅𝑅1 𝑓𝑓
�𝑅𝑅2 𝑓𝑓
Problematic that 𝑓𝑓𝑚𝑚
depends on data
COMP90051 Statistical Machine Learning
Uniform deviation bounds
• We could analyse risks of 𝑓𝑓𝑚𝑚 from specific learner
∗ But repeating for new learners? How to compare learners?
∗ Note there are ways to do this, and data-dependently
• Bound uniform deviations across whole class ℱ
𝑅𝑅 𝑓𝑓𝑚𝑚 − �𝑅𝑅 𝑓𝑓𝑚𝑚 ≤ sup𝑓𝑓∈ℱ 𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 ≤ ?
∗ Worst deviation over an entire class bounds learned risk!
∗ Convenient, but could be much worse than the actual gap for 𝑓𝑓𝑚𝑚
19
ℱ
𝑅𝑅
𝑓𝑓
−
� 𝑅𝑅
[𝑓𝑓
]
𝑓𝑓𝑚𝑚
sample 1
sample 2
sample 3
uniform deviation 1
uniform deviation 2
uniform deviation 3
Public domain, Wikipedia
PDF of UnifDev
�𝑈𝑈𝐷𝐷1�𝑈𝑈𝐷𝐷2�𝑈𝑈𝐷𝐷3
COMP90051 Statistical Machine Learning
Relation to estimation error?
• Recall estimation error? Learning part of excess risk!
𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅∗ = 𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓∗ − 𝑅𝑅∗
Theorem: ERM’s estimation error is
at most twice the uniform divergence
∗ Proof: a bunch of algebra!
𝑅𝑅 𝑓𝑓𝑚𝑚 ≤ �𝑅𝑅 𝑓𝑓∗ − �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓∗
= �𝑅𝑅 𝑓𝑓∗ − 𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓𝑚𝑚 − �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝑅𝑅 𝑓𝑓∗
≤ 𝑅𝑅 𝑓𝑓∗ − �𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓𝑚𝑚 − �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝑅𝑅 𝑓𝑓∗
≤ 2 sup𝑓𝑓∈ℱ 𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 + 𝑅𝑅 𝑓𝑓∗
20
COMP90051 Statistical Machine Learning
Mini Summary
• Why Hoeffding doesn’t cover a model 𝑓𝑓𝑚𝑚 learned
from data, only a fixed data-independent 𝑓𝑓
• Uniform deviation idea: Cover the worst case
deviation between risk and empirical risk, across ℱ
• Advantages: works for any learner, data distribution
• Connection back to bounding estimation error
Next: Next step for PAC learning – finite classes
21
COMP90051 Statistical Machine Learning
Error bound for
finite function classes
22
Our first uniform deviation bound
COMP90051 Statistical Machine Learning
The Union Bound
• If each model 𝑓𝑓 having large risk deviation is a “bad event”,
we need a tool to bound the probability that any bad event
happens. I.e. the union of bad events!
• Union bound: for a sequence of events 𝐴𝐴1,𝐴𝐴2 …
Pr �
𝑖𝑖
𝐴𝐴𝑖𝑖 ≤�
𝑖𝑖
Pr 𝐴𝐴𝑖𝑖
Proof:
Define 𝐵𝐵𝑖𝑖 = 𝐴𝐴𝑖𝑖\⋃𝑗𝑗=1
𝑖𝑖−1 𝐴𝐴𝑗𝑗 with 𝐵𝐵1 = 𝐴𝐴1.
1. We know: ⋃𝑖𝑖 𝐵𝐵𝑖𝑖 = ⋃𝑖𝑖 𝐴𝐴𝑖𝑖 (could prove by induction)
2. The 𝐵𝐵𝑖𝑖 are disjoint (empty intersections)
3. We know: 𝐵𝐵𝑖𝑖 ⊆ 𝐴𝐴𝑖𝑖 so Pr(𝐵𝐵𝑖𝑖) ≤ Pr 𝐴𝐴𝑖𝑖 by monotonicity
23
COMP90051 Statistical Machine Learning
Bound for finite classes ℱ
• A uniform deviation bound over any finite class or distribution
Theorem: Consider any 𝛿𝛿 > 0 and finite class ℱ. Then w.h.p
at least 1 − 𝛿𝛿: For all 𝑓𝑓 ∈ ℱ, 𝑅𝑅 𝑓𝑓 ≤ �𝑅𝑅 𝑓𝑓 + log |ℱ|+log 1/𝛿𝛿
2𝑚𝑚
Proof:
• If each model 𝑓𝑓 having large risk deviation is a “bad event”,
we bound the probability that any bad event happens.
• Pr ∃𝑓𝑓 ∈ ℱ,𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 ≥ ε ≤ ∑𝑓𝑓∈ℱ Pr 𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 ≥ ε
• ≤ |ℱ| exp −2𝑚𝑚𝜀𝜀2 by the union bound
• Followed by inversion, setting δ = ℱ exp −2𝑚𝑚𝜀𝜀2
24
COMP90051 Statistical Machine Learning
Discussion
• Hoeffding’s inequality only uses boundedness of the loss, not the
variance of the loss random variables
∗ Fancier concentration inequalities leverage variance
• Uniform deviation is worst-case, ERM on a very large over-
parametrised ℱ may approach the worst-case, but learners
generally may not
∗ Custom analysis, data-dependent bounds, PAC-Bayes, etc.
• Dependent data?
∗ Martingale theory
• Union bound is in general loose, as bad is if all the bad events were
independent (not necessarily the case even though underlying data
modelled as independent); and finite 𝓕𝓕
∗ VC theory coming up next!
25
COMP90051 Statistical Machine Learning
Mini Summary
• More on uniform deviation bounds
• The union bound (generic tool in probability theory)
• Finite classes: Bounding uniform deviation with
union+Hoeffding
Next time: PAC learning with infinite function classes!
26
�Lecturer: Ben Rubinstein
This lecture
Generalisation and Model Complexity
Probably Approximately Correct Learning
Standard setup
Risk: The good, bad and ugly
A familiar trade-off: More intuition
About Bayes risk
Let’s focus on 𝑅 𝑓 𝑚
Mini Summary
Bounding true risk �of one function
We need a concentration inequality
Hoeffding’s inequality
Common probability ‘tricks’
Et voila: A bound on true risk!
Mini Summary
Uniform deviation bounds
Our bound doesn’t hold for 𝑓= 𝑓 𝑚
Uniform deviation bounds
Relation to estimation error?
Mini Summary
Error bound for �finite function classes
The Union Bound
Bound for finite classes ℱ
Discussion
Mini Summary