CS计算机代考程序代写 Bayesian PowerPoint Presentation

PowerPoint Presentation

Lecturer: Ben Rubinstein

Lecture 6. PAC Learning Theory

COMP90051 Statistical Machine Learning

This lecture

• Excess risk
∗ Decomposition: Estimation vs approximation
∗ Bayes risk – irreducible error

• Probably approximation correct learning

• Bounding generalisation error with high probability
∗ Single model: Hoeffding’s inequality
∗ Finite model class: Also use the union bound

• Importance & limitations of uniform deviation bounds

Turing
Award
Inside

COMP90051 Statistical Machine Learning

Generalisation and Model Complexity
• Theory we’ve seen so far (mostly statistics)

∗ Asymptotic notions (consistency, efficiency)
∗ Convergence could be really slow
∗ Model complexity undefined

• Want: finite sample theory; convergence rates, trade-offs

• Want: define model complexity and relate it to test error
∗ Test error can’t be measured in real life, but it can be provably

bounded!
∗ Growth function, VC dimension

• Want: distribution-independent, learner-independent theory
∗ A fundamental theory applicable throughout ML
∗ Unlike bias-variance: distribution dependent, no model complexity,

theory maths

COMP90051 Statistical Machine Learning

Probably Approximately
Correct Learning

The bedrock of machine learning theory in
computer science.

COMP90051 Statistical Machine Learning

Standard setup

• Supervised binary classification of
∗ data in 𝒳𝒳 into label set 𝒴𝒴 = {−1,1}

• iid data {(𝑥𝑥𝑖𝑖 ,𝑦𝑦𝑖𝑖)}𝑖𝑖=1
𝑚𝑚 ~𝐷𝐷 some fixed unknown distribution

• Single test example independent from same 𝐷𝐷 when
representing generalisation performance (risk)

• Learning from a class of function ℱ mapping (classifying)
𝒳𝒳 into 𝒴𝒴

• What parts depend on the sample of data
∗ Empirical risk �𝑅𝑅[𝑓𝑓] that averages loss over the sample
∗ 𝑓𝑓𝑚𝑚 ∈ ℱ the learned model (it could be same sample or

different; theory is actually fully general here)

COMP90051 Statistical Machine Learning

Risk: The good, bad and ugly

𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅∗ = 𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓∗ − 𝑅𝑅∗

• Good: what we’d aim for in our class, with infinite data
∗ 𝑅𝑅 𝑓𝑓∗ true risk of best in class 𝑓𝑓∗ ∈ argmin𝑓𝑓∈ℱ 𝑅𝑅 𝑓𝑓

• Bad: we get what we get and don’t get upset
∗ 𝑅𝑅 𝑓𝑓𝑚𝑚 true risk of learned 𝑓𝑓𝑚𝑚 ∈ arg min𝑓𝑓∈ℱ �𝑅𝑅 𝑓𝑓 + 𝐶𝐶 𝑓𝑓 2 (e.g.)

• Ugly: we usually cannot even hope for perfection!
∗ 𝑅𝑅∗ ∈ inf𝑓𝑓 𝑅𝑅 𝑓𝑓 called the Bayes risk; noisy labels makes large

Excess risk Approximation errorEstimation error

COMP90051 Statistical Machine Learning

A familiar trade-off: More intuition
• simple family may underfit due to approximation error

• complex family may overfit due to estimation error

complex model classsimple model class

Approximation
error

Estimation
error

Excess Risk

COMP90051 Statistical Machine Learning

About Bayes risk

• Bayes risk 𝑅𝑅∗ ∈ inf𝑓𝑓 𝑅𝑅 𝑓𝑓
∗ Best risk possible, ever; but can be large
∗ Depends on distribution and loss function

• Bayes classifier achieves Bayes risk
∗ 𝑓𝑓𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝑥𝑥 = sgn𝔼𝔼 𝑌𝑌|𝑋𝑋 = 𝑥𝑥

𝒳𝒳

𝒴𝒴
1

-1

𝔼𝔼 𝑌𝑌|𝑋𝑋 = 𝑥𝑥

sgn𝔼𝔼 𝑌𝑌|𝑋𝑋 = 𝑥𝑥

Once again, named after
Bayes. Not Bayesian ML.

COMP90051 Statistical Machine Learning

Let’s focus on 𝑅𝑅 𝑓𝑓𝑚𝑚
• Since we don’t know data distribution, we

need to bound generalisation to be small
∗ Bound by test error �𝑅𝑅 𝑓𝑓𝑚𝑚 =

1
𝑚𝑚
∑𝑖𝑖=1
𝑚𝑚 𝑓𝑓(𝑋𝑋𝑖𝑖 ,𝑌𝑌𝑖𝑖)

∗ Abusing notation: 𝑓𝑓 𝑋𝑋𝑖𝑖 ,𝑌𝑌𝑖𝑖 = 𝑙𝑙(𝑌𝑌𝑖𝑖 , 𝑓𝑓 𝑋𝑋𝑖𝑖 )
𝑅𝑅 𝑓𝑓𝑚𝑚 ≤ �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝜀𝜀(𝑚𝑚,ℱ)

• Unlucky training sets, no always-guarantees possible!

• With probability ≥ 1 − 𝛿𝛿: 𝑅𝑅 𝑓𝑓𝑚𝑚 ≤ �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝜀𝜀(𝑚𝑚,ℱ, 𝛿𝛿)

• Called Probably Approximately Correct (PAC) learning
∗ ℱ called PAC learnable if 𝑚𝑚 = 𝑂𝑂(poly(1/𝜀𝜀, 1/𝛿𝛿)) to learn 𝑓𝑓𝑚𝑚 for any 𝜀𝜀, 𝛿𝛿
∗ Won Leslie Valiant (Harvard) the 2010 Turing Award

• Later: Why this bounds estimation error.
9

Leslie Valiant
CCA2.0 Renate Schmid

Don’t require
exponential growth

in training size 𝑚𝑚

COMP90051 Statistical Machine Learning

Mini Summary

• Excess risk as the goal of ML
• Decomposition into approximation, estimation errors
• Probably Approximately Correct (PAC) learning

∗ Like asymptotic theory in stats, but for finite sample size
∗ Worst-case on distributions: We don’t want to assume

something unrealistic about where the data comes from
∗ Worst-case on models: We don’t want a theory that applies to

narrow set of learners, but to ML in general
∗ We want it to produce a useful measure of model complexity

Next: First step to PAC theory – bounding single model risk

COMP90051 Statistical Machine Learning

Bounding true risk
of one function

One step at a time

COMP90051 Statistical Machine Learning

We need a concentration inequality

• �𝑅𝑅[𝑓𝑓] is an unbiased estimate of 𝑅𝑅[𝑓𝑓] for any fixed 𝑓𝑓 (why?)

• That means on average �𝑅𝑅[𝑓𝑓] lands on 𝑅𝑅[𝑓𝑓]

• What’s the likelihood 1 − δ that �𝑅𝑅[𝑓𝑓] lands within 𝜀𝜀 of 𝑅𝑅[𝑓𝑓]? Or
more precisely, what 1 − δ(𝑚𝑚, 𝜀𝜀) achieves a given 𝜀𝜀 > 0?

• Intuition: Just bounding CDF of �𝑅𝑅[𝑓𝑓], independent of distribution!!
12

Public domain, Wikipedia𝑅𝑅[𝑓𝑓]�𝑅𝑅1�𝑅𝑅5 �𝑅𝑅4 �𝑅𝑅3�𝑅𝑅2

𝟐𝟐𝟐𝟐

1 − 𝛿𝛿(𝑚𝑚, 𝜀𝜀)

1 − CDF(𝑅𝑅 𝑓𝑓 + 𝜀𝜀)CDF(𝑅𝑅 𝑓𝑓 − 𝜀𝜀)

�𝑅𝑅[𝑓𝑓]’s PDF

COMP90051 Statistical Machine Learning

Hoeffding’s inequality
• Many such concentration inequalities; a simplest one…

• Theorem: Let 𝑍𝑍1, … ,𝑍𝑍𝑚𝑚,𝑍𝑍 be iid random variables and
ℎ 𝑧𝑧 ∈ [𝑎𝑎, 𝑏𝑏] be a bounded function. For all 𝜀𝜀 > 0

Pr 𝔼𝔼 ℎ(𝑍𝑍) −
1
𝑚𝑚
�

𝑖𝑖=1

𝑚𝑚
ℎ(𝑍𝑍𝑖𝑖) ≥ 𝜀𝜀 ≤ 2 exp −

2𝑚𝑚𝜀𝜀2

(𝑏𝑏 − 𝑎𝑎)2

Pr 𝔼𝔼 ℎ(𝑍𝑍) −
1
𝑚𝑚
�

𝑖𝑖=1

𝑚𝑚
ℎ(𝑍𝑍𝑖𝑖) ≥ 𝜀𝜀 ≤ exp −

2𝑚𝑚𝜀𝜀2

(𝑏𝑏 − 𝑎𝑎)2

• Two-sided case in words: The probability
that the empirical average is far
from the expectation is small.

13
Public domain, Wikipedia

COMP90051 Statistical Machine Learning

Common probability ‘tricks’

• Inversion:
∗ For any event 𝐴𝐴, Pr �̅�𝐴 = 1 − Pr 𝐴𝐴
∗ Application: Pr 𝑋𝑋 > 𝜀𝜀 ≤ 𝛿𝛿

implies Pr 𝑋𝑋 ≤ 𝜀𝜀 ≥ 1 − 𝛿𝛿

• Solving for, in high-probability bounds:
∗ For given 𝜀𝜀 with 𝛿𝛿(𝜀𝜀) function 𝜀𝜀: Pr 𝑋𝑋 > 𝜀𝜀 ≤ 𝛿𝛿(𝜀𝜀)
∗ Given 𝛿𝛿′ can write 𝜀𝜀 = 𝛿𝛿−1(𝛿𝛿′): Pr 𝑋𝑋 > 𝛿𝛿−1(𝛿𝛿′) ≤ 𝛿𝛿′
∗ Let’s you specify either parameter
∗ Sometimes sample size 𝑚𝑚 a variable we can solve for too

COMP90051 Statistical Machine Learning

Et voila: A bound on true risk!

Result! 𝑅𝑅 𝑓𝑓 ≤ �𝑅𝑅 𝑓𝑓 + log 1/𝛿𝛿
2𝑚𝑚

with high probability (w.h.p.) ≥ 1 − 𝛿𝛿

Proof
• Take the 𝑍𝑍𝑖𝑖 as labelled examples (𝑋𝑋𝑖𝑖 ,𝑌𝑌𝑖𝑖)

• Take ℎ 𝑋𝑋,𝑌𝑌 = 𝑙𝑙 𝑌𝑌, 𝑓𝑓(𝑋𝑋) zero-one loss for some fixed 𝑓𝑓 ∈ ℱ
then ℎ 𝑋𝑋,𝑌𝑌 ∈ 0,1

• Apply one-sided Hoeffding: Pr 𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 ≥ 𝜀𝜀 ≤ exp −2𝑚𝑚𝜀𝜀2

COMP90051 Statistical Machine Learning

Mini Summary

• Goal: Bound true risk of a classifier based on its
empirical risk plus “stuff”

• Caveat: Bound is “with high probability” since we
could be unlucky with the data

• Approach: Hoeffding’s inequality which bounds how
far a mean is likely to be from an expectation

Next: PAC learning as uniform deviation bounds

COMP90051 Statistical Machine Learning

Uniform deviation bounds

Why we need our bound to simultaneously
(or uniformly) hold over a family of functions.

COMP90051 Statistical Machine Learning

Our bound doesn’t hold for 𝑓𝑓 = 𝑓𝑓𝑚𝑚

• Result says there’s set 𝑆𝑆 of good samples for which
𝑅𝑅 𝑓𝑓 ≤ �𝑅𝑅 𝑓𝑓 + log 1/𝛿𝛿

2𝑚𝑚
and Pr 𝐙𝐙 ∈ 𝑆𝑆 ≥ 1 − 𝛿𝛿

• But for different functions 𝑓𝑓1, 𝑓𝑓2, … we might get very different sets 𝑆𝑆1, 𝑆𝑆2, …

• 𝑆𝑆 observed may be bad for 𝑓𝑓𝑚𝑚. Learning minimises �𝑅𝑅 𝑓𝑓𝑚𝑚 , exacerbating this

ℱ

ri
sk

𝑅𝑅 𝑓𝑓

𝑓𝑓𝑚𝑚

𝟐𝟐𝟐𝟐
�𝑅𝑅1 𝑓𝑓

�𝑅𝑅2 𝑓𝑓

Problematic that 𝑓𝑓𝑚𝑚
depends on data

COMP90051 Statistical Machine Learning

Uniform deviation bounds

• We could analyse risks of 𝑓𝑓𝑚𝑚 from specific learner
∗ But repeating for new learners? How to compare learners?
∗ Note there are ways to do this, and data-dependently

• Bound uniform deviations across whole class ℱ
𝑅𝑅 𝑓𝑓𝑚𝑚 − �𝑅𝑅 𝑓𝑓𝑚𝑚 ≤ sup𝑓𝑓∈ℱ 𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 ≤ ?

∗ Worst deviation over an entire class bounds learned risk!
∗ Convenient, but could be much worse than the actual gap for 𝑓𝑓𝑚𝑚

ℱ

𝑅𝑅
𝑓𝑓
−
� 𝑅𝑅

[𝑓𝑓
]

𝑓𝑓𝑚𝑚

sample 1
sample 2

sample 3

uniform deviation 1
uniform deviation 2
uniform deviation 3

Public domain, Wikipedia

PDF of UnifDev

�𝑈𝑈𝐷𝐷1�𝑈𝑈𝐷𝐷2�𝑈𝑈𝐷𝐷3

COMP90051 Statistical Machine Learning

Relation to estimation error?

• Recall estimation error? Learning part of excess risk!
𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅∗ = 𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓∗ − 𝑅𝑅∗

Theorem: ERM’s estimation error is
at most twice the uniform divergence

∗ Proof: a bunch of algebra!
𝑅𝑅 𝑓𝑓𝑚𝑚 ≤ �𝑅𝑅 𝑓𝑓∗ − �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝑅𝑅 𝑓𝑓𝑚𝑚 − 𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓∗

= �𝑅𝑅 𝑓𝑓∗ − 𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓𝑚𝑚 − �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝑅𝑅 𝑓𝑓∗

≤ 𝑅𝑅 𝑓𝑓∗ − �𝑅𝑅 𝑓𝑓∗ + 𝑅𝑅 𝑓𝑓𝑚𝑚 − �𝑅𝑅 𝑓𝑓𝑚𝑚 + 𝑅𝑅 𝑓𝑓∗

≤ 2 sup𝑓𝑓∈ℱ 𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 + 𝑅𝑅 𝑓𝑓∗

COMP90051 Statistical Machine Learning

Mini Summary

• Why Hoeffding doesn’t cover a model 𝑓𝑓𝑚𝑚 learned
from data, only a fixed data-independent 𝑓𝑓

• Uniform deviation idea: Cover the worst case
deviation between risk and empirical risk, across ℱ

• Advantages: works for any learner, data distribution

• Connection back to bounding estimation error

Next: Next step for PAC learning – finite classes

COMP90051 Statistical Machine Learning

Error bound for
finite function classes

Our first uniform deviation bound

COMP90051 Statistical Machine Learning

The Union Bound
• If each model 𝑓𝑓 having large risk deviation is a “bad event”,

we need a tool to bound the probability that any bad event
happens. I.e. the union of bad events!

• Union bound: for a sequence of events 𝐴𝐴1,𝐴𝐴2 …
Pr �

𝑖𝑖
𝐴𝐴𝑖𝑖 ≤�

𝑖𝑖
Pr 𝐴𝐴𝑖𝑖

Proof:
Define 𝐵𝐵𝑖𝑖 = 𝐴𝐴𝑖𝑖\⋃𝑗𝑗=1

𝑖𝑖−1 𝐴𝐴𝑗𝑗 with 𝐵𝐵1 = 𝐴𝐴1.
1. We know: ⋃𝑖𝑖 𝐵𝐵𝑖𝑖 = ⋃𝑖𝑖 𝐴𝐴𝑖𝑖 (could prove by induction)
2. The 𝐵𝐵𝑖𝑖 are disjoint (empty intersections)
3. We know: 𝐵𝐵𝑖𝑖 ⊆ 𝐴𝐴𝑖𝑖 so Pr(𝐵𝐵𝑖𝑖) ≤ Pr 𝐴𝐴𝑖𝑖 by monotonicity

COMP90051 Statistical Machine Learning

Bound for finite classes ℱ

• A uniform deviation bound over any finite class or distribution

Theorem: Consider any 𝛿𝛿 > 0 and finite class ℱ. Then w.h.p

at least 1 − 𝛿𝛿: For all 𝑓𝑓 ∈ ℱ, 𝑅𝑅 𝑓𝑓 ≤ �𝑅𝑅 𝑓𝑓 + log |ℱ|+log 1/𝛿𝛿
2𝑚𝑚

Proof:
• If each model 𝑓𝑓 having large risk deviation is a “bad event”,

we bound the probability that any bad event happens.
• Pr ∃𝑓𝑓 ∈ ℱ,𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 ≥ ε ≤ ∑𝑓𝑓∈ℱ Pr 𝑅𝑅 𝑓𝑓 − �𝑅𝑅 𝑓𝑓 ≥ ε
• ≤ |ℱ| exp −2𝑚𝑚𝜀𝜀2 by the union bound
• Followed by inversion, setting δ = ℱ exp −2𝑚𝑚𝜀𝜀2

COMP90051 Statistical Machine Learning

Discussion

• Hoeffding’s inequality only uses boundedness of the loss, not the
variance of the loss random variables
∗ Fancier concentration inequalities leverage variance

• Uniform deviation is worst-case, ERM on a very large over-
parametrised ℱ may approach the worst-case, but learners
generally may not
∗ Custom analysis, data-dependent bounds, PAC-Bayes, etc.

• Dependent data?
∗ Martingale theory

• Union bound is in general loose, as bad is if all the bad events were
independent (not necessarily the case even though underlying data
modelled as independent); and finite 𝓕𝓕
∗ VC theory coming up next!

COMP90051 Statistical Machine Learning

Mini Summary

• More on uniform deviation bounds

• The union bound (generic tool in probability theory)

• Finite classes: Bounding uniform deviation with
union+Hoeffding

Next time: PAC learning with infinite function classes!

�Lecturer: Ben Rubinstein
This lecture
Generalisation and Model Complexity
Probably Approximately Correct Learning
Standard setup
Risk: The good, bad and ugly
A familiar trade-off: More intuition
About Bayes risk
Let’s focus on 𝑅 𝑓 𝑚
Mini Summary
Bounding true risk �of one function
We need a concentration inequality
Hoeffding’s inequality
Common probability ‘tricks’
Et voila: A bound on true risk!
Mini Summary
Uniform deviation bounds
Our bound doesn’t hold for 𝑓= 𝑓 𝑚
Uniform deviation bounds
Relation to estimation error?
Mini Summary
Error bound for �finite function classes
The Union Bound
Bound for finite classes ℱ
Discussion
Mini Summary

Related Posts