Microsoft PowerPoint – 07
Lecturer: Ben Rubinstein
Lecture 7. VC Theory
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• PAC learning bounds:
Countably infinite case works as we’ve done so far
General infinite case? Needs new ideas!
• Growth functions for the general PAC case
Considering patterns of labels possible on a data set
Gives good PAC bounds provided possible patterns don’t grow too fast
in the data set size
• Vapnik‐Chervonenkis (VC) dimension
Max number of points that can be labelled in all ways
Beyond this point, growth function is polynomial in data set size
Leads to famous, general PAC bound from VC theory
• Optional proofs at end (just for fun)
2
COMP90051 Statistical Machine Learning
Countably infinite ?
• Hoeffding gave us for a single 𝑓 ∈ ℱ
Pr 𝑅 𝑓 𝑅 𝑓
log
2𝑚
𝛿 𝑓
…where we’re free to choose (varying) 𝛿 𝑓 in 0,1 .
• Union bound “works” (sort of) for this case
Pr ∃𝑓 ∈ ℱ,𝑅 𝑓 𝑅 𝑓
log
2𝑚
𝛿 𝑓
∈ℱ
• Choose confidences to sum to constant 𝛿, then this works
E.g. 𝛿 𝑓 𝛿 · 𝑝 𝑓 where 1 ∑ 𝑝 𝑓∈ℱ
• By inversion: w.h.p 1 𝛿, for all 𝑓, 𝑅 𝑓 𝑅 𝑓
3
ℱ
𝑝 𝑓
Try this
for finite ℱ with
uniform 𝑝 𝑓
Josh Staiger (CCA2.0)
COMP90051 Statistical Machine Learning
Ok fine, but general case?
• Much of ML has continuous parameters
Countably infinite covers only discrete parameters
• Our argument fails!
𝑝 𝑓 becomes a density
Its zero for all 𝑓. No divide by zero!
Need a new argument!
• Idea introduced by VC theory: intuition
Don’t focus on whole class ℱ as if each 𝑓 is different
Focus on differences over sample 𝑍 , … ,𝑍
4
COMP90051 Statistical Machine Learning
Mini Summary
• Can eek out PAC bounds on countably infinite
families using Hoeffding bound + union bound
• No good for general (uncountably infinite) cases
• Need another fundamentally new idea
Next: Organising analysis around patterns of labels
possible on a data set, to avoid wort‐case bad events
5
COMP90051 Statistical Machine Learning
Growth Function
6
Focusing on the size of model families
on data samples
COMP90051 Statistical Machine Learning
Bad events: Unreasonably worst case?
• Bad event ℬ for model 𝑓
𝑅 𝑓 𝑅 𝑓 𝜀 with probability 2 exp 2𝑚𝜀
• Union bound: bad events don’t overlap!?
Pr ℬ or … or ℬ ℱ Pr ℬ ⋯ Pr ℬ ℱ 2 ℱ exp 2𝑚𝜀
7
Ω
ℬ
ℬ ℬ
Tight bound: No overlaps
Ω
Loose bound: Overlaps
ℬ
ℬ ℬ
COMP90051 Statistical Machine Learning
How do overlaps arise?
8
‐1 +1
true
error
true
error
𝑓 𝑓
Whole of population
‐1 +1
𝑓 𝑓
On a sample
Significantly overlapping events ℬ and ℬ
COMP90051 Statistical Machine Learning
How do overlaps arise?
9
VC theory focuses on the pattern of labels any 𝑓 ∈ ℱ could make
𝑓 𝑓
𝑓𝑓
COMP90051 Statistical Machine Learning
Dichotomies and Growth Function
• Definition: Given sample 𝑥 , … , 𝑥 and family ℱ, a dichotomy is a
𝑓 𝑥 , … , 𝑓 𝑥 ∈ 1, 1 for some 𝑓 ∈ ℱ.
• Unique dichotomies ℱ 𝐱 𝑓 𝑥 , … , 𝑓 𝑥 ∶ 𝑓 ∈ ℱ , patterns
of labels possible with the family
• Even when ℱ infinite, ℱ 𝐱 2 (why?)
• And also (relevant for ℱ finite, tiny), ℱ 𝐱 ℱ (why?)
• Intuition: ℱ 𝒙 might replace ℱ in union bound? How remove x?
• Definition: The growth function 𝑆ℱ 𝑚 sup𝐱∈𝒟 ℱ 𝐱 is the
max number of label patterns achievable by ℱ for any 𝑚 sample.
10
COMP90051 Statistical Machine Learning
for linear classifiers in 2D?
11
𝑆ℱ 3 8
COMP90051 Statistical Machine Learning
for linear classifiers in 2D?
12
X
X
ℱ 𝐱 6
but still have
𝑆ℱ 3 8
COMP90051 Statistical Machine Learning
• What about points?
• Can never produce the criss‐cross (XOR) dichotomy
• In fact ℱ
• Guess/exercise: What about general and dimension?
for linear classifiers in 2D?
13
COMP90051 Statistical Machine Learning
PAC Bound with Growth Function
• Theorem: Consider any 𝛿 0 and any class ℱ. Then w.h.p. at
least 1 𝛿: For all 𝑓 ∈ ℱ
𝑅 𝑓 𝑅 𝑓 2 2
log 𝑆ℱ 2𝑚 log 4/𝛿
𝑚
• Proof: out of scope (“only” 2‐3pgs), optional reading.
• Compare to PAC bounds so far
A few negligible extra constants (the 2s, the 4)
|ℱ| has become 𝑆ℱ 2𝑚
𝑆ℱ 𝑚 |ℱ|, not “worse” than union bound for finite ℱ
𝑆ℱ 𝑚 2 , very bad for big family with exponential growth function
gets 𝑅 𝑓 𝑅 𝑓 𝐵𝑖𝑔 𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡. Even 𝑅 𝑓 𝑅 𝑓 1 meaningless!!
14
COMP90051 Statistical Machine Learning
Mini Summary
• The previous PAC bound approach that organises bad events by model and
applies uniform bound is only tight if bad events are disjoint
• In reality some models generate overlapping bad events
• Better to organise families by possible patterns of labels on a data set: the
dichotomies of the family
• Counting possible dichotomies gives the growth function
• PAC bound with growth function potentially tackles general (uncountably
infinite) families provided growth function is sub‐exponential in data size
Next: VC dimension for a computable bound on growth functions, with the
polynomial behaviour we need! Gives our final, general, PAC bound
15
COMP90051 Statistical Machine Learning
The VC dimension
16
Computable, bounds growth function
COMP90051 Statistical Machine Learning
Vapnik‐Chervonenkis dimension
• Definition: The VC dimension of a family is
the largest such that ℱ .
Points 𝐱 𝑥 , … , 𝑥 are shattered by ℱ if ℱ 𝐱 2
So VC ℱ is the size of the largest set shattered by ℱ
• Example: linear classifiers in ,
• Guess: VC‐dim of linear classifiers in ?
17
Shattered Not shattered
COMP90051 Statistical Machine Learning
Example: from on whole domain?
18
𝑥 𝑥 𝑥 𝑥
0 0 0 0
0 1 1 0
1 0 0 1
1 1 0 1
0 1 0 0
1 0 1 0
1 1 1 1
0 0 1 1
0 1 0 1
1 1 1 0
• Columns are all points in domain
• Each row is a dichotomy on entire input domain
• Obtain dichotomies on a subset of points
𝐱′ ⊆ 𝑥 , … , 𝑥 by: drop columns, drop dupe rows
• ℱ shatters 𝐱′ if number of rows is 2|𝐱 |
𝑥 𝑥 𝑥
0 0 0
0 1 0
1 0 1
1 1 1
0 1 0
1 0 0
1 1 1
0 0 1
0 1 1
1 1 0
This example:
• Dropping column 3
leaves 8 rows behind:
ℱ shatters 𝑥 , 𝑥 , 𝑥
• Original table has
2 rows: ℱ doesn’t
shatter more than 3
• VC ℱ 3Note we’re using labels {0,1}
instead of {‐1,+1}. Why OK?
COMP90051 Statistical Machine Learning
Sauer‐Shelah Lemma
• Lemma (Sauer‐Shelah): Consider any ℱ with finite VC ℱ 𝑘,
any sample size 𝑚. Then 𝑆ℱ 𝑚 ∑
𝑚
𝑖 .
• From basic facts of Binomial coefficients
Bound is O 𝑚 : finite VC ⇒ eventually polynomial growth!
For 𝑚 𝑘, it is bounded by
• Theorem (VC bound): Consider any 𝛿 0 and any VC‐𝑘 class ℱ.
Then w.h.p. at least 1 𝛿: For all 𝑓 ∈ ℱ
𝑅 𝑓 𝑅 𝑓 2 2
𝑘 log log
𝑚
19
COMP90051 Statistical Machine Learning
VC bound big picture
• (Uniform) difference between 𝑅 𝑓 ,𝑅 𝑓 is O down from ∞
• Limiting complexity of ℱ leads to better generalisation
• VC dim, growth function measure “effective” size of ℱ
• VC dim doesn’t count functions, but uses geometry of family: projections
of family members onto possible samples
• Example: linear “gap‐tolerant” classifiers (like SVMs) with “margin” ∆
have VC O 1/∆ . Maximising “margin” reduces VC‐dimension.
20
log 𝑆ℱ 𝑚
𝑚VC ℱ
COMP90051 Statistical Machine Learning
Mini Summary
• VC‐dim is the largest set size shattered by a family
It is 𝑑 1 for linear classifiers in ℝ
Can calculate it on entire‐domain dichotomies of a family by
dropping columns and counting unique rows
• Sauer‐Shelah: The growth function grows only
polynomially in the set size beyond the VC‐dim
• As a result, VC PAC bounds uniform risk and empirical risk
deviation by
Next: Two selected proofs. Optional but beautiful.
21
COMP90051 Statistical Machine Learning
Two Selected Proofs
22
Green slides: Not examinable.
Food for thought. Soul food.
COMP90051 Statistical Machine Learning
Linear classifiers in ‐dim:
• Goal: construct 𝑚 𝑑 1 specific points in ℝ that are
shattered by the linear classifier family
• Data in rows of 𝐗
1 0 0 … 0
1 1 0 … 0
1
⋮
1
0
⋮
0
1
⋮
0
…
⋱
…
0
⋮
1
is invertible!
• Any dichotomy 𝑦 ∈ 1,1 , need 𝐰 with sign 𝐗𝐰 𝐲
• 𝐰 𝐗 𝐲 works!! sign 𝐗𝐰 sign 𝐗𝐗 𝐲 sign 𝐲 𝐲
• We’ve shown that ℱ can shatter 𝑑 1 points: VC ℱ 𝑑 1
23
COMP90051 Statistical Machine Learning
Linear classifiers in ‐dim:
• Goal: cannot shatter any set of 𝑑 2 points
• Any 𝐱 , … , 𝐱 , have more pts than dims: linear dependent
𝐱 ∑ 𝑎 𝐱 , for some 𝑗, where not all 𝑎 ’s are zero
• Possible dichotomy 𝐲? 𝑦
sign 𝑎 , if 𝑖 𝑗
1, if 𝑖 𝑗
Suppose 𝐰 generated 𝑖 𝑗: sign 𝑎 sign 𝐰′𝐱 so 𝑎 𝐰′𝐱 0
Can 𝐰 generate 𝑖 𝑗??
𝐰 𝐱 𝐰 ∑ 𝑎 𝐱 ∑ 𝑎 𝐰 𝐱 0 so sign 𝐰 𝐱 𝑦
• We’ve shown VC ℱ 𝑑 2, in other words VC ℱ 𝑑 1
24
COMP90051 Statistical Machine Learning
Proof of Sauer‐Shelah Lemma (by Haussler ’95)
• To show that growth function 𝑆ℱ 𝑚 ∑
𝑚
𝑖 we prove the bound
for any dichotomies ℱ 𝑥 , … , 𝑥 since ℱ 𝑥 , … , 𝑥 𝑆ℱ 𝑚
• Write 𝐘 ℱ 𝑥 , … , 𝑥 ⊆ 0,1 , where 1 → 0.
• Definition: Consider any column 1 𝑖 𝑚 and dichotomy 𝐲 ∈ 𝐘. The
shift operator 𝐻 𝐲;𝐘 returns 𝐲 if there exists some 𝐲′ ∈ 𝐘 differing
to 𝐲 only in the 𝑖th coordinate; otherwise it returns 𝐲 with 𝑦 0.
Define 𝐻 𝐘 𝐻 𝐲;𝐘 ∶ 𝐲 ∈ 𝐘 the shifting all dichotomies.
Intuition: Shifting along a column drops a +1 to 0 in that column so long as
now other row would become duplicated.
• Definition: A set of dichotomies 𝐕 ⊆ 0,1 is called closed below if
for all 1 𝑖 𝑚, shifting does nothing 𝐻 𝐕 𝐕.
Intuition: Every 𝐯 ∈ 𝐕 has, for every 1 𝑖 𝑚 for which 𝑣 1, some 𝐮 ∈ 𝐕
the same as 𝐯 except with 𝑢 0.
25
COMP90051 Statistical Machine Learning
Proof of Sauer‐Shelah Lemma (by Haussler ’95)
• Example set of 6 unique dichotomies on 𝑚 3 pts with VC=2
26
𝐱
𝐱
𝐱𝑥 𝑥 𝑥
0 0 0
0 1 1
1 0 0
1 1 0
1 0 1
1 1 1
𝐱
𝐱
𝐱
Shift down along 𝑖 1
𝐱
𝐱
𝐱
Shift down along 𝑖 2
𝐱
𝐱
𝐱
Closed below
COMP90051 Statistical Machine Learning
Proof of Sauer‐Shelah Lemma (by Haussler ’95)
• Goal: show that (1) shifting almost maintains VC dimension and cardinality
all the way to a closed‐below end, (2) closed‐below sets have the desired
Sauer‐Shelah bound
• Shifting property 1: 𝐻 𝐘 𝐘 for any 𝐘.
Proof: no two dichotomies in 𝐘 shift to the same dichotomy
• Shifting property 2: VC 𝐻 𝐘 VC 𝐘 for any 𝑖,𝐘.
Proof sketch: If 𝐻 𝐘 shatters a subset of points, then so too does 𝐘
• Shifting property 3: if 𝐘 is closed below, then all dichotomies 𝐲 ∈ 𝐘 have
at most VC 𝐘 ‐many 𝑦 1 (the rest 0).
Therefore: 𝐘 ⋯ 𝐘 by counting
Proof sketch: if a 𝐲 ∈ 𝐘 had more 1s, all combinations would exist “below”
• Together: exists a shift sequence 𝑖 , … , 𝑖 to a closed below 𝐻 𝐘 :
𝐘 𝐻 𝐘 ⋯ 𝐻 𝐘 ∑
𝐘
⋯ ∑ 𝐘
27
COMP90051 Statistical Machine Learning
Mini Summary
28
• Linear classifiers in have VC dimension
Lower bound VC‐dim with specific points that are shattered
Upper bound VC‐dim by lin. dependence of any 𝑑 2 points
• Sauer‐Shelah lemma bounds a family’s growth function
by a polynomial in VC dimension.
Ingenious shifting operator transforms sets of dichotomies
into boundable closed‐below sets
Along the way keeps cardinality and VC‐dim controlled
Next time: Support vector machines