CS计算机代考程序代写 Microsoft PowerPoint - 07

Microsoft PowerPoint – 07

Lecturer: Ben Rubinstein

Lecture 7. VC Theory

COMP90051 Statistical Machine Learning

This lecture

• PAC learning bounds:
 Countably infinite case works as we’ve done so far
 General infinite case? Needs new ideas!

• Growth functions for the general PAC case
 Considering patterns of labels possible on a data set
 Gives good PAC bounds provided possible patterns don’t grow too fast

in the data set size

• Vapnik‐Chervonenkis (VC) dimension
 Max number of points that can be labelled in all ways
 Beyond this point, growth function is polynomial in data set size
 Leads to famous, general PAC bound from VC theory

• Optional proofs at end (just for fun)

COMP90051 Statistical Machine Learning

Countably infinite ?
• Hoeffding gave us for a single 𝑓 ∈ ℱ

Pr 𝑅 𝑓 𝑅 𝑓
log

2𝑚
𝛿 𝑓

…where we’re free to choose (varying) 𝛿 𝑓 in 0,1 .

• Union bound “works” (sort of) for this case

Pr ∃𝑓 ∈ ℱ,𝑅 𝑓 𝑅 𝑓
log

2𝑚
𝛿 𝑓

∈ℱ

• Choose confidences to sum to constant 𝛿, then this works
 E.g. 𝛿 𝑓 𝛿 · 𝑝 𝑓 where 1 ∑ 𝑝 𝑓∈ℱ

• By inversion: w.h.p 1 𝛿, for all 𝑓, 𝑅 𝑓 𝑅 𝑓
3

ℱ
𝑝 𝑓

Try this
for finite ℱ with
uniform 𝑝 𝑓

Josh Staiger (CCA2.0)

COMP90051 Statistical Machine Learning

Ok fine, but general case?

• Much of ML has continuous parameters
 Countably infinite covers only discrete parameters 

• Our argument fails! 
 𝑝 𝑓 becomes a density
 Its zero for all 𝑓. No divide by zero!
 Need a new argument!

• Idea introduced by VC theory: intuition
 Don’t focus on whole class ℱ as if each 𝑓 is different
 Focus on differences over sample 𝑍 , … ,𝑍

COMP90051 Statistical Machine Learning

Mini Summary

• Can eek out PAC bounds on countably infinite
families using Hoeffding bound + union bound

• No good for general (uncountably infinite) cases

• Need another fundamentally new idea

Next: Organising analysis around patterns of labels
possible on a data set, to avoid wort‐case bad events

COMP90051 Statistical Machine Learning

Growth Function

Focusing on the size of model families
on data samples

COMP90051 Statistical Machine Learning

Bad events: Unreasonably worst case?
• Bad event ℬ for model 𝑓

𝑅 𝑓 𝑅 𝑓 𝜀 with probability 2 exp 2𝑚𝜀

• Union bound: bad events don’t overlap!?
Pr ℬ or … or ℬ ℱ Pr ℬ ⋯ Pr ℬ ℱ 2 ℱ exp 2𝑚𝜀

ℬ

ℬ ℬ

Tight bound: No overlaps

Loose bound: Overlaps

ℬ

ℬ ℬ

COMP90051 Statistical Machine Learning

How do overlaps arise?

‐1 +1

true
error

𝑓 𝑓

Whole of population

‐1 +1

𝑓 𝑓

On a sample

Significantly overlapping events ℬ and ℬ

COMP90051 Statistical Machine Learning

How do overlaps arise?

VC theory focuses on the pattern of labels any 𝑓 ∈ ℱ could make

𝑓 𝑓

𝑓𝑓

COMP90051 Statistical Machine Learning

Dichotomies and Growth Function

• Definition: Given sample 𝑥 , … , 𝑥 and family ℱ, a dichotomy is a
𝑓 𝑥 , … , 𝑓 𝑥 ∈ 1, 1 for some 𝑓 ∈ ℱ.

• Unique dichotomies ℱ 𝐱 𝑓 𝑥 , … , 𝑓 𝑥 ∶ 𝑓 ∈ ℱ , patterns
of labels possible with the family

• Even when ℱ infinite, ℱ 𝐱 2 (why?)

• And also (relevant for ℱ finite, tiny), ℱ 𝐱 ℱ (why?)

• Intuition: ℱ 𝒙 might replace ℱ in union bound? How remove x?

• Definition: The growth function 𝑆ℱ 𝑚 sup𝐱∈𝒟 ℱ 𝐱 is the
max number of label patterns achievable by ℱ for any 𝑚 sample.

COMP90051 Statistical Machine Learning

for linear classifiers in 2D?

𝑆ℱ 3 8

COMP90051 Statistical Machine Learning

for linear classifiers in 2D?

X
X

ℱ 𝐱 6
but still have
𝑆ℱ 3 8

COMP90051 Statistical Machine Learning

• What about points?

• Can never produce the criss‐cross (XOR) dichotomy

• In fact ℱ
• Guess/exercise: What about general and dimension?

for linear classifiers in 2D?

COMP90051 Statistical Machine Learning

PAC Bound with Growth Function

• Theorem: Consider any 𝛿 0 and any class ℱ. Then w.h.p. at
least 1 𝛿: For all 𝑓 ∈ ℱ

𝑅 𝑓 𝑅 𝑓 2 2
log 𝑆ℱ 2𝑚 log 4/𝛿

𝑚

• Proof: out of scope (“only” 2‐3pgs), optional reading.

• Compare to PAC bounds so far
 A few negligible extra constants (the 2s, the 4)
 |ℱ| has become 𝑆ℱ 2𝑚
 𝑆ℱ 𝑚 |ℱ|, not “worse” than union bound for finite ℱ
 𝑆ℱ 𝑚 2 , very bad for big family with exponential growth function

gets 𝑅 𝑓 𝑅 𝑓 𝐵𝑖𝑔 𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡. Even 𝑅 𝑓 𝑅 𝑓 1 meaningless!!
14

COMP90051 Statistical Machine Learning

Mini Summary

• The previous PAC bound approach that organises bad events by model and
applies uniform bound is only tight if bad events are disjoint

• In reality some models generate overlapping bad events

• Better to organise families by possible patterns of labels on a data set: the
dichotomies of the family

• Counting possible dichotomies gives the growth function

• PAC bound with growth function potentially tackles general (uncountably
infinite) families provided growth function is sub‐exponential in data size

Next: VC dimension for a computable bound on growth functions, with the
polynomial behaviour we need! Gives our final, general, PAC bound

COMP90051 Statistical Machine Learning

The VC dimension

Computable, bounds growth function

COMP90051 Statistical Machine Learning

Vapnik‐Chervonenkis dimension

• Definition: The VC dimension of a family is
the largest such that ℱ .
 Points 𝐱 𝑥 , … , 𝑥 are shattered by ℱ if ℱ 𝐱 2
 So VC ℱ is the size of the largest set shattered by ℱ

• Example: linear classifiers in ,

• Guess: VC‐dim of linear classifiers in ?

Shattered Not shattered

COMP90051 Statistical Machine Learning

Example: from on whole domain?

𝑥 𝑥 𝑥 𝑥
0 0 0 0
0 1 1 0
1 0 0 1
1 1 0 1
0 1 0 0
1 0 1 0
1 1 1 1
0 0 1 1
0 1 0 1
1 1 1 0

• Columns are all points in domain

• Each row is a dichotomy on entire input domain

• Obtain dichotomies on a subset of points
𝐱′ ⊆ 𝑥 , … , 𝑥 by: drop columns, drop dupe rows

• ℱ shatters 𝐱′ if number of rows is 2|𝐱 |

𝑥 𝑥 𝑥
0 0 0
0 1 0
1 0 1
1 1 1
0 1 0
1 0 0
1 1 1
0 0 1
0 1 1
1 1 0

This example:

• Dropping column 3
leaves 8 rows behind:
ℱ shatters 𝑥 , 𝑥 , 𝑥

• Original table has
2 rows: ℱ doesn’t

shatter more than 3

• VC ℱ 3Note we’re using labels {0,1}
instead of {‐1,+1}. Why OK?

COMP90051 Statistical Machine Learning

Sauer‐Shelah Lemma

• Lemma (Sauer‐Shelah): Consider any ℱ with finite VC ℱ 𝑘,
any sample size 𝑚. Then 𝑆ℱ 𝑚 ∑

𝑚
𝑖 .

• From basic facts of Binomial coefficients
 Bound is O 𝑚 : finite VC ⇒ eventually polynomial growth!

 For 𝑚 𝑘, it is bounded by

• Theorem (VC bound): Consider any 𝛿 0 and any VC‐𝑘 class ℱ.
Then w.h.p. at least 1 𝛿: For all 𝑓 ∈ ℱ

𝑅 𝑓 𝑅 𝑓 2 2
𝑘 log log

𝑚

COMP90051 Statistical Machine Learning

VC bound big picture

• (Uniform) difference between 𝑅 𝑓 ,𝑅 𝑓 is O down from ∞

• Limiting complexity of ℱ leads to better generalisation
• VC dim, growth function measure “effective” size of ℱ
• VC dim doesn’t count functions, but uses geometry of family: projections

of family members onto possible samples
• Example: linear “gap‐tolerant” classifiers (like SVMs) with “margin” ∆

have VC O 1/∆ . Maximising “margin” reduces VC‐dimension.
20

log 𝑆ℱ 𝑚

𝑚VC ℱ

COMP90051 Statistical Machine Learning

Mini Summary

• VC‐dim is the largest set size shattered by a family
 It is 𝑑 1 for linear classifiers in ℝ
 Can calculate it on entire‐domain dichotomies of a family by

dropping columns and counting unique rows

• Sauer‐Shelah: The growth function grows only
polynomially in the set size beyond the VC‐dim

• As a result, VC PAC bounds uniform risk and empirical risk
deviation by

Next: Two selected proofs. Optional but beautiful.

COMP90051 Statistical Machine Learning

Two Selected Proofs

Green slides: Not examinable.
Food for thought. Soul food.

COMP90051 Statistical Machine Learning

Linear classifiers in ‐dim:

• Goal: construct 𝑚 𝑑 1 specific points in ℝ that are
shattered by the linear classifier family

• Data in rows of 𝐗

1 0 0 … 0
1 1 0 … 0
1
⋮
1

0
⋮
0

1
⋮
0

…
⋱
…

0
⋮
1

is invertible!

• Any dichotomy 𝑦 ∈ 1,1 , need 𝐰 with sign 𝐗𝐰 𝐲

• 𝐰 𝐗 𝐲 works!! sign 𝐗𝐰 sign 𝐗𝐗 𝐲 sign 𝐲 𝐲

• We’ve shown that ℱ can shatter 𝑑 1 points: VC ℱ 𝑑 1
23

COMP90051 Statistical Machine Learning

Linear classifiers in ‐dim:

• Goal: cannot shatter any set of 𝑑 2 points

• Any 𝐱 , … , 𝐱 , have more pts than dims: linear dependent

𝐱 ∑ 𝑎 𝐱 , for some 𝑗, where not all 𝑎 ’s are zero

• Possible dichotomy 𝐲? 𝑦
sign 𝑎 , if 𝑖 𝑗
1, if 𝑖 𝑗

 Suppose 𝐰 generated 𝑖 𝑗: sign 𝑎 sign 𝐰′𝐱 so 𝑎 𝐰′𝐱 0
 Can 𝐰 generate 𝑖 𝑗??
 𝐰 𝐱 𝐰 ∑ 𝑎 𝐱 ∑ 𝑎 𝐰 𝐱 0 so sign 𝐰 𝐱 𝑦

• We’ve shown VC ℱ 𝑑 2, in other words VC ℱ 𝑑 1

COMP90051 Statistical Machine Learning

Proof of Sauer‐Shelah Lemma (by Haussler ’95)

• To show that growth function 𝑆ℱ 𝑚 ∑
𝑚
𝑖 we prove the bound

for any dichotomies ℱ 𝑥 , … , 𝑥 since ℱ 𝑥 , … , 𝑥 𝑆ℱ 𝑚

• Write 𝐘 ℱ 𝑥 , … , 𝑥 ⊆ 0,1 , where 1 → 0.

• Definition: Consider any column 1 𝑖 𝑚 and dichotomy 𝐲 ∈ 𝐘. The
shift operator 𝐻 𝐲;𝐘 returns 𝐲 if there exists some 𝐲′ ∈ 𝐘 differing
to 𝐲 only in the 𝑖th coordinate; otherwise it returns 𝐲 with 𝑦 0.
Define 𝐻 𝐘 𝐻 𝐲;𝐘 ∶ 𝐲 ∈ 𝐘 the shifting all dichotomies.

 Intuition: Shifting along a column drops a +1 to 0 in that column so long as
now other row would become duplicated.

• Definition: A set of dichotomies 𝐕 ⊆ 0,1 is called closed below if
for all 1 𝑖 𝑚, shifting does nothing 𝐻 𝐕 𝐕.

 Intuition: Every 𝐯 ∈ 𝐕 has, for every 1 𝑖 𝑚 for which 𝑣 1, some 𝐮 ∈ 𝐕
the same as 𝐯 except with 𝑢 0.

COMP90051 Statistical Machine Learning

Proof of Sauer‐Shelah Lemma (by Haussler ’95)

• Example set of 6 unique dichotomies on 𝑚 3 pts with VC=2

𝐱

𝐱𝑥 𝑥 𝑥
0 0 0
0 1 1
1 0 0
1 1 0
1 0 1
1 1 1

𝐱

Shift down along 𝑖 1

𝐱

Shift down along 𝑖 2

𝐱

Closed below

COMP90051 Statistical Machine Learning

Proof of Sauer‐Shelah Lemma (by Haussler ’95)
• Goal: show that (1) shifting almost maintains VC dimension and cardinality

all the way to a closed‐below end, (2) closed‐below sets have the desired
Sauer‐Shelah bound

• Shifting property 1: 𝐻 𝐘 𝐘 for any 𝐘.
 Proof: no two dichotomies in 𝐘 shift to the same dichotomy

• Shifting property 2: VC 𝐻 𝐘 VC 𝐘 for any 𝑖,𝐘.
 Proof sketch: If 𝐻 𝐘 shatters a subset of points, then so too does 𝐘

• Shifting property 3: if 𝐘 is closed below, then all dichotomies 𝐲 ∈ 𝐘 have
at most VC 𝐘 ‐many 𝑦 1 (the rest 0).
 Therefore: 𝐘 ⋯ 𝐘 by counting
 Proof sketch: if a 𝐲 ∈ 𝐘 had more 1s, all combinations would exist “below”

• Together: exists a shift sequence 𝑖 , … , 𝑖 to a closed below 𝐻 𝐘 :
𝐘 𝐻 𝐘 ⋯ 𝐻 𝐘 ∑

𝐘
⋯ ∑ 𝐘

COMP90051 Statistical Machine Learning

Mini Summary

• Linear classifiers in have VC dimension
 Lower bound VC‐dim with specific points that are shattered
 Upper bound VC‐dim by lin. dependence of any 𝑑 2 points

• Sauer‐Shelah lemma bounds a family’s growth function
by a polynomial in VC dimension.
 Ingenious shifting operator transforms sets of dichotomies
into boundable closed‐below sets

 Along the way keeps cardinality and VC‐dim controlled

Next time: Support vector machines

Related Posts