The Vapnik-Chervonenkis Dimension
Prof. Dan A. Simovici
UMB
1/96
Outline
1 Basic Definitions for Vapnik-Chervonenkis Dimension
2 Growth Functions
3 The Sauer-Shelah Theorem
4 The Link between VCD and PAC Learning
5 The VCD of Collections of Sets
2/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Trace of a Collection of Sets
Definition
Let C be a collection of sets and let U be a set. The trace of collection C on the set U is the collection
CU = {U ∩ C | C ∈ C}.
If the trace of C on U, CU equals P(U), then we say that U is shattered
by C.
U is shattered by C if C can carve any subset of U as an intersection with a set in C.
3/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Example
Let U = {u1, u2} and let C be the collection of sets
C = {{u3}, {u1, u3}, {u2, u3}, {u1, u2, u3}}.
C shatters U because we can write:
∅ = {u1} = {u2} = {u1,u2} =
U∩{u3} U∩{u1,u3} U∩{u2,u3} U∩{u1,u2,u3}
4/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Definition
The Vapnik-Chervonenkis dimension of the collection C (called the VC-dimension for brevity) is the largest size of a set K that is shattered by C.
This largest size is denoted by VCD(C).
5/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Example
Note that the previous collection C cannot shatter the set
U′ = {u1, u2, u3} because this set has 8 subsets and C has just four sets. Thus, if is impossible to express all subsets of U′ as intersections of U′ with some set of C.
The VCD dimension of the collection C is 2.
6/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Note that:
We have VCD(C) = 0 if and only if |C| = 1.
If VCD(C) = d, then there exists a set K of size d such that for each subset L of K there exists a set C ∈ C such that L = K ∩ C .
C shatters K if and only if the trace of C on K denoted by CK shatters K. This allows us to assume without loss of generality that both the sets of the collection C and a set K shattered by C are subsets of a set U.
7/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Collections of Sets as Sets of Hypotheses
Let U be a set, K a subset, and let C be a collection of sets.
Each C ∈ C defines a hypothesis hC : U −→ {−1, 1} that is a dichotomy,
where
1 if u ∈ C, hC(u)= −1 ifu̸∈C.
K is shattered by C if and only if for every subset L of K there exists a hypothesis hC such that the set of positive examples {u ∈ U | hC (u) = 1} equals L.
8/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Finite Collections have Finite VC-Dimension
Let C be a collection of sets with VCD(C) = d and let K be a set shattered by C with |K| = d.
Since there exist 2d subsets of K, there are at least 2d subsets of C, so
2d |C|.
Consequently, VCD(C) log2 |C|. This shows that if C is finite, then VCD(C) is finite.
The converse is false: there exist infinite collections C that have a finite VC -dimension.
9/96
Basic Definitions for Vapnik-Chervonenkis Dimension
A Tabular Representation of Collections
If U = {u1,…,un} is a finite set, then the trace of a collection
C = {C1,…,Cp} of subsets of U on a subset K of U can be presented in an intuitive, tabular form.
Let θ be a table containing the rows t1, . . . , tp and the binary attributes u1,…,un.
Each tuple tk corresponds to a set Ck of C and is defined by
1 ifui∈Ck, tk [ui ] = 0 otherwise,
for 1 i n. Then, C shatters K if the content of the projection r[K] consists of 2|K| distinct rows.
10/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Example
Let U = {u1,u2,u3,u4} and let
C = {{u2, u3}, {u1, u3, u4}, {u2, u4}, {u1, u2}, {u2, u3, u4}} represented by:
TC
u1
u2
u3
u4
0 1 0 1 0
1 0 1 1 1
1 1 0 0 1
0 1 1 0 1
The set K = {u1, u3} is shattered by the collection C because the projection on K ((0, 1), (1, 1), (0, 0), (1, 0), (0, 1)). contains the all four necessary tuples (0, 1), (1, 1), (0, 0), and (1, 0).
No subset K of U that contains at least three elements can be shattered by C because this would require the projection r[K] to contain at least eight tuples. Thus, VCD(C) = 2.
11/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Observations:
Every collection of sets shatters the empty set.
If C shatters a set of size n, then it shatters a set of size p, where p n.
For a collection of sets C and for m ∈ N, let ΠC[m]=max{|CK| | |K|=m}
be the largest number of distinct subsets of a set having m elements that can be obtained as intersections of the set with members of C.
We have ΠC[m] 2m;
if C shatters a set of size m, then ΠC[m] = 2m.
12/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Definition
A Vapnik-Chervonenkis class (or a VC class) is a collection C of sets such that VCD(C) is finite.
13/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Example
Let R be the set of real numbers and let I be the collection of sets {(−∞,t) | t ∈ R}.
We claim that any singleton is shattered by I. Indeed, if S = {x} is a singleton, then P({x}) = {∅,{x}}. Thus, if t x, we have
(−∞, t) ∩ S = {x}; also, if t < x, we have (−∞, t) ∩ S = ∅, so
IS = P(S).
There is no set S with |S| = 2 that can be shattered by I. Indeed, suppose that S = {x,y}, where x < y. Then, any member of I that contains y includes the entire set S, so IS = {∅,{x},{x,y}} ≠ P(S). This shows that I is a VC class and VCD(I) = 1.
14/96
Basic Definitions for Vapnik-Chervonenkis Dimension
Example
Consider the collection I = {[a, b] | a, b ∈ R, a b} of closed intervals. We claim that VCD(I) = 2. To justify this claim, we need to show that there exists a set S = {x,y} such that IS = P(S) and no three-element set can be shattered by I.
For the first part of the statement, consider the intersections [u,v]∩S=∅, wherev
Then, there exists a subset T of S having at least k + 1 elements such that C shatters T.
44/96
The Sauer-Shelah Theorem
Proof
Let |SH(C)| be the number of sets shattered by C. We have |SH(C)| |C| by the previous theorem.
Let Pk(S) be the collection of subsets of S that contain k or fewer elements.
The inequality of the theorem means that |C| > |Pk(S)|, hence
|SH(C)| > |Pk(S)|. Therefore, there exists a subset T of S with at least k + 1 elements that is shattered by C.
45/96
The Sauer-Shelah Theorem
Theorem
Let φ : N2 −→ N be the function defined by
φ(d,m) = We have
for d,m ∈ N.
1
φ(d,m−1)+φ(d −1,m−1),
m φ(d,m)= d
if m = 0 or d = 0 otherwise.
46/96
The Sauer-Shelah Theorem
Proof
The argument is by strong induction on s = d + m.
The base case, s = 0, implies m = 0 and d = 0, and the equality is immediate.
47/96
The Sauer-Shelah Theorem
Suppose that the equality holds for φ(d′, m′), where d′ + m′ < d + m. We have:
φ(d,m−1)+φ(d −1,m−1) (by definition)
φ(d,m) =
= d m−1 + d −1 m−1
i=0 i i=0 i (by inductive hypothesis)
= d m−1 + d m−1 i=0 i i=1 i−1
(by changing the summation index in the second sum)
= d m−1 + d m−1
i=0 i i=0 (because m−1 = 0)
i−1
−1
= d m−1 + m−1
i=0 i i−1 = d m=m,
i=0i d which gives the desired conclusion.
48/96
The Sauer-Shelah Theorem
Another Inequality
Suppose that VCD(C) = d and |S| = n. Then SH(C) ⊆ Pd(S), hence
d n n
|C||SH(C)|
Together with the previous inequality we obtain:
i = d . 2 |C| d =φ(n,d).
d n
i=1
49/96
The Sauer-Shelah Theorem
Lemma
For d ∈ N and d 2 we have
d−1 dd
2 d!.
Proof: The argument is by induction on d. In the basis step, d = 2 both members are equal to 2.
50/96
The Sauer-Shelah Theorem
Suppose the inequality holds for d. We have
(d + 1)d+1 (d + 1)!
(d + 1)d dd = d! = d! ·
(d + 1)d dd
1d
= d!· 1+d 2d· 1+d 2d
dd 1d
(by inductive hypothesis)
because
1d 1
1 + d 1 + d d = 2.
This concludes the proof of the inequality.
51/96
The Sauer-Shelah Theorem
Lemma
Wehaveφ(d,m)2md foreverymdandd1. d!
Proof: The argument is by induction on d and n. If d = 1, then
φ(1, m) = m + 1 2m for m 1, so the inequality holds for every m 1, when d = 1.
52/96
The Sauer-Shelah Theorem
Proof (cont’d)
If m = d 2, then φ(d,m) = φ(d,d) = 2d and the desired inequality follows immediately from a previous Lemma.
Suppose that the inequality holds for m > d 1. We have
φ(d,m+1) =
φ(d,m)+φ(d −1,m) (by the definition of φ)
md md−1
2d!+2(d−1)!
(by inductive hypothesis)
= 2 md−1 1+m. (d − 1)! d
53/96
The Sauer-Shelah Theorem
Proof (cont’d)
It is easy to see that the inequality
2 md−1 1+m2(m+1)d
is equivalent to
(d − 1)! d d! d 1d
m+1 1+m
and, therefore, is valid. This yields immediately the inequality of the lemma.
54/96
The Sauer-Shelah Theorem
The Asymptotic Behavior of the Function φ
Theorem
The function φ satisfies the inequality: φ(d,m) < emd
d
for every m d and d 1.
Proof: By a previous Lemma, φ(d,m) 2md . Therefore, we need to
show only that
d!
dd
2 e
67/96
The Sauer-Shelah Theorem
Example
Suppose that C is finite. For any fixed set C0 ⊕ C ∈ ∆C0,ε(C), the probability that we fail to hit C0 ⊕ C in m random examples is at most
(1 − ε)m. Thus, the probability that we fail to hit some C0 ⊕ C ∈ ∆C0,ε(C) is bounded above by |C|(1 − ε)m.
68/96
The Link between VCD and PAC Learning
The Double Sample Theorem
Theorem
Let C be a concept class with VCD(C) = d.
Let A be any algorithm that given a set S of m labeled examples {(xi,c(xi)) | 1 i m} sampled iid according to some fixed but unknown distribution D over the instance space X produces as output a hypothesis h that is consistent with c. Then, A is a PAC algorithm and
1 1 d 1 mk0 εlogδ+εlogε .
for some positive constant k0.
69/96
The Link between VCD and PAC Learning
Proof
Draw a sample S1 of size m from D and let A be the event that the elements of S1 fail to form an ε-net for ∆C0,ε(C).
If A occurs, then S1 misses some region T, where T ∈ ∆C0,ε(C).
Fix this region T and draw an additional sample S2 of size m from D.
70/96
The Link between VCD and PAC Learning
Let V be a binomial random variable that gives the number of hits of T bythesampleS2. WehaveE(V)=mεandvar(V)=mε(1−ε)because the probability of an element of S2 hitting T is ε.
By Chebyshev’s Inequality applied to V we have
P(|V −mε|a) mε(1−ε). a2
Taking a = εm it follows that 2
P(|V−mε|εm) 4(1−ε) 2 εm
provided that m 8 . ε
41, εm 2
71/96
The Link between VCD and PAC Learning
Thus, if m 8 , ε
The inequality
P(|V −εm| εm) 1. 22
|V − εm| εm 2
is equivalent to εm V 3εm, which implies P(V εm) 1. 2222
72/96
The Link between VCD and PAC Learning
Proof (cont’d)
To summarize: we have calculated the probability that S2 will hit T many times given that T was fixed using the previous sampling, that is, given that S1 does not form an ε-net.
Let B be the event that S1 does not form an ε-net and that S2 hits T at least εm times. Then, we have shown that for m = O(1/ε) we have
2
P(B|A) 1. 2
73/96
The Link between VCD and PAC Learning
Proof (cont’d)
Since P(B|A) 1 we have 2
P(B) = P(B|A)P(A) 1P(A). 2
Our goal of bounding P(A) is equivalent to finding δ such that P(B) δ 2
because this would imply P(A) δ.
74/96
The Link between VCD and PAC Learning
Proof (cont’d)
Let S = S1 ∪ S2 be a random sample of 2m. Note that since the samples
are iid obtaining S is equivalent of sampling S1 and S2 separately and let
T beafixedsetsuchthat|T|εm. 2
Consider a random partition of S into S1 and S2 and consider the
probability that S1 ∩ T = ∅.
An Equivalent Problem: we have 2m balls each colored red or blue with
exactly l red balls, where l εm . Divide the 2m balls into groups of equal 2
size S1 and S2. Find an upper bound on the probability that all l balls fall in S2 (that is, the probability that S1 ∩ R = ∅).
75/96
The Link between VCD and PAC Learning
Proof (cont’d)
Yet Another Equivalent Problem: Divide 2m non-colored balls into S1 and
S2, choose l to be colored red, and compute the probability that all red
balls fall in S2. The probability of this taking place is: m
l 2m l
Note that
m l−1 m−i l−
l−11 1 εm 2m= 2m−i 2=2l=2 2.
l i=0 i=0
This is the probability for a fixed S and T. The probability that this
occurs for some T ∈ ∆C0,ε(S) such that |T| εm can be computed by 2
summing over all T and applying the union bound:
P(B)
|Π (εm)|2−εm |Π (εm)|2−εm ∆C ,ε(S) 2 ∆C (S) 2
0202
2εmd εm δ 2−2 .
d2
76/96
The Link between VCD and PAC Learning
Proof (cont’d)
The last inequality implies
mk0 εlogδ+εlogε .
for some positive constant k0.
1 1 d 1
77/96
The Link between VCD and PAC Learning
Optional Material
78/96
The VCD of Collections of Sets
Let u : B2k −→ B2 be a Boolean function of k arguments and let C1,…,Ck be k subsets of a set U. Define the set u(C1,…,Ck) as the subset C of U whose indicator function is IC = u(IC1 , . . . , ICk ).
Example
If u : B2 −→ B2 is the Boolean function u(a1, a2) = a1 ∨ a2, then
u(C1, C2) is C1 ∪ C2; similarly, if u(x1, x2) = x1 ⊕ x2, then u(C1, C2) is the symmetric difference C1 ⊕ C2 for every C1, C2 ∈ P(U).
79/96
The VCD of Collections of Sets
Let u : B2k −→ B2 and C1,…,Ck are k family of subsets of U, the family of sets u(C1,…,Ck) is
u(C1,…,Ck) = {u(C1,…,Ck) | C1 ∈ C1,…,Ck ∈ Ck}. Theorem
Let α(k) be the least integer a such that a > k. log(ea)
If C1, . . . , Ck are k collections of subsets of the set U such that
d = max{VCD(Ci) | 1 i k} and u : B2 −→ B2 is a Boolean function, then
VCD(u(C1,…,Ck)) α(k)·d.
80/96
The VCD of Collections of Sets
Proof
Let S be a subset of U that consists of m elements. The collection (Ci )S is not larger than φ(d,m). For a set in the collection W ∈ u(C1,…,Ck)S wecanwriteW =S∩u(C1,…,Ck),or,equivalently,
1W =1S ·u(1C1,…,1Ck).
There exists a Boolean function gS such that
1S ·u(1C1,…,1Ck)=gS(1S ·1C1,…,1S ·1Ck)=gS(1S∩C1,…,1S∩Ck).
Since there are at most φ(d , m) distinct sets of the form S ∩ Ci for every i , 1 i k, it follows that there are at most (φ(d,m))k distinct sets W, hence u(C1,…,Ck)[m] (φ(d,m))k.
81/96
The VCD of Collections of Sets
Proof (cont’d)
By a previous theorem,
u(C1,…,Ck)[m] d .
We observed that if ΠC[m] < 2m, then VCD(C) < m. Therefore, to limit
the Vapnik-Chervonenkis dimension of the collection u(C1, . . . , Ck ) it
suffices to require that emkd < 2m. d
emkd
Let a = m . The last inequality can be written as (ea)kd < 2ad ; d
equivalently, we have (ea)k < 2a, which yields k < a . If α(k) is the log(ea)
least integer a such that k < a , then m α(k)d, which gives our log(ea)
conclusion.
82/96
The VCD of Collections of Sets
Example
If k = 2, the least integer a such that a > 2 is k = 10, as it can be log(ea)
seen by graphing this function; thus, if C1,C2 are two collection of concepts with VCD(C1) = VCD(C2) = d, the Vapnik-Chervonenkis dimension of the collections C1 ∨ C2 or C1 ∧ C2 is not larger than 10d .
83/96
The VCD of Collections of Sets
Lemma
Let S,T be two sets and let f : S −→ T be a function. If D is a collection of subsets of T, U is a finite subset of S and C = f −1(D) is the collection {f−1(D) | D∈D},then|CU||Df(U)|.
Proof: Let V = f(U) and denote f`U by g. For D,D′ ∈ D we have
(U ∩f−1(D))⊕(U ∩f−1(D′))
= U∩(f−1(D)⊕f−1(D′))=U∩(f−1(D⊕D′))
= g−1(V ∩(D⊕D′))=g−1(V ∩D)⊕g−1(V ⊕D′).
Thus, C = U ∩ f −1(D) and C′ = U ∩ f −1(D′) are two distinct members of CU, then V ∩ D and V ∩ D′ are two distinct members of Df (U). This implies |CU| |Df (U)|.
84/96
The VCD of Collections of Sets
Theorem
Let S,T be two sets and let f : S −→ T be a function. If D is a collection of subsets of T and C = f −1(D) is the collection {f −1(D) | D ∈ D}, then VCD(C) VCD(D). Moreover, if f is a surjection, then
VCD(C) = VCD(D).
85/96
The VCD of Collections of Sets
Proof
Suppose that C shatters an n-element subset K = {x1, . . . , xn} of S, so |CK|=2n ByapreviousLemmawehave|CK||Df(U)|,so|Df(U)|2n, which implies |f (U)| = n and |Df (U)| = 2n, because f (U) cannot have more than n elements. Thus, D shatters f (U), so VCD(C) VCD(C). Suppose now that f is surjective and H = {t1, . . . , tm} is an m element set that is shattered by D. Consider the set L = {u1, . . . , um} such that
ui ∈f−1(ti)for1im.LetUbeasubsetofL.SinceHisshattered by D, there is a set D ∈ D such that f (U) = H ∩ D, which implies
U = L ∩ f −1(D). Thus, L is shattered by C and this means that
VCD(C) = VCD(D).
86/96
The VCD of Collections of Sets
Definition
The density of C is the number
denss(C)=inf{s∈R>0 | ΠC[m]c·ms foreverym∈N},
for some positive constant c.
87/96
The VCD of Collections of Sets
Theorem
Let S,T be two sets and let f : S −→ T be a function. If D is a collection of subsets of T and C = f −1(D) is the collection {f −1(D) | D ∈ D}, then denss(C) denss(D). Moreover, if f is a surjection, then
denss(C) = denss(D).
Proof: Let L be a subset of S such that |L| = m. Then, |CL| |Df (L)|. In general, we have |f (L)| m, so |Df (L)| D[m] cms . Therefore, we have |CL| |Df (L)| D[m] cms , which implies denss(C) denss(D).
If f is a surjection, then, for every finite subset M of T such that |M| = m there is a subset L of S such that |L| = |M| and f (L) = M. Therefore, D[m] ΠC[m] and this implies denss(C) = denss(D).
88/96
The VCD of Collections of Sets
If C, D are two collections of sets such that C ⊆ D, then VCD(C) VCD(D) and denss(C) denss(D).
Theorem
LetCbeacollectionofsubsetsofasetSandletC′={S−C |C∈C}. Then,foreveryK∈P(S)wehave|CK|=|CK′ |.
89/96
The VCD of Collections of Sets
Proof
We prove the statement by showing the existence of a bijection
f : CK −→ CK′ . If U ∈ CK , then U = K ∩ C , where C ∈ C. Then
S − C ∈ C′ and we define f (U) = K ∩ (S − C) = K − C ∈ CK′ . The function f is well-defined because if K ∩ C1 = K ∩ C2, then
K−C1 =K−(K∩C1)=K−(K∩C2)=K−C2. Itisclearthatiff(U)=f(V)forU,V ∈CK,U=K∩C1,and
V =K∩C2,thenK−C1 =K−C2,soK∩C1 =K∩C2 andthismeans thatU=V. Thus,f isinjective. IfW ∈CK′ ,thenW =K∩C′ forsome C′ ∈ C. Since C′ = S − C for some C ∈ C, it follows that W = K − C, so W = f (U), where U = K ∩ C.
90/96
The VCD of Collections of Sets
Corollary
LetCbeacollectionofsubsetsofasetSandletC′={S−C |C∈C}. We have denss(C) = denss(C′) and VCD(C) = VCD(C′).
91/96
The VCD of Collections of Sets
Theorem
For every collection of sets we have denss(C) VCD(C). Furthermore, if denss(C) is finite, then C is a VC-class.
Proof: If C is not a VC-class the inequality denss(C) VCD(C) is clearly satisfied. Suppose now that C is a VC-class and VCD(C) = d. By Sauer-Shelah Theorem we have ΠC[m] φ(d,m); then, we obtain
ΠC[m] emd, so denss(C) d.
d
Suppose now that denss(C) is finite. Since ΠC[m] cms 2m for m
sufficiently large, it follows that VCD(C) is finite, so C is a VC-class.
92/96
The VCD of Collections of Sets
Let D be a finite collection of subsets of a set S. The partition πD was
defined as consisting of the nonempty sets of the form
{Da1 ∩Da2 ∩···∩Dar,where(a,a,…,a)∈{0,1}r. 12r12r
Definition
A collection D = {D1,…,Dr} of subsets of a set S is independent if the partition πD has the maximum numbers of blocks, that is, it consists of 2r blocks.
If D is independent, then the Boolean subalgebra generated by D in the Boolean algebra (P(S), {∩, ∪, ̄, ∅, S}) contains 22r sets, because this subalgebra has 2r atoms. Thus, if D shatters a subset T with |T| = p, then the collection DT contains 2p sets, which implies 2p 22r , or p 2r .
93/96
The VCD of Collections of Sets
Definition
Let C be a collection of subsets of a set S. The independence number of C I(C) is:
I(C) = sup{r | {C1,…,Cr}
is independent for some finite {C1, . . . , Cr } ⊆ C}.
94/96
The VCD of Collections of Sets
Theorem
Let S,T be two sets and let f : S −→ T be a function. If D is a collection of subsets of T and C = f −1(D) is the collection {f −1(D) | D ∈ D}, then I (C) I (D). Moreover, if f is a surjection, then I (C) = I (D).
Proof: Let E = {D1,…,Dp} be an independent finite subcollection of D. The partition πE contains 2r blocks. The number of atoms of the subalgebra generated by {f −1(D1), . . . , f −1(Dp)} is not greater than 2r . Therefore, I(C) I(D); from the same supplement it follows that if f is surjective, then I (C) = I (D).
95/96
The VCD of Collections of Sets
Theorem
If C is a collection of subsets of a set S such that VCD(C) 2n, then I(C) n.
Proof: Suppose that VCD(C) 2n, that is, there exists a subset T of S that is shattered by C and has at least 2n elements. Then, the collection Ht contains at least 22n sets, which means that the Boolean subalgebra of P(T) generated by TC contains at least 2n atoms. This implies that the subalgebra of P(S) generated by C contains at least this number of atoms, so I(C) n.
96/96