Learning via Uniform Convergence
Prof. Dan A. Simovici
UMB
1/17
Outline
1 Uniform Convergence
2 Finite Classes are Agostically PAC-learnable
2/17
Uniform Convergence
Reminder
For agnostic learning the generalization error is: LD(h) = D({(x, y) | h(x) ̸= y}).
The empirical risk is:
LS(h)= |{i | h(xi)̸=yi for 1i m}|. m
3/17
Uniform Convergence
Definition
Let H be a hypothesis class, and let D be a distribution. A training set S is ε-representative with respect to the above elements, if the absolute value of the difference between the empirical risk and the generalization error is less than ε,
for all h ∈ H. Equivalently,
|LS (h) − LD (h)| ε.
LS (h) − ε LD (h) LS (h) + ε.
4/17
Uniform Convergence
Definition
Let H be a class of hypotheses. A ERM predictor for H is a hypothsis g such that its empirical risk LS (g ) is minimal, that is, LS (g ) LS (h) for every sample S and hypothesis h ∈ H.
5/17
Uniform Convergence
The next lemma stipulates that when the sample is ε -representative, the 2
ERM learning rule applied to a sample S is guaranteed to return a good hypothesis hS.
Lemma
Assume that a training set S is ε -representative, that is, 2
|LS (h) − LD (h)| ε . Then, any hS that minimizes the empirical risk 2
satisfies
hS ∈argminh∈HLS(h)
LD (hS ) min LD (h) + ε. h∈H
6/17
Uniform Convergence
Proof
For every h ∈ H we have
LD(hS)
LS(hS)+ε 2
(by the ε -representativeness of S to hS ) 2
LS(h)+ε 2
(becausehS isanERMpredictor,henceLS(hS)LS(h)) LD(h)+ε+ε
LD(h)+ε.
22
(because S is ε -representative, so LS (h) LD (h) + ε )
22
Thus, to ensure that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least 1 − δ over the random choice of a training set, it will be an ε-represeentative training set.
7/17
Uniform Convergence
Generalized Loss Functions
Definition
Given a hypothesis set H and some domain Z let l : H × Z −→ R0 be a loss function.
The risk function is the expected loss of a classifier h ∈ H given by
LD(h) = Ez∼D[l(h,z)]. The empirical risk for S = {s1,…,sm} is
1 m
LS (h) = m
l(h, si ).
i=1
8/17
Uniform Convergence
Definition
A hypothesis class H has the uniform convergence property (relative to a domain Z and a loss function l) if there exists a function
mUC : (0, 1)2 −→ N (the same for all hypotheses in H and all probability distributions D) such that for every ε, δ ∈ (0, 1) if S is a sample of size m, where m mUC(ε, δ), then with probability at least 1 − δ, S is ε-representative.
The term uniform refers to the fact that mUC(ε,δ) is the same for all hypotheses in H and all probability distributions D.
9/17
Uniform Convergence
REMINDER: Agnostic PAC Learning
The realizability assumption (the existence of a hypothesis h∗ ∈ H such that Px∼D(h∗(x) = f (x)) = 1 ) is not realistic in many cases.
Agnostic learning replaces the realizability assumption and the targeted labeling function f , with a distribution D defined on pairs (data, labels), that is, with a distribution D on X × Y.
Since D is defined over X × Y, the the generalization error is LD(h) = D({(x, y) | h(x) ̸= y}).
10/17
Uniform Convergence
Theorem
If a class H has the uniform convergence property with a function mUC, then the class H is agnostically PAC learnable with the sample complexity mH(ε, δ) mUC(ε/2, δ). Furthermore, in this case, the ERMH paradigm is a successful agnostic learner for H.
11/17
Uniform Convergence
Proof
Suppose that H has the uniform convergence property with a function mUC.
For every ε,δ ∈ (0,1) if S is a sample of size m, where m mUC(ε/2,δ), then with probability at least 1 − δ, S is ε/2-representative, which means that for all h ∈ H we have:
LD (h) LS (h) + ε/2. By the definition of hS we have:
LD(hS) minLD(h)+ε/2 h∈H
min LD(h) + ε, h∈H
hence H is agnostically PAC-learnable with mH(ε, δ) = mUC(ε/2, δ).
12/17
Finite Classes are Agostically PAC-learnable
Theorem
Uniform convergence holds for a finite hypothesis class.
Proof: Fix ε, δ ∈ (0, 1).
We need a sample S = {s1,…,sm} of size m that guarantees that for any D with probability at least 1 − δ we have that for all h ∈ H, |LS (h) − LD (h)| ε (a representative sample).
Equivalently,
Dm({S | ∃h∈H,|LS(h)−LD(h)|>ε})<δ.
Note that
{S | ∃h∈H,|LS(h)−LD(h)|>ε}= {S | |LS(h)−LD(h)|>ε} h∈H
13/17
Finite Classes are Agostically PAC-learnable
This implies
Dm({S | ∃h∈H,|LS(h)−LD(h)|>ε})
= Dm({S | |LS(h)−LD(h)|>ε}). h∈H
14/17
Finite Classes are Agostically PAC-learnable
Next phase: each term of the right side of previous inequality
h∈H Dm({S | |LS(h) − LD(h)| > ε}) is small enough (for large m).
Let θi be the random variable θi = l(h, si ). Since h is fixed and and s1,…,sm are iid random variables, it follows that θ1,…,θm are also iid random variables.
E(θ1) = ··· = E(θm) = μ.
Range of l is [0, 1] and therefore, the range of θi is [0, 1].
Each term Dm({S | |LS(h) − LD(h)| > ε}) is small enough for large m.
We have:
1 m
LS(h)=m
i=1
θi andLD(h)=μ.
15/17
Finite Classes are Agostically PAC-learnable
By Hoeffding’s Inequality,
Dm({S | |LS(h)−LD(h)|>ε}) 1m
=Pm θi−μ>ε i=1
2e−2mε2 h∈H
2|H|e−2mε2 .
To have Dm({S | |LS(h) − LD(h)| > ε}) δ we need e−2mε2 δ, which is equivalent to
m log(2|H|/δ). 2ε2
16/17
Finite Classes are Agostically PAC-learnable
A Corollary
Recall that the ERM algorithm returns a hypothesis h such that for which LS(h) is minimal.
Corollary
Let H be a finite hypothesis class, let Z be a domain, and
l : H × Z −→ [0, 1] be a loss function. Then H enjoys the uniform convergence property with sample complexity
log 2|H|
mUC(ε,δ) = δ H 2ε2
.
Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity;
2log 2|H|
m (ε,δ)mUC(ε/2,δ) δ H H ε2
.
17/17