CS代考计算机代写 algorithm Learning via Uniform Convergence

Learning via Uniform Convergence
Prof. Dan A. Simovici
UMB
1/17

Outline
1 Uniform Convergence
2 Finite Classes are Agostically PAC-learnable
2/17

Uniform Convergence
Reminder
For agnostic learning the generalization error is: LD(h) = D({(x, y) | h(x) ̸= y}).
The empirical risk is:
LS(h)= |{i | h(xi)̸=yi for 1􏰀i 􏰀m}|. m
3/17

Uniform Convergence
Definition
Let H be a hypothesis class, and let D be a distribution. A training set S is ε-representative with respect to the above elements, if the absolute value of the difference between the empirical risk and the generalization error is less than ε,
for all h ∈ H. Equivalently,
|LS (h) − LD (h)| 􏰀 ε.
LS (h) − ε 􏰀 LD (h) 􏰀 LS (h) + ε.
4/17

Uniform Convergence
Definition
Let H be a class of hypotheses. A ERM predictor for H is a hypothsis g such that its empirical risk LS (g ) is minimal, that is, LS (g ) 􏰀 LS (h) for every sample S and hypothesis h ∈ H.
5/17

Uniform Convergence
The next lemma stipulates that when the sample is ε -representative, the 2
ERM learning rule applied to a sample S is guaranteed to return a good hypothesis hS.
Lemma
Assume that a training set S is ε -representative, that is, 2
|LS (h) − LD (h)| 􏰀 ε . Then, any hS that minimizes the empirical risk 2
satisfies
hS ∈argminh∈HLS(h)
LD (hS ) 􏰀 min LD (h) + ε. h∈H
6/17

Uniform Convergence
Proof
For every h ∈ H we have
LD(hS) 􏰀 􏰀 􏰀
􏰀
LS(hS)+ε 2
(by the ε -representativeness of S to hS ) 2
LS(h)+ε 2
(becausehS isanERMpredictor,henceLS(hS)􏰀LS(h)) LD(h)+ε+ε
LD(h)+ε.
22
(because S is ε -representative, so LS (h) 􏰀 LD (h) + ε )
22
Thus, to ensure that the ERM rule is an agnostic PAC learner, it suffices to show that with probability of at least 1 − δ over the random choice of a training set, it will be an ε-represeentative training set.
7/17

Uniform Convergence
Generalized Loss Functions
Definition
Given a hypothesis set H and some domain Z let l : H × Z −→ R􏰁0 be a loss function.
The risk function is the expected loss of a classifier h ∈ H given by
LD(h) = Ez∼D[l(h,z)]. The empirical risk for S = {s1,…,sm} is
1 􏰆m
LS (h) = m
l(h, si ).
i=1
8/17

Uniform Convergence
Definition
A hypothesis class H has the uniform convergence property (relative to a domain Z and a loss function l) if there exists a function
mUC : (0, 1)2 −→ N (the same for all hypotheses in H and all probability distributions D) such that for every ε, δ ∈ (0, 1) if S is a sample of size m, where m 􏰁 mUC(ε, δ), then with probability at least 1 − δ, S is ε-representative.
The term uniform refers to the fact that mUC(ε,δ) is the same for all hypotheses in H and all probability distributions D.
9/17

Uniform Convergence
REMINDER: Agnostic PAC Learning
The realizability assumption (the existence of a hypothesis h∗ ∈ H such that Px∼D(h∗(x) = f (x)) = 1 ) is not realistic in many cases.
Agnostic learning replaces the realizability assumption and the targeted labeling function f , with a distribution D defined on pairs (data, labels), that is, with a distribution D on X × Y.
Since D is defined over X × Y, the the generalization error is LD(h) = D({(x, y) | h(x) ̸= y}).
10/17

Uniform Convergence
Theorem
If a class H has the uniform convergence property with a function mUC, then the class H is agnostically PAC learnable with the sample complexity mH(ε, δ) 􏰀 mUC(ε/2, δ). Furthermore, in this case, the ERMH paradigm is a successful agnostic learner for H.
11/17

Uniform Convergence
Proof
Suppose that H has the uniform convergence property with a function mUC.
For every ε,δ ∈ (0,1) if S is a sample of size m, where m 􏰁 mUC(ε/2,δ), then with probability at least 1 − δ, S is ε/2-representative, which means that for all h ∈ H we have:
LD (h) 􏰀 LS (h) + ε/2. By the definition of hS we have:
LD(hS) 􏰀 minLD(h)+ε/2 h∈H
􏰀 min LD(h) + ε, h∈H
hence H is agnostically PAC-learnable with mH(ε, δ) = mUC(ε/2, δ).
12/17

Finite Classes are Agostically PAC-learnable
Theorem
Uniform convergence holds for a finite hypothesis class.
Proof: Fix ε, δ ∈ (0, 1).
We need a sample S = {s1,…,sm} of size m that guarantees that for any D with probability at least 1 − δ we have that for all h ∈ H, |LS (h) − LD (h)| 􏰀 ε (a representative sample).
Equivalently,
Dm({S | ∃h∈H,|LS(h)−LD(h)|>ε})<δ. Note that {S | ∃h∈H,|LS(h)−LD(h)|>ε}= 􏰈{S | |LS(h)−LD(h)|>ε} h∈H
13/17

Finite Classes are Agostically PAC-learnable
This implies
Dm({S | ∃h∈H,|LS(h)−LD(h)|>ε})
= 􏰆Dm({S | |LS(h)−LD(h)|>ε}). h∈H
14/17

Finite Classes are Agostically PAC-learnable
Next phase: each term of the right side of previous inequality
􏰍h∈H Dm({S | |LS(h) − LD(h)| > ε}) is small enough (for large m).
Let θi be the random variable θi = l(h, si ). Since h is fixed and and s1,…,sm are iid random variables, it follows that θ1,…,θm are also iid random variables.
E(θ1) = ··· = E(θm) = μ.
Range of l is [0, 1] and therefore, the range of θi is [0, 1].
Each term Dm({S | |LS(h) − LD(h)| > ε}) is small enough for large m.
We have:
1 􏰆m
LS(h)=m
i=1
θi andLD(h)=μ.
15/17

Finite Classes are Agostically PAC-learnable
By Hoeffding’s Inequality,
Dm({S | |LS(h)−LD(h)|>ε}) 􏰄􏰎1􏰆m 􏰎􏰅
=P􏰎􏰎m θi−μ􏰎􏰎>ε i=1
􏰀 􏰆 2e−2mε2 h∈H
􏰀 2|H|e−2mε2 .
To have Dm({S | |LS(h) − LD(h)| > ε}) 􏰀 δ we need e−2mε2 􏰀 δ, which is equivalent to
m 􏰁 log(2|H|/δ). 2ε2
16/17

Finite Classes are Agostically PAC-learnable
A Corollary
Recall that the ERM algorithm returns a hypothesis h such that for which LS(h) is minimal.
Corollary
Let H be a finite hypothesis class, let Z be a domain, and
l : H × Z −→ [0, 1] be a loss function. Then H enjoys the uniform convergence property with sample complexity
􏰏log 2|H| 􏰐
mUC(ε,δ) = δ H 2ε2
.
Furthermore, the class is agnostically PAC learnable using the ERM algorithm with sample complexity;
􏰏2log 2|H| 􏰐
m (ε,δ)􏰀mUC(ε/2,δ)􏰀 δ H H ε2
.
17/17