The Probably Approximately Correct (PAC) Learning
Prof. Dan A. Simovici
UMB
1/21
Outline
1 The Agnostic PAC Learning
2 The Scope of Learning Problems
2/21
Outline
A learning algorithm A starts with a hypothesis class H and a sample S, under certain conditions, it returns a hypothesis h that has a small true error. A specific definition is given next.
We assume that the data set X is equipped with a probability distribution D.
3/21
Outline
What is the PAC Model?
Definition
A hypothesis class H is PAC learnable if there exists a function
mH : (0, 1)2 −→ N and a learning algorithm A such that for every
ε, δ ∈ (0, 1), every distribution D over X , and for every labeling function
f : X −→ {0, 1}, if realizability assumption holds with respect to H, D, f , then when running the algorithm A on a sample S that consists of
m mH(ε, δ) generated by D and labeled by f , A returns a hypothesis h such that, with probability at least 1 − δ (over the choice of examples), we have for the true error L(D,f )(h):
P(L(D,f )(h) ε) 1 − δ.
ε is the accuracy parameter and δ is the confidence parameter
4/21
Outline
Approximation Parameters
the accuracy parameter ε determines how far the output classifier can be from the optimal one, and
the confidence parameter δ indicates how likely is the classifier is to meet that accuracy requirement.
5/21
The Agnostic PAC Learning
What is Agnostic PAC Learning?
The realizability assumption, the existence of a hypothesis h∗ ∈ H such that Px∼D(h∗(x) = f (x)) = 1 is not practical in many cases.
Agnostic learning replaces the realizability assumption and the targeted labeling function f , with a distribution D defined on pairs (data, labels), that is, with a distribution D on X × Y.
6/21
The Agnostic PAC Learning
When the probability distribution D was defined on X , the generalization error of a hypothesis was defined as:
LD,f (h) = D({x | h(x) ̸= f (x)}).
For agnostic learning D is defined over X × Y, so we redefine the
generalization error as:
LD(h) = D({(x, y) | h(x) ̸= y}).
We seek a predictor for which LD(h) is minimal.
The definition of the empirical risk remains the same:
LS(h)= |{i | h(xi)̸=yi for 1i m}|. m
7/21
The Agnostic PAC Learning
The Bayes Classifier and Its Optimality
Let D be any probability distribution over X × Y, where Y = {0, 1}.
Let X be a random variable ranging over X and Y be a random variable ranging over Y = {0, 1}.
The Bayes predictor is the function fD defined as
fD(x) =
1 ifP(Y=1|X=x)1 2
0 otherwise.
8/21
The Agnostic PAC Learning
Theorem
Given any probability distribution D over X × {0, 1} the best label predicting function f : X −→ {0, 1} is the Bayes predictor fD .
In other words, we need to prove that for hypothesis g : X −→ {0, 1} we have LD(fD) LD(g).
9/21
The Agnostic PAC Learning
Proof
Let X be a random variable ranging over X, Y be a random variable ranging over Y = {0, 1}, and let αx be the probability of a having a label 1givenx,thatis,αx =P(Y =1|X =x).
With this notation the Bayes predictor is
1 ifαx1 fD(x) = 2
0 ifαx<1. 2
10/21
The Agnostic PAC Learning
Proof (cont’d)
We have:
LD(fD) = P(fD(X) ̸= y|X = x)
= P(fD(x)=1|X =x)P(Y =0|X =x)
+P(fD(x) = 0|X = x)P(Y = 1|X = x) 1
= P αx2 P(Y=0|X=x) 1
+P αx<2 P(Y=1|X=x)
Note: when we write P(fD(X) ̸= y|X = x) we mean the probability that a pair (x,y) is such that fD(X) ̸= y assuming that X = x. Similar conventions apply to all probabilities listed above.
11/21
The Agnostic PAC Learning
Proof (cont’d)
If αx 1, then 2
1 1 min{αx,1−αx}=1−αx,P αx 2 =1,P αx < 2 =0,
and
1 1
P αx2 (1−αx)+P αx<2 αx
= 1−αx =min{1−αx,αx}.
Ifαx <1,thenmin{αx,1−αx}=αx,Pαx 1=0,Pαx <1=1
and
222
1 1
P αx2 (1−αx)+P αx<2 αx
= αx =min{1−αx,αx}.
12/21
The Agnostic PAC Learning
Proof (cont’d)
Let g be any other classifier. We have:
P(g(X)̸=Y|X =x) =
P(g(X)=0|X =x)P(Y =1|X =x)
+P(g(X) = 1|X = x)P(Y = 0|X = x) = P(g(X) = 0|X = x)αx
+P(g(X) = 1|X = x)(1 − αx)
P(g(X)=0|X =x)min{αx,1−αx}
+P(g(X)=1|X =x)min{αx,1−αx}
(P(g(X)=0|X =x)+P(g(X)=1|X =x))
·min{αx,1−αx}
= min{αx,1−αx}=P(fD(X)̸=y|X =x).
13/21
The Agnostic PAC Learning
Agnostic PAC-Learnability
Definition
A hypothesis class H is agnostic PAC learnable if there exists a function mH : (0, 1)2 −→ N and a learning algorithm A with the following property: For every
ε,δ ∈ (0,1) and
for every distribution D over X × Y,
when running A on m mH(ε, δ) iid examples generated by D, A returns a hypothesis h such that with probability at least 1 − δ (over the choice of the m training examples) we have
LD(h) min LD(h′) + ε. h′∈H
14/21
The Agnostic PAC Learning
If the realizability assumption holds, agnostic PAC learning provides the same guarantees as PAC learning.
When the realizability assumption does not hold, no learner can guarantee an arbitrary small error.
A learner A can declare success if the error is not much larger than the smallest error achievable by a hypothesis from H.
15/21
The Scope of Learning Problems
Multiclass Classification
Example
Let X be a set of document features, and Y a set of topics (sports, politics, health, etc.).
By document features we mean counts of certain key words, size, or origin of the document.
The loss function will be the probability of the event that occurs when the predictor suggest a wrong label.
16/21
The Scope of Learning Problems
Regression
Example
In regression we seek to find a functional relationship h between the X and Y components of the data.
For example, to predict the weight of a baby at birth X can be a set of triplets in R3
(head circumference, abdominal circumference, femur length)
and Y is is the weight at birth. We seek h that will minimize the loss LD(h) = E(x,y)∼D(h(x) − y)2.
17/21
The Scope of Learning Problems
Generalized Loss Functions
Definition
Given a set of hypotheses H, a domain Z, a loss function is a function l : H × Z −→ R + .
For prediction problems we have Z = X × Y.
Definition
The risk function is the expected loss of the classifier h ∈ H with respect to a probability distribution D over Z, namely
LD(h) = Ez∼D(l(h, z)).
The empirical risk is the expected loss over the sample S=(z1,...,sm)∈Zm as
1 m
LS (h) = m
l(h, zi ).
i=1
18/21
The Scope of Learning Problems
0-1 Loss
The random variable z ranges over X × Y and the loss function is 0 ifh(x)=y,
l0−1(h,(x,y))= 1 ifh(x)̸=y.
This is used in binary or multiclass classification problems.
For the 0/1 loss the definition of LD(h) = Ez∼D(l(h,z)) coincides with the previous definition in the agnostic PAC, LD(h) = D({(x,y) | h(x) ̸= y}).
19/21
The Scope of Learning Problems
Square Loss
The random variable z ranges over X × Y and the loss function is lsq(h, (x, y)) = (h(x) − y)2.
20/21
The Scope of Learning Problems
Agnostic PAC Learnability for General Loss Functions
Definition
A hypothesis class H is agnostic PAC learnable with respect to a set Z and a loss function l : H × Z −→ R+ if there exists a function
mH : (0, 1)2 −→ N and a learning algorithm A with the following property: For every ε, δ ∈ (0, 1) and for every distribution D over Z , when running A on m mH(ε, δ) iid examples generated by D, A returns a hypothesis h such that with probability at least 1 − δ (over the choice of the m training examples) we have
LD(h) min LD(h′) + ε, h′∈H
where LD(h) = Ez∼D(l(h,z)).
21/21