CS代考计算机代写 algorithm The Probably Approximately Correct (PAC) Learning

The Probably Approximately Correct (PAC) Learning
Prof. Dan A. Simovici
UMB
1/21

Outline
1 The Agnostic PAC Learning
2 The Scope of Learning Problems
2/21

Outline
A learning algorithm A starts with a hypothesis class H and a sample S, under certain conditions, it returns a hypothesis h that has a small true error. A specific definition is given next.
We assume that the data set X is equipped with a probability distribution D.
3/21

Outline
What is the PAC Model?
Definition
A hypothesis class H is PAC learnable if there exists a function
mH : (0, 1)2 −→ N and a learning algorithm A such that for every
ε, δ ∈ (0, 1), every distribution D over X , and for every labeling function
f : X −→ {0, 1}, if realizability assumption holds with respect to H, D, f , then when running the algorithm A on a sample S that consists of
m 􏰁 mH(ε, δ) generated by D and labeled by f , A returns a hypothesis h such that, with probability at least 1 − δ (over the choice of examples), we have for the true error L(D,f )(h):
P(L(D,f )(h) 􏰀 ε) 􏰁 1 − δ.
ε is the accuracy parameter and δ is the confidence parameter
4/21

Outline
Approximation Parameters
the accuracy parameter ε determines how far the output classifier can be from the optimal one, and
the confidence parameter δ indicates how likely is the classifier is to meet that accuracy requirement.
5/21

The Agnostic PAC Learning
What is Agnostic PAC Learning?
The realizability assumption, the existence of a hypothesis h∗ ∈ H such that Px∼D(h∗(x) = f (x)) = 1 is not practical in many cases.
Agnostic learning replaces the realizability assumption and the targeted labeling function f , with a distribution D defined on pairs (data, labels), that is, with a distribution D on X × Y.
6/21

The Agnostic PAC Learning
When the probability distribution D was defined on X , the generalization error of a hypothesis was defined as:
LD,f (h) = D({x | h(x) ̸= f (x)}).
For agnostic learning D is defined over X × Y, so we redefine the
generalization error as:
LD(h) = D({(x, y) | h(x) ̸= y}).
We seek a predictor for which LD(h) is minimal.
The definition of the empirical risk remains the same:
LS(h)= |{i | h(xi)̸=yi for 1􏰀i 􏰀m}|. m
7/21

The Agnostic PAC Learning
The Bayes Classifier and Its Optimality
Let D be any probability distribution over X × Y, where Y = {0, 1}.
Let X be a random variable ranging over X and Y be a random variable ranging over Y = {0, 1}.
The Bayes predictor is the function fD defined as
fD(x) =
􏰑1 ifP(Y=1|X=x)􏰁1 2
0 otherwise.
8/21

The Agnostic PAC Learning
Theorem
Given any probability distribution D over X × {0, 1} the best label predicting function f : X −→ {0, 1} is the Bayes predictor fD .
In other words, we need to prove that for hypothesis g : X −→ {0, 1} we have LD(fD) 􏰀 LD(g).
9/21

The Agnostic PAC Learning
Proof
Let X be a random variable ranging over X, Y be a random variable ranging over Y = {0, 1}, and let αx be the probability of a having a label 1givenx,thatis,αx =P(Y =1|X =x).
With this notation the Bayes predictor is
􏰑1 ifαx􏰁1 fD(x) = 2
0 ifαx<1. 2 10/21 The Agnostic PAC Learning Proof (cont’d) We have: LD(fD) = P(fD(X) ̸= y|X = x) = P(fD(x)=1|X =x)P(Y =0|X =x) +P(fD(x) = 0|X = x)P(Y = 1|X = x) 􏰂 1􏰃 = P αx􏰁2 P(Y=0|X=x) 􏰂 1􏰃 +P αx<2 P(Y=1|X=x) Note: when we write P(fD(X) ̸= y|X = x) we mean the probability that a pair (x,y) is such that fD(X) ̸= y assuming that X = x. Similar conventions apply to all probabilities listed above. 11/21 The Agnostic PAC Learning Proof (cont’d) If αx 􏰁 1, then 2 􏰂 1􏰃 􏰂 1􏰃 min{αx,1−αx}=1−αx,P αx 􏰁 2 =1,P αx < 2 =0, and 􏰂1􏰃 􏰂1􏰃 P αx􏰁2 (1−αx)+P αx<2 αx = 1−αx =min{1−αx,αx}. Ifαx <1,thenmin{αx,1−αx}=αx,P􏰉αx 􏰁1􏰊=0,P􏰉αx <1􏰊=1 and 222 􏰂1􏰃 􏰂1􏰃 P αx􏰁2 (1−αx)+P αx<2 αx = αx =min{1−αx,αx}. 12/21 The Agnostic PAC Learning Proof (cont’d) Let g be any other classifier. We have: P(g(X)̸=Y|X =x) = P(g(X)=0|X =x)P(Y =1|X =x) +P(g(X) = 1|X = x)P(Y = 0|X = x) = P(g(X) = 0|X = x)αx +P(g(X) = 1|X = x)(1 − αx) 􏰁 P(g(X)=0|X =x)min{αx,1−αx} +P(g(X)=1|X =x)min{αx,1−αx} 􏰁 (P(g(X)=0|X =x)+P(g(X)=1|X =x)) ·min{αx,1−αx} = min{αx,1−αx}=P(fD(X)̸=y|X =x). 13/21 The Agnostic PAC Learning Agnostic PAC-Learnability Definition A hypothesis class H is agnostic PAC learnable if there exists a function mH : (0, 1)2 −→ N and a learning algorithm A with the following property: For every ε,δ ∈ (0,1) and for every distribution D over X × Y, when running A on m 􏰁 mH(ε, δ) iid examples generated by D, A returns a hypothesis h such that with probability at least 1 − δ (over the choice of the m training examples) we have LD(h) 􏰀 min LD(h′) + ε. h′∈H 14/21 The Agnostic PAC Learning If the realizability assumption holds, agnostic PAC learning provides the same guarantees as PAC learning. When the realizability assumption does not hold, no learner can guarantee an arbitrary small error. A learner A can declare success if the error is not much larger than the smallest error achievable by a hypothesis from H. 15/21 The Scope of Learning Problems Multiclass Classification Example Let X be a set of document features, and Y a set of topics (sports, politics, health, etc.). By document features we mean counts of certain key words, size, or origin of the document. The loss function will be the probability of the event that occurs when the predictor suggest a wrong label. 16/21 The Scope of Learning Problems Regression Example In regression we seek to find a functional relationship h between the X and Y components of the data. For example, to predict the weight of a baby at birth X can be a set of triplets in R3 (head circumference, abdominal circumference, femur length) and Y is is the weight at birth. We seek h that will minimize the loss LD(h) = E(x,y)∼D(h(x) − y)2. 17/21 The Scope of Learning Problems Generalized Loss Functions Definition Given a set of hypotheses H, a domain Z, a loss function is a function l : H × Z −→ R + . For prediction problems we have Z = X × Y. Definition The risk function is the expected loss of the classifier h ∈ H with respect to a probability distribution D over Z, namely LD(h) = Ez∼D(l(h, z)). The empirical risk is the expected loss over the sample S=(z1,...,sm)∈Zm as 1 􏰆m LS (h) = m l(h, zi ). i=1 18/21 The Scope of Learning Problems 0-1 Loss The random variable z ranges over X × Y and the loss function is 􏰑0 ifh(x)=y, l0−1(h,(x,y))= 1 ifh(x)̸=y. This is used in binary or multiclass classification problems. For the 0/1 loss the definition of LD(h) = Ez∼D(l(h,z)) coincides with the previous definition in the agnostic PAC, LD(h) = D({(x,y) | h(x) ̸= y}). 19/21 The Scope of Learning Problems Square Loss The random variable z ranges over X × Y and the loss function is lsq(h, (x, y)) = (h(x) − y)2. 20/21 The Scope of Learning Problems Agnostic PAC Learnability for General Loss Functions Definition A hypothesis class H is agnostic PAC learnable with respect to a set Z and a loss function l : H × Z −→ R+ if there exists a function mH : (0, 1)2 −→ N and a learning algorithm A with the following property: For every ε, δ ∈ (0, 1) and for every distribution D over Z , when running A on m 􏰁 mH(ε, δ) iid examples generated by D, A returns a hypothesis h such that with probability at least 1 − δ (over the choice of the m training examples) we have LD(h) 􏰀 min LD(h′) + ε, h′∈H where LD(h) = Ez∼D(l(h,z)). 21/21