CS代考计算机代写 algorithm The No-Free-Lunch Theorem

The No-Free-Lunch Theorem
Prof. Dan A. Simovici
UMB
1/23

Outline
1 Preliminaries
2 The No-Free-Lunch Theorem
2/23

Preliminaries
Reminder
If K is event such that P(K) = p, 1K is a random variable 􏰑1 if K takes place
If P(K) = p, then
1K = 0 otherwise. 􏰂0 1􏰃
1K: 1−p p If X is a random variable
and E(1K) = p.
􏰂x1 ··· xn􏰃 X:p···p,
then X = 􏰍ni=1 xi1X=xi , where
1n
􏰂01􏰃
1X=xi : 1−p p . ii
3/23

Preliminaries
First Lemma
Lemma
Let Z be a random variable that takes values in [0, 1] such that E [Z ] = μ. Then, for every a ∈ (0, 1) we have
P(Z > 1 − a) 􏰁 μ − (1 − a) and P(Z > a) 􏰁 μ − a 􏰁 μ − a. a 1−a
Proof: The random variable Y = 1 − Z is non-negative with E(Y)=1−E(Z)=1−μ. ByMarkov’sinequality:
P(Z 􏰀1−a)=P(1−Z 􏰁a)=P(Y 􏰁a)􏰀 E(Y) = 1−μ. aa
Therefore,
P(Z > 1 − a) 􏰁 1 − 1 − μ = a + μ − 1 = μ − (1 − a). aaa
4/23

Preliminaries
Proof (cont’d)
By replacing a by 1 − a we have:
P(Z >a)􏰁 μ−a 􏰁μ−a. 1−a
5/23

Preliminaries
Second Lemma
Lemma
Let θ be a random variable that ranges in the interval [0, 1] such that E(θ) 􏰁 1. We have
4
􏰂 1􏰃 1 P θ>8 􏰁7.
Proof: From the second inequality of the previous lemma it follows that P(θ > a) 􏰁 E(θ) − a.
1−a
By substituting a = 1 we obtain: 8
11−11 P(θ>)􏰁4 8=.
81−17 8
6/23

The No-Free-Lunch Theorem
A learning task is defined by an unknown probability distribution D over X × Y.
The goal of the learner is to find (to learn) a hypothesis h : X −→ Y such that its risk LD(h) is sufficiently small.
The choice of a hypothesis class H reflects some prior knowlege that the learner has about the task: a belief that a member of H is a low-error model for the task.
Fundamental Question: There exist a universal learner A and a training set size m such that for every distribution D, if A receives m iid examples from D, there is a high probability that A will produce h with a low risk?
7/23

The No-Free-Lunch Theorem
The No-Free-Lunch (NFL) Theorem stipulates that a universal learner (for every distribution) does not exist!
A learner fails if, upon receiving a sequence of iid examples from a distribution D, its output hypothesis is likely to have a large loss (say, larger than 0.3), whereas for the same distribution there exists another learner that will output a hypothesis with a small loss.
More precise statment: for every binary prediction task and learner, there exists a distribution D for which the learning task fails.
No learner can succeed on all learning tasks: every learner has tasks on which it fails whereas other learners succeed.
8/23

The No-Free-Lunch Theorem
Recall 0/1 Loss Function
The loss function is the 0/1-loss function l0−1:
􏰑0 ifh(x)=y, l0−1(h,(x,y))= 1 ifh(x)̸=y.
9/23

The No-Free-Lunch Theorem
The NFL Theorem
For a learning algorithm A denote by A(S) the hypothesis returned by the algorithm A upon receiving the training sequence S.
Theorem
Let A be any learning algorithm for the task of binary classification with respect to the 0/1-loss function over a domain X . Let m < |X | be a 2 number representing a training set size. There exists a distribution D over X × {0, 1} such that: thereexistsafunctionf :X −→{0,1}withLD(f)=0; with probability at least 1/7 over the choice of a sample S of size m there exists a hypothesis h = A(S) such that we have LD(h) 􏰁 1/8. 10/23 The No-Free-Lunch Theorem Interpretation of the NFL Theorem For every learner, there us a task for which it fails, even though the task can be successfully learned by another learner. 11/23 The No-Free-Lunch Theorem Proof Let C be a subset of X of size 2m; this set exist because we assume that |calx| > 2m.
Intuition of the proof: any algorithm that observes only half of the instances of C has no information of what should be the labels of the other half. Therefore, there exists a target function f which would contradict the labels that A(S) predicts on the unobserved instances of C.
12/23

The No-Free-Lunch Theorem
Note that:
There are T = 22m possible functions from C to {0,1}: f1,…,fT. The set C × {0, 1} consists of the pairs
C × {0, 1} = {(x1, 0), (x1, 1), . . . , (x2m, 0), (x2m, 1)} For each fi let Di be the distribution over C × {0, 1} given by
􏰑1 Di({(x,y)}} = |C|
0 The probability to choose a pair (x,y) is 1
if y = fi(x) otherwise.
if y is the true label according to fi and 0, otherwise (if y ̸= fi (x )). Clearly LDi (fi ) = 0.
|C|
13/23

The No-Free-Lunch Theorem
Intuition
Let m = 3, C = {x1,x2,x3,x4,x5,x6}. Suppose that
fi(x1) = 1 fi(x4) = 1
The distribution Di is:
(x1, 0) (x2, 0)
fi(x2) = 0 fi(x3) = 1 fi(x5) = 1 fi(x6) = 0,
(x3, 0) (x4, 0) (x5, 0) (x6, 0) 010001
66 (x1, 1) (x2, 1) (x3, 1) (x4, 1) (x5, 1) (x6, 1)
1 0 1 1 1 0. 6 666
Clearly, we have:
LDi (fi) = P({(x,y) | fi(x) ̸= y}) = 0.
14/23

The No-Free-Lunch Theorem
Claim (*):
For every algorithm A that receives a training set of m examples from C × {0, 1} and returns a function A(S) : C −→ {0, 1} we have:
max ES∼Dm (LDi (A(S))) 􏰁 1. 1􏰀i 􏰀|T | 4
This means that for every A′ that receives a training set of m examples
from X × {0, 1} there exists f : X −→ {0, 1} and a distribution D over
X × {0, 1} such that LD(f ) = 0 and ES∼Dm (LD(A′(S))) 􏰁 1 . 4
15/23

The No-Free-Lunch Theorem
Note that the index j refers to samples while i refers to hypotheses. There are k = (2m)m possible sequences (samples)
S1,…,Sk
If Sj = (x1, . . . , xm), the sequence labeled by a function fi is denoted
of m examples from C. by
Sji =((x1,fi(x1)),…,(xm,fi(xm))).
If the distribution is Di , then the possible training sets that A can receive are S1i , . . . , Ski and all these training sets have the same probability of being sampled. Therefore, the expected error of the sample S is:
1 􏰆k
ES ∼Dm (LDi (A(S )) = k
LDi (A(Sji )).
j=1
16/23

The No-Free-Lunch Theorem
Recall that there are T = 22m possible functions from C to {0, 1}: f1,…,fT, so 1 􏰀 i 􏰀 T, where i the superscript of Sji reflecting the labeling function fi .
We have:
1 􏰆k
max
LDi (A(Sji ))
1􏰀i􏰀T k
j=1
1 􏰆T 1 􏰆k
􏰁 T k i=1
LDi (A(Sji )) j=1
1 􏰆k 1 􏰆T = k T
j=1 i=1
1 􏰆T
􏰁 min 1􏰀j􏰀k T
i=1
LDi (A(Sji )) LDi(A(Sji)).
17/23

The No-Free-Lunch Theorem
Fix some j and let Sj = {x1,…,xm}. Let v1,…,vp be the examples in C that do not appear in Sj . Clearly p 􏰁 m. Therefore, for each
h : C −→ {0, 1} and every i we have:
Hence,
r=1
􏰁
LDi (h)
= 2m
1 􏰆p
1h(x)̸=fi (x) 􏰁 2m 1h(vr )̸=fi (vr ).
1h(vr )̸=fi (vr )
1 􏰆T
T
i=1
1 􏰆T 1 􏰆p
1A(S i )(vr )̸=fi (vr ) j
􏰁 2p LDi (A(Sji ))
1􏰆 1􏰆p
x∈C
r=1
T 2p i=1
r=1 1 􏰆p 1 􏰆T
= 2p T 1A(Sji)(vr)̸=fi(vr) r=1 i=1
􏰁
1 1􏰆T min
1A(Si )(vr )̸=f (vr ). j i
2 1􏰀t􏰀p T
i=1
18/23

The No-Free-Lunch Theorem
Let vr be an example in C that does not appear in a sample Sj . We can partition all functions in {f1,…,fT} into T/2 disjoint sets {fi,fi′} such that we have
fi(c) ̸= fi′(c) if and only if c = vr.
Since for a set {fi,fi′} we must have Si = Si′, it follows that
which implies
jj
1A(Si )(vr )̸=fi (vr ) + 1A(Si′ )(vr )̸=f ′ (vr ) = 1, jji
1􏰆T 1
T
i=1
1A(Si )(vr )̸=fi (vr ) = 2. j
19/23

The No-Free-Lunch Theorem
Since
1􏰆T 11􏰆T
LDi (A(Sji )) 􏰁
1􏰆T 1
T
and
T
i=1
min
2 1􏰀t􏰀p T
1A(S i )(vr )̸=f (vr ) j i
i=1
T
1A(Si (vr )̸=fi (vr ) = 2, j
i=1
we have
1􏰆T 1
i=1
L D i ( A ( S ji ) 􏰁 4 .
20/23

The No-Free-Lunch Theorem
Thus,
implies
max
1 􏰆k
1 􏰆T
LDi(A(Sji))􏰁 min 1􏰀j􏰀k T
LDi(A(Sji))
1􏰀i􏰀T k
j=1
i=1
max
LDi(A(Sji))􏰁 . 4
1􏰆k 1
1􏰀i􏰀T k
j=1
21/23

The No-Free-Lunch Theorem
We combined
1􏰆T
T
i=1 1 􏰆k
1 1􏰆T 􏰁 min
LDi (A(Sji ))
LDi(A(Sji)) 􏰁 min LDi(A(Sji))
max
1􏰀i􏰀T k
ES∼Dm(LDi(A(S))) =
1􏰀j􏰀k T 1 􏰆k
2 1􏰀t􏰀p T
1 􏰆T
1A(S i )(vr )̸=f (vr ) j i
i=1
j=1
i=1
k LDi(A(Sji))
1 􏰆T
T
i=1 to obtain:
j=1
1 2
max ES∼Dm (LDi (A(S)) 􏰁 1. 1􏰀i􏰀T i 4
1A(S i )(vr )̸=fi (vr ) = j
22/23

The No-Free-Lunch Theorem
Thus, the Claim (*) is justified.
This means that for every algorithm A′ that receives a training set of m examples from X × {0, 1} there exists a function f : X −→ {0, 1} and a distribution D over X × {0, 1} such that LD(f ) = 0 and
ES∼Dm (LD(A′(S))) 􏰁 1. 4
By the second Lemma this implies:
􏰂′ 1􏰃1 P LD(A(S))􏰁8 􏰁7.
23/23