程序代写代做代考 algorithm data science C Bayesian AI data mining Learning Theory

Learning Theory
COMP9417 Machine Learning and Data Mining
Term 2, 2020
COMP9417 ML & DM Learning Theory Term 2, 2020 1 / 78

Acknowledgements
Material derived from slides for the book
“Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
Material derived from slides for the book
“Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012)
http://www.cs.ubc.ca/~murphyk/MLbook
Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
Material derived from slides for the book
“Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
Material derived from figures for the book
“Python Data Science Handbook” by J. VanderPlas O’Reilly Media (2017) http://shop.oreilly.com/product/0636920034919.do
Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani, Goa, India (2016)
COMP9417 ML & DM Learning Theory Term 2, 2020 2 / 78

Aims
Aims
This lecture will introduce you to some foundational results that apply in machine learning irrespective of any particular algorithm, and will enable you to define and reproduce some of the fundamental approaches and results from the computational and statistical theory. Following it you should be able to:
• describe a basic theoretical framework for sample complexity of learning
• describe the Probably Approximately Correct (PAC) learning framework
• describe the Vapnik-Chervonenkis (VC) dimension framework
• describe the Mistake Bounds framework and apply the Winnow
algorithm within this framework
• outline the “No Free Lunch” Theorem
COMP9417 ML & DM Learning Theory Term 2, 2020 3 / 78

Introduction
Introduction
Are there general laws that apply to inductive learning?
From computational learning theory and statistical learning theory:
• in both cases, theoretical results have been obtained that apply in very general settings
• provide elegant frameworks leading to results on what can or cannot be learned algorithmically
• however, theoretical results are not always easy to obtain for practically useful machine learning methods and applications
• need to make (sometimes unrealistic) assumptions
• results may be overly pessimistic, e.g., worst-case bounds
• nonetheless, both areas have contributed results which are important for understanding some key issues in machine learning
• have also led to important advances in practical algorithms, such as boosting and support vector machines
COMP9417 ML & DM Learning Theory Term 2, 2020 4 / 78

Introduction
Key Idea
Leaning theory aims at a body of theory that captures all important aspects of the fundamentals of the learning process and any algorithm or class of algorithms designed to do learning — i.e., we desire theory to capture the algorithm-independent aspects of machine learning.
BUT: we’re not quite there yet . . .
COMP9417 ML & DM Learning Theory Term 2, 2020 5 / 78

Computational Learning Theory
Computational Learning Theory
We seek theory to relate:
• Probability of successful learning
• Number of training examples
• Complexity of hypothesis space
• Time complexity of learning algorithm
• Accuracy to which target concept is approximated • Manner in which training examples presented
COMP9417 ML & DM Learning Theory Term 2, 2020 6 / 78

Computational Learning Theory
Questions for Learning Theory
Computational Learning Theory
Some questions to ask, without focusing on any particular algorithm: • Sample complexity
• How many training examples required for learner to converge (with high probability) to a successful hypothesis ?
• Computational complexity
• How much computational effort required for learner to converge (with
high probability) to a successful hypothesis ? • Hypothesis complexity
• How do we measure the complexity of a hypothesis ? • How large is a hypothesis space ?
• Mistake bounds
• How many training examples will the learner misclassify before
converging to a successful hypothesis ?
COMP9417 ML & DM Learning Theory Term 2, 2020 7 / 78

Computational Learning Theory
Questions for Learning Theory
Computational Learning Theory
What do we consider to be a successful hypothesis: • identical to target concept ?
• mostly agrees with target concept . . . ?
• … does this most of the time ?
COMP9417 ML & DM Learning Theory Term 2, 2020 8 / 78

Computational Learning Theory
Questions for Learning Theory
Computational Learning Theory
Instead of focusing on particular algorithms, learning theory aims to characterise classes of algorithms.
The framework of Probably Approximately Correct (PAC) learning can be used to address questions such as:
• How many training examples ?
• How much computational effort required ? • How complex a hypothesis class needed ? • How to quantify hypothesis complexity ? • How many mistakes will be made ?
We start to look at PAC learning using Concept Learning.
COMP9417 ML & DM Learning Theory Term 2, 2020 9 / 78

PAC Learning
Prototypical Concept Learning Task
Given:
Instances X: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast
Target function c: EnjoySport : X → {0, 1} Hypotheses H: Conjunctions of literals. E.g.
⟨?, Cold, High, ?, ?, ?⟩
Training examples D: Positive and negative examples of target function
⟨x1, c(x1)⟩, . . . ⟨xm, c(xm)⟩
Determine:
A hypothesis h in H such that h(x) = c(x) for all x in D? A hypothesis h in H such that h(x) = c(x) for all x in X?
COMP9417 ML & DM Learning Theory Term 2, 2020 10 / 78

PAC Learning
Sample Complexity
Given: set of instances X set of hypotheses H
set of possible target concepts C
training instances generated by a fixed, unknown probability
distribution D over X
Learner observes a sequence D of training examples of form ⟨x, c(x)⟩,
for some target concept c ∈ C
instances x are drawn from distribution D
teacher provides target value c(x) for each Learner must output a hypothesis h estimating c
h is evaluated by its performance on subsequent instances drawn according to D
Note: randomly drawn instances, noise-free classifications
COMP9417 ML & DM Learning Theory Term 2, 2020 11 / 78

PAC Learning
True Error of a Hypothesis
Instance space X

+
Where c
and h disagree
– +
ch

COMP9417 ML & DM Learning Theory
Term 2, 2020 12 / 78

PAC Learning
True Error of a Hypothesis
Definition: The true error (denoted errorD(h)) of hypothesis h with respect to target concept c and distribution D is the probabil- ity that h will misclassify an instance drawn at random according to D.
errorD(h) ≡ Pr [c(x) ̸= h(x)] x∈D
COMP9417 ML & DM Learning Theory Term 2, 2020 13 / 78

PAC Learning
Two Notions of Error
Training error of hypothesis h with respect to target concept c • How often h(x) ̸= c(x) over training instances
True error of hypothesis h with respect to c
• How often h(x) ̸= c(x) over future random instances
Our concern:
• Can we bound the true error of h given the training error of h? • First consider when training error of h is zero (i.e., h ∈ V SH,D)
COMP9417 ML & DM Learning Theory Term 2, 2020 14 / 78

PAC Learning
Concept Learning as Search
Question: What can be learned ?
Answer: (only) what is in the hypothesis space
How big is the hypothesis space for EnjoySport ? Instance space
Sky×AirTemp×…×Forecast = 3×2×2×2×2×2 = 96
COMP9417 ML & DM Learning Theory Term 2, 2020 15 / 78

PAC Learning
Concept Learning as Search
Hypothesis space
Sky×AirTemp×…×Forecast = 5×4×4×4×4×4 = 5120
(semantically distinct1 only) = 1+(4×3×3×3×3×3) = 973
The learning problem ≡ searching a hypothesis space. How ?
1Any hypothesis with an ∅ constraint covers no instances, hence all are semantically equivalent.
COMP9417 ML & DM Learning Theory Term 2, 2020 16 / 78

PAC Learning
Instances, Hypotheses, and More-General-Than
Instances X Hypotheses H
hh
13
h
2
x
1
x
2
11
x = h =
22
h =
3
Specific
x = h =
General
COMP9417 ML & DM Learning Theory
Term 2, 2020
17 / 78

PAC Learning
A generality order on hypotheses
Definition: Let hj and hk be Boolean-valued functions defined over instances X. Then hj is more general than or equal to hk (written hj ≥g hk) if and only if
(∀x ∈ X)[(hk(x) = 1) → (hj(x) = 1)]
Intuitively, hj is more general than or equal to hk if any instance
satisfying hk also satisfies hj.
hj is (strictly) more general than hk (written hj >g hk) if and only if
(hj ≥g hk) ∧ (hk ̸≥g hj).
hj is more specific than hk when hk is more general than hj.
COMP9417 ML & DM Learning Theory Term 2, 2020 18 / 78

PAC Learning
Version Space
A hypothesis h is consistent with a set of training examples D of target concept c if and only if h(x) = c(x) for each training example ⟨x, c(x)⟩ in D.
Consistent(h, D) ≡ (∀⟨x, c(x)⟩ ∈ D) h(x) = c(x)
The version space, V SH,D, with respect to hypothesis space H and training examples D, is the subset of hypotheses from H consistent with all training examples in D.
V SH,D ≡ {h ∈ H|Consistent(h, D)}
COMP9417 ML & DM Learning Theory Term 2, 2020 19 / 78

PAC Learning
Exhausting the Version Space
Hypothesis space H
error =.1 r =.2
error =.3 r =.4
error =.3 r =.1
error =.2 r=0
VS H,D
error =.1 r=0
error =.2 r =.3
COMP9417 ML & DM Learning Theory Term 2, 2020 20 / 78

PAC Learning
Exhausting the Version Space
Note: in the diagram
(r = training error, error = true error)
Definition: The version space V SH,D is said to be ε-exhausted with respect to c and D, if every hypothesis h in V SH,D has error less than ε with respect to c and D.
(∀h ∈ V SH,D) errorD(h) < ε So V SH,D is not ε-exhausted if it contains at least one h with errorD(h) ≥ ε. COMP9417 ML & DM Learning Theory Term 2, 2020 21 / 78 PAC Learning How many examples will ε-exhaust the VS? Theorem: [Haussler, 1988]. If the hypothesis space H is finite, and D is a sequence of m ≥ 1 independent random examples of some target concept c, then for any 0 ≤ ε ≤ 1, the probability that the version space with respect to H and D is not ε-exhausted (with respect to c) is less than |H |e−εm Interesting! This bounds the probability that any consistent learner will output a hypothesis h with error(h) ≥ ε COMP9417 ML & DM Learning Theory Term 2, 2020 22 / 78 PAC Learning How many examples will ε-exhaust the VS? If we want this probability to be below δ |H|e−εm ≤ δ then m≥ 1(ln|H|+ln(1/δ)) ε COMP9417 ML & DM Learning Theory Term 2, 2020 23 / 78 PAC Learning How many examples will ε-exhaust the VS? How many examples are sufficient to assure with probability at least (1 − δ) that every h in V SH,D satisfies errorD(h) ≤ ε ? Use our theorem: m≥ 1(ln|H|+ln(1/δ)) ε COMP9417 ML & DM Learning Theory Term 2, 2020 24 / 78 PAC Learning Learning Conjunctions of Boolean Literals Suppose H contains conjunctions of constraints on up to n Boolean attributes (i.e., n Boolean literals – any h can contain a literal, or its negation, or neither). Then |H| = 3n, and or m≥ 1(ln3n +ln(1/δ)) ε m≥ 1(nln3+ln(1/δ)) ε COMP9417 ML & DM Learning Theory Term 2, 2020 25 / 78 PAC Learning How About EnjoySport? m≥ 1(ln|H|+ln(1/δ)) ε If H is as given in EnjoySport then |H| = 973, and m≥ 1(ln973+ln(1/δ)) ε COMP9417 ML & DM Learning Theory Term 2, 2020 26 / 78 PAC Learning How About EnjoySport? ... if want to assure that with probability 95%, V S contains only hypotheses with errorD(h) ≤ .1, then it is sufficient to have m examples, where m ≥ 1 (ln 973 + ln(1/0.05)) 0.1 m ≥ 10(ln 973 + ln 20) m ≥ 10(6.88 + 3.00) m ≥ 98.8 COMP9417 ML & DM Learning Theory Term 2, 2020 27 / 78 PAC Learning PAC Learning Consider a class C of possible target concepts defined over a set of instances X of length n, and a learner L using hypothesis space H. Definition: C is PAC-learnable by L using H if for all c ∈ C, distributions D over X, ε such that 0 < ε < 1/2, and δ such that 0 < δ < 1/2, learner L will with probability at least (1 − δ) output a hypothesis h ∈ H such that errorD(h) ≤ ε, in time that is polynomial in 1/ε, 1/δ, n and size(c). Probably Approximately Correct Learning L. Valiant, (1984; 2013). COMP9417 ML & DM Learning Theory Term 2, 2020 28 / 78 PAC Learning PAC Learning in a Real-Valued Hypothesis Space In a real-valued hypothesis space a ray is a one-dimensional threshold function defined for a real-value θ on instance space R1 by rθ(x)=1 ⇐⇒ x≥θ A learning algorithm for rays should find the smallest ray containing all the positive examples in the training set. Given training examples D = {⟨x1, c(x1)⟩, ⟨(x2, c(x2)⟩, . . . ⟨xm, c(xm)⟩}, the hypothesis output by the learner L(D) should be rλ, where λ=argmin{xi |c(xi)=1} 1≤i≤m and λ = ∞ if the training set has no positive examples of the concept. COMP9417 ML & DM Learning Theory Term 2, 2020 29 / 78 PAC Learning PAC Learning in a Real-Valued Hypothesis Space A “memoryless online” algorithm2 to learn rays based on finding the minimum of a set: 1 setλ=∞ 2 fori=1tomdo • ifc(xi)=1andxi <λthensetλ=xi 3 L(D) = rλ where ∞ is a symbol taken to be greater than any real number. 2See: Anthony and Biggs (1992). COMP9417 ML & DM Learning Theory Term 2, 2020 30 / 78 PAC Learning PAC Learning in a Real-Valued Hypothesis Space For example, if the training set is {⟨3.6578, 1⟩, ⟨2.5490, 0⟩, ⟨3.4156, 1⟩, ⟨3.5358, 1⟩, ⟨3.3413, 1⟩, ⟨4.4987, 1⟩} the corresponding sequence of hypotheses is r∞, r3.6578, r3.6578, r3.4156, r3.4156, r3.3413, r3.3413 COMP9417 ML & DM Learning Theory Term 2, 2020 31 / 78 PAC Learning PAC Learning in a Real-Valued Hypothesis Space If the training set is for some target concept rθ then the hypothesis returned by the learning algorithm should be a ray rλ where λ ≥ θ. As the size of the training set increases, we should expect a smaller error from predicting with rλ instead of rθ. Error ε corresponds to distance λ ≥ θ. The probability of not seeing a single example in the training set in this interval is 1 − ε. For m examples this is (1 − ε)m. As in the previous setting, can show that algorithm for learning rays is PAC. COMP9417 ML & DM Learning Theory Term 2, 2020 32 / 78 PAC Learning Agnostic PAC Learning So far, assumed c ∈ H — consistent learners (noise-free) Agnostic learning setting: don’t assume c ∈ H • What do we want then? • The hypothesis h that makes fewest errors on training data • What is sample complexity in this case? m≥ 1 (ln|H|+ln(1/δ)) 2ε2 derived from Hoeffding bounds: Pr[errorD(h) > errorD(h) + ε] ≤ e−2mε2
COMP9417 ML & DM Learning Theory Term 2, 2020 33 / 78

PAC Learning
Unbiased Learners
Unbiased concept class C contains all target concepts definable on instance space X.
|C| = 2|X|
Say X is defined using n Boolean features, then |X| = 2n.
|C| = 22n
An unbiased learner has a hypothesis space able to represent all possible
target concepts, i.e., H = C.
m≥ 1(2nln2+ln(1/δ))
ε
i.e., exponential (in the number of features) sample complexity !
COMP9417 ML & DM Learning Theory Term 2, 2020 34 / 78

VC Dimension
How Complex is a Hypothesis ?
COMP9417 ML & DM Learning Theory Term 2, 2020 35 / 78

VC Dimension
How Complex is a Hypothesis ?
The solid curve is the function sin(50x) for x ∈ [0, 1].
The blue (solid) and green (hollow) points illustrate how the associated indicator function I(sin(α x) > 0) can shatter (separate) an arbitrarily large number of points by choosing an appropriately high frequency α.
Classes separated based on sin(α x), for frequency α, a single parameter.
COMP9417 ML & DM Learning Theory Term 2, 2020 36 / 78

VC Dimension
Shattering a Set of Instances
Definition: a dichotomy of a set S is a partition of S into two disjoint subsets.
Definition: a set of instances S is shattered by hypothesis space H if and only if for every dichotomy of S there exists some hypothesis in H consistent with this dichotomy.
COMP9417 ML & DM Learning Theory Term 2, 2020 37 / 78

VC Dimension
Three Instances Shattered
Instance space X
COMP9417 ML & DM Learning Theory
Term 2, 2020 38 / 78

VC Dimension
Shattering a set of instances
Consider the following instances:
m = ManyTeeth ∧ ¬Gills ∧ ¬Short ∧ ¬Beak g = ¬ManyTeeth ∧ Gills ∧ ¬Short ∧ ¬Beak s = ¬ManyTeeth ∧ ¬Gills ∧ Short ∧ ¬Beak b = ¬ManyTeeth ∧ ¬Gills ∧ ¬Short ∧ Beak
There are 16 different subsets of the set {m, g, s, b}. Can each of them be represented by its own conjunctive concept? The answer is yes: for every instance we want to exclude, we add the corresponding negated literal to the conjunction. Thus, {m, s} is represented by ¬Gills ∧ ¬Beak, {g, s, b} is represented by ¬ManyTeeth, {s} is represented by
¬ManyTeeth ∧ ¬Gills ∧ ¬Beak, and so on. We say that this set of four instances is shattered by the hypothesis language of conjunctive concepts.
COMP9417 ML & DM Learning Theory Term 2, 2020 39 / 78

VC Dimension
Shattering a set of instances
Suppose we have a dataset described by d Boolean features, and a hypothesis space of conjunctions of up to d Boolean literals. Then the largest subset of instances that can be shattered is at least d.
To see this argument more clearly, suppose d = 3 and that the i-th instance is represented by having only the i-th Boolean literal set to true, simplified here to the corresponding bitstring:
instance1: 100 instance2: 010 instance3: 001
Now any dichotomy can be constructed by a conjunctive hypothesis that excludes any instance simply by adding the appropriate literal. For example, the hypothesis including only instance2 is ¬l1 ∧ ¬l3. In fact, it can be shown that the largest subset of instances that can be shattered in this setting has size exactly d.
COMP9417 ML & DM Learning Theory Term 2, 2020 40 / 78

VC Dimension
The Vapnik-Chervonenkis Dimension
Definition: The Vapnik-Chervonenkis dimension, V C(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then V C(H) ≡ ∞.
Note: the VC dimension can be defined for an infinite hypothesis space H since it depends only on the size of finite subsets of X.
COMP9417 ML & DM Learning Theory Term 2, 2020 41 / 78

VC Dimension
VC Dimension of Conjunctive Concepts
From the earlier slide on shattering a set of instances by a conjunctive hypothesis, if we have an instance space X where each instance is described by d Boolean features, and a hypothesis space H of conjunctions of up to d Boolean literals, then the VC Dimension V C(H) = d.
What about some other type of hypothesis space ? Let’s look at linear classifiers.
COMP9417 ML & DM Learning Theory Term 2, 2020 42 / 78

VC Dimension
VC Dimension of Linear Decision Surfaces
From the left, shown are three dichotomies of the same three instances. Can a linear classifier be found for the other five dichotomies? On the right, this set of four instances clearly cannot be shattered by a hypothesis space of linear classifiers.
COMP9417 ML & DM Learning Theory Term 2, 2020 43 / 78

VC Dimension
VC Dimension of Linear Decision Surfaces
Consider linear classifiers in two dimensions. What is the VC dimension of this class of hypotheses H?
Clearly, for a subset of 2 instances we can find a linear classifier for all possible dichotomies. What about a subset of 3 instances ?
The same argument as for 2 instances applies (see first three examples on previous slide, and case (a) on next slide), as long as the instances are not collinear (case (b) on next slide). So the VC dimension is at least 3.
However, in this setting, there is no set of 4 points that can be shattered. In general, for linear classifiers in d dimensions the VC dimension is d + 1.
COMP9417 ML & DM Learning Theory Term 2, 2020 44 / 78

VC Dimension
VC Dimension of Linear Decision Surfaces
( a ) ( b)
COMP9417 ML & DM
Learning Theory Term 2, 2020 45 / 78

VC Dimension
Sample Complexity from VC Dimension
We can now generalise the PAC-learning result obtained earlier to answer the question: how many randomly drawn examples suffice to ε-exhaust
V SH,D with probability at least (1 − δ)?
m≥ 1(4log2(2/δ)+8VC(H)log2(13/ε)) ε
So we see that the concept of the VC dimension of a hypothesis class gives us a general framework for characterising the complexity or capacity of hypotheses in terms of their ability to express all possible target concepts in a particular learning setting.
For example, the argument developed for the VC dimension of linear classifiers can be extended to multi-layer perceptrons to obtain sample complexity bounds in the PAC-learning framework.
COMP9417 ML & DM Learning Theory Term 2, 2020 46 / 78

Mistake Bounds
Mistake Bounds
So far: how many examples needed to learn ?
What about: how many mistakes before convergence ?
Let’s consider similar setting to PAC learning:
• Instances drawn at random from X according to distribution D
• Learner must classify each instance before receiving correct classification from teacher
• Can we bound the number of mistakes learner makes before converging?
Again, we start by analysing concept learning.
COMP9417 ML & DM Learning Theory Term 2, 2020 47 / 78

Mistake Bounds
The Find-S Algorithm
An online, specific-to-general, concept learning algorithm:
• Initialize h to the most specific hypothesis in H • For each positive training instance x
• For each attribute constraint ai in h
• If the constraint ai in h is satisfied by x
• Then do nothing
• Else replace ai in h by the next more general constraint satisfied by x
COMP9417 ML & DM Learning Theory Term 2, 2020 48 / 78

Mistake Bounds
Hypothesis Space Search by Find-S
Instances X
Hypotheses H
x- 3
x+ x+ 12
x+ 4
h0 h1
h 2,3
h4
Specific
h4 =
COMP9417 ML & DM Learning Theory Term 2, 2020 49 / 78
General
1
x 2 = , +
x 3 = , – x 4 = , +
h0=<∅,∅,∅, ∅,∅,∅>
h1 = h2 =
h3 =
x = , +

Mistake Bounds
Find-S – does it work ?
Assume: a hypothesis hc ∈ H describes target function c, and training data is error-free.
By definition, hc is consistent with all positive training examples so can never cover a negative example.
For each h generated by Find-S hc is more general than or equal to h. So h can never cover a negative example.
COMP9417 ML & DM Learning Theory Term 2, 2020 50 / 78

Mistake Bounds
Complaints about Find-S
• Can’t tell whether it has learned concept learned hypothesis is an element of the VS but may not be the only consistent hypothesis
• Can’t tell when training data inconsistent cannot handle noisy data
• Picks a maximally specific h (why?) might require maximally general h
COMP9417 ML & DM Learning Theory Term 2, 2020 51 / 78

Mistake Bounds
Mistake Bounds: Find-S
Consider Find-S when H = conjunction of Boolean literals Find-S:
• Initialize h to the most specific hypothesis l1 ∧¬l1 ∧l2 ∧¬l2 …ln ∧¬ln
• For each positive training instance x
• Remove from h any literal that is not satisfied by x
• Output hypothesis h.
How many mistakes before converging to correct h?
COMP9417 ML & DM Learning Theory Term 2, 2020 52 / 78

Mistake Bounds
Mistake Bounds: Find-S
Find-S will converge to error-free hypothesis in the limit if •C⊆H
• data are noise-free
How many mistakes before converging to correct h?
COMP9417 ML & DM Learning Theory Term 2, 2020 53 / 78

Mistake Bounds
Mistake Bounds: Find-S
• Find-S initially classifies all instances as negative
• will generalize for positive examples by dropping unsatisfied literals
• since c ∈ H, will never classify a negative instance as positive (if it did:
• would exist a hypothesis h′ such that c ≥g h′
• would exist an instance x such that h′(x) = 1 and c(x) ̸= 1 • which contradicts definition of generality ordering ≥g)
• mistake bound from number of positives classified as negatives How many mistakes before converging to correct h?
COMP9417 ML & DM Learning Theory Term 2, 2020 54 / 78

Mistake Bounds
Mistake Bounds: Find-S
• 2n terms in initial hypothesis
• first mistake, remove half of these terms, leaving n
• each further mistake, remove at least 1 term
• in worst case, will have to remove all n remaining terms
• would be most general concept – everything is positive
• worst case number of mistakes would be n + 1
• worst case sequence of learning steps, removing only one literal per step
COMP9417 ML & DM Learning Theory Term 2, 2020 55 / 78

Mistake Bounds
Mistake Bounds: Halving Algorithm
Find-S returns the single most-specific consistent hypothesis.
An extension of the Find-S concept learning algorithm is the Candidate-Elimination algorithm which returns all consistent hypotheses, i.e., it finds the Version Space.
Now consider the Halving Algorithm:
• Learns concept using Candidate-Elimination algorithm
• Classifies new instances by majority vote of Version Space hypotheses
How many mistakes will the Halving Algorithm make before converging to correct h?
• … in worst case? • … in best case?
COMP9417 ML & DM Learning Theory Term 2, 2020 56 / 78

Mistake Bounds
Mistake Bounds: Halving Algorithm
Halving Algorithm learns exactly when the version space contains only one consistent hypothesis which corresponds to the target concept
• at each step, remove all hypotheses whose vote was incorrect • mistake when majority vote classification is incorrect
• mistake bound in converging to correct h ?
COMP9417 ML & DM Learning Theory Term 2, 2020 57 / 78

Mistake Bounds
Mistake Bounds: Halving Algorithm
• how many mistakes worst case ?
• on every step, mistake because majority vote is incorrect
• each mistake, number of hypotheses reduced by at least half
• hypothesis space size |H|, worst-case mistake bound ⌊log2 |H|⌋
• how many mistakes best case ?
• on every step, no mistake because majority vote is correct
• still remove all incorrect hypotheses, up to half
• best case, no mistakes in converging to correct h
COMP9417 ML & DM Learning Theory Term 2, 2020 58 / 78

Mistake Bounds
Winnow
Mistake bounds are based on the work of Nick Littlestone3 who developed the Winnow family of algorithms.
• online, mistake-driven algorithm • similar to perceptron
• learns linear threshold function
• designed to learn in the presence of many irrelevant features • number of mistakes grows only logarithmically with number of
irrelevant features
• Next slide shows the algorithm known as Winnow2
• in Winnow1 attribute elimination not demotion
3N. Littlestone. “Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm”. Machine Learning 2(4): 285-318 (1987)
COMP9417 ML & DM Learning Theory Term 2, 2020 59 / 78

Mistake Bounds
Winnow2
While some instances are misclassified For each instance x
classify x using current weights w If predicted class is incorrect
If x has class 1
For each xi = 1, wi ← αwi
(if xi = 0, leave wi unchanged)
# Promotion
Otherwise
For each xi =1, wi ← wi
# Demotion
Here x and w are vectors of features and weights, respectively.
COMP9417 ML & DM Learning Theory Term 2, 2020 60 / 78
α
(if xi = 0, leave wi unchanged)

Mistake Bounds
Winnow2
• user-supplied threshold θ • classis1if􏰀wiai >θ
• note similarity to perceptron training rule
• Winnow2 uses multiplicative weight updates • α>1
• will do much better than perceptron with many irrelevant attributes • an attribute-efficient learner
• what are irrelevant attributes ?
• assume that the target concept can be expressed using a finite subset r
of attributes or features
• if there are n attributes in total, n − r are irrelevant
• mistake-bounds for Winnow-type algorithms are logarithmic in the
number of irrelevant attributes
• typically, the worst-case mistake-bound is something like O(r log n)
COMP9417 ML & DM Learning Theory Term 2, 2020 61 / 78

Mistake Bounds
Optimal Mistake Bounds
Let MA(C) be the max number of mistakes made by algorithm A to learn concepts in C. (maximum over all possible c ∈ C, and all possible training sequences)
MA(C) ≡ max MA(c) c∈C
Definition: Let C be an arbitrary non-empty concept class. The optimal mistake bound for C, denoted Opt(C), is the minimum over all possible learning algorithms A of MA(C).
Opt(C) ≡ min MA(C) A∈learning algorithms
V C(C) ≤ Opt(C) ≤ MHalving(C) ≤ log2(|C|).
COMP9417 ML & DM Learning Theory Term 2, 2020 62 / 78

Mistake Bounds
Weighted Majority
Based on4:
• a generalisation of Halving Algorithm
• predicts by weighted vote of set of prediction algorithms
• learns by altering weights for prediction algorithms
• any prediction algorithm simply predicts value of target concept given an instance; they can be
• elements of hypothesis space H, or even • different learning algorithms
• if there is inconsistency between prediction algorithm and training example
• then reduce weight of prediction algorithm
• bound number of mistakes of ensemble by number of mistakes made
by best prediction algorithm
4N. Littlestone, M. Warmuth. “The Weighted Majority Algorithm”. Inf. Comput. 108(2): 212-261 (1994)
COMP9417 ML & DM Learning Theory Term 2, 2020 63 / 78

Mistake Bounds
Weighted Majority
ai is the i-th prediction algorithm wi is the weight associated with ai
For all i, initialize wi ← 1
For each training example ⟨x,c(x)⟩
Initialize q0 and q1 to 0
For each prediction algorithm ai
If ai(x)=0 then q0 ←q0 +wi If ai(x)=1 then q1 ←q1 +wi
If q1 > q0 then predict c(x) = 1
If q0 > q1 then predict c(x) = 0
If q0 = q1 then predict 0 or 1 at random for c(x) For each prediction algorithm ai
If ai(x) ̸= c(x) then wi ← βwi
COMP9417 ML & DM Learning Theory Term 2, 2020 64 / 78

Mistake Bounds
Weighted Majority
Weighted Majority algorithm begins by assigning weight of 1 to each prediction algorithm.
On misclassification by a prediction algorithm its weight is reduced by multiplication by a constant β, 0 ≤ β ≤ 1.
Equivalent to Halving Algorithm when β = 0. Otherwise, this just down-weights contribution of algorithms which make errors.
Key result: number of mistakes made by Weighted Majority algorithm will never be greater than a constant factor times the number of mistakes made by the best member of the pool, plus a term that grows only logarithmically in the number of prediction algorithms in the pool.
COMP9417 ML & DM Learning Theory Term 2, 2020 65 / 78

Limits on Machine Learning
Some questions about Machine Learning
1 2 3
Are there reasons to prefer one learning algorithm over another ? Can we expect any method to be superior overall ?
Can we even find an algorithm that is overall superior to random guessing ?
COMP9417 ML & DM Learning Theory Term 2, 2020 66 / 78

Limits on Machine Learning
Some questions about Machine Learning
• Perhaps surprisingly, the answer to each of these questions is no!
• Unless one has some prior knowledge on, or can make some
assumptions about, the distribution of target functions.
• This is a consequence of a number of results collectively known as the “No Free Lunch” Theorem
• Since these results specifically address the issue of generalising from a training-set to minimize off training-set error they are also referred to as “Conservation Laws of Generalization”.
COMP9417 ML & DM Learning Theory Term 2, 2020 67 / 78

Limits on Machine Learning
No Free Lunch Theorem
Two main results are:
• Uniformly averaged over all target functions, the expected off-training-set error for all learning algorithms is the same.
• Assuming that the training set D can be learned correctly by all algorithms, averaged over all target functions no learning algorithm gives an off-training set error superior to any other:
ΣF [E1(E|F, D) − E2(E|F, D)] = 0
where F is the set of possible target functions, E is the off-training set
error, and E1, E2 are expectations for two learning algorithms.
COMP9417 ML & DM Learning Theory Term 2, 2020 68 / 78

Limits on Machine Learning
No Free Lunch Theorem
• There are also related results for cases where the distribution on target functions is not uniform.
• Therefore, all statements of the form “learning algorithm 1 is better than algorithm 2” are ultimately statements about the relevant target functions.
COMP9417 ML & DM Learning Theory Term 2, 2020 69 / 78

Limits on Machine Learning
No Free Lunch example
• Consider a data set with three Boolean features, labelled by a target function F
• Suppose we have two different (deterministic) algorithms that generate two different hypotheses, both of which fit the training data D exactly
• The first algorithm assumes all instances x are in the target function F, unless labelled otherwise in training set D, and the second algorithm assumes the opposite
• For this particular target function F the first algorithm is clearly superior in terms of off-training-set error
• But this cannot be determined from the performance on training data D alone !
COMP9417 ML & DM Learning Theory Term 2, 2020 70 / 78

Limits on Machine Learning
No Free Lunch example
x
F
h1
h2
D
000 001 010
1 -1 1
1 -1 1
1 -1 1
011
100
101
110
111
-1 1 -1 1 1
1 1 1 1 1
-1 -1 -1 -1 -1
E1(E|F, D) = 0.4
E2(E|F, D) = 0.6
COMP9417 ML & DM Learning Theory Term 2, 2020 71 / 78

Limits on Machine Learning
No Free Lunch example
• But note that the algorithm designer does not know F here
• Furthermore, if we have no prior knowledge about which F we are
trying to learn, neither algorithm is superior to the other
• Both fit the training data correctly, but there are a total of 25 target
functions consistent with D
• For each of these target functions there is exactly one other function whose output is inverted with respect to each of the off-training set instances
• So the performance of algorithms 1 and 2 will be inverted, and their off-training-set error differences cancel
• Thus ensuring expected (average) error difference of zero
COMP9417 ML & DM Learning Theory Term 2, 2020 72 / 78

Limits on Machine Learning
A Conservation Theorem of Generalization Performance
For every possible learning algorithm for binary classification the sum of performance over all possible target functions is exactly zero !
• on some problems we get positive performance
• so there must be other problems for which we get an equal and
opposite amount of negative performance
COMP9417 ML & DM Learning Theory Term 2, 2020 73 / 78

Limits on Machine Learning
A Conservation Theorem of Generalization Performance
COMP9417 ML & DM Learning Theory Term 2, 2020 74 / 78

Limits on Machine Learning
A Conservation Theorem of Generalization Performance
COMP9417 ML & DM Learning Theory Term 2, 2020 75 / 78

Limits on Machine Learning
A Conservation Theorem of Generalization Performance
Takeaways for machine learning practitioners:
• Even popular, theoretically well-founded algorithms will perform poorly on some domains, i.e., those where the learning algorithm is not a “good match” to the class of target functions.
• It is the assumptions about the learning domains, i.e., hidden target functions, that are relevant.
• Experience with a broad range of techniques is the best insurance for solving arbitrary new classification problems.
See: Duda, Hart & Stork (2001)
COMP9417 ML & DM Learning Theory Term 2, 2020 76 / 78

Limits on Machine Learning
Ugly Duckling Theorem
In the absence of assumptions there is no privileged or “best” feature representation.
For n objects, the set of concepts is 2n. The number of concepts of which any pair of objects is a member (has the same concept definition, or pattern) is 2n−1
• using a finite number of predicates to distinguish any two patterns denoting sets of objects, the number of predicates shared by any two such patterns is constant and independent of those patterns.
Therefore, even the notion of similarity between patterns depends on assumptions, i.e., inductive bias.
See: Duda et al. (2001).
COMP9417 ML & DM Learning Theory Term 2, 2020 77 / 78

Summary
Algorithm Independent Aspects of Machine Learning
• Algorithm independent analyses using techniques from computational complexity, pattern recognition, statistics, etc.
• Some results over-conservative from practical viewpoint, e.g., worst-case rather than average-case analyses
• Nonetheless, has led to many practically useful algorithms, e.g.,
• PAC
• Boosting
• VC dimension
• SVM
• Mistake Bounds
• Winnow, Weighted Majority
• Bias-variance decomposition
• Ensemble learning methods
COMP9417 ML & DM Learning Theory Term 2, 2020 78 / 78

References
Anthony, M. and Biggs, N. (1992). Computational Learning Theory. Cambridge University Press;.
Duda, R., Hart., P., and Stork, D. (2001). Pattern Classfication. Wiley.
COMP9417 ML & DM Learning Theory Term 2, 2020 78 / 78