CS计算机代考程序代写 Bayesian data mining algorithm decision tree Learning Theory

Learning Theory
COMP9417 Machine Learning & Data Mining
Term 1, 2021
Adapted from slides by Dr Michael Bain

Aims
This lecture will introduce you to some foundational results that apply in machine learning irrespective of any particular algorithm and will enable you to define and reproduce some of the fundamental approaches and results from the computational and statistical theory. Following it you should be able to:
• describe a basic theoretical framework for sample complexity of learning
• describe the Probably Approximately Correct (PAC) learning framework
• describe the Vapnik-Chervonenkis (VC) dimension framework
COMP9417 T1, 2021 1

Introduction
Machine learning: Have a computer solve problems by learning from data
rather than being explicitly programmed.
o Regression
o Classification
o Clustering
o Ranking
o Reinforcement learning o…
COMP9417 T1, 2021 2

Computational Learning Theory
The goal of learning theory is to develop and analyse formal models that help us understand:
o what concepts we can hope to learn efficiently, and how much data is necessary to learn them
o what types of guarantees we might hope to achieve (error bounds, complexity bounds)
o why particular algorithms may or may not perform well under various conditions
COMP9417 T1, 2021 3

Key Idea
Leaning theory aims at a body of theory that captures all important aspects of the fundamentals of the learning process and any algorithm or class of algorithms designed to do learning — i.e., we desire theory to capture the algorithm-independent aspects of machine learning.
BUT: we’re not quite there yet …
COMP9417 T1, 2021 4

Computational Learning Theory
Inductive learning: learning from examples and all machine learning algorithms are a kind of inductive learning and there are some questions that we are interested to be able to answer in such framework:
o Probability of successful learning
o Number of training examples
o Complexity of hypothesis space
o Time complexity of learning algorithm
o Accuracy to which target concept is approximated o Manner in which training examples presented
COMP9417 T1, 2021 5

Computational Learning Theory
Some questions to ask, without focusing on any particular algorithm:
• Sample complexity
– How many training examples required for learner to converge (with high probability) to a successful hypothesis ?
• Computational complexity
– How much computational effort required for learner to converge (with high probability) to a successful hypothesis ?
• Hypothesis complexity
– How do we measure the complexity of a hypothesis ?
– How large is a hypothesis space?
COMP9417 T1, 2021 6

Computational Learning Theory
What do we consider to be a successful hypothesis:
– identical to target concept ?
– mostly agrees with target concept … ?
– … does this most of the time ?
COMP9417 T1, 2021 7

Computational Learning Theory
Instead of focusing on particular algorithms, learning theory aims to characterize classes of algorithms.
Probably Approximately Correct (PAC) is a framework for mathematical analysis of learning which was proposed by Valiant in 1984.
COMP9417 T1, 2021 8

Probably Approximately Correct (PAC)
The framework of Probably Approximately Correct (PAC) learning can be used to address questions such as:
– How many training examples?
– How much computational effort required? – How complex a hypothesis class needed? – How to quantify hypothesis complexity?
– How many mistakes will be made ?
COMP9417 T1, 2021 9

Sample Complexity
Given:
• set of instances 𝑋
• set of hypotheses 𝐻
• set of possible target concepts 𝐶
• training instances generated by a fixed, unknown probability distribution 𝒟 over 𝑋
Learnerobservesasequence𝐷oftrainingexamplesofform x,𝑐(x),for some target concept 𝑐 ∈ 𝐶
o Instances x are drawn from distribution 𝒟
o teacher provides target value 𝑐(x) for each x
Learner must output a hypothesis h estimating 𝑐
o h is evaluated by its performance on subsequent
instances drawn according to 𝒟 Note: randomly drawn instances
COMP9417 T1, 2021 10

True Error of a Hypothesis
COMP9417 T1, 2021 11

True Error of a Hypothesis
Definition: The true error (denoted 𝑒𝑟𝑟𝑜𝑟𝒟(h)) of hypothesis h with respect to target concept 𝑐 and distribution 𝒟 is the probability that h will misclassify an instance drawn at random according to 𝒟.
𝑒𝑟𝑟𝑜𝑟𝒟(h)≡ Pr[𝑐x≠h(x)] x∈𝒟
COMP9417 T1, 2021 12

Two Notions of Error
Training error of hypothesis h with respect to target concept 𝑐: o How often h x ≠ 𝑐(x) over training instances
True error of hypothesis h with respect to 𝑐:
o How often h x ≠ 𝑐(x) over future random instances
Our concern:
– Can we bound the true error of h given the training error of h? – First consider when training error of h is zero (i.e., h ∈ 𝑉 𝑆#,%)
COMP9417 T1, 2021 13

PAC Learning
Consider a class 𝐶 of possible target concepts defined over a set of instances 𝑋 of length 𝑛, and a learner 𝐿 using hypothesis space 𝐻.
Definition: 𝐶 is PAC-learnable by 𝐿 using 𝐻 if for all 𝑐 ∈ 𝐶, distributions 𝒟over𝑋,𝜖suchthat0 < 𝜖< 1/2,and𝛿suchthat0 <𝛿 < 1/2, learner 𝐿 will with probability at least (1 − 𝛿) output a hypothesis h ∈ 𝐻 such that 𝑒𝑟𝑟𝑜𝑟𝒟(h) ≤ 𝜖 in time (for sample size) that is polynomial in 1⁄𝜖, 1⁄𝛿, 𝑛 and 𝑠𝑖𝑧𝑒(𝑐). Probably Approximately Correct Learning L. Valiant, (1984; 2013). COMP9417 T1, 2021 14 PAC Learning Given: – 𝐶: concept class – 𝐿 : learner – 𝐻: hypothesis space – 𝑛: length of instances – |𝐻|, size of hypothesis space – 𝒟: distribution over inputs – 𝜖:errorgoal(0 < 𝜖< 1⁄2) – 𝛿:certaintygoal(0 <𝛿 < 1⁄2) 𝐶 is PAC-learnable by 𝐿 using 𝐻, if learner 𝐿, with probability (1 − 𝛿) , outputs a hypothesis h ∈ 𝐻, such that 𝑒𝑟𝑟𝑜𝑟𝒟(h) ≤ 𝜖 for a sample size polynomial in 1⁄𝜖, 1⁄𝛿, 𝑛 and 𝑠𝑖𝑧𝑒(𝑐). We start to look at PAC learning using Concept Learning. COMP9417 T1, 2021 15 Prototypical Concept Learning Task Concept Sky AirTemp Humidity Wind Water Forecast EnjoySport 1 Sunny Warm Normal Strong Warm Same Yes 2 Sunny Warm High Strong Warm Same Yes 3 Rainy Cold High Strong Warm Change No 4 Sunny Warm High Strong Warm Change Yes • A set of example days, and each is described by six attributes. • The task is to learn to predict the value of EnjoySport for arbitrary day, based on the values of its attribute values. COMP9417 T1, 2021 16 Prototypical Concept Learning Task Example : Instances 𝑋: Possible days, each described by the attributes Sky, AirTemp, Humidity, Wind, Water, Forecast Target function c: EnjoySport : 𝑋 → 0,1 Hypotheses 𝐻: Conjunctions of literals. Training examples 𝐷: Positive and negative examples of target function x", 𝑐(x") , ... , x#, 𝑐(x#) Determine: A hypothesis h in 𝐻 such that h(x) = 𝑐(x) for all x in 𝐷? A hypothesis h in 𝐻 such that h(x) = 𝑐(x) for all x in 𝑋? COMP9417 T1, 2021 17 Concept Learning Concept learning: can be formulated as: “problem of searching through a predefined space of potential hypothesis that best fits the training examples” Tom Mitchell "the search for and listing of attributes that can be used to distinguish exemplars from non exemplars of various categories" Bruner, Goodnow, & Austin (1967) Concept Learning: Acquiring the definition of a general category from given positive and negative training examples of the category. COMP9417 T1, 2021 18 Concept Learning Hypothesis representation for concept learning task: – Each hypothesis consists of a conjunction of constraints on the instance attributes. – Each hypothesis will be a vector of six constraints, specifying the values of the six attributes – Each attribute will be: o ? - indicating any value is acceptable for the attribute (don’t care) o single value – specifying a single required value (ex. Warm) (specific) o ∅ - indicating no value is acceptable for the attribute (no value) COMP9417 T1, 2021 19 Concept Learning • EnjoySport concept learning task requires learning the sets of days for which 𝐸𝑛𝑗𝑜𝑦𝑆𝑝𝑜𝑟𝑡 = 𝑦𝑒𝑠, describing this set by a conjunction of constraints over the instance attributes. • One example of hypothesis: < 𝑆𝑘𝑦 = 𝑆𝑢𝑛𝑛𝑦, 𝐴𝑖𝑟𝑇𝑒𝑚𝑝 =? , 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 =? , 𝑊𝑖𝑛𝑑 = 𝑆𝑡𝑟𝑜𝑛𝑔, 𝑊𝑎𝑡𝑒𝑟 =? , 𝐹𝑜𝑟𝑒𝑐𝑎𝑠𝑡 = 𝑠𝑎𝑚𝑒 >
• The most general hypothesis:
< 𝑆𝑘𝑦 =?,𝐴𝑖𝑟𝑇𝑒𝑚𝑝 =?,𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 =?,𝑊𝑖𝑛𝑑 =?,𝑊𝑎𝑡𝑒𝑟 =?,𝐹𝑜𝑟𝑒𝑐𝑎𝑠𝑡 =?>
• The most specific hypothesis:
< 𝑆𝑘𝑦 = ∅,𝐴𝑖𝑟𝑇𝑒𝑚𝑝 = ∅,𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = ∅,𝑊𝑖𝑛𝑑 = ∅,𝑊𝑎𝑡𝑒𝑟 = ∅,𝐹𝑜𝑟𝑒𝑐𝑎𝑠𝑡 = ∅ >
COMP9417 T1, 2021 20

Concept Learning as Search Question: What can be learned?
Answer: (only) what is in the hypothesis space
• Concept learning can be viewed as the task of searching through a large space of hypotheses implicitly defined by the hypothesis representation.
• The goal of this search is to find the hypothesis that best fits the training examples.
COMP9417 T1, 2021 21

EnjoySport – Hypothesis Space
How big is the hypothesis space for EnjoySport ?
• Instance space
• Hypothesis space
𝑆𝑘𝑦×𝐴𝑖𝑟𝑇𝑒𝑚𝑝×⋯×𝐹𝑜𝑟𝑒𝑐𝑎𝑠𝑡 = 3×2×2×2×2×2 = 96 instances
𝑆𝑘𝑦×𝐴𝑖𝑟𝑇𝑒𝑚𝑝×⋯×𝐹𝑜𝑟𝑒𝑐𝑎𝑠𝑡 = 5×4×4×4×4×4
= 5120 synthetically distinct hypothesis
𝑆𝑘𝑦×𝐴𝑖𝑟𝑇𝑒𝑚𝑝×⋯×𝐹𝑜𝑟𝑒𝑐𝑎𝑠𝑡 =1+4×3×3×3×3×3
= 973 semantically distinct hypothesis
(Any hypothesis with an ∅ constraint covers no instances, hence all are semantically equivalent)
COMP9417 T1, 2021 22

Instances, Hypotheses, and More-General-Than
COMP9417 T1, 2021 23

A generality order on hypotheses
Definition: Let h$ and h% be Boolean-valued functions defined over instances 𝑋. Then h$ is more-general-than-or-equal-to h% (written h$ ≥& h%) if and only if
(∀𝑥 ∈ 𝑋)[(h%(x) = 1) → (h$(x) = 1)]
Intuitively, h$ is more-general-than-or-equal-to h% if any instance satisfying h%
also satisfies h$.
h$ is (strictly) more general than h% (written h$ >& h%) if and only if
(h$≥& h%)⋀(h%≱& h$).
h$ is more specific than h% when h% is more general than h$.
COMP9417 T1, 2021 24

Version Space
A hypothesis h is consistent with a set of training examples 𝐷 of target concept 𝑐 if and only if h(x) = 𝑐(x) for each training example x, 𝑐(x) in 𝐷.
𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡(h, 𝐷) ≡ (∀ x, 𝑐(x) ∈ 𝐷) h(x) = 𝑐(x)
The version space, 𝑉 𝑆?,A, with respect to hypothesis space 𝐻 and training examples 𝐷, is the subset of hypotheses from 𝐻 consistent with all training examples in 𝐷.
𝑉 𝑆?,A ≡ h ∈ 𝐻 𝐶𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑡(h, 𝐷)}
COMP9417 T1, 2021 25

Exhausting the Version Space
COMP9417 T1, 2021 26

Exhausting the Version Space
Note: in the diagram
(𝑟 = trainingerror,𝑒𝑟𝑟𝑜𝑟 = trueerror)
Definition: The version space 𝑉𝑆?,A is said to be 𝜖-exhausted with respect to 𝑐 and 𝒟, if every hypothesis h in 𝑉𝑆?,A has error less than 𝜖 with respect to 𝑐 and 𝒟.
(∀ h∈𝑉𝑆?,A)𝑒𝑟𝑟𝑜𝑟𝒟(h) < 𝜖 So 𝑉𝑆?,A is not 𝜖-exhausted if it contains at least one h with 𝑒𝑟𝑟𝑜𝑟𝒟(h) ≥𝜖 COMP9417 T1, 2021 27 How many examples will -exhaust the VS? Theorem [Haussler, 1988]. If the hypothesis space 𝐻 is finite, and 𝐷 is a sequence of 𝑚 ≥ 1 independent random examples of some target concept 𝑐, then for any 0 ≤ 𝜖 ≤ 1, the probability that the version space with respect to 𝐻 and 𝐷 is not 𝜖-exhausted (with respect to 𝑐) is less than: 𝐻 𝑒CDE Interesting! This bounds the probability that any consistent learner will output a hypothesis h with 𝑒𝑟𝑟𝑜𝑟(h) ≥ 𝜖. COMP9417 T1, 2021 28 How many examples will 𝝐-exhaust the VS? If we want this probability to be below 𝛿 then 𝐻𝑒CDE ≤𝛿 𝑚 ≥ 1𝜖 l n 𝐻 + l n 𝛿1 COMP9417 T1, 2021 29 How many examples will 𝝐-exhaust the VS? How many examples are sufficient to assure with probability at least (1 − 𝛿) that every h in 𝑉 𝑆?,A satisfies 𝑒𝑟𝑟𝑜𝑟𝒟(h) ≤ 𝜖? Use our theorem: 𝑚 ≥ 1𝜖 l n 𝐻 + l n 𝛿1 COMP9417 T1, 2021 30 How About EnjoySport? 𝑚 ≥ 1𝜖 l n 𝐻 + l n 𝛿1 If 𝐻 is as given in EnjoySport then |𝐻| = 973, and 𝑚≥1𝜖 ln973+ln 𝛿1 COMP9417 T1, 2021 31 How About EnjoySport? ... if want to assure that with probability 95%, 𝑉𝑆 contains only hypotheses with 𝑒𝑟𝑟𝑜𝑟𝒟(h) ≤ 0.1, then it is sufficient to have m examples, where 𝑚≥1 ln973+ln 1 0.1 0.05 𝑚≥10 ln973+ln 20 𝑚≥10 6.88+3 𝑚 ≥ 98.8 COMP9417 T1, 2021 32 Agnostic PAC Learning So far, assumed 𝑐 ∈ 𝐻 — consistent learners (noise-free) Agnostic learning setting: don’t assume 𝑐 ∈ 𝐻 • What do we want then? – The hypothesis h that makes fewest errors on training data Hoeffding bounds: For hypothesis space 𝐻: 𝑃𝑟 𝑒𝑟𝑟𝑜𝑟𝒟 h > 𝑒𝑟𝑟𝑜𝑟’ h + 𝜖 ≤ 𝑒()#*! 𝑃𝑟[𝑒𝑟𝑟𝑜𝑟𝒟(h+,-.) > 𝑒𝑟𝑟𝑜𝑟'(h+,-.) + 𝜖] ≤ |𝐻|𝑒()#*!
• What is sample complexity in this case?
𝑚≥ 1 ln 𝐻 +ln 1 2𝜖) 𝛿
COMP9417 T1, 2021
33

PAC Learning
So far, we have only considered cases with finite hypothesis space, however in reality many of the hypothesis spaces that we use have infinite dimension.
…what can we do/say about such hypothesis spaces?
COMP9417 T1, 2021 34

Hypothesis Space Size
Question:
Which hypothesis spaces are infinite, and which are finite? q Decision trees (with discrete inputs)
q Neural networks
q Linear classifiers
q Decision trees (with continuous inputs)
COMP9417 T1, 2021 35

Learners and Complexity
• Different learners have different complexity
• Complexity relates to the “representational power”
• Usual trade-off:
– More power = represent more complex systems, may overfit
– Less power: won’t overfit, but may not find “best learner”
Question: How can we quantify complexity?
– Not easily…
– One solution is VC (Vapnik-Chervonenkis) dimension
COMP9417 T1, 2021 36

Learners and Complexity
• How complexity relates to PAC learning?
– We want to be able to say something about the true error of the hypothesis based on the training error of the hypothesis
– We know that:
o in underfitting domain: those two errors are pretty similar
o In overfitting domain: test error (which is a representation of true error) might be much worse!
COMP9417 T1, 2021 37

Shattering a Set of Instances
Definition: a dichotomy of a set 𝑆 is a partition of 𝑆 into two disjoint subsets.
Definition: a set of instances 𝑆 is shattered by hypothesis space 𝐻 if and only if for every dichotomy of 𝑆 there exists some hypothesis in 𝐻 consistent with this dichotomy.
COMP9417 T1, 2021 38

Shattering
We say a classifier 𝑓(x) can shatter points x”, …, x# iff for all possible 𝑦”, … , 𝑦#, 𝑓(x) can achieve zero error on training data (x”, 𝑦”),…, (x#, 𝑦#). (i.e., there exist some 𝜃 that gets zero error)
Example:
Can 𝑓 x = sign(𝜃/ + 𝜃”𝑥” + 𝜃)𝑥)) shatter these points?
COMP9417 T1, 2021 39

Shattering Example:
Can 𝑓 x = sign(x0x − 𝜃) shatter these points?
This particular classifier can not shatter these two points.
This configuration can not be classified by 𝑓 x
COMP9417 T1, 2021 40

The Vapnik-Chervonenkis Dimension
Definition: The Vapnik-Chervonenkis dimension,𝑉𝐶(𝐻), of hypothesis space 𝐻 defined over instance space 𝑋 is the size of the largest finite subset of 𝑋 shattered by 𝐻. If arbitrarily large finite sets of 𝑋 can be shattered by 𝐻, then 𝑉𝐶(𝐻) ≡ ∞.
𝑽𝑪 dimension: the largest set of inputs that the hypothesis class can shatter (label in all possible ways).
Note: the 𝑉𝐶 dimension can be defined for an infinite hypothesis space 𝐻 since it depends only on the size of finite subsets of 𝑋.
COMP9417 T1, 2021 41

The Vapnik-Chervonenkis Dimension
Example:
𝑋 = {1,…,10}
𝐻 = {h(x)=x≥𝜃}
For one input (e.g., 𝑆 = {6}), the hypothesis class can label it in all possible ways.
What about any pair of input (e.g., 𝑆 = {5,7})? • No. So, 𝑉𝐶 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 = 1
Although 𝐻, here, is an infinite hypothesis space, but it is a weak hypothesis space and not very expressive!
COMP9417 T1, 2021 42

The Vapnik-Chervonenkis Dimension
Question:
𝑋=R
𝐻 = {h(x)=x∈[𝑎,𝑏]}
• What is the size of parameter?
• What is the size of hypothesis space?
• What is the size of 𝑉𝐶 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛?
COMP9417 T1, 2021 43

The Vapnik-Chervonenkis Dimension
Question:
𝑋=R
𝐻 = {h(x)=x∈[𝑎,𝑏]}
• What is the size of parameter? 2 parameters 𝑎 and 𝑏
• What is the size of hypothesis space? infinite
• What is the size of 𝑉𝐶 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛? 𝑉𝐶 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛 = 2
COMP9417 T1, 2021 44

The Vapnik-Chervonenkis Dimension
Question:
𝑋 = R)
𝐻 = {h(x)=𝑤0x≥𝜃}
• What is the size of parameter?
• What is the size of hypothesis space?
• What is the size of 𝑉𝐶 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛?
COMP9417 T1, 2021 45

The Vapnik-Chervonenkis Dimension
Question:
𝑋 = R)
𝐻 = {h(x)=𝑤0x≥𝜃}
• What is the size of parameter? 3 parameters
• What is the size of hypothesis space? infinite
• What is the size of 𝑉𝐶 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛? See next 3 slides
COMP9417 T1, 2021 46

VC Dimension of Linear Decision Surfaces
From the left, shown are three dichotomies of the same three instances. Can a linear classifier be found for the other five dichotomies? On the right, this set of four instances clearly cannot be shattered by a hypothesis space of linear classifiers.
COMP9417 T1, 2021 47

VC Dimension of Linear Decision Surfaces
Clearly, for a subset of 2 instances we can find a linear classifier for all possible dichotomies.
The same argument as for 2 instances applies (see first three examples on previous slide, and case (a) on next slide), as long as the instances are not collinear (case (b) on next slide). So the 𝑉𝐶 dimension is at least 3.
However, in this setting, there is no set of 4 points that can be shattered. In general, for linear classifiers in 𝑑 dimensions (𝑑 is number of features)
the 𝑉𝐶 dimension is 𝑑 + 1.
COMP9417 T1, 2021 48

VC Dimension of Linear Decision Surfaces
COMP9417 T1, 2021 49

VC dimension and Number of parameters
Note that VC dimension does not necessarily equal the number of parameters!
Number of parameters does not necessarily equal complexity:
– Can define a classifier with lot of parameters but not much power!
– Can define a classifier with one parameter, but lots of power! (look at the example in next slide)
COMP9417 T1, 2021 50

How Complex is a Hypothesis ?
COMP9417 T1, 2021 51

How Complex is a Hypothesis ?
The solid curve is the function sin(50𝑥) for 𝑥 ∈ [0,1].
The blue (solid) and green (hollow) points illustrate how the associated indicator function 𝐼(sin(𝛼 𝑥) > 0) can shatter (separate) an arbitrarily large number of points by choosing an appropriately high frequency 𝛼.
Classes separated based on sin(𝛼 𝑥), for frequency 𝛼, a single parameter.
COMP9417 T1, 2021 52

Sample Complexity from VC Dimension
We can now generalize the PAC-learning result obtained earlier to answer the question: how many randomly drawn examples suffice to 𝜖-exhaust 𝑉𝑆1,’ with probability at least (1 − 𝛿)?
𝑚 ≥ 1𝜖 (4log)(2/𝛿) + 8𝑉𝐶(𝐻)log)(13/𝜖))
So we see that the concept of the 𝑉𝐶 dimension of a hypothesis class gives us a general framework for characterizing the complexity or capacity of hypotheses in terms of their ability to express all possible target concepts in a particular learning setting.
COMP9417 T1, 2021 53

Sample Complexity and 𝑽𝑪 dimension
• For finite hypothesis space:
𝑚 ≥ 1𝜖 l n 𝐻 + l n 𝛿1
• For infinite hypothesis space with finite 𝑉𝐶 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛:
𝑚 ≥ 1𝜖 (4log)(2/𝛿) + 8𝑉𝐶(𝐻)log)(13/𝜖))
COMP9417 T1, 2021 54

𝑽𝑪 dimension of finite 𝑯
If 𝑑 = 𝑉𝐶(𝐻), then in means that there are at least 23 hypothesis in the hypothesis space that can shatter 𝑑 points in all possible ways (which is 23 ), so:
23 ≤ |H| 𝑑 ≤ log) |𝐻|
Theorem:
𝐻 is is PAC-learnable if and only if 𝑉𝐶 dimension is finite.
COMP9417 T1, 2021 55

Summary
• PAC learning
• Sample complexity
• Version space
• VC dimension and hypothesis space complexity
• VC dimension capture PAC-learnability
COMP9417 T1, 2021 56

Acknowledgements
• Material derived from slides for the book “Elements of Statistical Learning (2nd Ed.)” by T. Hastie, R. Tibshirani & J. Friedman. Springer (2009) http://statweb.stanford.edu/~tibs/ElemStatLearn/
• Material derived from slides for the book “Machine Learning: A Probabilistic Perspective” by P. Murphy MIT Press (2012) http://www.cs.ubc.ca/~murphyk/MLbook
• Material derived from slides for the book “Machine Learning” by P. Flach Cambridge University Press (2012) http://cs.bris.ac.uk/~flach/mlbook
• Material derived from slides for the book “Bayesian Reasoning and Machine Learning” by D. Barber Cambridge University Press (2012) http://www.cs.ucl.ac.uk/staff/d.barber/brml
• Material derived from slides for the book “Machine Learning” by T. Mitchell McGraw- Hill (1997) http://www-2.cs.cmu.edu/~tom/mlbook.html
• Material derived from slides for the course “Machine Learning” by A. Srinivasan BITS Pilani Goa Campus, India (2016)
COMP9417 T1, 2021 57