CS计算机代考程序代写 algorithm COMS 4771 Introduction to Machine Learning

COMS 4771 Introduction to Machine Learning
Nakul Verma

Towards formalizing ‘learning’
What does it mean to learn a concept?
• Gain knowledge or experience of the concept.
The basic process of learning
• Observe a phenomenon
• Construct a model from observations
• Use that model to make decisions / predictions
How can we make this more precise?

A statistical machinery for learning
Phenomenon of interest:
Input space: Output space:
There is an unknown distribution over The learner observes examples
drawn from
which predicts well. (generalization error of f )
Construct a model:
Let be a collection of models, where each predicts given
Machine learning
From observations, select a model
We can say that we have learned the phenomenon if for any tolerance level of our choice.

PAC Learning
For all tolerance levels , and all confidence levels , if there exists some model selection algorithm that selects from m observations ie, , and has the property:
with probability at least over the draw of the sample,
We call
• The model class is PAC-learnable.
• If the is polynomial in and , then is efficiently PAC-learnable
A popular algorithm:
Empirical risk minimizer (ERM) algorithm

PAC learning simple model classes
Theorem (finite size ):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown
if , then with probability at least
is efficiently PAC learnable
Occam’s Razor Principle:
All things being equal, usually the simplest explanation
of a phenomenon is a good hypothesis. Simplicity = representational succinctness

Proof sketch
Define:
(generalization error of f ) (sample error of f )
We need to analyze:
≤0
Uniform deviations of expectation of a random variable to the sample

Proof sketch
Fix any and a sample , define random variable
(generalization error of f )
(sample error of f )
Lemma (Chernoff-Hoeffding bound ‘63):
Let be m Bernoulli r.v. drawn independently from B(p). for any tolerance level
A classic result in concentration of measure, proof later

Proof sketch
Need to analyze
Equivalently, by choosing with probability at least , for all

PAC learning simple model classes
Theorem (Occam’s Razor):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown
if , then with probability at least
is efficiently PAC learnable

One thing left…
Still need to prove:
Lemma (Chernoff-Hoeffding bound ‘63):
Let be m Bernoulli r.v. drawn independently from B(p). for any tolerance level
How
sample average
deviates from true average
as a function of number of samples (m)
Need to analyze: How does the probability measure concentrates towards a central value (like mean)

Detour: Concentration of Measure
Let’s start with something simple:
Let X be a non-negative random variable.
For a given constant c > 0, what is: ?
Why?
Observation
Take expectation on both sides.
Markov’s Inequality
c 0
by Markov’s Inequality
Chebyshev’s Inequality
X
True for all distributions!
EX

Concentration of Measure
Sharper estimates using an exponential!
Let X be a random variable (not necessarily non-negative).
For some given constant c > 0 Observation:
This is called Chernoff’s bounding method
for any t > 0
by Markov’s Inequality

Concentration of Measure
Now, Given X1, …, Xm i.i.d. random variables (assume 0 ≤ Xi ≤ 1)
Define Yi := Xi – EXi
By Cherneoff’s bounding technique
Yi i.i.d.
This implies the Chernoff-Hoeffding bound!

Back to Learning Theory!
Theorem (Occam’s Razor):
Pick any tolerance level , and any confidence level
let be examples drawn from an unknown
if , then with probability at least
is efficiently PAC learnable

Learning general concepts
Consider linear classification
Occam’s Razor bound is ineffective

VC Theory
Need to capture the true richness of
Definition (Vapnik-Chervonenkis or VC dimension):
We say that a model class as VC dimension d, if d is the largest set of points such that for all possible labelings of
there exists some that achieves that labelling.
Example: = linear classifiers in R2

VC Dimension
Another example:
= Rectangles in R2
VC dimension:
• A combinatorial concept to capture the true richness of
• Often (but not always!) proportional to the degrees-of-freedom or the number of independent parameters in
The class of rectangles cannot realize this labelling

VC Theorem
Theorem (Vapnik-Chervonenkis ’71):
Pick any tolerance level
let be
if
, and any confidence level
examples drawn from an unknown
, then with probability at least
is efficiently PAC learnable
VC Theorem  Occam’s Razor Theorem

Tightness of VC bound
Theorem (VC lower bound):
Let be any model selection algorithm that given m samples, returns a model from , that is,
For all tolerance levels , and all confidence levels , there exists a distribution such that if

Some implications
• VC dimension of a model class fully characterizes its learning ability!
• Results are agnostic to the underlying distribution.

One algorithm to rule them all?
From our discussion it may seem that ERM algorithm is universally consistent.
This is not the case!
Theorem (no free lunch, Devroye ‘82):
Pick any sample size m, any algorithm and any There exists a distribution such that
while the Bayes optimal error,

Further refinements and extensions
• How to do model class selection? Structural risk results.
• Dealing with kernels – Fat margin theory
• Incorporating priors over the models – PAC-Bayes theory
• Is it possible to get distribution dependent bound? Rademacher complexity
• How about regression? Can derive similar results for nonparametric regression.

What We Learned…
• Formalizing learning
• PAC learnability
• Occam’s razor Theorem
• VC dimension and VC theorem
• VC theorem
• No Free-lunch theorem

Questions?