Outline
From structured risk minimization to prediction error
Error bounds: Hoeffding, Union, Vapnik-Chervonenkis
VC dimension of typical models
Classification in (high-dimensional) feature space, and kernels
1/25
Recap: Structured Risk Minimization
Structured risk minimization is a method for model selection that struc- tures the space of solutions into a sequence of increasingly large sets
and chooses among competing solutions the one that is included in the smallest set.
Example: Concentric spheres Sn = {θ ∈ Rd : ∥θ∥2 < Cn} with Cn < Cn+1. We recover common regularization schemes for linear models, e.g. ridge regression, large-margin, etc.
Question: Can we link model complexity and prediction strength di- rectly?
2/25
Formalizing the Learning Problem
Given some ground-truth joint distribution P (x, y) over the input x ∈ Rd and the output y ∈ {−1, 1} (or R for regression), we measure the error of some classifier f : Rd → R using the risk function:
R[f ] =
where l(f(x),y) is some loss function.
Examples:
0/1loss(classification) MSE (regression)
0 if f(x)·y>0 l(f(x),y)= 1 else
l(f (x), y) = 1 (f (x) − y)2 2
l(f (x), y)dP (x, y)
3/25
Formalizing the Learning Problem
We would like to minimize the risk function
R[f ] =
Problem: P is unknown −→ we need an inductive principle.
A related quantity that can be computed from some training sample D = (x1,y1),…,(xN,yN)
is the empirical risk, which replaces P(x,y) by an average over the training sample:
1 N
Remp[f] = N
l(f (x), y)dP (x, y)
l(f(xi),yi)
i=1
4/25
Predicting R[f]
Estimating a parameter Learning a ML model
?
Questions:
Given a dataset of N points, how well will my model generalize? How quickly R[f] and Remp[f] converge with N?
Does it actually converge?
5/25
l
a
w
o
f
l
a
r
g
e
n
u
m
b
e
r
s
Predicting R[f]
Cross-validation (last week)
Simulate the true distribution by randomly splitting the available data between training and test. This approach provides an estimator of the risk R[f].
Error bounds (this week)
Bound R[f] based on model complexity. This can be achieved for
specific measures of complexity, e.g. the VC dimension.
In the next few slides, we give a brief overview of error bounds, based on Bousquet et al. (2003) Introduction to Statistical Learning Theory.
6/25
Error Bounds: Hoeffding’s Inequality
Hoeffding’s inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. (Wikipedia)
Theorem (Hoeffding). Let Z1, . . . , ZN be N i.i.d. random variables with g(Z) ∈ [a,b], then, for all ε > 0, we have:
1N 2Nε2
Pr N
i=1
g(Zi)−E[g(Z)]≥ε ≤2exp −(b−a)2
When applied to variables composing the risk, i.e. Zi = (f(xi),ti) are the prediction/label pairs, g = l, and also observing that l(·) ∈ [0, 1], we get:
P(|Remp[f] − R[f]| > ε) ≤ 2 exp(−2Nε2) which further reduces to:
log(2/δ)
P R[f]≤Remp[f]+ 2N ≥1−δ
7/25
Error Bounds: Hoeffding’s Inequality
Problem: When training a machine learning model, we make the function f depend on the data. This causes the random variables Z1, . . . , ZN represent- ing the prediction and the labels to no longer be i.i.d. Therefore, Hoeffding’s inequality not applicable in that case.
Idea: Avoid the dependency on a particular function f, by considering in- stead a supremum over all functions in F:
R[f] − Remp[f] ≤ max R[g] − Remp[g] g∈F
i.e. the worst-case gap between true and empirical risk.
8/25
Error Bounds: Union Bound
We build an union-bound (of the type P(A ∪ B) ≤ P(A) + P(B)) for the expression above, and then apply the Hoeffding inequality to each individual functions:
P max R[g] − Remp[g] > ε g∈F
g∈F
≤ |F|exp(−2Nε2).
After some manipulations, we get:
R[f] ≤ Remp[f] + 2N
with probability at least 1 − δ.
R[g]−Remp[g] >ε
=P
≤ P(R[g] − Remp[g] > ε)
g∈F
log|F|+log(1/δ)
9/25
Error Bounds: Vapnik-Chervonenkis
Refinement of the union bound in the previous slide.
Spread the δs unevenly across point to only consider those that are likely to be misclassified.
Make the approach applicable to large F by measuring the size of F ‘projected on the data’:
Fx1,…,xN = {(f(x1),…,f(xN)) : f ∈ F}
,…,x (aka. growth function) N
Fx
Applying these two steps, we can arrive at the bound
SF (N ) = max x1,…,x
1
N
logSF(2N)+log(2/δ) R[f]≤Remp[f]+2 2 N
with probability at least 1 − δ (cf. Bousquet 2003).
10/25
Error bounds (Summary)
Hoeffding
Union Bound
Vapnik-Chervonenkis
Notes:
log(2/δ) R[f] ≤ Remp[f] + 2N
log|F|+log(1/δ) R[f] ≤ Remp[f] + 2N
logSF(2N)+log(2/δ) R[f] ≤ Remp[f] + 2 2 N
All three bounds hold with probability at least 1 − δ.
Hoeffding’s bound only holds for fixed functions (i.e. not applicable to a trained classifier).
VC bounds also applicable to large (infinite) function spaces.
11/25
Error Bounds: Vapnik-Chervonenkis
The VC-dimension is the maximum number of data points that the function class can always classify in any possible way.
hF =max{N:SF(N)=2N}
Another VC bound, where the VC dimension is stated explicitly:
hF(log(2N/hF)+1)+log(2/δ) R[f]≤Remp[f]+2 2 N
12/25
Model Complexity: Sinusoid Function
fθ : R → {−1, 1}, fθ (x) = sign(sin(θx))
F={fθ:θ∈R}
log |F | = ∞ −→ Union bound doens’t provide guarantee hF = ∞ −→ VC bound doesn’t provide guarantee either
13/25
Linear Classifier
fw,b :Rd →R, fw,b(x)=sign(w⊤x+b) F={fw,b:w∈Rd,b∈R}
log|F|=∞
Union bound doesn’t provide guarantee hF =d+1
VC bound guarantees convergence
Convergence especially fast when d low.
14/25
Linear Classifier with Margin
fw,b :Rd →R, fw,b(x)=sign(w⊤x+b) F = fw,b : w ∈ Rd, b ∈ R ,
∀Ni=1 : |w⊤xi + b| ≥ 1, 1=M
∥w∥ 2
h ≤mind+1,4R2+1 F M2
(Burges (1998) A Tutorial on Support Vector
Machines for Pattern Recognition) M
VC bound guarantees convergence
Convergence especially fast when d low
or when the margin is large.
R
15/25
From Linear to Nonlinear Models
Linear models of type f(x) = w⊤x + b are easy to regularize (e.g. dimensionality reduction, large-margin). However the class of linear models is too restrictive (cannot solve nonlinear problems).
Approach 1: Consider an arbitrary nonlinear function f(x), (e.g. a quadratic model f(x) = x⊤Ax + b⊤x + c), and learn the parameters of this model from the data.
Approach 2: Apply a fixed nonlinear feature map φ : Rd → RD to the data, and learn a linear model in the image domain:
f(x) = β⊤φ(x) + b
16/25
Linear Model in Feature Space
Procedure:
1. Consider a fixed feature map:
φ:Rd →RD , x→φ(x)
where d ≪ D (typically)
2. Learn β ∈ RD and b ∈ R such that the function f(x) = β⊤φ(x) + b
correctly predicts the data.
17/25
Linear Model in Feature Space
Example: All second-order monomials
φ:R2 →R3
(x1, x2) → (z1, z2, z3) := (x21, √2x1x2, x2)
18/25
Linear Model in Feature Space
Remarks:
Classicalstatistics:harderasthedatabecomehigh-dimensional.
VC-dimension viewpoint: not harder as we can control complexity via other means (e.g. by enforcing a large margin in the feature space).
⇒ complexity matters, not dimensionality.
19/25
From Feature Spaces to Kernels
Example:
x = (x1,x2)
φ(x) = (x21, √2x1x2, x2)
Dot product in (high-dimensional) feature space can be computed in R2: ⟨φ(x),φ(x′)⟩ = ⟨x,x′⟩2
′ k(x, x )
This is known as the kernel trick (cf. Boser, Guyon & Vapnik 1992).
20/25
The Kernel Trick in Practice
Example:
We have a dataset D = {x1, . . . , xN } and would like to compute the (squared) distance to the mean in feature space, i.e.
1N 2
f(x) = φ(x) − N
Applying the identity
∥a − b∥2 = ⟨a,a⟩ − 2⟨a,b⟩ + ⟨b,b⟩
we arrive at a formulation
f(x) = ⟨φ(x),φ(x)⟩ − 2 ⟨φ(x),φ(xi)⟩ + 1 ⟨φ(xi),φ(xj)⟩ 2
i=1
NNN
φ(xi)
Ni=1 Ni=1j=1 k(x,x) k(x,xi) k(xi,xj)
where the function (initially defined in feature space) can be expressed solely in terms of the kernel function.
21/25
From Kernels to Feature Spaces
[Mercer] If k is continuous, symmetric, and for all square-integrable functions g(x) satisfies the condition:
then the kernel can be expanded as:
NF
k(x, x′) = λiψi(x)ψi(x′)
i=1
withλi >0,andNF ∈NorNF =∞andwecanconstructthefeaturemap
√λ1 ψ1 (x)
√ φ(x) := λ2ψ2(x)
g(x)k(x,x′)g(x′)dxdx′ ≥0
satisfying ⟨φ(x), φ(x′)⟩ = k(x, x′)
. .
22/25
Example
Showthatk:Rd ×Rd →Rwithk(x,x′)=c+x1x′1 andc>0,isaMercer kernel.
Find a feature map φ(x) associated to this kernel
23/25
Example of Kernels
Examples of commonly used kernels satisfying the Mercer property
Linear Polynomial Gaussian Student
k(x, x′) = ⟨x, x′⟩
k(x, x′) = (⟨x, x′⟩ + β)γ
k(x, x′) = exp(−γ∥x − x′∥2)
k(x, x′) = 1 α+∥x−x′∥2
β ∈ R≥0, γ ∈ N γ ∈ R>0 α ∈ R>0
24/25
Final remarks
Cross-validation vs. VC dimension
Last week: Since it was unclear how to measure complexity, we tried to directly predict the generalization error via cross-validation, and choose parameters to minimize this error.
This week: Model complexity can be reliably measured (VC-dimension). It allows to do model selection (e.g. prioritize large-margin models) without having to consume data.
Nonlinear representation and kernels
Linearly unsolvable problems can become solvable in high-dimensional feature spaces.
High dimensionality is not a big problem, because other techniques (e.g. large margin) can be used to control capacity.
25/25