CS计算机代考程序代写 scheme Outline

Outline
􏰐 From structured risk minimization to prediction error
􏰐 Error bounds: Hoeffding, Union, Vapnik-Chervonenkis
􏰐 VC dimension of typical models
􏰐 Classification in (high-dimensional) feature space, and kernels
1/25

Recap: Structured Risk Minimization
Structured risk minimization is a method for model selection that struc- tures the space of solutions into a sequence of increasingly large sets
and chooses among competing solutions the one that is included in the smallest set.
Example: Concentric spheres Sn = {θ ∈ Rd : ∥θ∥2 < Cn} with Cn < Cn+1. We recover common regularization schemes for linear models, e.g. ridge regression, large-margin, etc. Question: Can we link model complexity and prediction strength di- rectly? 2/25 Formalizing the Learning Problem Given some ground-truth joint distribution P (x, y) over the input x ∈ Rd and the output y ∈ {−1, 1} (or R for regression), we measure the error of some classifier f : Rd → R using the risk function: 􏰊 R[f ] = where l(f(x),y) is some loss function. Examples: 0/1loss(classification) MSE (regression) 􏰞 0 if f(x)·y>0 l(f(x),y)= 1 else
l(f (x), y) = 1 (f (x) − y)2 2
l(f (x), y)dP (x, y)
3/25

Formalizing the Learning Problem
We would like to minimize the risk function
􏰊
R[f ] =
Problem: P is unknown −→ we need an inductive principle.
A related quantity that can be computed from some training sample D = (x1,y1),…,(xN,yN)
is the empirical risk, which replaces P(x,y) by an average over the training sample:
1 􏰃N
Remp[f] = N
l(f (x), y)dP (x, y)
l(f(xi),yi)
i=1
4/25

Predicting R[f]
Estimating a parameter Learning a ML model
?
Questions:
􏰐 Given a dataset of N points, how well will my model generalize? 􏰐 How quickly R[f] and Remp[f] converge with N?
􏰐 Does it actually converge?
5/25
l
a
w
o
f
l
a
r
g
e
n
u
m
b
e
r
s

Predicting R[f]
Cross-validation (last week)
􏰐 Simulate the true distribution by randomly splitting the available data between training and test. This approach provides an estimator of the risk R[f].
Error bounds (this week)
􏰐 Bound R[f] based on model complexity. This can be achieved for
specific measures of complexity, e.g. the VC dimension.
In the next few slides, we give a brief overview of error bounds, based on Bousquet et al. (2003) Introduction to Statistical Learning Theory.
6/25

Error Bounds: Hoeffding’s Inequality
Hoeffding’s inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. (Wikipedia)
Theorem (Hoeffding). Let Z1, . . . , ZN be N i.i.d. random variables with g(Z) ∈ [a,b], then, for all ε > 0, we have:
􏰆􏰟􏰟1􏰃N 􏰟􏰟􏰇 􏰆2Nε2􏰇
Pr 􏰟N
i=1
g(Zi)−E[g(Z)]􏰟≥ε ≤2exp −(b−a)2
When applied to variables composing the risk, i.e. Zi = (f(xi),ti) are the prediction/label pairs, g = l, and also observing that l(·) ∈ [0, 1], we get:
P(|Remp[f] − R[f]| > ε) ≤ 2 exp(−2Nε2) which further reduces to:
􏰆 􏰋 log(2/δ) 􏰇
P R[f]≤Remp[f]+ 2N ≥1−δ
7/25

Error Bounds: Hoeffding’s Inequality
Problem: When training a machine learning model, we make the function f depend on the data. This causes the random variables Z1, . . . , ZN represent- ing the prediction and the labels to no longer be i.i.d. Therefore, Hoeffding’s inequality not applicable in that case.
Idea: Avoid the dependency on a particular function f, by considering in- stead a supremum over all functions in F:
R[f] − Remp[f] ≤ max 􏰑R[g] − Remp[g]􏰒 g∈F
i.e. the worst-case gap between true and empirical risk.
8/25

Error Bounds: Union Bound
We build an union-bound (of the type P(A ∪ B) ≤ P(A) + P(B)) for the expression above, and then apply the Hoeffding inequality to each individual functions:
􏰆􏰑􏰑 􏰒􏰒􏰇
P max R[g] − Remp[g] > ε g∈F
􏰆 􏰡 􏰑􏰑 􏰒
g∈F
≤ |F|exp(−2Nε2).
After some manipulations, we get:
R[f] ≤ Remp[f] + 2N
with probability at least 1 − δ.
R[g]−Remp[g] >ε
􏰒􏰇
=P
≤ 􏰃 P(R[g] − Remp[g] > ε)
g∈F
􏰋log|F|+log(1/δ)
9/25

Error Bounds: Vapnik-Chervonenkis
Refinement of the union bound in the previous slide.
􏰐 Spread the δs unevenly across point to only consider those that are likely to be misclassified.
􏰐 Make the approach applicable to large F by measuring the size of F ‘projected on the data’:
Fx1,…,xN = {(f(x1),…,f(xN)) : f ∈ F}
,…,x 􏰟􏰟 (aka. growth function) N
􏰟􏰟Fx
Applying these two steps, we can arrive at the bound
SF (N ) = max x1,…,x
1
N
􏰋 logSF(2N)+log(2/δ) R[f]≤Remp[f]+2 2 N
with probability at least 1 − δ (cf. Bousquet 2003).
10/25

Error bounds (Summary)
Hoeffding
Union Bound
Vapnik-Chervonenkis
Notes:
􏰋 log(2/δ) R[f] ≤ Remp[f] + 2N
􏰋log|F|+log(1/δ) R[f] ≤ Remp[f] + 2N
􏰋 logSF(2N)+log(2/δ) R[f] ≤ Remp[f] + 2 2 N
􏰐 All three bounds hold with probability at least 1 − δ.
􏰐 Hoeffding’s bound only holds for fixed functions (i.e. not applicable to a trained classifier).
􏰐 VC bounds also applicable to large (infinite) function spaces.
11/25

Error Bounds: Vapnik-Chervonenkis
The VC-dimension is the maximum number of data points that the function class can always classify in any possible way.
hF =max{N:SF(N)=2N}
Another VC bound, where the VC dimension is stated explicitly:
􏰋 hF(log(2N/hF)+1)+log(2/δ) R[f]≤Remp[f]+2 2 N
12/25

Model Complexity: Sinusoid Function
􏰐 fθ : R → {−1, 1}, fθ (x) = sign(sin(θx))
􏰐 F={fθ:θ∈R}
􏰐 log |F | = ∞ −→ Union bound doens’t provide guarantee 􏰐 hF = ∞ −→ VC bound doesn’t provide guarantee either
13/25

Linear Classifier
􏰐 fw,b :Rd →R, fw,b(x)=sign(w⊤x+b) 􏰐 F={fw,b:w∈Rd,b∈R}
􏰐 log|F|=∞
􏰐 Union bound doesn’t provide guarantee 􏰐 hF =d+1
􏰐 VC bound guarantees convergence
􏰐 Convergence especially fast when d low.
14/25

Linear Classifier with Margin
􏰐 fw,b :Rd →R, fw,b(x)=sign(w⊤x+b) 􏰐 F = 􏰓fw,b : w ∈ Rd, b ∈ R ,
∀Ni=1 : |w⊤xi + b| ≥ 1, 1=M􏰔
∥w∥ 2
􏰐 h ≤min􏰓d+1,􏰜4R2􏰝+1􏰔 F M2
(Burges (1998) A Tutorial on Support Vector
Machines for Pattern Recognition) M
􏰐 VC bound guarantees convergence
􏰐 Convergence especially fast when d low
or when the margin is large.
R
15/25

From Linear to Nonlinear Models
Linear models of type f(x) = w⊤x + b are easy to regularize (e.g. dimensionality reduction, large-margin). However the class of linear models is too restrictive (cannot solve nonlinear problems).
Approach 1: Consider an arbitrary nonlinear function f(x), (e.g. a quadratic model f(x) = x⊤Ax + b⊤x + c), and learn the parameters of this model from the data.
Approach 2: Apply a fixed nonlinear feature map φ : Rd → RD to the data, and learn a linear model in the image domain:
f(x) = β⊤φ(x) + b
16/25

Linear Model in Feature Space
Procedure:
1. Consider a fixed feature map:
φ:Rd →RD , x􏰛→φ(x)
where d ≪ D (typically)
2. Learn β ∈ RD and b ∈ R such that the function f(x) = β⊤φ(x) + b
correctly predicts the data.
17/25

Linear Model in Feature Space
Example: All second-order monomials
φ:R2 →R3
(x1, x2) 􏰛→ (z1, z2, z3) := (x21, √2x1x2, x2)
18/25

Linear Model in Feature Space
Remarks:
􏰐 Classicalstatistics:harderasthedatabecomehigh-dimensional.
􏰐 VC-dimension viewpoint: not harder as we can control complexity via other means (e.g. by enforcing a large margin in the feature space).
⇒ complexity matters, not dimensionality.
19/25

From Feature Spaces to Kernels
Example:
x = (x1,x2)
φ(x) = (x21, √2x1x2, x2)
Dot product in (high-dimensional) feature space can be computed in R2: ⟨φ(x),φ(x′)⟩ = ⟨x,x′⟩2
􏰎 􏰍􏰌 ′􏰏 k(x, x )
This is known as the kernel trick (cf. Boser, Guyon & Vapnik 1992).
20/25

The Kernel Trick in Practice
Example:
We have a dataset D = {x1, . . . , xN } and would like to compute the (squared) distance to the mean in feature space, i.e.
􏰠􏰠 1􏰃N 􏰠􏰠2
f(x) = 􏰠φ(x) − N
Applying the identity
∥a − b∥2 = ⟨a,a⟩ − 2⟨a,b⟩ + ⟨b,b⟩
we arrive at a formulation
f(x) = ⟨φ(x),φ(x)⟩ − 2 􏰃⟨φ(x),φ(xi)⟩ + 1 􏰃􏰃⟨φ(xi),φ(xj)⟩ 2
i=1
NNN
φ(xi)􏰠
􏰎 􏰍􏰌 􏰏 Ni=1􏰎 􏰍􏰌 􏰏 Ni=1j=1􏰎 􏰍􏰌 􏰏 k(x,x) k(x,xi) k(xi,xj)
where the function (initially defined in feature space) can be expressed solely in terms of the kernel function.
21/25

From Kernels to Feature Spaces
[Mercer] If k is continuous, symmetric, and for all square-integrable functions g(x) satisfies the condition:
􏰊􏰊
then the kernel can be expanded as:
NF
k(x, x′) = 􏰃 λiψi(x)ψi(x′)
i=1
withλi >0,andNF ∈NorNF =∞andwecanconstructthefeaturemap
√λ1 ψ1 (x)
√  φ(x) :=  λ2ψ2(x)
g(x)k(x,x′)g(x′)dxdx′ ≥0
satisfying ⟨φ(x), φ(x′)⟩ = k(x, x′)
 .  .
22/25

Example
Showthatk:Rd ×Rd →Rwithk(x,x′)=c+x1x′1 andc>0,isaMercer kernel.
Find a feature map φ(x) associated to this kernel
23/25

Example of Kernels
Examples of commonly used kernels satisfying the Mercer property
Linear Polynomial Gaussian Student
k(x, x′) = ⟨x, x′⟩
k(x, x′) = (⟨x, x′⟩ + β)γ
k(x, x′) = exp(−γ∥x − x′∥2)
k(x, x′) = 1 α+∥x−x′∥2
β ∈ R≥0, γ ∈ N γ ∈ R>0 α ∈ R>0
24/25

Final remarks
Cross-validation vs. VC dimension
􏰐 Last week: Since it was unclear how to measure complexity, we tried to directly predict the generalization error via cross-validation, and choose parameters to minimize this error.
􏰐 This week: Model complexity can be reliably measured (VC-dimension). It allows to do model selection (e.g. prioritize large-margin models) without having to consume data.
Nonlinear representation and kernels
􏰐 Linearly unsolvable problems can become solvable in high-dimensional feature spaces.
􏰐 High dimensionality is not a big problem, because other techniques (e.g. large margin) can be used to control capacity.
25/25