CS代写 Spring 2022

Spring 2022
Recitation 5 Feb 23 1 / 33
Recitation 5

Copyright By PowCoder代写 加微信 powcoder

Recitation 5
Motivation
Recitation 5
Consider applying linear regression to the data set on the left, and an SVM to the data set on the right. What is the issue?

Recitation 5
Motivation
Recitation 5 Feb 23 4 / 33
Regression Solution
Using features (1,x,x2) and w = (−.1,0,1) gives us
fw (x) = −.1 + 0x + 1×2 = x2 − .1. Our prediction function is quadratic but we obtained it through standard linear methods.

Recitation 5
Motivation
Recitation 5
Feb 23 5 / 33
SVM Solution
For the SVM we expand our feature vector from (1, x1, x2) to
(1, x1, x2, x1x2, x12, x2). Using w = (−1.875, 2.5, −2.5, 0, 1, 1) gives −1.875+2.5×1 −2.5×2 +x12 +x2 = (x1 +1.25)2 +(x2 −1.25)2 −5 = 0 as our decision boundary.

Motivation
Recitation 5 Kernels
Linear model is clearly insufficient to represent these problems. The most intuitive solution is to expand the input space
Adding features
We can define a feature map function φ(x) : X 􏰁→ H
dim(H) > dim(X )
For ridge regression, φ(1, x) = [1, x, x2].
For SVM, φ(1, x1, x2) = [1, x1, x2, x1x2, x12, x2].
We then find a linear separator on the feature space H.
Recitation 5 Feb 23 6 / 33

Adding Features
Recitation 5 Kernels
From undergrad Calc (Taylor’s Thm), we learned polynomials can approximate any function.
We can linearly model any problem perfectly if we add enough terms. But adding features obviously comes with a cost.
The cost grows exponentially as we increase the degree.
Recitation 5 Feb 23 7 / 33

Recitation 5 Kernels
Adding Features
Suppose we begin with d-dimensional inputs x = (x1,…,xd). We add all features up to degree M. More precisely, all terms of the form
xp1 ···xpd pi ≥0andp1+···+pd ≤M 1d
How many features will we have in total?
There will be 􏳑M+d􏳒 terms total. If M is fixed and we let d grow, this
M behaves like dM
Both M and d impacts the cost of adding features.
If we stick with polynomial features up to order M, it’s takes exponential time O(dM) to compute all features.
What if we don’t want to reduce the model complexity? How do we make the computation feasible?
Recitation 5 Feb 23 8 / 33

Recitation 5 Kernels
Representer Theorem (Baby Version)
Theorem ((Baby) Representer Theorem)
Suppose you have a loss function of the form
J (w ) = L(w T φ(x1 ), . . . , w T φ(xn )) + R (∥w ∥2 )
xi ∈ Rd,w ∈ Rd′,φ(x) : Rd 􏰁→ Rd′.
L : Rn → R is an arbitrary function (loss term). R : R≥0 → R is increasing (regularization term).
Assume J has at least one minimizer. Then J has a minimizer w∗ of the form w∗ = 􏲿ni=1 αiφ(xi) for some α ∈ Rn. If R is strictly increasing, then all minimizers have this form.
Recitation 5 Feb 23 9 / 33

Recitation 5 Kernels
Representer Theorem: Proof
Let w∗ ∈ Rd′ and let S = Span(φ(x1),…,φ(xn)).
Suppose w∗ is the optimal parameter, and it does not lie in S.
Then we can write w∗ = u + v where u ∈ S and v ∈ S⊥. (Here u is the orthogonal projection of w∗ onto S, and S⊥ is the subspace of all vectors orthogonal to S.)
Then (w∗)Tφ(xi)=(u+v)Tφ(xi)=uTφ(xi)+vTφ(xi)=uTφ(xi). So the prediction only depends on uT φ(xi ).
But ∥w∗∥2 = ∥u+v∥2 = ∥u∥2 +∥v∥2 +2uTv = ∥u∥2 +∥v∥2 ≥ ∥u∥2. Thus R(∥w∗∥2) ≥ R(∥u∥2) showing J(w∗) ≥ J(u).
Recitation 5 Feb 23 11 / 33

Recitation 5 Kernels
Representer Theorem
If your loss function only depends on w via its inner products with the inputs, and the regularization is an increasing function of the l2 norm, then we can write w∗ as a linear combination of the training data.
Recitation 5 Feb 23 12 / 33

Recitation 5 Kernels
The Kernel Function
Definition (Kernel)
Given a feautre map φ(x) : X 􏰁→ Z, the kernel function corresponding to
where ⟨·, ·⟩ is an inner product operator.
k(x, x′ ) = ⟨φ(x), φ(x′ )⟩
So a kernel function computes the inner product of applying the
feature map φ(x) for two inputs x,x′ ∈ X.
We only need to know the output of the kernel to find the parameters. Predictor function is:
f(x∗) = 􏳇αik(xi,x∗) i
Recitation 5 Feb 23 13 / 33

Recitation 5 Kernels
Efficiency of Kernel
Consider the polynomial kernel k(x, y) = ⟨φ(x), φ(y)⟩ = (1 + xT y)M where x,y ∈ Rd. For example, if M = 2 we have
(1+xTy)2 = 1+2xTy +xTyxTy
= 1+2􏲿di=1 xiyi +􏲿di,j=1 xiyixjyj.
Option 1: First explicitly evaluate φ(x) and φ(y), and then compute ⟨φ(x ), φ(y )⟩.
φ(x) = (1,√2×1,…,√2xd,x12,…,xd2,√2x1x2,√2x1x3,…,√2xd−1xd)
Takes O(dM) times to evaluate φ(x) and φ(y).
Takes another O(dM) times to compute the inner product. Time complexity is O(dM).
Recitation 5 Feb 23 14 / 33

Recitation 5 Kernels
Efficiency of Kernel
Consider the polynomial kernel k(x, y) = ⟨φ(x), φ(y)⟩ = (1 + xT y)M where x,y ∈ Rd. This computes the inner product of all monomials up to degree M in time O(d). For example, if M = 2 we have
(1+xTy)2 = 1+2xTy +xTyxTy
= 1+2􏲿di=1 xiyi +􏲿di,j=1 xiyixjyj.
Option 2: First calculate 1+xTy, then calculate (1+xTy)M. Takes O(d) time to evaluate 1 + xT y.
Takes O(1) time to calculate (1 + xT y)M
Time complexity is O(d)
Recitation 5 Feb 23 15 / 33

Recitation 5 Kernels
Recap on what we achieved
Start with a low dimensional model Due to limited input data size Number of parameters is d
Want to increase the model capacity by adding features xi → φ(xi ) The cost is too high as we increase degrees
Number of parameters is d′, d′ >> d
Realize the optimal parameter is a linear combination of φ(xi ) Representer Theorem
Number of parameters becomes N,d′ >> N > d
Realize we only need the inner produce of two φ(xi), k(·,·) We don’t need to compute φ(·)
Greatly reduces computation cost
The rephrased problem becomes a linear problem
But the solution still has high dimensional expressive power!
Recitation 5 Feb 23 16 / 33

Mercer’s Theorem
Not all function f(x,y) are valid kernels. Why? k(x, y) = ⟨φ(x), φ(y)⟩
How can we know if k(x,y) is a valid kernel or not?
Recitation 5 Kernels
Recitation 5
Theorem (Mercer’s Theorem)
Fix a kernel k : X × X → R. There is a Hilbert space H and a feature map φ : X → H such that k(x,y) = ⟨φ(x),φ(y)⟩H if and only if for any x1, . . . , xn ∈ X the associated matrix K is positive semi-definite:
k(x1, x1) · · · k(x1, xn)  . .. . 
K=…. k(xn, x1) · · · k(xn, xn)
Such a kernel k is called positive semi-definite.

Positive Semi-Definite
Recitation 5 Kernels
Definition (Positive Semi-Definite)
A matrix A ∈ Rn×n is positive semi-definite if it is symmetric and xT Ax ≥ 0
for all x ∈ Rn.
Equivalent to saying the matrix is symmetric with non-negative eigenvalues.
Recitation 5 Feb 23 18 / 33

Valid Kernels
Recitation 5 Kernels
A function k(x,y) is a valid kernel iff it satisfies all the properties of inner product:
Symmetricity
k(x, y) = k(y, x).
Non-negativity
k(x,x) ≥ 0, equality holds when x = 0
k(ax, by) = abk(x, y)
OR The K is positive semi-definitive.
Recitation 5

Dot Product
k ( x i , x j ) = x iT x j
Mth Polynomial Kernels
k (xi , xj ) = (1 + xiT xj )M
RBF Kernels
k(xi,xj) = exp(− 2σ2 )
Sigmoid kernel k(xi,xj)=tanh(αxiTxj +c)
||xi −xj ||2
Recitation 5
Kernel Examples
Recitation 5

Recitation 5 Kernels
Going to infinite dimension
What is the polynomial expression of φ(·) for RBF and Sigmoid Kernel?
There are no finite expression, they are sum of infinite polynomials φ(x) = e−x2/2σ2 􏳞1,􏳭 1 x,􏳭 1 x2,􏳭 1 x3,…􏳟
1!σ2 2!σ4 3!σ6
This implies we have essentially modeled the problem using a infinite
degree polynomial!
At this point, the factor limiting our model capacity is the amount of training data.
Recitation 5 Feb 23 21 / 33

Recitation 5
What are Kernels doing
f (x) = sin(x)ecos(x)
Recitation 5 Feb 23 23 / 33

Recitation 5 Kernels
Representer Theorem: Ridge Regression
By adding features to ridge regression we had
J(w ̃) = n
= n1 ∥ Φ w ̃ − y ∥ 2 2 + λ w ̃ T w ̃ ,
where Φ ∈ Rn×d′ is the matrix with φ(xi)T as its ith row. Representer Theorem applies giving w ̃ = 􏲿nj =1 αj φ(xj ) = ΦT α. Plugging in gives
1􏳧T 􏳧2 TT J(α)= 􏳧􏳧ΦΦ α−y􏳧􏳧 +λα ΦΦ α.
Define K = ΦΦT
(w ̃Tφ(xi)−yi)2 +λ∥w ̃∥2
Recitation 5

Recitation 5 Kernels
Representer Theorem: Primal SVM
For a general linear model, the same derivation above shows J(w) = L(Φw) + R(∥w∥2)
J(α) = L(Kα) + R( Here φ(xi )T w became (K α)i .
The primal SVM has loss function
J(w)= n This is kernelized to
(1−yi(φ(xi)Tw))+ +∥w∥2.
(1−yi(Kα)i)+ +αTKα.
Positive decision made if (w∗)Tφ(x) = 􏲿ni=1 αik(xi,x) > 0.
Recitation 5 Feb 23 25 / 33

maximizeα subject to
αi − 2 αiαjyiyjφ(xi)Tφ(xj) i,j=1
Recitation 5 Kernels
The dual SVM problem (with features) is given by
􏳇αiyi = 0 i=1
αi ∈􏳞0,nc􏳟 fori=1,…,n.
We can immediately kernelize (no representer theorem needed) by
replacing φ(xi )T φ(xj ) = k (xi , xj ).
Recall that we were able to derive the conclusion of the representer
theorem using strong duality for SVMs.
Recitation 5 Feb 23 26 / 33

Recitation 5 Kernels
It’s much easier to compute the kernel k(x,y) instead of the inner product.
The kernel k(x,y), to some extent, represents a similarity score between two data points.
The predictor function is basically assigning a value to the new value base on the values near it.
We are almost guaranteed to overfit on training data (we have N data points and N parameters), regularization is very important.
Recitation 5 Feb 23 27 / 33

Some Math that was skipped
Pre- Orthogonality
Recitation 5 Kernels
Recitation 5

Recitation 5 Kernels
Inner Product Space (or “Pre-Hilbert” Spaces)
An inner product space (over reals) is a vector space V with an inner product, which is a mapping
⟨·, ·⟩ : V × V → R
that has the following properties: ∀x,y,z ∈ V and a,b ∈ R: Symmetry: ⟨x, y⟩ = ⟨y, x⟩
Linearity: ⟨ax+by,z⟩=a⟨x,z⟩+b⟨y,z⟩ Positive-definiteness: ⟨x, x⟩ ≥ 0 and ⟨x, x⟩ = 0 ⇐⇒ x = 0.
To show a function ⟨·, ·⟩ is an inner product, we need to check the above conditions.
Recitation 5 Feb 23 29 / 33

Recitation 5 Kernels

A pre-Hilbert space is a vector space equipped with an inner product. We need an additional technical condition for Hilbert space:
completeness.
A space is complete if all Cauchy sequences in the space converge to a point in the space.
Recitation 5 Feb 23 30 / 33
Definition
A Hilbert space is a complete inner product space.
Any finite dimensional inner produce space is a Hilbert space.

Recitation 5
Orthogonality (Definitions)
Recitation 5 Feb 23 31 / 33
Definition
Two vectors are orthogonal if ⟨x,x′⟩ = 0. We denote this by x ⊥ x′.
Definition
x is orthogonal to a set S, i.e. x ⊥ S, if x ⊥ s for all x ∈ S.

Recitation 5
Pythagorean Theorem
Recitation 5 Feb 23 32 / 33
Theorem (Pythagorean Theorem)
If x ⊥ x′, then ∥x + x′∥2 = ∥x∥2 + ∥x′∥2.
∥x+x′∥2 = 􏳥x+x′,x+x′􏳦
= ⟨x,x⟩+􏳥x,x′􏳦+􏳥x′,x􏳦+􏳥x′,x′􏳦
= ∥x∥2 + ∥x′∥2.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com