编程代写 CS221, Autumn 2014.

Kernel Methods
February 22, 2022
February 22, 2022 1 / 80

Copyright By PowCoder代写 加微信 powcoder

Logistics:
Please fill out the course survey so that we can better help you
Feel free to ask questions during the lab
Today’s lecture: Wrap-up SVM
Motivation of kernel methods: Our data is typically not linearly separable, but we like to work with linear models.
Adding features (going to high-dimensional space) allow us to use linear models for complex data.
Kernels allow us to think about similarities rather than feature engineering.
February 22, 2022 2 / 80

Feature Maps
February 22, 2022 3 / 80

The Input Space X
Our general learning theory setup: no assumptions about X
But X = Rd for the specific methods we’ve developed: Ridge regression
Lasso regression
Support Vector Machines
Our hypothesis space for these was all affine functions on Rd :
F = 􏳓x 􏰁→ w T x + b | w ∈ Rd , b ∈ R􏳔 .
What if we want to do prediction on inputs not natively in Rd ?
February 22, 2022

The Input Space X
Often want to use inputs not natively in Rd :
Text documents Image files Sound recordings DNA sequences
But everything in a computer is a sequence of numbers
The ith entry of each sequence should have the same “meaning”
All the sequences should have the same length
February 22, 2022

Feature Extraction
Definition
Mapping an input from X to a vector in Rd is called feature extraction or featurization.
February 22, 2022 6 / 80

Linear Models with Explicit Feature Map
Input space: X (no assumptions)
Introduce feature map φ : X → Rd
The feature map maps into the feature space Rd . Hypothesis space of affine functions on feature space:
F = 􏳓x 􏰁→ w T φ(x ) + b | w ∈ Rd , b ∈ R􏳔 .
February 22, 2022

Geometric Example: Two class problem, nonlinear boundary
With identity feature map φ(x) = (x1,x2) and linear models, can’t separate regions With appropriate featurization φ(x) = 􏳑x1,x2,x12 +x2􏳒, becomes linearly separable . Video: http://youtu.be/3liCbRZPrZA
February 22, 2022 8 / 80

Expressivity of Hypothesis Space
For linear models, to grow the hypothesis spaces, we must add features.
Sometimes we say a larger hypothesis is more expressive. (can fit more relationships between input and action)
Many ways to create new features.
February 22, 2022

Handling Nonlinearity with Linear Methods
February 22, 2022 10 / 80

Example Task: Predicting Health
General Philosophy: Extract every feature that might be relevant
Features for medical diagnosis height
body temperature blood pressure etc…
From Percy Liang’s “Lecture 3” slides from Stanford’s CS221, Autumn 2014.
February 22, 2022

Feature Issues for Linear Predictors
For linear predictors, it’s important how features are added
The relation between a feature and the label may not be linear
There may be complex dependence among features
Three types of nonlinearities can cause problems: Non-monotonicity
Saturation
Interactions between features
From Percy Liang’s “Lecture 3” slides from Stanford’s CS221, Autumn 2014.
February 22, 2022

Non-monotonicity: The Issue
Feature Map: φ(x) = [1,temperature(x)]
Action: Predict health score y ∈ R (positive is good)
Hypothesis Space F={affine functions of temperature}
Health is not an affine function of temperature.
Affine function can either say
Very high is bad and very low is good, or Very low is bad and very high is good, But here, both extremes are bad.
From Percy Liang’s “Lecture 3” slides from Stanford’s CS221, Autumn 2014.
February 22, 2022

Non-monotonicity: Solution 1
Transform the input:
φ(x) = 􏳞1,{temperature(x)-37}2􏳟, where 37 is “normal” temperature in Celsius.
Ok, but requires manually-specified domain knowledge Do we really need that?
What does wTφ(x) look like?
From Percy Liang’s “Lecture 3” slides from Stanford’s CS221, Autumn 2014.
February 22, 2022

Non-monotonicity: Solution 2
Think less, put in more:
φ(x) = 􏳞1,temperature(x),{temperature(x)}2􏳟.
More expressive than Solution 1. General Rule
Features should be simple building blocks that can be pieced together.
From Percy Liang’s “Lecture 3” slides from Stanford’s CS221, Autumn 2014.
February 22, 2022

Saturation: The Issue
Setting: Find products relevant to user’s query Input: Product x
Action: Score the relevance of x to user’s query Feature Map:
φ(x) = [1,N(x)], where N(x) = number of people who bought x.
We expect a monotonic relationship between N(x) and relevance, but also expect diminishing return.
From Percy Liang’s “Lecture 3” slides from Stanford’s CS221, Autumn 2014.
February 22, 2022

Saturation: Solve with nonlinear transform
Smooth nonlinear transformation:
φ(x) = [1,log{1+N(x)}]
log(·) good for values with large dynamic ranges Discretization (a discontinuous transformation):
φ(x) = (1(0 􏳫 N(x) < 10),1(10 􏳫 N(x) < 100),...) Small buckets allow quite flexible relationship From Percy Liang’s "Lecture 3" slides from Stanford’s CS221, Autumn 2014. February 22, 2022 Interactions: The Issue Input: Patient information x Action: Health score y ∈ R (higher is better) Feature Map φ(x) = [height(x),weight(x)] Issue: It’s the weight relative to the height that’s important. Impossible to get with these features and a linear classifier. Need some interaction between height and weight. From Percy Liang’s "Lecture 3" slides from Stanford’s CS221, Autumn 2014. February 22, 2022 Interactions: Approach 1 Google “ideal weight from height” J. D. Robinson’s “ideal weight” formula (for a male): weight(kg) = 52 + 1.9 [height(in) − 60] Make score square deviation between height(h) and ideal weight(w) f (x) = (52+1.9[h(x)−60]−w(x))2 WolframAlpha for complicated Mathematics: f (x) = 3.61h(x)2 −3.8h(x)w(x)−235.6h(x)+w(x)2 +124w(x)+3844 From Percy Liang’s "Lecture 3" slides from Stanford’s CS221, Autumn 2014. February 22, 2022 Interactions: Approach 2 Just include all second order features:  φ(x) = 1,h(x),w(x),h(x)2,w(x)2, h(x)w(x) More flexible, no Google, no WolframAlpha. General Principle Simpler building blocks replace a single “smart” feature. cross term From Percy Liang’s "Lecture 3" slides from Stanford’s CS221, Autumn 2014. February 22, 2022 Monomial Interaction Terms Interaction terms are useful building blocks to model non-linearities in features. Suppose we start with x = (1,x1,...,xd) ∈ Rd+1 = X. Consider adding all monomials of degree M: xp1 ···xpd , with p1 +···+pd = M. 1d Monomials with degree 2 in 2D space: x12, x2, x1x2 How many features will we end up with? 􏳑M+d−1􏳒 (“stars and bars”) This leads to extremely large data matrices For d = 40 and M = 8, we get 314457495 features. February 22, 2022 Big Feature Spaces Very large feature spaces have two potential issues: Overfitting Memory and computational costs Solutions: Overfitting we handle with regularization. Kernel methods can help with memory and computational costs when we go to high (or infinite) dimensional spaces. February 22, 2022 22 / 80 The Kernel Trick February 22, 2022 23 / 80 SVM with Explicit Feature Map Let ψ:X→Rd be a feature map. The SVM objective (with explicit feature map): min ||w||2 + max􏳑0,1−yiwTψ(xi)􏳒. Computation is costly if d is large (e.g. with high-degree monomials) Last time we mentioned an equivalent optimization problem from Lagrangian duality. February 22, 2022 SVM Dual Problem By Lagrangian duality, it is equivalent to solve the following dual problem: maximize s.t. i=1 If α∗ is an optimal value, then α∗i yiψ(xi) α∗i yiψ(xi)Tψ(x). αi − 2 αiαjyiyjψ(xj)T ψ(xi) i,j=1 αiyi =0 and αi ∈ 0,n ∀i. Key observation: ψ(x) only shows up in inner products with another ψ(x′) for both training and inference. February 22, 2022 Compute the Inner Products Consider 2D data. Let’s introduce degree-2 monomials using ψ : R2 → R3. (x1,x2) 􏰁→ (x12,√2x1x2,x2). The inner product is ψ(x )T ψ(x ′ ) = x12 x1′ 2 + (√2x1 x2 )(√2x1′ x2′ ) + x2 x2′ 2 = (x1x1′)2 + 2(x1x1′)(x2x2′) + (x2x2′)2 = (x1x1′ + x2x2′)2 =(xTx′)2 We can calculate the inner product ψ(x)Tψ(x′) in the original input space without accessing the features ψ(x)! February 22, 2022 26 / 80 Compute the Inner Products Now, consider monomials up to degree-2: (x1,x2) 􏰁→ (1,√2x1,√2x2,x12,√2x1x2,x2). The inner product can be computed by ψ(x)Tψ(x′)=(1+xTx′)2 (check). More generally, for features maps producing monomials up to degree-p, we have ψ(x)Tψ(x′)=(1+xTx′)p. (Note that the coefficients of each monomial in ψ may not be 1) Kernel trick: we do not need explicit features to calculate inner products. Using explicit features: O(dp) Using implicit computation: O(d) February 22, 2022 Kernel Function February 22, 2022 28 / 80 The Kernel Function Input space: X Feature space: H (a Hilbert space, e.g. Rd ) Feature map: ψ:X→H The kernel function corresponding to ψ is k(x,x ′) = 􏳥ψ(x),ψ(x ′)􏳦, where ⟨·,·⟩ is the inner product associated with H. Why introduce this new notation k(x,x ′)? We can often evaluate k(x,x′) without explicitly computing ψ(x) and ψ(x′). When can we use the kernel trick? February 22, 2022 Some Methods Can Be “Kernelized” Definition A method is kernelized if every feature vector ψ(x) only appears inside an inner product with another feature vector ψ(x ′). This applies to both the optimization problem and the prediction function. The SVM Dual is a kernelization of the original SVM formulation. Optimization: Prediction: maximize s.t. αi − 2 αiαjyiyjψ(xj)T ψ(xi) i,j=1 αiyi =0 and αi ∈ 0,n ∀i. f ˆ ( x ) = α ∗i y i ψ ( x i ) T ψ ( x ) . February 22, 2022 The Kernel Matrix Definition The kernel matrix for a kernel k on x1,...,xn ∈ X is k(x1,x1) ··· k(x1,xn) 􏳑􏳒... n×n K= k(xi,xj)i,j= . . ··· ∈R . k(xn,x1) ··· k(xn,xn) In ML this is also called a Gram matrix, but traditionally (in linear algebra), Gram matrices are defined without reference to a kernel or feature map. February 22, 2022 The Kernel Matrix The kernel matrix summarizes all the information we need about the training inputs x1,...,xn to solve a kernelized optimization problem. In the kernelized SVM, we can replace ψ(xi )T ψ(xj ) with Kij : maximizeα s.t. αi − 2 αi αj yi yj Kij i,j=1 αiyi =0 and αi ∈ 0,n i=1,...,n. February 22, 2022 Kernel Methods Given a kernelized ML algorithm (i.e. all ψ(x)’s show up as ⟨ψ(x),ψ(x′)⟩), Can swap out the inner product for a new kernel function. New kernel may correspond to a very high-dimensional feature space. Once the kernel matrix is computed, the computational cost depends on number of data points n, rather than the dimension of feature space d. Useful when d >> n.
Computing the kernel matrix may still depend on d and the essence of the trick is getting around this O(d) dependence.
February 22, 2022 33 / 80

Example Kernels
February 22, 2022 34 / 80

Kernels as Similarity Scores
Often useful to think of the k(x,x′) as a similarity score for x and x′.
We can design similarity functions without thinking about the explicit feature map, e.g.
“string kernels”, “graph kerners”.
How do we know that our kernel functions actually correspond to inner products in some feature space?
February 22, 2022 35 / 80

How to Get Kernels?
Explicitly construct ψ(x) : X → Rd (e.g. monomials) and define k(x,x ′) = ψ(x)T ψ(x ′). Directly define the kernel function k(x,x′) (“similarity score”), and verify it corresponds to
⟨ψ(x),ψ(x′)⟩ for some ψ.
There are many theorems to help us with the second approach.
February 22, 2022 36 / 80

Linear Algebra Review: Positive Semidefinite Matrices
Definition
A real, symmetric matrix M ∈ Rn×n is positive semidefinite (psd) if for any x ∈ Rn, xT Mx 􏳬 0.
The following conditions are each necessary and sufficient for a symmetric matrix M to be positive semidefinite:
M can be factorized as M = RT R, for some matrix R. All eigenvalues of M are greater than or equal to 0.
February 22, 2022

Positive Definite Kernel
Definition
A symmetric function k : X × X → R is a positive definite (pd) kernel on X if for any finite set {x1,…,xn} ∈ X (n ∈ N), the kernel matrix on this set
k(x1,x1) ··· k(x1,xn) 􏳑􏳒… 
K= k(xi,xj)i,j= . . ···  k(xn,x1) ··· k(xn,xn)
is a positive semidefinite matrix.
Symmetric: k(x,x′)=k(x′,x)
The kernel matrix needs to be positive semidefinite for any finite set of points.
Equivalent definition: 􏲿n 􏲿n αiαjk(xi,xj) 􏳬 0 given αi ∈ R ∀i. i=1 j=1
February 22, 2022

Mercer’s Theorem
A symmetric function k(x,x′) can be expressed as an inner product k(x,x ′) = 􏳥ψ(x),ψ(x ′)􏳦
for some ψ if and only if k(x,x′) is positive definite.
Proving a kernel function is positive definite is typically not easy. But we can construct new kernels from valid kernels.
February 22, 2022

Generating New Kernels from Old
Suppose k,k1,k2 : X×X → R are pd kernels. Then so are the following:
knew(x,x′) knew(x,x′) knew(x,x′) knew(x,x′) knew(x,x′)
= αk (x , x ′ ) for α 􏳬 0 (non-negative scaling)
= k1(x,x′)+k2(x,x′) (sum)
= k1(x,x ′)k2(x,x ′) (product)
= k (ψ(x ), ψ(x ′ )) for any function ψ(·) (recursion)
= f (x)f (x ′) for any function f (·) (f as 1D feature map)
Lots more theorems to help you construct new kernels from old.
Based on ’s slides:https://www.cs.ubc.ca/~schmidtm/Courses/540-W19/L12.5.pdf
February 22, 2022

Linear Kernel
Input space: X = Rd
Feature space: H = Rd , with standard inner product Feature map
ψ(x)=x k(x,x′)=xTx′
February 22, 2022

Quadratic Kernel in Rd Input space X = Rd
Feature space: H = RD, where D = d +􏳑d􏳒 ≈ d2/2. 2
Feature map:
ψ(x) = (x1,…,xd,x12,…,xd2,√2x1x2,…,√2xixj,…√2xd−1xd)T
Then for ∀x,x′ ∈Rd
k(x,x ′) = 􏳥ψ(x),ψ(x ′)􏳦
= 􏳥x,x′􏳦+􏳥x,x′􏳦2
Computation for inner product with explicit mapping: O(d2) Computation for implicit kernel calculation: O(d).
February 22, 2022

Polynomial Kernel in Rd Input space X = Rd
Kernel function:
k(x,x′)=􏳑1+􏳥x,x′􏳦􏳒M Corresponds to a feature map with all monomials up to degree M.
For any M, computing the kernel has same computational cost Cost of explicit inner product computation grows rapidly in M.
February 22, 2022

Radial Basis Function (RBF) / Gaussian Kernel
Input space X = Rd
′ 􏳈 ∥x−x′∥2􏳉 k(x,x)=exp − 2σ2 ,
where σ2 is known as the bandwidth parameter. Probably the most common nonlinear kernel. Does it act like a similarity score?
Have we departed from our “inner product of feature vector” recipe? Yes and no: corresponds to an infinite dimensional feature vector
February 22, 2022

Remaining Questions
Our current recipe:
Recognize kernelized problem: ψ(x) only occur in inner products ψ(x)T ψ(x ′)
Pick a kernel function (“similarity score”)
Compute the kernel matrix (n by n where n is the dataset size) Optimize the model and make predictions by accessing the kernel matrix
Next: When can we apply kernelization?
February 22, 2022

Representer Theorem
February 22, 2022 46 / 80

SVM solution is in the “span of the data”
We found the SVM dual problem can be written as:
􏳇n 1􏳇n αi −
αiαjyiyjxjTxi
Given dual solution α∗, primal solution is w∗=􏲿ni=1α∗i yixi.
Notice: w∗ is a linear combination of training inputs x1,…,xn.
We refer to this phenomenon by saying “w∗ is in the span of the data.” Or in math, w∗ ∈ span(x1,…,xn).
2 i,j=1 αiyi =0
α i ∈ 􏳞 0 , nc 􏳟 i = 1 , . . . , n .
February 22, 2022

Ridge regression solution is in the “span of the data”
The ridge regression solution for regularization parameter λ > 0 is w∗=argmin1􏳇n 􏳓wTx−y􏳔2+λ∥w∥2.
This has a closed form solution (Homework #3):
w ∗ = 􏳑X T X + λI 􏳒−1 X T y , where X is the design matrix, with x1,…,xn as rows.
w∈Rdn ii 2 i=1
February 22, 2022

Ridge regression solution is in the “span of the data”
Rearranging w ∗ = 􏳑X T X + λI 􏳒−1 X T y , we can show that (also Homework #3):
w∗ = XT􏳈1y−1Xw∗􏳉 λλ
So w∗ is in the span of the data. i.e. w∗ ∈ span(x1,…,xn)
= XTα∗= α∗ixi. i=1
February 22, 2022

If solution is in the span of the data, we can reparameterize
The ridge regression solution for regularization parameter λ > 0 is w∗=argmin1􏳇n 􏳓wTx−y􏳔2+λ∥w∥2.
w∈Rdn ii 2 i=1
We now know that w∗ ∈ span(x1,…,xn) ⊂ Rd.
So rather than minimizing over all of Rd, we can minimize over span(x1,…,xn).
w∗= argmin 1􏳇n 􏳓wTx −y􏳔2+λ∥w∥2.
w∈span(x1,…,xn) n
Let’s reparameterize the objective by replacing w as a linear combination of the inputs.
February 22, 2022

If solution is in the span of the data, we can reparameterize
Note that for any w ∈ span(x1,…,xn), we have w = XT α, for some α ∈ Rn. So let’s replace w with X T α in our optimization problem:
[original]w∗ = argmin1􏳇n 􏳓wTx −y􏳔2+λ∥w∥2 w∈Rdn ii 2
∗ 1􏳇n􏳣TT􏳤2T2 [reparameterized] α = argmin 􏳑X α􏳒 xi −yi +λ∥X α∥2.
α∈Rn n i=1
To get w∗ from the reparameterized optimization problem, we just take w∗ =XTα∗.
We changed the dimension of our optimization variable from d to n. Is this useful? February 22, 2022

Consider very large feature spaces
Suppose we have a 300-million dimension feature space [very large]
(e.g. using high order monomial interaction terms as features, as described last lecture)
Suppose we have a training set of 300,000 examples [fairly large]
In the original formulation, we solve a 300-million dimension optimization problem.
In the reparameterized formulation, we solve a 300,000-dimension optimization problem. This is why we care about when the solution is in the span of the data.
This reparameterization is interesting when we have more features than data (d ≫ n).
February 22, 2022 52 / 80

What’s next?
For SVM and ridge regression, we found that the solution is in the span of the data. derived in two rather ad-hoc ways
Up next: The Representer Theorem, which shows that this “span of the data” result occurs far more generally, and we prove it using basic linear algebra.
February 22, 2022 53 / 80

Math Review: Inner Product Spaces and
February 22, 2022 54 / 80

Hypothesis spaces we’ve seen so far
Finite-dimensional vector space (linear functions):
H=􏳓f : X→R|f(x)=wTx, w,x ∈Rd􏳔 .
To consider more complex input spaces (e.g. text, images), we use a feature map φ : X → F: H=􏳓f : X→R|f(x)=wTφ(x)􏳔 .
φ does not have to be linear.
The feature space F can be Rd (Euclidean space) or an infinite-dimensional vector space. We would like more structure on F.
February 22, 2022 55 / 80

Inner Product Space (or “Pre-Hilbert” Spaces)
An inner product space (over reals) is a vector space V with an inner product, which is a mapping
⟨·,·⟩:V×V→R that has the following properties: ∀x,y,z ∈ V and a,b ∈ R:
Symmetry: ⟨x,y⟩ = ⟨y,x⟩
Linearity: ⟨ax +by,z⟩ = a⟨x,z⟩+b⟨y,z⟩ Positive-definiteness: ⟨x,x⟩ 􏳬 0 and ⟨x,x⟩ = 0 ⇐⇒ x = 0V.
To show a function ⟨·,·⟩ is an inner product, we need to check the above conditions.
def T d Exercise: show that ⟨x,y⟩ = x y is an inner product on R .
February 22, 2022

Norm from Inner Product
Inner product is nice because it gives us notions of “size”, “distance”, “angle” in the vector space. For an inner product space, we can ddefine a norm as
∥x∥ = ⟨x,x⟩.
Rd with standard Euclidean inner product is an inner product space:
⟨x,y⟩:=xTy
February 22, 2022

Orthogonality (Definitions)
Definition
Two vectors are orthogonal if ⟨x,x ′⟩ = 0. We denote this by x ⊥ x ′.
Definition
x isorthogonaltoasetS,i.e. x⊥S,ifx⊥s forallx∈S.
February 22, 2022 58 / 80

Pythagorean Theorem
Theorem (Pythagorean Theorem)
If x ⊥ x ′ , then ∥x + x ′ ∥2 = ∥x ∥2 + ∥x ′ ∥2 .

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com