Machine learning lecture slides
COMS 4771 Fall 2020
0/15
Regression III: Kernels
Outline
Dual form of ridge regression Examples of kernel trick
Kernel methods
1/15
Linear algebraic identity
←xT → y 11
1 . n×d 1. n LetA=√ . ∈Randb=√.∈R
n . n . ← xTn → yn
Linear algebraic identity: for any A ∈ Rn×d and any λ > 0, (ATA + λI)−1AT = AT(AAT + λI)−1.
d×d n×n
Check: multiply both sides by ATA + λI and “factor”.
2/15
Alternative (dual) form for ridge regression (1)
Implications for ridge regression
wˆ=AT(AAT+λI)−1b= nATαˆ=αˆixi.
√n
√ i=1 =: nαˆ
Matrix AAT = 1 K, where K ∈ Rn×n is the Gram matrix n
Ki,j =xTixj. Prediction with wˆ on new point x:
n
x T wˆ = αˆ i · x T x i
i=1
3/15
Alternative (dual) form for ridge regression (2)
Therefore, can “represent” predictor via data points x1, . . . , xn and αˆ.
Similar to nearest neighbor classifier, except also have αˆ
To get αˆ: solve linear system involving K (and not A directly) To make prediction on x: iterate through the xi to compute
inner products with x; take appropriate weighted sum of results When is this a good idea?
4/15
Quadratic expansion
Suppose we want to do feature expansion to get all quadratic terms in φ(x)
φ(x) = (1,√2×1,…,√2xd, x21,…,x2d ,√2x1x2,…,√2x1xd,…,√
linear terms squared terms cross terms
This feature expansion has 1 + 2d + d = Θ(d2) terms 2
Explicitly computing φ(x), φ(x′), and then φ(x)Tφ(x′) would take Θ(d2) time.
“Kernel trick”: can compute φ(x)Tφ(x′) in O(d) time: φ(x)Tφ(x′) = (1 + xTx′)2.
Similar trick for cubic expansion, quartic expansion, etc.
5/15
Gaussian kernel
For any σ > 0, there is an infinite-dimensional feature expansion φ: Rd → R∞ such that
2σ2
φ(x)Tφ(x′) = exp − which can be computed in O(d) time.
2 ,
Called Gaussian kernel or Radial Basis Function (RBF) kernel
(with bandwidth σ).
Feature expansion for d = 1 and σ = 1 case:
2 x2 x3
φ(x)=e−x /2 1,x,√ ,√ ,… . 2! 3!
∥x−x′∥2
6/15
Kernels
A positive definite kernel k: X × X → R is a symmetric function satisfying the following property: For any n, and any x1,…,xn ∈X,then×nmatrixwhose(i,j)-thentryis k(xi,xj) is positive semidefinite.
Theorem: For any positive definite kernel k, there exists a feature map φ: X → H such that φ(x)Tφ(x′) = k(x,x′) for all x, x′ ∈ X .
Here, H is a special kind of inner product space called the Reproducing Kernel Hilbert Space (RKHS) corresponding to k.
Algorithmically, we don’t have to worry about what φ is. Instead, just use k.
7/15
Kernel ridge regression (1)
Trainingdata(x1,y1),…,(xn,yn)∈X×R
Ridge regression with feature map φ: minimize
1 n
n
Compute the n × n kernel matrix K where
Ki,j =k(xi,xj).
Letting w = ni=1 αiφ(xi) for α = (α1,…,αn), ridge
regression objective is equivalent to
1∥Kα−y∥2 +λαTKα n
where y = (y1,…,yn) ∈ Rn.
i=1
(φ(xi)Tw − yi)2 + λ∥w∥2
8/15
Kernel ridge regression (2)
Minimizer wrt α is solution αˆ to linear system of equations (K + nλI)α = y.
Return predictor that is represented by αˆ ∈ Rn and x1, . . . , xn To make prediction on new x ∈ X : output
n
αˆi ·k(x,xi). i=1
Inductive bias:
|wˆTφ(x) − wˆTφ(x′)| ≤ ∥wˆ∥2 · ∥φ(x) − φ(x′)∥2
√′
= αˆTKαˆ · ∥φ(x) − φ(x )∥2
9/15
Kernel methods
Many methods / algorithms can be “kernelized” into kernel methods
E.g., nearest neighbor, PCA, SVM, gradient descent, . . .
“Spectral regularization” with kernels: solve g(K/n)α = y/n for α.
10/15
4 3 2 1 0
-1 -2
Figure 1: Polynomial kernel with Kernel Ridge Regression and Kernel PCR
krr
kpcr
y
-3 -2 -1 0 1 2 3 x
11/15
4 3 2 1 0
-1 -2
Figure 2: RBF kernel with Kernel Ridge Regression and Kernel PCR
krr
kpcr
y
-3 -2 -1 0 1 2 3 x
12/15
4 3 2 1 0
-1 -2
lambda=0.0001
lambda=0.001
lambda=0.01
y
-3 -2 -1 0 1 2 3 x
Figure 3: RBF kernel with Kernel PCR
13/15
New kernels from old kernels
Suppose k1 and k2 are positive definite kernel functions.
Is k(x, x′) = k1(x, x′) + k2(x, x′) a positive definite kernel
function?
Is k(x, x′) = a k1(x, x′) (for a ≥ 0) a positive definite kernel
function?
Is k(x, x′) = k1(x, x′) k2(x, x′) a positive definite kernel
function?
14/15
Postscript
Problem with kernel methods when n is large Kernel matrix K is of size n2
Time for prediction generally ∝ n
Some possible solutions:
Nystrom approximations
Find other ways to make αˆ sparse Random Fourier features
15/15