程序代写代做代考 algorithm kernel Machine learning lecture slides

Machine learning lecture slides
COMS 4771 Fall 2020
0/15

Regression III: Kernels

Outline
􏰛 Dual form of ridge regression 􏰛 Examples of kernel trick
􏰛 Kernel methods
1/15

Linear algebraic identity
←xT → y 11
1 .  n×d 1. n 􏰛LetA=√ . ∈Randb=√.∈R
n  .  n  .  ← xTn → yn
􏰛 Linear algebraic identity: for any A ∈ Rn×d and any λ > 0, (ATA + λI)−1AT = AT(AAT + λI)−1.
􏰱 􏰰􏰯 􏰲 􏰱 􏰰􏰯 􏰲
d×d n×n
􏰛 Check: multiply both sides by ATA + λI and “factor”.
2/15

Alternative (dual) form for ridge regression (1)
􏰛 Implications for ridge regression
wˆ=AT(AAT+λI)−1b= nATαˆ=􏰊αˆixi. 􏰱 􏰰􏰯 􏰲
√n
√ i=1 =: nαˆ
􏰛 Matrix AAT = 1 K, where K ∈ Rn×n is the Gram matrix n
Ki,j =xTixj. 􏰛 Prediction with wˆ on new point x:
n
x T wˆ = 􏰊 αˆ i · x T x i
i=1
3/15

Alternative (dual) form for ridge regression (2)
􏰛 Therefore, can “represent” predictor via data points x1, . . . , xn and αˆ.
􏰛 Similar to nearest neighbor classifier, except also have αˆ
􏰛 To get αˆ: solve linear system involving K (and not A directly) 􏰛 To make prediction on x: iterate through the xi to compute
inner products with x; take appropriate weighted sum of results 􏰛 When is this a good idea?
4/15

Quadratic expansion
􏰛 Suppose we want to do feature expansion to get all quadratic terms in φ(x)
φ(x) = (1,√2×1,…,√2xd, x21,…,x2d ,√2x1x2,…,√2x1xd,…,√ 􏰱 􏰰􏰯 􏰲􏰱􏰰􏰯􏰲􏰱 􏰰􏰯
linear terms squared terms cross terms
􏰛 This feature expansion has 1 + 2d + 􏰏d􏰐 = Θ(d2) terms 2
􏰛 Explicitly computing φ(x), φ(x′), and then φ(x)Tφ(x′) would take Θ(d2) time.
􏰛 “Kernel trick”: can compute φ(x)Tφ(x′) in O(d) time: φ(x)Tφ(x′) = (1 + xTx′)2.
􏰛 Similar trick for cubic expansion, quartic expansion, etc.
5/15

Gaussian kernel
􏰛 For any σ > 0, there is an infinite-dimensional feature expansion φ: Rd → R∞ such that
2σ2
φ(x)Tφ(x′) = exp − which can be computed in O(d) time.
2 ,
􏰛 Called Gaussian kernel or Radial Basis Function (RBF) kernel
(with bandwidth σ).
􏰛 Feature expansion for d = 1 and σ = 1 case:
2 􏰅 x2 x3 􏰆
φ(x)=e−x /2 1,x,√ ,√ ,… . 2! 3!
􏰅 ∥x−x′∥2􏰆
6/15

Kernels
􏰛 A positive definite kernel k: X × X → R is a symmetric function satisfying the following property: For any n, and any x1,…,xn ∈X,then×nmatrixwhose(i,j)-thentryis k(xi,xj) is positive semidefinite.
􏰛 Theorem: For any positive definite kernel k, there exists a feature map φ: X → H such that φ(x)Tφ(x′) = k(x,x′) for all x, x′ ∈ X .
􏰛 Here, H is a special kind of inner product space called the Reproducing Kernel Hilbert Space (RKHS) corresponding to k.
􏰛 Algorithmically, we don’t have to worry about what φ is. Instead, just use k.
7/15

Kernel ridge regression (1)
􏰛 Trainingdata(x1,y1),…,(xn,yn)∈X×R
􏰛 Ridge regression with feature map φ: minimize
1 􏰊n
n
􏰛 Compute the n × n kernel matrix K where
Ki,j =k(xi,xj).
􏰛 Letting w = 􏰉ni=1 αiφ(xi) for α = (α1,…,αn), ridge
regression objective is equivalent to
1∥Kα−y∥2 +λαTKα n
where y = (y1,…,yn) ∈ Rn.
i=1
(φ(xi)Tw − yi)2 + λ∥w∥2
8/15

Kernel ridge regression (2)
􏰛 Minimizer wrt α is solution αˆ to linear system of equations (K + nλI)α = y.
􏰛 Return predictor that is represented by αˆ ∈ Rn and x1, . . . , xn 􏰛 To make prediction on new x ∈ X : output
n
􏰊αˆi ·k(x,xi). i=1
􏰛 Inductive bias:
|wˆTφ(x) − wˆTφ(x′)| ≤ ∥wˆ∥2 · ∥φ(x) − φ(x′)∥2
√′
= αˆTKαˆ · ∥φ(x) − φ(x )∥2
9/15

Kernel methods
􏰛 Many methods / algorithms can be “kernelized” into kernel methods
􏰛 E.g., nearest neighbor, PCA, SVM, gradient descent, . . .
􏰛 “Spectral regularization” with kernels: solve g(K/n)α = y/n for α.
10/15

4 3 2 1 0
-1 -2
Figure 1: Polynomial kernel with Kernel Ridge Regression and Kernel PCR
krr
kpcr
y
-3 -2 -1 0 1 2 3 x
11/15

4 3 2 1 0
-1 -2
Figure 2: RBF kernel with Kernel Ridge Regression and Kernel PCR
krr
kpcr
y
-3 -2 -1 0 1 2 3 x
12/15

4 3 2 1 0
-1 -2
lambda=0.0001
lambda=0.001
lambda=0.01
y
-3 -2 -1 0 1 2 3 x
Figure 3: RBF kernel with Kernel PCR
13/15

New kernels from old kernels
􏰛 Suppose k1 and k2 are positive definite kernel functions.
􏰛 Is k(x, x′) = k1(x, x′) + k2(x, x′) a positive definite kernel
function?
􏰛 Is k(x, x′) = a k1(x, x′) (for a ≥ 0) a positive definite kernel
function?
􏰛 Is k(x, x′) = k1(x, x′) k2(x, x′) a positive definite kernel
function?
14/15

Postscript
􏰛 Problem with kernel methods when n is large 􏰛 Kernel matrix K is of size n2
􏰛 Time for prediction generally ∝ n
􏰛 Some possible solutions:
􏰛 Nystrom approximations
􏰛 Find other ways to make αˆ sparse 􏰛 Random Fourier features
15/15