CS计算机代考程序代写 algorithm Machine learning lecture slides

Machine learning lecture slides

COMS 4771 Fall 2020

0 / 15

Regression III: Kernels

Outline

I Dual form of ridge regression
I Examples of kernel trick
I Kernel methods

1 / 15

Linear algebraic identity

I Let A = 1√
n



← xT1 →

…
← xTn →


 ∈ Rn×d and b = 1√n



y1
…
yn


 ∈ Rn

I Linear algebraic identity: for any A ∈ Rn×d and any λ > 0,

(ATA+ λI︸︷︷︸
d×d

)−1AT = AT(AAT + λI︸︷︷︸
n×n

)−1.

I Check: multiply both sides by ATA+ λI and “factor”.

2 / 15

Alternative (dual) form for ridge regression (1)

I Implications for ridge regression

ŵ = AT (AAT + λI)−1b︸︷︷︸
=:
√
nα̂

=
√
nATα̂ =

n∑
i=1

α̂ixi.

I Matrix AAT = 1
n
K, where K ∈ Rn×n is the Gram matrix

Ki,j = xTi xj .

I Prediction with ŵ on new point x:

xTŵ =
n∑
i=1

α̂i · xTxi

3 / 15

Alternative (dual) form for ridge regression (2)

I Therefore, can “represent” predictor via data points x1, . . . , xn
and α̂.
I Similar to nearest neighbor classifier, except also have α̂
I To get α̂: solve linear system involving K (and not A directly)
I To make prediction on x: iterate through the xi to compute

inner products with x; take appropriate weighted sum of results
I When is this a good idea?

4 / 15

Quadratic expansion

I Suppose we want to do feature expansion to get all quadratic
terms in ϕ(x)

ϕ(x) = (1,
√

2×1, . . . ,
√

2xd︸︷︷︸
linear terms

, x21, . . . , x
2
d︸︷︷︸

squared terms

,
√

2x1x2, . . . ,
√

2x1xd, . . . ,
√

2xd−1xd︸︷︷︸
cross terms

)

I This feature expansion has 1 + 2d+
(d

2
)

= Θ(d2) terms
I Explicitly computing ϕ(x), ϕ(x′), and then ϕ(x)Tϕ(x′) would

take Θ(d2) time.

I “Kernel trick”: can compute ϕ(x)Tϕ(x′) in O(d) time:

ϕ(x)Tϕ(x′) = (1 + xTx′)2.

I Similar trick for cubic expansion, quartic expansion, etc.

5 / 15

Gaussian kernel

I For any σ > 0, there is an infinite-dimensional feature
expansion ϕ : Rd → R∞ such that

ϕ(x)Tϕ(x′) = exp
(
−
‖x− x′‖22

2σ2

)
,

which can be computed in O(d) time.
I Called Gaussian kernel or Radial Basis Function (RBF) kernel

(with bandwidth σ).

I Feature expansion for d = 1 and σ = 1 case:

ϕ(x) = e−x
2/2
(

1, x,
x2
√

2!
,
x3
√

3!
, . . .

)
.

6 / 15

Kernels

I A positive definite kernel k : X × X → R is a symmetric
function satisfying the following property: For any n, and any
x1, . . . , xn ∈ X , the n× n matrix whose (i, j)-th entry is
k(xi, xj) is positive semidefinite.

I Theorem: For any positive definite kernel k, there exists a
feature map ϕ : X → H such that ϕ(x)Tϕ(x′) = k(x, x′) for
all x, x′ ∈ X .
I Here, H is a special kind of inner product space called the

Reproducing Kernel Hilbert Space (RKHS) corresponding to k.

I Algorithmically, we don’t have to worry about what ϕ is.
Instead, just use k.

7 / 15

Kernel ridge regression (1)

I Training data (x1, y1), . . . , (xn, yn) ∈ X × R
I Ridge regression with feature map ϕ: minimize

1
n

n∑
i=1

(ϕ(xi)Tw − yi)2 + λ‖w‖22

I Compute the n× n kernel matrix K where

Ki,j = k(xi, xj).

I Letting w =
∑n
i=1 αiϕ(xi) for α = (α1, . . . , αn), ridge

regression objective is equivalent to

1
n
‖Kα− y‖22 + λα

TKα

where y = (y1, . . . , yn) ∈ Rn.

8 / 15

Kernel ridge regression (2)

I Minimizer wrt α is solution α̂ to linear system of equations

(K + nλI)α = y.

I Return predictor that is represented by α̂ ∈ Rn and x1, . . . , xn
I To make prediction on new x ∈ X : output

n∑
i=1

α̂i · k(x, xi).

I Inductive bias:

|ŵTϕ(x)− ŵTϕ(x′)| ≤ ‖ŵ‖2 · ‖ϕ(x)− ϕ(x′)‖2
=
√
α̂TKα̂ · ‖ϕ(x)− ϕ(x′)‖2

9 / 15

Kernel methods

I Many methods / algorithms can be “kernelized” into
kernel methods
I E.g., nearest neighbor, PCA, SVM, gradient descent, . . .

I “Spectral regularization” with kernels: solve g(K/n)α = y/n
for α.

10 / 15

-3 -2 -1 0 1 2 3

-2

-1

krr

kpcr

Figure 1: Polynomial kernel with Kernel Ridge Regression and Kernel PCR

11 / 15

-3 -2 -1 0 1 2 3

-2

-1

krr

kpcr

Figure 2: RBF kernel with Kernel Ridge Regression and Kernel PCR

12 / 15

-3 -2 -1 0 1 2 3

-2

-1

lambda=0.0001

lambda=0.001

lambda=0.01

Figure 3: RBF kernel with Kernel PCR

13 / 15

New kernels from old kernels

I Suppose k1 and k2 are positive definite kernel functions.
I Is k(x, x′) = k1(x, x′) + k2(x, x′) a positive definite kernel

function?
I Is k(x, x′) = a k1(x, x′) (for a ≥ 0) a positive definite kernel

function?
I Is k(x, x′) = k1(x, x′) k2(x, x′) a positive definite kernel

function?

14 / 15

Postscript

I Problem with kernel methods when n is large
I Kernel matrix K is of size n2
I Time for prediction generally ∝ n

I Some possible solutions:
I Nystrom approximations
I Find other ways to make α̂ sparse
I Random Fourier features

15 / 15

Regression III: Kernels

Related Posts