Machine learning lecture slides
Machine learning lecture slides
COMS 4771 Fall 2020
0 / 15
Regression III: Kernels
Outline
I Dual form of ridge regression
I Examples of kernel trick
I Kernel methods
1 / 15
Linear algebraic identity
I Let A = 1√
n
← xT1 →
…
← xTn →
∈ Rn×d and b = 1√n
y1
…
yn
∈ Rn
I Linear algebraic identity: for any A ∈ Rn×d and any λ > 0,
(ATA+ λI︸ ︷︷ ︸
d×d
)−1AT = AT(AAT + λI︸ ︷︷ ︸
n×n
)−1.
I Check: multiply both sides by ATA+ λI and “factor”.
2 / 15
Alternative (dual) form for ridge regression (1)
I Implications for ridge regression
ŵ = AT (AAT + λI)−1b︸ ︷︷ ︸
=:
√
nα̂
=
√
nATα̂ =
n∑
i=1
α̂ixi.
I Matrix AAT = 1
n
K, where K ∈ Rn×n is the Gram matrix
Ki,j = xTi xj .
I Prediction with ŵ on new point x:
xTŵ =
n∑
i=1
α̂i · xTxi
3 / 15
Alternative (dual) form for ridge regression (2)
I Therefore, can “represent” predictor via data points x1, . . . , xn
and α̂.
I Similar to nearest neighbor classifier, except also have α̂
I To get α̂: solve linear system involving K (and not A directly)
I To make prediction on x: iterate through the xi to compute
inner products with x; take appropriate weighted sum of results
I When is this a good idea?
4 / 15
Quadratic expansion
I Suppose we want to do feature expansion to get all quadratic
terms in ϕ(x)
ϕ(x) = (1,
√
2×1, . . . ,
√
2xd︸ ︷︷ ︸
linear terms
, x21, . . . , x
2
d︸ ︷︷ ︸
squared terms
,
√
2x1x2, . . . ,
√
2x1xd, . . . ,
√
2xd−1xd︸ ︷︷ ︸
cross terms
)
I This feature expansion has 1 + 2d+
(d
2
)
= Θ(d2) terms
I Explicitly computing ϕ(x), ϕ(x′), and then ϕ(x)Tϕ(x′) would
take Θ(d2) time.
I “Kernel trick”: can compute ϕ(x)Tϕ(x′) in O(d) time:
ϕ(x)Tϕ(x′) = (1 + xTx′)2.
I Similar trick for cubic expansion, quartic expansion, etc.
5 / 15
Gaussian kernel
I For any σ > 0, there is an infinite-dimensional feature
expansion ϕ : Rd → R∞ such that
ϕ(x)Tϕ(x′) = exp
(
−
‖x− x′‖22
2σ2
)
,
which can be computed in O(d) time.
I Called Gaussian kernel or Radial Basis Function (RBF) kernel
(with bandwidth σ).
I Feature expansion for d = 1 and σ = 1 case:
ϕ(x) = e−x
2/2
(
1, x,
x2
√
2!
,
x3
√
3!
, . . .
)
.
6 / 15
Kernels
I A positive definite kernel k : X × X → R is a symmetric
function satisfying the following property: For any n, and any
x1, . . . , xn ∈ X , the n× n matrix whose (i, j)-th entry is
k(xi, xj) is positive semidefinite.
I Theorem: For any positive definite kernel k, there exists a
feature map ϕ : X → H such that ϕ(x)Tϕ(x′) = k(x, x′) for
all x, x′ ∈ X .
I Here, H is a special kind of inner product space called the
Reproducing Kernel Hilbert Space (RKHS) corresponding to k.
I Algorithmically, we don’t have to worry about what ϕ is.
Instead, just use k.
7 / 15
Kernel ridge regression (1)
I Training data (x1, y1), . . . , (xn, yn) ∈ X × R
I Ridge regression with feature map ϕ: minimize
1
n
n∑
i=1
(ϕ(xi)Tw − yi)2 + λ‖w‖22
I Compute the n× n kernel matrix K where
Ki,j = k(xi, xj).
I Letting w =
∑n
i=1 αiϕ(xi) for α = (α1, . . . , αn), ridge
regression objective is equivalent to
1
n
‖Kα− y‖22 + λα
TKα
where y = (y1, . . . , yn) ∈ Rn.
8 / 15
Kernel ridge regression (2)
I Minimizer wrt α is solution α̂ to linear system of equations
(K + nλI)α = y.
I Return predictor that is represented by α̂ ∈ Rn and x1, . . . , xn
I To make prediction on new x ∈ X : output
n∑
i=1
α̂i · k(x, xi).
I Inductive bias:
|ŵTϕ(x)− ŵTϕ(x′)| ≤ ‖ŵ‖2 · ‖ϕ(x)− ϕ(x′)‖2
=
√
α̂TKα̂ · ‖ϕ(x)− ϕ(x′)‖2
9 / 15
Kernel methods
I Many methods / algorithms can be “kernelized” into
kernel methods
I E.g., nearest neighbor, PCA, SVM, gradient descent, . . .
I “Spectral regularization” with kernels: solve g(K/n)α = y/n
for α.
10 / 15
-3 -2 -1 0 1 2 3
x
-2
-1
0
1
2
3
4
y
krr
kpcr
Figure 1: Polynomial kernel with Kernel Ridge Regression and Kernel PCR
11 / 15
-3 -2 -1 0 1 2 3
x
-2
-1
0
1
2
3
4
y
krr
kpcr
Figure 2: RBF kernel with Kernel Ridge Regression and Kernel PCR
12 / 15
-3 -2 -1 0 1 2 3
x
-2
-1
0
1
2
3
4
y
lambda=0.0001
lambda=0.001
lambda=0.01
Figure 3: RBF kernel with Kernel PCR
13 / 15
New kernels from old kernels
I Suppose k1 and k2 are positive definite kernel functions.
I Is k(x, x′) = k1(x, x′) + k2(x, x′) a positive definite kernel
function?
I Is k(x, x′) = a k1(x, x′) (for a ≥ 0) a positive definite kernel
function?
I Is k(x, x′) = k1(x, x′) k2(x, x′) a positive definite kernel
function?
14 / 15
Postscript
I Problem with kernel methods when n is large
I Kernel matrix K is of size n2
I Time for prediction generally ∝ n
I Some possible solutions:
I Nystrom approximations
I Find other ways to make α̂ sparse
I Random Fourier features
15 / 15
Regression III: Kernels