CS代考计算机代写 decision tree algorithm Kernels Methods in Machine Learning

Kernels Methods in Machine Learning
• Perceptron. Geometric Margins.
• Support Vector Machines (SVMs).
Maria-Florina Balcan 03/23/2015

Quick Recap about Perceptron and Margins

• •
Example arrive sequentially. We need to make a prediction.
Afterwards observe the outcome.
For i=1, 2, …, : Phase i:
Mistake bound model
The Online Learning Model
Online Algorithm
Example 𝑥𝑖 Prediction h(𝑥𝑖)
Observe c∗(𝑥𝑖)
• Analysis wise, make no distributional assumptions.
• Goal: Minimize the number of mistakes.

Perceptron Algorithm in Online Model
WLOG homogeneous linear separators [w0 = 0].
• Set t=1, start with the all zero vector 𝑤1.
X = Rn
• Given example 𝑥, predict + iff 𝑤 ⋅𝑥 ≥ 0
𝑡 XXX O
• On a mistake, update as follows: O O
• Mistake on positive, 𝑤 ← 𝑤 + 𝑥 X 𝑡+1𝑡 XXO
O • Mistake on negative, 𝑤𝑡+1 ← 𝑤𝑡 − 𝑥 O
Note 1:
wt is weighted sum of incorrectly classified examples 𝑤𝑡 =𝑎𝑖1𝑥𝑖1 +⋯+𝑎𝑖𝑘𝑥𝑖𝑘 So,𝑤𝑡 ⋅𝑥=𝑎𝑖1𝑥𝑖1 ⋅𝑥+⋯+𝑎𝑖𝑘𝑥𝑖𝑘 ⋅𝑥
Note 2:
• No matter how long the sequence is or how high dimension n is!
Number of mistakes ever made depends only on the geometric margin of examples seen.
X XXw
X
OO

Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.
Margin of example 𝑥1 𝑥1
If 𝑤 = 1, margin of x w.r.t. w is |𝑥 ⋅ 𝑤|.
w
Margin of example 𝑥2
𝑥2

Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.
Definition: The margin 𝛾𝑤 of a set of examples 𝑆 wrt a linear separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.
Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾𝑤 over all linear separators 𝑤.
w
+

-𝛾𝛾+ —
++
+

— –

Perceptron: Mistake Bound
• No matter how long the sequence is how high dimension n is!
Theorem: If data linearly separable by margin 𝛾 and points inside a ball of radius 𝑅, then Perceptron makes ≤ 𝑅/𝛾 2 mistakes.
+ + +
Margin: the amount of wiggle-room available for a solution.
+
+ w* — R++
–   — –

(Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.)

Perceptron Extensions
• Can use it to find a consistent separator with a given set S linearly separable by margin 𝛾 (by cycling through the data).
• Can convert the mistake bound guarantee into a distributional guarantee too (for the case where the 𝑥𝑖s come from a fixed distribution).
• Can be adapted to the case where there is no perfect
separator as long as the so called hinge loss (i.e., the total distance needed to move the points to classify them correctly large
margin) is small.
• Can be kernelized to handle non-linear decision boundaries!

Theorem: If data linearly separable by margin 𝛾 and points inside a ball of radius 𝑅, then Perceptron makes ≤ 𝑅/𝛾 2 mistakes.
Implies that large margin classifiers have smaller complexity!


Complexity of Large Margin Linear Sep.
Know that in Rn we can shatter n+1 points with linear separators, but not n+2 points (VC-dim of linear sep is n+1).
What if we require that the points be linearly separated by margin 𝛾?
X X O O
XXO 𝑅2O
Can have at most 𝛾 points inside ball of radius R that can be shattered at margin 𝛾 (meaning that every
XXX O O O
XXw XO
• •
labeling is achievable by a separator of margin 𝛾).
So, large margin classifiers have smaller complexity!
• Nice implications for usual distributional learning setting.
• Less classifiers to worry about that will look good over the sample, but bad over all….
Less prone to overfitting!!!!

Margin Important Theme in ML.
Both sample complexity and algorithmic implications. Sample/Mistake Bound complexity:
• If large margin, # mistakes Peceptron makes is small (independent on the dim of the space)!
• If large margin 𝛾 and if alg. produces a large margin classifier, then amount of data needed depends only on R/𝛾 [Bartlett & Shawe-Taylor ’99].
• Suggests searching for a – large margin classifier…
Algorithmic Implications: • Perceptron, Kernels, SVMs…
w
– 𝛾 𝛾 + +++ –
+
– –
– –
– –

So far, talked about margins in the context of (nearly) linearly separable datasets

What if Not Linearly Separable
Problem: data not linearly separable in the most natural feature representation.
Example:
Solutions:
vs
No good linear separator in pixel representation.
• “Learn a more complex class of functions” • (e.g., decision trees, neural networks, boosting).
• “Use a Deep Network”
• “Combine Kernels and Deep Networks”
• “Use a Kernel”
(a neat solution that attracted a lot of attention)

Overview of Kernel Methods
What is a Kernel?
A kernel K is a legal def of dot-product: i.e. there exists an implicit mapping Φ s.t. K( , ) =Φ( )⋅ Φ( )
E.g., K(x,y) = (x ¢ y + 1)d
: (n-dimensional space) ! nd-dimensional space
Why Kernels matter?
• •

Many algorithms interact with data only via dot-products. So, if replace x ⋅ z with K x, z they act implicitly as if data
was in the higher-dimensional Φ-space.
If data is linearly separable by large margin in the Φ-space,
then good sample complexity.
[Or other regularity properties for controlling the capacity.]

Definition
Kernels
K(⋅,⋅) is a kernel if it can be viewed as a legal definition of inner product:
• ∃φ:X→RN s.t.Kx,z =φx ⋅φ(z)
• Range of φ is called the Φ-space.
• N can be very large.
• But think of φ as implicit, not explicit!!!!

Example
For n=2, d=2, the kernel K x,z = x⋅z d corresponds to
𝑥1,𝑥2 →Φ𝑥 =(𝑥12,𝑥2, 2𝑥1𝑥2)
Φ-space XXXXX
Original space
x2
X
X
X
X XXX XXOOX OX
OOOx1 OOX OOX OOOX z1
XO OOXX XXXOXX
XzXXX
XXX3X
X

Example
φ:R2 → R3, x1,x2 → Φ x = (x12,x2, 2x1x2) φ x ⋅φ 𝑧 = x12,x2, 2x1x2 ⋅(𝑧12,𝑧2, 2𝑧1𝑧2)
= x1𝑧1+x2𝑧2 2= x⋅𝑧2=K(x,z)
Original space x Φ-space
X2 XXX
XX XX
X OXXX
XXOOX OX OOOx1 OXX
O OOOz1 XOOX OOXX
XX Xz
OX
X
X3XXXX XXX X

Definition
Kernels
K(⋅,⋅) is a kernel if it can be viewed as a legal definition of inner product:
• ∃φ:X→RN s.t.Kx,z =φx ⋅φ(z)
• Range of φ is called the Φ-space.
• N can be very large.
• But think of φ as implicit, not explicit!!!!

Example Note: feature space might not be unique.
φ:R2 → R3, x1,x2 → Φ x = (x12,x2, 2x1x2) φ x ⋅φ 𝑧 = x12,x2, 2x1x2 ⋅(𝑧12,𝑧2, 2𝑧1𝑧2)
= x1𝑧1+x2𝑧2 2= x⋅𝑧2=K(x,z)
φ:R2 → R4, x1,x2 → Φ x = (x12,x2,x1x2,x2x1)
φ x ⋅ φ 𝑧 = (x12, x2, x1x2, x2x1) ⋅ (z12, z2, z1z2, z2z1)
= x⋅𝑧 2 =K(x,z)

Avoid explicitly expanding the features Feature space can grow really large and really quickly….
Crucial to think of φ as implicit, not explicit!!!!
• Polynomialkerneldegreee𝑑,𝑘 𝑥,𝑧 = 𝑥⊤𝑧 𝑑 =𝜙 𝑥 ⋅𝜙 𝑧
– 𝑥1𝑑, 𝑥1𝑥2 …𝑥𝑑, 𝑥12𝑥2 …𝑥𝑑−1
– Total number of such feature is
𝑑+𝑛−1 = 𝑑+𝑛−1! 𝑑 𝑑! 𝑛 − 1 !
– 𝑑=6,𝑛=100,thereare1.6billionterms
𝑘𝑥,𝑧 = 𝑥⊤𝑧𝑑=𝜙𝑥 ⋅𝜙𝑧
𝑂 𝑛 𝑐𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛!

Kernelizing a learning algorithm
• If all computations involving instances are in terms of inner products then:

 Conceptually, work in a very high diml space and the alg’s performance depends only on linear separability in that extended space.
 Computationally, only need to modify the algo by replacing each x ⋅ z with a K x, z .
Examples of kernalizable algos:
• classification: Perceptron, SVM.
• regression: linear, ridge regression.
• clustering: k-means.

Kernelizing the Perceptron Algorithm
• Set t=1, start with the all zero vector 𝑤1.
• Given example 𝑥, predict + iff 𝑤𝑡 ⋅𝑥 ≥ 0
• On a mistake, update as follows:
• Mistake on positive, 𝑤𝑡+1 ← 𝑤𝑡 + 𝑥
• Mistake on negative, 𝑤𝑡+1 ← 𝑤𝑡 − 𝑥
XX OO
XXX O OO
XXw XO
XXO
O
Easy to kernelize since 𝑤𝑡 is weighted sum of incorrectly classified examples 𝑤𝑡 = 𝑎𝑖1 𝑥𝑖1 + ⋯ + 𝑎𝑖𝑘 𝑥𝑖𝑘
Replace
Note: need to store all the mistakes so far.
𝑤𝑡 ⋅𝑥=𝑎𝑖1𝑥𝑖1 ⋅𝑥+⋯+𝑎𝑖𝑘𝑥𝑖𝑘 ⋅𝑥 with 𝑎𝑖1 𝐾(𝑥𝑖1,𝑥)+⋯+𝑎𝑖𝑘𝐾(𝑥𝑖𝑘,𝑥)

• •
Given 𝑥, predict + iff 𝜙(𝑥𝑖𝑡−1 ) ⋅ 𝜙(𝑥) 𝑎𝑖1 𝐾(𝑥𝑖1,𝑥)+⋯+𝑎𝑖𝑡−1𝐾(𝑥𝑖𝑡−1,𝑥)≥0
On the 𝑡 th mistake, update as follows:
• Mistake on positive, set 𝑎 ← 1; store 𝑥
Φ-space
Kernelizing the Perceptron Algorithm
• Mistake on negative, 𝑎𝑖𝑡 ← −1; store 𝑥𝑖𝑡
Perceptron
𝑤𝑡 =𝑎𝑖1𝑥𝑖1 +⋯+𝑎𝑖𝑘𝑥𝑖𝑘
𝑤𝑡 ⋅𝑥=𝑎𝑖1𝑥𝑖1 ⋅𝑥+⋯+𝑎𝑖𝑘𝑥𝑖𝑘 ⋅𝑥 → 𝑎𝑖1 𝐾(𝑥𝑖1,𝑥)+⋯+𝑎𝑖𝑘𝐾(𝑥𝑖𝑘,𝑥)
Exact same behavior/prediction rule as if mapped data in the 𝜙-space and ran Perceptron there!
Do this implicitly, so computational savings!!!!!
X X O O 𝑖𝑡 𝑖𝑡 O
X X O O XXO
X O O XXX w

Generalize Well if Good Margin
• If data is linearly separable by margin in the 𝜙-space, then small mistake bound.
• If margin 𝛾 in 𝜙-space, then Perceptron makes 𝑅 2 𝛾
mistakes.
+
 + w*
+ - ++
Φ-space
+ —+
-R
— –

Kernels: More Examples • Linear:Kx,z =x⋅𝑧
• Polynomial:K x,𝑧 = x⋅𝑧 d orK x,𝑧 = 1+x⋅𝑧 d
𝑥−𝑧 2 2𝜎2
• Gaussian: K x, 𝑧 = exp −
• Laplace Kernel: K x, 𝑧 = exp − ||𝑥−𝑧||
2𝜎2
• Kernel for non-vectorial data, e.g., measuring similarity between sequences.

Properties of Kernels
Theorem (Mercer)
K is a kernel if and only if:
• K is symmetric
• For any set of training points 𝑥1, 𝑥2, … , 𝑥𝑚 and for any 𝑎1,𝑎2,…,𝑎𝑚 ∈ 𝑅, we have:
𝑖,𝑗𝑎𝑖𝑎𝑗𝐾𝑥𝑖,𝑥𝑗 ≥0
𝑎𝑇𝐾𝑎 ≥ 0
I.e., 𝐾 = (𝐾 𝑥𝑖 , 𝑥𝑗 )𝑖,𝑗=1,…,𝑛 is positive semi-definite.

Kernel Methods
• Offer great modularity.
• No need to change the underlying learning algorithm to accommodate a particular choice of kernel function.
• Also, we can substitute a different algorithm while maintaining the same kernel.

Kernel, Closure Properties Easily create new kernels using basic ones!
Fact: If K1 ⋅,⋅ and K2 ⋅,⋅ are kernels c1 ≥ 0,𝑐2 ≥ 0, then K x, z = c1K1 x, z + c2K2 x, z is a kernel.
Key idea: concatenate the 𝜙 spaces. φ x = ( c1 φ1 x , c2 φ2(x))
φx ⋅φ(z)=c1φ1 x ⋅φ1 z+c2φ2 x ⋅φ2 z
𝐾1(𝑥,𝑧)
𝐾2(𝑥,𝑧)

Kernel, Closure Properties Easily create new kernels using basic ones!
Fact: If K1 ⋅,⋅ and K2 ⋅,⋅ are kernels,
then K x, z = K1 x, z K2 x, z is a kernel.
Key idea:
φx ⋅φ(z)=
φ x = φ1,i x φ2,j x
𝑖∈ 1,…,𝑛 ,𝑗∈{1,…,𝑚} φ1,i x φ2,j xφ1,i z φ2,j z
𝑖,𝑗
=
= 𝑖φ1,i x φ1,𝑖 zK2 x,z =K1 x,z K2 x,z
φ1,i x φ1,𝑖 z φ2,𝑗 x φ2,j z 𝑖𝑗

Kernels, Discussion
• If all computations involving instances are in terms of inner products then:
 Conceptually, work in a very high diml space and the alg’s performance depends only on linear separability in that extended space.
 Computationally, only need to modify the algo by replacing each x ⋅ z with a K x, z .
• Lots of Machine Learning algorithms are kernalizable:
• classification: Perceptron, SVM.
• regression: linear regression.
• clustering: k-means.

Kernels, Discussion
• If all computations involving instances are in terms of inner products then:
 Conceptually, work in a very high diml space and the alg’s performance depends only on linear separability in that extended space.
 Computationally, only need to modify the algo by replacing each x ⋅ z with a K x, z .
How to choose a kernel:
• Kernels often encode domain knowledge (e.g., string kernels)
• Use Cross-Validation to choose the parameters, e.g., 𝜎 for
Gaussian Kernel
K x, 𝑧 = exp −
𝑥−𝑧 2 2𝜎2
• Learn a good kernel; e.g., [Lanckriet-Cristianini-Bartlett-El Ghaoui- Jordan’04]