Beacon Conference of Undergraduate Research
Kernel Method
Lingqiao Liu
University of Adelaide
Outlines
University of Adelaide 2
• Why Kernel methods?
– Motivation
– Benefits
• Kernel Method
– Criteria for kernel functions
– Kernel SVM
– Commonly used kernels
• Kernelizing machine learning algorithms
– Simple practice
– Kernel k-means
– Kernel Regression
– Kernel PCA
Motivation
University of Adelaide 3
• Limitation of Linear Classifiers
Feature transform
University of Adelaide 4
• Idea: transform the feature to higher dimensional space
Demo
University of Adelaide 5
Feature transform
• Importance of the feature transform
– High-dimensional feature space usually works
• The transform function
– Can be a simple arithmetic operation
– Can be anything!
University of Adelaide 6
Feature transform
University of Adelaide 7
Issue
• The dimensionality of the mapped feature
– can be high or even infinite
– computational cost or infeasible
– More convenient to define it implicitly
University of Adelaide 8
Kernel trick
• In many cases, we are only interested in the inner
product of the mapped features
• In that case, we can define the feature map implicitly
• Kernel function
– can be more concise than the feature map
University of Adelaide 9
Polynomial kernel
• Definition
• Can be expanded as (when d = 2):
University of Adelaide 10
Recall that
Polynomial kernel
University of Adelaide 11
Gaussian RBF Kernel
• Definition
University of Adelaide 12
Understand Kernel
• Intuitively, modelling the similarity between two feature
points
– Guideline for designing kernels
– Not all similarity measurement can be a kernel function
University of Adelaide 13
Outlines
University of Adelaide 14
• Why Kernel methods?
– Motivation
– Benefits
• Kernel Method
– Criteria for kernel functions
– Kernel SVM
– Commonly used kernels
• Kernelizing machine learning algorithms
– Simple practice
– Kernel k-means
– Kernel Regression
– Kernel PCA
Criterion for a kernel function
• By definition, check if a kernel function can be written in
this form
• Equivalently, with arbitrary m samples, check if the
kernel matrix
University of Adelaide 15
Criterion for a kernel function
• Implies Semi-positive-definite of K
• Definition
• We can extend this to infinite number of samples
(functional space)
University of Adelaide 16
Criterion for a kernel function
• Mercer condition
University of Adelaide 17
Summary so far
• Kernel function
– Sometimes we are interested in the inner product of transformed
features
– We define a function to calculate the inner product between a
pair of features, called the kernel function
– Define a kernel function is equivalent to defining a feature
transform
• Application?
University of Adelaide 18
Kernel SVM
• Linear SVM (Primal form)
University of Adelaide 19
Kernel SVM
• Dual form
University of Adelaide 20
Kernel SVM
University of Adelaide 21
Kernel SVM
University of Adelaide 22
Kernel SVM
• Decision function for kernel SVM
University of Adelaide 23
No need to access to the exact value of the transformed feature, but
just their inner product
Kernel SVM
• Decision boundary of a Kernel SVM
University of Adelaide 24
Lesson learned
• Kernel SVM produce more flexible decision boundary
and thus can lead to better classification performance
• Choosing a good kernel is the key to success
University of Adelaide 25
Commonly used kernels
• Linear Kernel
• Polynomial kernel
• Gaussian RBF kernel
University of Adelaide 26
Commonly used kernels
• Histogram alike features: Bag-of-words
University of Adelaide 27
Commonly used kernels
• Histogram alike feature: bag-of-features model
University of Adelaide 28
Commonly used kernels for histogram liked
features
• Hellinger kernel
• Histogram Intersection Kernel
• RBF Kernel
University of Adelaide 29
Lesson learned (from kernel SVM)
• Kernel SVM produce more flexible decision boundary
and thus can lead to better classification performance
• Choosing a good kernel is the key to success
• We need to reformulate the original form to make the
problem only depend on the inner product of
transformed features
University of Adelaide 30
Outlines
University of Adelaide 31
• Why Kernel methods?
– Motivation
– Benefits
• Kernel Method
– Criteria for kernel functions
– Kernel SVM
– Commonly used kernels
• Kernelizing machine learning algorithms
– Simple practice
– Kernel k-means
– Kernel Regression
– Kernel PCA
Kernelize Machine Learning Algorithm
• How to kernelize a ML algorithm
– Assume we already know
– Reformulate the calculation to ensure that the algorithm only
depend on
• Revisit ML algorithms
– Kernel K-means
– Kernel regression
– Kernel PCA
University of Adelaide 32
Warm-up Practice
• Euclidean distance
University of Adelaide 33
Warm-up Practice
• Euclidean distance
University of Adelaide 34
K-means algorithm
University of Adelaide 35
How to kernelize k-means
• Key step: Assigning to the nearest centre – calculating
distance between feature and centre
• Challenges
– We cannot explicitly access
– Consequently, we cannot explicitly access
• How to represent centre?
University of Adelaide 36
How to kernelize k-means
University of Adelaide 37
Define
How to kernelize k-means
University of Adelaide 38
Put them together: Kernel K-means
• We could combine E step and M step into a single step:
– Calculating the assignment
By using the above kernelized algorithm
– Iteratively execute above step until the assignment won’t change
between iterations
University of Adelaide 39
Kernel Regression
• Linear Regression Problem
• However, directly expand this term cannot kernelize the
expression
University of Adelaide 40
Represent w by training data
• All the training data span a linear subspace
• You can imagine it as a plane in high-dimensional space
• We can project any d-dimensional vector as a
component in the subspace and a component orthogonal
to the subspace
University of Adelaide 41
Kernel Regression
• Rewrite w
University of Adelaide 42
Component on the subspace. Because it is on the subspace, we
can find a combination weight to reconstruct it from the
training samples
Component orthogonal to the subspace. So
Kernel Regression
University of Adelaide 43
So instead of solving w, we solve instead, then we use
to calculate the prediction
Lesson Learned
• Represent model parameters by linear combination of
training features, learn the combination weight instead
• Representer theorem:
https://en.wikipedia.org/wiki/Representer_theorem
University of Adelaide 44
PCA and kernel PCA
University of Adelaide 45
• Basic concept
– The dimensionality reduction function is a mapping function that
maps the original features to low-dimensional features
– In general, this mapping function can be learned from training data
and apply to other data
– The parameters of the mapping function in PCA (and KPCA):
• Mean vector
• Project matrix P
Key steps in PCA
University of Adelaide 46
• Learning parameters:
– Centralize data
– Calculate covariance matrix
– Calculate eigen vectors and eigen values
– Sort eigen values and their corresponding eigen vectors, keep the
top-k eigen vectors as the projection matrix P
• Apply transform
– Centralize all data with the mean vector calculated from training data
– Apply the learned projection matrix to the centralized data
Kernel PCA
University of Adelaide 47
• Main differences:
– The inner product between two samples are measured by the
kernel function
– Each sample can be essentially represented by
– We do not know the actual representation of
• Challenges
– We cannot directly centralize data (for training and testing)
– We cannot directly calculate the covariance matrix
– We cannot directly calculate the projection matrix P
Kernel PCA
University of Adelaide 48
• Main differences:
– The inner product between two samples are measured by the
kernel function
– Each sample can be essentially represented by
– We do not know the actual representation of
• Challenges
– We cannot directly centralize data (for training and testing)
– We cannot directly calculate the covariance matrix
– We cannot directly calculate the projection matrix P
Overcome Challenge 2
University of Adelaide 49
Overcome Challenge 2
University of Adelaide 50
Overcome Challenge 2
University of Adelaide 51
Note that, in the above equations, we assume transformed
feature is not centred, not the original feature
Kernel PCA
University of Adelaide 52
• Main differences:
– The inner product between two samples are measured by the
kernel function
– Each sample can be essentially represented by
– We do not know the actual representation of
• Challenges
– We cannot directly centralize data (for training and testing)
– We cannot directly calculate the covariance matrix
– We cannot directly calculate the projection matrix P
Overcome Challenge 1 (at training stage)
University of Adelaide 53
Kernel PCA
University of Adelaide 54
• Main differences:
– The inner product between two samples are measured by the
kernel function
– Each sample can be essentially represented by
– We do not know the actual representation of
• Challenges
– We cannot directly centralize data (for training and testing)
– We cannot directly calculate the covariance matrix
– We cannot directly calculate the projection matrix P
Overcome Challenge 3 and Challenge 1 (at
testing time)
University of Adelaide 55
• Recall that the projection matrix P in PCA obtained by stacking
eigen vectors of the covariance matrix (sorting them through
their corresponding eigen-values in the descending order)
Overcome Challenge 3 and Challenge 1 (at
testing time)
University of Adelaide 56
Putting together: KPCA
• Training
– Centralized kernel matrix (slide 51)
– Calculating centeralized Kernel matrix and its eigen vector (slide
49)
– Calculate the eigen vector, eigen value of the centralized kernel
matrix (obtain v’)
– Sort eigen vectors by its corresponding eigen values (same as in
PCA)
• Testing
– Use the equation in slide 54 to calculate the dimension-reduced
features .
University of Adelaide 57
Summary
• Kernel Method: theory
– Motivation
– Criterion for kernel functions.
– Use of Kernels: Kernel SVM
– Commonly used kernel
• Kernel Method: kernelize machine learning algorithm
– Kernel k-means
– Kernel Regression
– Kernel PCA
University of Adelaide 58