CS计算机代考程序代写 algorithm Bioinformatics Outline

Outline
􏰐 Large margin classifiers
􏰐 Primal/Dual formulation of the SVM 􏰐 Kernel SVMs
􏰐 Hard-margin / soft-margin
􏰐 SVM and Hinge Loss
􏰐 SVMs beyond classification
􏰐 Applications
1/25

Linear Classifiers with Margin
􏰐 Let φ(x) be some feature map mapping the data in Rd to some feature space RD.
􏰐 We consider functions of the type fθ : Rd → R with fθ(x) = sign(w⊤φ(x) + b), where θ = (w,b) denotes the parameters.
􏰐 We consider functions that classify with some margin: F=􏰓fθ:θ∈RD+1 ,
R
∀Ni=1:|w⊤φ(xi)+b|≥1, 2 =M􏰔 ∥w∥
􏰐 The VC-dimension can be bounded as
hF ≤min􏰓D+1,􏰗4R2􏰘+1􏰔 M2
(cf. Burges 1998).
M
2/25

Support Vector Machine (SVM)
reduce complexity
support vectors
increase margin
3/25

SVM Primal
The classifier with largest margin between the positive and negative data points can be obtained by solving the minimization problem:
min 1∥w∥2 (minimizing ∥w∥2/2 is equivalent to max- w,b 2 imizing the margin 2/∥w∥)
subject to the constraints
∀Ni=1 : yi ·(w⊤φ(xi)+b)≥1.
Once the parameters have been optimized, the decision function is given by f(x) = sign(w⊤φ(x) + b)
and it can be applied to classify new data.
4/25

SVM Primal
input view
constraints
SRM view
5/25
margin
constraints
constraints

Slater’s Condition
Consider the optimization problem minf(θ)
θ
with f convex, subject to the inequality constraints ∀mi=1 : gi(θ) ≤ 0
with gi also convex. If there exists a point θ in the input domain that satisfies all constraints with strict inequalities, then the solution of the optimization problem is also given by its Lagrange dual formulation, the constrained opti- mization problem:
m
max􏰙min􏰙f(θ)+􏰃αigi(θ)􏰚􏰚 ∀mi=1 αi ≥0, (1)
αθ
where the inner minimization problem is typically solved analytically.
i=1
6/25

SVM: from the Primal to the Dual
Slater’s condition for SVM:
∃ w,b s.t. ∀Ni=1 : yi ·(w⊤φ(xi)+b) > 1
In other words, the two classes must be linearly separable. When φ can be computed explicitly, this can be tested by running e.g. the perceptron algorithm.
7/25

SVM: from the Primal to the Dual
If the Slater’s conditions are verified, we inject the SVM objective in Eq. (1):
N
max􏰙min􏰙1∥w∥2 +􏰃αi ·(1−yi ·(w⊤φ(xi)+b))􏰚􏰚 s.t. a≽0,
α w,b 2
and after solving min{·} analytically by setting the gradient to zero, we get:
i=1
w,b NN
i=1 ij 􏰎 􏰍􏰌 􏰏 i=1 k(xi,xj)
where we have also found in the process w = 􏰂i αiyiφ(xi).
max 􏰃αi−1 􏰃αiαjyiyj 􏰕φ(xi),φ(xj)􏰖 s.t. α ≽ 0 and 􏰃αiyi = 0 α2
8/25

Dual Optimization of the SVM (derivation)
N
max􏰙min􏰙1∥w∥2 +􏰃αi ·(1−yi ·(w⊤φ(xi)+b))􏰚􏰚 s.t. a≽0,
i=1
α w,b 2
NN
max 􏰃αi−1 􏰃αiαjyiyj 􏰕φ(xi),φ(xj)􏰖 s.t. α ≽ 0 and 􏰃αiyi = 0 α2
i=1 ij 􏰎 􏰍􏰌 􏰏 i=1 k(xi,xj)
9/25

Predicting with the Dual Solution
The prediction function can be rewritten as: f(x) = sign(w⊤φ(x) + b)
= sign(􏰂i αiyi φ(xi)⊤φ(x) + b) 􏰎 􏰍􏰌 􏰏
k(xi,x) Question: how do we find the bias b ?
Answer: We recall that because the model has maximized the margin be- tween the two classes, the nearest point to the decision boundary for each class is exactly on the margin, i.e.
min􏰓yi ·(w⊤φ(xi)+b)􏰔=1 i
After some manipulations, we get: b = 1 − min 􏰓 􏰂j αj yj φ(xj )⊤ φ(xi ) 􏰔. i|yi=1 􏰎 􏰍􏰌 􏰏
k(xj,xi)
10/25

Examples of Kernels
Observation: In the SVM dual form, we never need to access the feature map φ(·) explicitly. Instead, we can always express computations in terms of the kernel function.
Examples of commonly used kernels satisfying the Mercer property are:
Linear Polynomial Gaussian
k(x, x′) = ⟨x, x′⟩
k(x, x′) = (⟨x, x′⟩ + β)γ β ∈ R≥0, γ ∈ N k(x, x′) = exp(−γ∥x − x′∥2) γ ∈ R>0
Note: The feature map associated to the Gaussian kernel is infinite-dimensional. However, in the dual form, we never need to access it for training and pre- diction, and we only need the kernel function.
11/25

SVM with Polynomial Kernel
The SVM decision function
N
g(x) = 􏰃 αiyi (⟨xi, x⟩ + a)γ + b
is polynomial of degree γ.
i=1
􏰎 􏰍􏰌 􏰏
k(xi,x)
degree=1 degree=3 degree=12
Observation: The polynomial decision function has very high value in the extrapolation regime.
12/25

SVM with Gaussian Kernel
The SVM decision function
N
g(x) = 􏰃 αiyi exp(−γ∥xi − x∥2) +b
i=1
􏰎 􏰍􏰌 􏰏
k(xi,x)
is a superposition of ‘bump’ functions. Intermediate values of γ usually give the best results.
gamma=0.01 gamma=0.30 gamma=10.00
13/25

More Advanced Kernels
More advanced kernels can be used to incorporate prior knowledge into the classifier, and to let the algorithm operate on non-vector data.
Examples:
􏰐 Bag-of-words kernels (text classification, image recognition) 􏰐 Tree kernels (text classification)
􏰐 Weighted degree kernel (bioinformatics)
􏰐 Graph kernels
Some of these kernels will be studied in Machine Learning 2.
14/25

Computational Requirements of SVMs
Computation required by the primal
􏰐 Optimization of a vector (w,b) in RD+1 subject to N constraints. 􏰐 Primal is infeasible when D is very large (e.g. infinite-dimensional).
Computation required by the dual
􏰐 Computation of a kernel matrix K of size N ×N with Kij = k(xi,xj) 􏰐 Dual is infeasible when N is large (e.g. 1 million data points).
What if both N and D are large?
􏰐 Random features (approximates infinite-dimensional feature maps
through randomization).
􏰐 Neural networks (learns an efficient feature map of size ≪ D).
15/25

Soft-Margin SVM
To account for cases where the data is not separable (e.g. due to noise), we in- troduce slack variables (ξi)i that let points not satisfy the margin constraints, at the cost of some penalty.
N min 1∥w∥2+C􏰃ξi
w,b,ξ 2 subject to the constraints
i=1
Note:
∀Ni=1 : yi ·(w⊤φ(xi)+b)≥1−ξi
and ξi ≥0.
􏰐 Slater’s conditions are now always satisfied (by choosing slack variables large enough for difficult examples), and the Soft-Margin SVM also has a dual formulation.
􏰐 Bias can be recovered from the dual using another set of conditions, the KKT conditions.
16/25

KKT Conditions (simplified)
Consider the optimization problem minf(θ). Subject to the inequality con- straints ∀mi=1 : gi(θ) ≤ 0. For this problem, the Karush-Kuhn-Tucker (KKT) conditions are:
m
∇f(θ) + 􏰃 λi∇gi(θ) = 0
i=1
∀mi=1 : gi(θ) ≤ 0
∀mi=1 : λi ≥ 0 λigi(θ) = 0
􏰐 If Slater’s conditions are satisfied, then the KKT conditions are satisfied at the optimum.
􏰐 The bias of the primal can be recovered from the ‘complementary slackness’ equation.
(stationarity) (primal feasibility) (dual feasibility) (complementary slackness)
17/25

Effect of the parameter C
The larger the parameter C the more the decision boundary is forced to correctly classify every data point. Best results are usually obtained with intermediate values of the parameter C.
C=0.1 C=10.0 C=10000.0
18/25

SVM and Hinge Loss
Recall that we have defined g(x) = w⊤φ(xi)+b, and therefore, we can write the constraint as:
ξi ≥ 1 − yi · g(x) We can consider instead a positive slack function
ξi′ =max(0,1−yi ·g(x)) and inject it directly in the objective
min
w,b,ξ
lhinge(y, g(x)) N􏰌 􏰏􏰎 􏰍
1∥w∥2 +C 􏰃 max􏰑0,1 − yi · g(x)􏰒
2
i=1
􏰎 􏰍􏰌 􏰏 􏰎 􏰍􏰌 􏰏
Ereg
Eemp
i.e. we optimize the sum of a fitting term and a regularization term (as we do e.g. for logistic regression but with a different loss function).
19/25

SVM Beyond Classification
Support vector data description (SVDD):
Consider an unlabeled dataset D = {x1, . . . , xN }. We would like to find the smallest enclosing sphere of the data, i.e.
min R2 R,c
subject to the constraints
∀Ni=1 : ∥φ(xi) − c∥2 ≤ R2
and we would like to detect a point to be anomalous if the point exceeds a certain distance from the center c, i.e.
f(x) = sign(∥φ(x) − c∥ − τ)
Note: Like for the SVM, the SVDD algorithm also has a dual form, it can expressed in terms of kernels, and we can add slack variables to it.
20/25

SVM Beyond Classification
Support vector regression (SVR):
Consider a labeled regression dataset D = {(x1,y1),··· ,(xN,yN)} with yi ∈ R. We would like to build the regression model
min ∥w∥2 w,b
subject to the constraints
∀Ni=1 : |w⊤φ(xi)+b−yi|≤ε
Note: Like for the SVM, support vector regres- sion has a dual form, it can be kernelized, and we can add slack variables to it. With the slack vari- ables, the constraints above can also be seen as applying an ε-sensitive loss.
21/25

Visual Summary
large margin/ VC dimension
quadratic SVMs programming
kernel choice/ prior knowledge
kernel trick
classification/ regression/…
slack variables/ noise modeling
22/25

SVM for ECG-based arrhythmia Detection
Wang et al. (2014), Hardware Specializa- tion in Low-power Sensing Applications to Address Energy and Resilience
23/25

SVM for Text Categorization
Precision/recall-breakeven point on the ten most frequent Reuters categories and microaveraged performance over all Reuters categories.
Source: Joachims et al. 1998, Text categorization with Support Vector Machines: Learn- ing with many relevant features
24/25

SVM/HoG for Pedestrian Detection
Source: Dutta 2015, Pedestrian Detection using HOG and SVM in Automotives
25/25