SVM and SMO
Support Vector Machine and SMO in Python¶
Noted by Jiang Hao on 2015/6/3
Chapter 0: Introduction¶
This is a note taken during my study on Support Vector Machine. It mainly includes:
How I understand SVM, SMO and Pegsos
Derivation details of some equations and formulas
Chapter 1: The mathematics of the margin¶
Hypothesis¶
The hypothesis of SVM is as follows.
$${h_{w,b}}(x) = g({w^T}x + b) = \left\{ {\begin{array}{*{20}{c}}{1,{w^T}x + b \ge 0}\\{ – 1,{w^T}x + b < 0}\end{array}} \right.$$ Notice the difference between this and the logististic regression. Two kind of margins¶ We define the functional margin of $(w,b)$ with respect to the training sample $({x^{(i)}},{y^{(i)}})$: $${{\hat \gamma }^{(i)}} = {y^{(i)}}({w^T}{x^{(i)}} + b)$$Given a training set $S = \{ ({x^{(i)}},{y^{(i)}}),i = 1,...,m\} $, we define the functional margin of $(w,b)$ with respect to $S$: $$\hat \gamma = \mathop {min}\limits_{i = 1,...,m} {{\hat \gamma }^{(i)}}$$ Notice that the functional margin is not a good measure of confidence. We can multiply a number to both $w$ and $b$ without changing $h_{w,b}(x)$, while at the same time we changed the functional margin be the same magnitude. Intuitively, it makes sense to impose some normalization to the margin. We define the geometric margin of $(w,b)$ with respect to training sample $(x^{(i)},y^{(i)})$: $${\gamma ^{(i)}} = {y^{(i)}}(\frac{{{w^T}}}{{||w||}} \cdot x^{(i)} + \frac{b}{{||w||}})$$We also define the geometric margin of training set $S$: $$\gamma = \mathop {min}\limits_{i = 1,...,m} {\gamma ^{(i)}}$$It is clear that the relationship between the funcitonal margin and geometric margin is as follows. When $||w||=1$ the functional margin equals the geometric margin. $${\gamma ^{(i)}} = \frac{{{{\hat \gamma }^{(i)}}}}{{||w||}}$$$$\gamma = \frac{{\hat \gamma }}{{||w||}}$$Largest margin classifier¶ Assuming that the training samples are linearly seperable, the optimization of SVM is to find a hyperplane that seperates the positive and negative points with the maximun 'gap', which is the geometric margin. The optimization problem can be expressed as follows: $$\mathop {\max }\limits_{\hat \gamma ,w,b} \frac{{\hat \gamma }}{{||w||}}$$$$s.t. \ {y^{(i)}}({w^T}x + b) \ge \hat \gamma ,i = 1,...,m$$Now we add a constraint on $w$ and $b$ to make ${\hat \gamma}=1$. As talked before this wouldn't change anything. Now maximizing $\frac{{\hat \gamma }}{{||w||}} = \frac{1}{{||w||}}$ is the same as minimizaing $||w||^2$. We have the optimization problem: $$\mathop {\min }\limits_{w,b} \frac{1}{2}||w|{|^2}$$$$s.t. \ {y^{(i)}}({w^T}x + b) \ge 1,i = 1,...,m$$ We do the transformation from $\frac{1}{{||w||}}$ to $||w||^2$ mainly because $\frac{1}{{||w||}}$ is non-convex, which is a problem no off-the-shelf software can solve. The reason why we can arbitarily set $\hat \gamma$ is because however we change the functional margin, the geometric margin never changes. I was thinking whether it's necessary to introduce the functional margin? If we directly jump to the geometric margin, it will be simple and intuitive to understand. Chapter 2: The Lagrange Duality¶ The primal problem¶ To solve the constrained optimization problem, we need a method called the Lagrange Duality. Let's start by considering the primal problem: $$\mathop {\min }\limits_w f(w)$$$$s.t.{g_i}(w) \le 0,i = 1,...,k$$$${h_i}(w) = 0,i = 1,...,l$$To solve this problem, we define the generalized Lagrangian: $$L(w,\alpha ,\beta ) = f(w) + \sum\limits_{i = 1}^k {{\alpha _i}{g_i}(w)} + \sum\limits_{i = 1}^l {{\beta _i}{h_i}(w)} $$Here the $\alpha_i$'s and $\beta_i$'s are Lagrange multipliers. Consider the equation $${\theta _P}(w) = \mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} L(w,\alpha ,\beta )$$Hence we get $${\theta _P}(w) = \left\{ {\begin{array}{*{20}{l}}{f(w)\quad if\;w\;satisfies\;primal\;constra{\mathop{\rm int}} s}\\{\infty \quad otherwise}\end{array}} \right.$$Thus the minimizaion problem is the same as the primal problem. $$\mathop {\min }\limits_w f(w) = \mathop {\min }\limits_w {\theta _P}(w) = \mathop {\min }\limits_w \mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} L(w,\alpha ,\beta )$$We define the optimal value of the primal problem to be ${p^ * } = \mathop {\min }\limits_w {\theta _P}(w)$. The dual problem¶ Now let's check a different problem called the dual problem. We define $${\theta _D}(\alpha ,\beta ) = \mathop {\min }\limits_w L(w,\alpha ,\beta )$$The dual problem can be written $$\mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} {\theta _D}(\alpha ,\beta ) = \mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} \mathop {\min }\limits_w L(w,\alpha ,\beta )$$The dual problem is similar to the primal problem. except the max min order. We define the optimal value of the dual problem ${d^ * } = \mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} {\theta _D}(\alpha ,\beta )$. Relationship between the primal and the dual¶ How are the two problems be related? $${d^ * } = \mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} \mathop {\min }\limits_w L(w,\alpha ,\beta ) \le \mathop {\min }\limits_w \mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} L(w,\alpha ,\beta ) = {p^ * }$$Prove: For $\forall w,\alpha ,\beta $, we have $\mathop {\min }\limits_w L(w,\alpha ,\beta ) \le L(w,\alpha ,\beta ) \le \mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} L(w,\alpha ,\beta )$; thus ${\theta _D}(\alpha ,\beta ) \le {\theta _P}(w)$; so we get $\mathop {\max }\limits_{\alpha ,\beta :{\alpha _i} \ge 0} {\theta _D}(\alpha ,\beta ) \le \mathop {\min }\limits_w {\theta _P}(w)$ Under certain conditions, we have ${d^ * } = {p^ * }$. In this situation we can solve the dual problem instead of the primal problem. The conditions are: $f$ and $g_i$'s are convex $h_i$'s are affine $g_i$'s are strictly feasible, which means there exists some $w$ sp that $g_i(w)\lt0$ for all $i$ Under these conditions, there must exists ${w^ * }$, ${\alpha ^ * }$, ${\beta ^ * }$: ${w^ * }$ is the solution for the primal problem; ${\alpha ^ * }$, ${\beta ^ * }$ are solutions to the dual problem. More over, ${d^ * } = {p^ * } = L({w^ * },{\alpha ^ * },{\beta ^ * })$. the Karush-Kuhn-Tucker conditions: given the above conditions, ${w^ * }$, ${\alpha ^ * }$, ${\beta ^ * }$ are solutions for the primal and dual problems is the sufficient and necessary condition that ${w^ * }$, ${\alpha ^ * }$, ${\beta ^ * }$ satisfy the KKT conditions: $$\frac{\partial }{{\partial {w_i}}}L({w^*},{\alpha ^*},{\beta ^*}) = 0,\;i = 1,...,n$$$$\frac{\partial }{{\partial {\alpha _i}}}L({w^*},{\alpha ^*},{\beta ^*}) = 0,\;i = 1,...,k$$$$\frac{\partial }{{\partial {\beta _i}}}L({w^*},{\alpha ^*},{\beta ^*}) = 0,\;i = 1,...,l$$$$\alpha _i^*{g_i}({w^*}) = 0,\;i = 1,...,k$$$${g_i}({w^*}) \le 0,\;i = 1,...,k$$$$\alpha _i^* \ge 0,\;i = 1,...,k$$$${h_i}({w^*}) = 0,\;i = 1,...,l$$The 4th equation is called the KKT dual complementary condition. It implies that if $\alpha _i^* > 0$, then ${g_i}({w^*}) = 0$. It shows that SVM only relies on a small number of support vectors. It also gives us the convergence condition in SMO.
Chapter 3 Optimize the margin¶
The linear separable case¶
For the linear separable case, the primal problem of SVM is as follows
$$\mathop {\min }\limits_{w,b} \frac{1}{2}||w|{|^2}$$$$s.t.\; – {y^{(i)}}({w^T}x + b) + 1 \le 0,\;i = 1,…,m$$It is an obvious convex QP problem. Thus we can solving this problem by solving its dual form. We construct its Lagrangian:
$$L(w,b,\alpha ) = \frac{1}{2}||w|{|^2} – \sum\limits_{i = 1}^m {{\alpha _i}[{y^{(i)}}({w^T}{x^{(i)}} + b) – 1]} $$The dual form of this problem is
$$\mathop {\max }\limits_\alpha {\theta _D}(\alpha ) = \mathop {\max }\limits_\alpha \mathop {\min }\limits_{w,b} L(w,b,\alpha )$$Firstly, we need to get $\mathop {\min }\limits_{w,b} L(w,b,\alpha )$ by setting the derivatives of $L$ in respect to $w$ and $b$ as $0$:
$${\nabla _w}L(w,b,\alpha ) = w – \sum\limits_{i = 1}^m {{\alpha _i}{y^{(i)}}{x^{(i)}}} = 0$$$$\frac{\partial }{{\partial b}}L(w,b,\alpha ) = – \sum\limits_{i = 1}^n {{\alpha _i}{y^{(i)}}} = 0$$From the first equation we get: $w = \sum\limits_{i = 1}^m {{\alpha _i}}$. Feed it into $L(w,b,\alpha )$:
$$L(w,b,\alpha ) = \sum\limits_{i = 1}^m {{\alpha _i}} – \frac{1}{2}\sum\limits_{i,j = 1}^m {{y^{(i)}}{y^{(j)}}{\alpha _i}{\alpha _j}\left\langle {{x^{(i)}},{x^{(j)}}} \right\rangle } $$Now the dual problem can be written as:
$$\mathop {\max }\limits_\alpha \;W(\alpha ) = \sum\limits_{i = 1}^m {{\alpha _i}} – \frac{1}{2}\sum\limits_{i,j = 1}^m {{y^{(i)}}{y^{(j)}}{\alpha _i}{\alpha _j}\left\langle {{x^{(i)}},{x^{(j)}}} \right\rangle } $$$$s.t.\quad {\alpha _i} \ge 0,\;i = 1,…,m$$$$\sum\limits_{i = 1}^n {{\alpha _i}{y^{(i)}}} = 0$$
The non-linear separable case¶
To make SVM work better for 1.non-linear separable dataset or 2.dataset with outliers, we introduce the $l1$ regularization to the optimization formula:
$$\mathop {\min }\limits_{w,b} \frac{1}{2}||w|{|^2} + C\sum\limits_{i = 1}^m {{\xi _i}} $$$$s.t.\;{y^{(i)}}({w^T}{x^{(i)}} + b) \ge 1 – {\xi _i},\;i = 1,…,m$$$${\xi _i} \ge 0,i = 1,…,m$$ Notice the $\xi$ pronounces $[ksi]$
The new Lagrangian can be written as:
$$L(w,b,\xi ,\alpha ,\gamma ) = \frac{1}{2}||w|{|^2} + C\sum\limits_{i = 1}^m {{\xi _i}} – \sum\limits_{i = 1}^m {{\alpha _i}[{y^{(i)}}({w^T}{x^{(i)}} + b) – 1 + {\xi _i}]} – \sum\limits_{i = 1}^m {{\gamma _i}{\xi _i}} $$Similar to the separable case, we solve this problem by taking the dual form. First we set the first derivatives to zeros.
$${\nabla _w}L(w,b,\xi ,\alpha ,\gamma ) = w – \sum\limits_{i = 1}^m {{\alpha _i}{y^{(i)}}{x^{(i)}}} = 0$$$$\frac{\partial }{{\partial b}}L(w,b,\xi ,\alpha ,\gamma ) = – \sum\limits_{i = 1}^m {{\alpha _i}{y^{(i)}}} = 0$$$$\frac{\partial }{{\partial {\xi _i}}}L(w,b,\xi ,\alpha ,\gamma ) = C – {\alpha _i} – {\gamma _i} = 0$$Set all we get into the Lagrangian, we get the dual form as follows:
$$\mathop {\max }\limits_\alpha \;W(\alpha ) = \sum\limits_{i = 1}^m {{\alpha _i}} – \frac{1}{2}\sum\limits_{i,j = 1}^m {{y^{(i)}}{y^{(j)}}{\alpha _i}{\alpha _j}\left\langle {{x^{(i)}},{x^{(j)}}} \right\rangle } $$$$s.t.\quad 0 \le {\alpha _i} \le C,\;i = 1,…,m$$$$\sum\limits_{i = 1}^n {{\alpha _i}{y^{(i)}}} = 0$$
Chapter 4 Sequential minimal optimization¶
Sequential minimal optimization, or SMO, consists of two processes:
a method to solve the two Lagrange multipliers QP problem
a heuristic way to choose which multipliers to optimize
Solution to the two Langrange multipliers QP problem¶
We use coordinate ascent method to solve the optimization problem. The simple idea behind coordinate ascent is: each time optimize the target with respect to only one variable, keeping the rest variables fixed. This method garantees convergence because each iteration the target is a bit closer to the final optimization. To make the convergence faster, a heuristic is used to choose variable which will make the largest increase towards the target.
We use coordinate ascent to solve the dual problem for two reasons:
this is a QP problem, which has the global maximun
for each iteration, the optimization is done quite efficiently
Due to the $\sum\limits_{i = 1}^n {{\alpha _i}{y^{(i)}}} = 0$ constraint, one $\alpha _i$ can not be upgraded with all the other $\alpha$’s fixed. So on each iteration we need to choose a pair of $\alpha$’s (we generally choose $\alpha _1$ and $\alpha _2$) to upgrade simultaneously. The constraint can be written:
$${\alpha _1}{y^{(1)}} + {\alpha _2}{y^{(2)}} = – \sum\limits_{i = 3}^n {{\alpha _i}{y^{(i)}}} = \varsigma $$${\alpha _1}$ can be denoted as:
$${\alpha _1} = {y^{(1)}}\varsigma – {\alpha _2}{y^{(1)}}{y^{(2)}}$$Thus we can write the target optimization using ${\alpha _2}$. For denoting convenience first we introduce some notations.
$$f(x) = \sum\limits_{i = 1}^n {{\alpha _i}{y^{(i)}}K({x^{(i)}},x)} + b$$$${E_i} = f({x^{(i)}}) – {y^{(i)}}$$$${v_i} = \sum\limits_{j = 3}^n {{\alpha _j}{y^{(j)}}{K_{ij}}} = f({x^{(i)}}) – \sum\limits_{j = 1}^2 {{\alpha _j}{y^{(j)}}{K_{ij}}} – b$$$$\eta = 2{K_{12}} – {K_{11}} – {K_{22}}$$The dual optimization problem can be written:
$$W({\alpha _2}) = \sum\limits_{i = 1}^m {{\alpha _i}} – \frac{1}{2}\sum\limits_{i,j = 1}^m {{y^{(i)}}{y^{(j)}}{\alpha _i}{\alpha _j}{K_{ij}}} $$$$ = {\alpha _1} + {\alpha _2} – \frac{1}{2}{K_{11}}\alpha _1^2 – \frac{1}{2}{K_{22}}\alpha _2^2 – {y^{(1)}}{y^{(2)}}{K_{12}}{\alpha _1}{\alpha _2} – {y^{(1)}}{\alpha _1}\sum\limits_{i = 3}^m {{\alpha _i}{y^{(i)}}{K_{i1}}} – {y^{(2)}}{\alpha _2}\sum\limits_{i = 3}^m {{\alpha _i}{y^{(i)}}{K_{i2}}} + c$$$$ = {\alpha _1} + {\alpha _2} – \frac{1}{2}{K_{11}}\alpha _1^2 – \frac{1}{2}{K_{22}}\alpha _2^2 – {y^{(1)}}{y^{(2)}}{K_{12}}{\alpha _1}{\alpha _2} – {y^{(1)}}{\alpha _1}{v_1} – {y^{(2)}}{\alpha _2}{v_2}$$$$ = {y^{(1)}}\varsigma – {\alpha _2}{y^{(1)}}{y^{(2)}} + {\alpha _2} – \frac{1}{2}{K_{11}}{(\varsigma – {\alpha _2}{y^{(2)}})^2} – \frac{1}{2}{K_{22}}\alpha _2^2 – {y^{(2)}}{K_{12}}(\varsigma – {\alpha _2}{y^{(2)}}){\alpha _2} – {v_1}(\varsigma – {\alpha _2}{y^{(2)}}) – {y^{(2)}}{\alpha _2}{v_2}$$$$ = \frac{1}{2}\eta \alpha _2^2 + ({y^{(2)}}(E_1^{old} – E_2^{old}) – \eta \alpha _2^{old}){\alpha _2} + const.$$The first and second derivatives are:
$$\frac{{dW({\alpha _2})}}{{d{\alpha _2}}} = \eta {\alpha _2} + ({y^{(2)}}(E_1^{old} – E_2^{old}) – \eta \alpha _2^{old})$$
$$\frac{{{d^2}W({\alpha _2})}}{{d\alpha _2^2}} = \eta $$Notice that $\eta \le 0$.
Proof: $\eta = 2{K_{12}} – {K_{11}} – {K_{22}}$ $ = 2\phi {({x^{(1)}})^T}\phi ({x^{(2)}}) – \phi {({x^{(1)}})^T}\phi ({x^{(1)}}) – \phi {({x^{(2)}})^T}\phi ({x^{(2)}})$ $ = – ||\phi ({x^{(1)}}) – \phi ({x^{(2)}})||_2^2$ $ \le 0$
If $\eta \lt 0$, then the function is convex and has maximal value when the first derivative equals $0$:
$$\frac{{dW({\alpha _2})}}{{d{\alpha _2}}} = \eta {\alpha _2} + ({y^{(2)}}(E_1^{old} – E_2^{old}) – \eta \alpha _2^{old}) = 0$$$${\alpha _2} = \alpha _2^{old} – \frac{{{y^{(2)}}(E_1^{old} – E_2^{old})}}{\eta }$$If $\eta = 0$, then the function is monotonic and gets maximum on either endpoints. The endpoints(ie. $\alpha$ constraints) will be introduced in the next paragraph. We need to compare which endpoint is the greater one.
On the other hand, since this is an optimization problem with constraints, $\alpha$’s change must obey the following constraints
$$0 \le {\alpha _2} \le C$$$$0 \le {\alpha _1} \le C$$$${\alpha _1}{y^{(1)}} + {\alpha _2}{y^{(2)}} = \varsigma $$Due to the linear relationship between $\alpha _1$ and $\alpha _2$, $\alpha _2$ is restricted within a square box on a diagonal line.
if $y_1 = y_2$, $L \le \alpha _2^{new} \le H$ where $L = \max (0,\alpha _1^{old} + \alpha _2^{old} – C)$ and $H = \min (C,\alpha _1^{old} + \alpha _2^{old})$
if $y_1 \ne y_2$, $L \le \alpha _2^{new} \le H$ where $L = \max (0,\alpha _2^{old} – \alpha _1^{old})$ and $H = \min (C,C + \alpha _2^{old} – \alpha _1^{old})$
We need to clip $\alpha _2$ within the above constraints. The clipped value is the right $\alpha _2$ we want.
$$\alpha _2^{new} = \left\{ {\begin{array}{*{20}{c}}{H,\quad \alpha _2^{new,unc} > H}\\{\alpha _2^{new,unc},\quad L \le \alpha _2^{new,unc} \le H}\\{L,\quad \alpha _2^{new,unc} < L}\end{array}} \right.$$Then we can update $\alpha _1$ using $\alpha ^{new} _2$ $$\alpha _1^{new} = \alpha _1^{old} + {y^{(1)}}{y^{(2)}}(\alpha _2^{old} - \alpha _2^{new})$$ Heuristic way of choosing two $\alpha$'s¶ According to Osuna's theorem, if at least one of the two Lagrange multipliers violates the KKT conditions, then each step's optimization will increase the objective function. Thus convergence is garanteed. For a faster convergence, Platt introduced a heuristic way to determine the two $\alpha$'s. It includes two steps: first step of choosing $\alpha _2$ and second step of choosing $\alpha _1$. Step 1. Choosing $\alpha _2$ which violates the KKT conditions: This is done by the alternation of two kind of iterations. Iteration 1: One scan over the entire $\alpha$'s to find violating $\alpha$'s. Then go to iteration 2. Iteration 2: Repeatedly scan over the non-border $\alpha$'s to find violating $\alpha$'s until all non-border $\alpha$'s obey the KKT constraints within $\epsilon$. Then go to iteration 1. Iteration ends after one scan over the entire $\alpha$'s and find all $\alpha$'s obey the KKT conditions within $\epsilon$. As is clear, this heuristic method wants to concentrate on the non-border $\alpha$'s, which are more likely KKT-violated. During the optimizing, the border $\alpha$'s tend to stay at the border, while the non-border ones move as the other $\alpha$'s are optimized. So we spend most of the time scanning and ajusting the non-border $\alpha$'s. After all the non-border $\alpha$'s become self-consistent, we then scan over the entire vector to seach for border $\alpha$'s which became KKT-violated due to ajusting the non-border $\alpha$'s. The $\epsilon$ is typically set to $10 ^{-3}$. It doesn't need to be set to a higher accuracy. Question: What's the KKT conditions like here? Now let's consider the primal problem which needs to minimize objective function under multiple inequality constraints. The solution to the primal problem is obtained by obeying these KKT conditions: The gradient of $L(w,b,\xi ,\alpha ,\gamma )$ with respect to the primal problem variables $w$, $b$, $\xi$ vanishes. Note this is implicitly always satisfied in the dual problem. The formula below is derived from this condition and will be soon used $$C - {\alpha _i} - {\gamma _i} = 0,\;i = 1,...,m$$ The KKT dual complementary conditions, the primal problem constraints and ${\alpha _i}$, ${\gamma _i}$ constraints must be satisfied. That is $${\alpha _i}( - {y^{(i)}}({w^T}{x^{(i)}} + b) + 1 - {\xi _i}) = 0,\;i = 1,...,m$$$${\gamma _i}{\xi _i} = 0,\;i = 1,...,m$$$$g({x^{(i)}}) = - {y^{(i)}}({w^T}{x^{(i)}} + b) + 1 - {\xi _i} \le 0,\;i = 1,...,m$$$$h({x^{(i)}}) = - {\xi _i} \le 0,\;i = 1,...,m$$$${\alpha _i} \ge 0,\;i = 1,...,m$$$${\gamma _i} \ge 0,\;i = 1,...,m$$Notice: The KKT dual complementary conditions include two kinds of complementary situations: objective function optimization on the constraints border and optimization inside the constraints. For the optimization on the constraints border, ${\alpha _i}$ and ${\gamma _i}$ are non-zero because the optimization is also related to ${g_i}$ and ${h_i}$ equalities or inequalities. For the optimization inside the constraints, ${\alpha _i}$ and ${\gamma _i}$ are set to be zero because the optimization doesn't relate to ${g_i}$ and ${h_i}$. From $C - {\alpha _i} - {\gamma _i} = 0$, ${\alpha _i} \ge 0$ and ${\gamma _i} \ge 0$, we get $0 \le {\alpha _i} \le C$. If ${\alpha _i} = 0$, then ${\gamma _i}=C$, so ${\xi _i}=0$, we get ${y^{(i)}}({w^T}{x^{(i)}} + b) \ge 1$ If $0 < {\alpha _i} < C$, then ${\xi _i}=0$, so we have ${y^{(i)}}({w^T}{x^{(i)}} + b) = 1$ If ${\alpha _i} = C$, then $ - {y^{(i)}}({w^T}{x^{(i)}} + b) + 1 - {\xi _i} = 0$ and ${\xi _i} \ge 0$, so we have ${y^{(i)}}({w^T}{x^{(i)}} + b) \le 1$ These are the KKT conditions for this problem. So the KKT conditions are violated if either below situation happens: ${\alpha _i} < C$ and ${y^{(i)}}({w^T}{x^{(i)}} + b) < 1$ ${\alpha _i} > 0$ and ${y^{(i)}}({w^T}{x^{(i)}} + b) > 1$
Step 2. Choosing $\alpha _1$ to maximize the step taken during joint optimization
The step is determined by this: $\Delta {\alpha _2} = {\alpha _2} – \alpha _2^{old} = – \frac{{{y^{(2)}}(E_1^{old} – E_2^{old})}}{\eta }$. Platt thought evaluating the kernel function K is time consuming, so SMO approximates the step size by $|E_1^{old} – E_2^{old}|$. In step 2 the algorithm is done like this:
SMO looks for a non-border $\alpha _1$ that maximize $|E_1^{old} – E_2^{old}|$.
If this does not make enough progress, it starts a sequential scan through the non-border examples, starting at a random position.
If this fails too, it starts a sequential scan though all the examples, also starting at a random position.
Because this algorithm spends most of the time adjusting the non-border $\alpha$’s, the corresponding $E _i$’s are cached.
After a successful optimization step, we need to update b and non-border $E$ cache.
If $0 < {\alpha _2} < C$, according to the KKT conditions, we have $\sum\limits_{i = 1}^m {{\alpha _i}{y^{(i)}}{K_{i2}}} + b = {y^{(i)}}$. So $${b^{new} _2} = {y^{(i)}} - \sum\limits_{i = 1}^m {{\alpha _i}{y^{(i)}}{K_{i2}}} $$$$ = {y^{(i)}} - \sum\limits_{i = 3}^m {{\alpha _i}{y^{(i)}}{K_{i2}}} - {\alpha _1}{y^{(1)}}{K_{12}} - {\alpha _2}{y^{(2)}}{K_{22}}$$$$ = - {E_2} - {y^{(2)}}{K_{22}}(\alpha _2^{new} - \alpha _2^{old}) - {y^{(1)}}{K_{12}}(\alpha _1^{new} - \alpha _1^{old}) + {b^{old}}$$It's the same for $b^{new} _1$ if $0 < {\alpha _1} < C$: $$b_1^{new} = - {E_1} - {y^{(1)}}{K_{11}}(\alpha _1^{new} - \alpha _1^{old}) - {y^{(2)}}{K_{12}}(\alpha _2^{new} - \alpha _2^{old}) + {b^{old}}$$$b^{new}$ is updated this way: If $0 < {\alpha _2} < C$, $b^{new} = b^{new} _2$; Else if $0 < {\alpha _1} < C$, $b^{new} = b^{new} _1$; Else $b^{new} = (b^{new} _1 + b^{new} _2) / 2$; For each non-border $\alpha _i$', the corresponding $E _i$ is calculated: $$E_i^{new} = \sum\limits_{j = 1}^m {{\alpha _j}{y^{(j)}}{K_{ij}}} + {b^{new}} - {y^{(i)}}$$