Margins
Perceptron and Online Perceptron
Daniel Hsu (COMS 4771)
Let S be a collection of labeled examples from Rd × {−1, +1}. We say S is linearly separable if there exists w ∈ Rd such that
min y⟨w, x⟩ > 0, (x,y)∈S
and we call w a linear separator for S.
The (minimum) margin of a linear separator w for S is the minimum distance from x to the hyperplane orthogonal to w, among all (x, y) ∈ S. Note that this notion of margin is invariant to positive scaling of w. If we rescale w so that
min y⟨w, x⟩ = 1, (x,y)∈S
then this minimum distance is 1/∥w∥2. Therefore, the linear separator with the largest minimum margin is described by the following mathematical optimization problem:
min
w∈Rd s.t.
1 ∥w∥2 2
y⟨w,x⟩ ≥ 1, (x,y) ∈ S.
Perceptron algorithm
The Perceptron algorithm is given as follows. The input to the algorithm is a collection S of labeled examples from Rd × {−1, +1}.
• Begin with wˆ1 := 0 ∈ Rd. • Fort=1,2,…:
– If there is a labeled example in S (call it (xt, yt)) such that yt⟨wˆt, xt⟩ ≤ 0, then set wˆt+1 := wˆt +ytxt. – Else, return wˆt.
Theorem. Let S be a collection of labeled examples from Rd × {−1, +1}. Suppose there exists a vector w⋆ ∈ Rd such that
min y⟨w⋆, x⟩ = 1. (x,y)∈S
Then Perceptron on input S halts after at most ∥w⋆∥2L2 loop iterations, where L := max(x,y)∈S ∥x∥2. Proof. Suppose Perceptron does not exit the loop in the t-th iteration. Then there is a labeled example
(xt,yt) ∈ S such that
yt⟨w⋆, xt⟩ ≥ 1, yt⟨wˆt, xt⟩ ≤ 0.
We bound ⟨w⋆,wt+1⟩ from above and below to deduce a bound on the number of loop iterations. First, we bound ⟨w⋆, wˆt⟩ from below:
⟨w⋆, wˆt+1⟩ = ⟨w⋆, wˆt⟩ + yt⟨w⋆, xt⟩ ≥ ⟨w⋆, wˆt⟩ + 1.
1
Since wˆ1 = 0, we have
We now bound ⟨w⋆,wˆt+1⟩ from above. By Cauchy-Schwarz,
⟨w⋆, wˆt+1⟩ ≥ t.
⟨w⋆, wˆt+1⟩ ≤ ∥w⋆∥2∥wˆt+1∥2.
∥wˆt+1∥2 = ∥wˆt∥2 + 2yt⟨wˆt, xt⟩ + yt2∥xt∥2 ≤ ∥wˆt∥2 + L2. ∥wˆt+1∥2 ≤ L2t,
√
⟨w⋆, wˆt+1⟩ ≤ ∥w⋆∥2L
t ≤ ⟨w⋆, wˆt+1⟩ ≤ ∥w⋆∥2L t,
which in turn implies the inequality t ≤ ∥w⋆∥2L2. Online Perceptron algorithm
The Online Perceptron algorithm is given as follows. The input to the algorithm is a sequence (x1, y1), (x2, y2), . . . of labeled examples from Rd × {−1, +1}.
• Begin with wˆ1 := 0 ∈ Rd. • Fort=1,2,…:
– If yt⟨wˆt, xt⟩ ≤ 0, then set wˆt+1 := wˆt + ytxt. – Else, wˆt+1 := wˆt.
We say that Online Perceptron makes a mistake in round t if yt⟨wˆt, xt⟩ ≤ 0.
Theorem. Let (x1, y1), (x2, y2), . . . be a sequence of labeled examples from Rd × {−1, +1} such that there
Also,
Since ∥wˆ1∥2 = 0, we have
so
Combining the upper and lower bounds on ⟨w⋆,wˆt+1⟩ shows that
t.
√
exists a vector w⋆ ∈ Rd satisfying
Then Online Perceptron on input (x1, y1), (x2, y2), . . . makes at most ∥w⋆∥2L2 mistakes, where L :=
maxt=1,2,… ∥xt∥2.
Proof. The proof of this theorem is essentially the same as the proof of the iteration bound for Perceptron.
Online Perceptron may be applied to a collection of labeled examples S by considering the labeled examples in S in any (e.g., random) order. If S is linearly separable, then the number of mistakes made by Online Perceptron can be bounded using the theorem.
However, Online Perceptron is also useful when S is not linearly separable. This is especially notable in comparison to Perceptron, which never terminates if S is not linearly separable.
Theorem. Let (x1, y1), (x2, y2), . . . be a sequence of labeled examples from Rd ×{−1, +1}. Online Perceptron on input (x1, y1), (x2, y2), . . . makes at most
mind ∥w⋆∥2L2 + ∥w⋆∥2L l(⟨w⋆, xt⟩, yt) + l(⟨w⋆, xt⟩, yt) w⋆∈R t∈M t∈M
mistakes, where L := maxt=1,2,… ∥xt∥2, M is the set of rounds on which Online Perceptron makes a mistake, and l(yˆ, y) := [1 − yˆy]+ = max{0, 1 − yˆy} is the hinge loss of yˆ when y is the correct label.
min yt⟨w⋆, xt⟩ = 1. t=1,2,…
2
Proof. Fix any w⋆ ∈ Rd. Consider any round t in which Online Perceptron makes a mistake. Let Mt :={1,…,t}∩MandMt :=|Mt|. Wewillbound⟨w⋆,wˆt+1⟩fromaboveandbelowtodeduceabound on Mt, the number of mistakes made by Online Perceptron through the first t rounds. First we bound ⟨w⋆, wˆt+1⟩ from above. By Cauchy-Schwarz,
Moreover,
Since wˆ1 = 0, we have
and thus
⟨w⋆, wˆt+1⟩ ≤ ∥w⋆∥2∥wˆt+1∥2.
∥wˆt+1∥2 = ∥wˆt∥2 + 2yt⟨wˆt, xt⟩ + yt2∥xt∥2 ≤ ∥wˆt∥2 + L2. ∥wˆt+1∥2 ≤ L2Mt,
⟨w⋆, wˆt+1⟩ ≤ ∥w⋆∥2LMt.
We now bound ⟨w⋆, wt+1⟩ from below:
Since wˆ1 = 0, where
⟨w⋆, wˆt+1⟩ = ⟨w⋆, wˆt⟩ + 1 − [1 − yt⟨w⋆, xt⟩] ≥ ⟨ w ⋆ , wˆ t ⟩ + 1 − [ 1 − y t ⟨ w ⋆ , x t ⟩ ] +
= ⟨w⋆, wˆt⟩ + 1 − l(⟨w⋆, xt⟩, yt), ⟨w⋆,wˆt+1⟩≥Mt −Ht,
Ht := l(⟨w⋆,xt⟩,yt). t∈Mt
Combining the upper and lower bounds on ⟨w⋆,wˆt+1⟩ shows that
Mt − Ht ≤ ⟨w⋆, wˆt+1⟩ ≤ ∥w⋆∥2LMt,
i.e.,
This inequality is quadratic in √Mt. By solving it1, we deduce the bound
Mt − ∥w⋆∥2LMt − Ht ≤ 0.
Mt ≤ 1∥w⋆∥2L2 + 1∥w⋆∥2L∥w⋆∥2L2 + 4Ht + Ht,
22
which can be further loosened to the following (slightly more interpretable) bound:
The claim follows.
Mt ≤ ∥w⋆∥2L2 + ∥w⋆∥2LHt + Ht.
√
1The inequality is of the form x2 − bx − c ≤ 0 for some non-negative b and c. This implies that x ≤ (b +
hence x2 ≤ (b2 +2b√b2 +4c+b2 +4c)/4. We can then use the fact that √A+B ≤ √A+√B for any non-negative A and B to deduce x2 ≤ b2 +b√c+c.
b2 + 4c)/2, and
3