Perceptron and Online Perceptron
Perceptron and Online Perceptron
Daniel Hsu (COMS 4771)
Margins
Let S be a collection of labeled examples from Rd × {−1, +1}. We say S is linearly separable if there exists
w ∈ Rd such that
min
(x,y)∈S
y〈w, x〉 > 0,
and we call w a linear separator for S.
The (minimum) margin of a linear separator w for S is the minimum distance from x to the hyperplane
orthogonal to w, among all (x, y) ∈ S. Note that this notion of margin is invariant to positive scaling of w. If
we rescale w so that
min
(x,y)∈S
y〈w, x〉 = 1,
then this minimum distance is 1/‖w‖2. Therefore, the linear separator with the largest minimum margin is
described by the following mathematical optimization problem:
min
w∈Rd
1
2
‖w‖22
s.t. y〈w, x〉 ≥ 1, (x, y) ∈ S.
Perceptron algorithm
The Perceptron algorithm is given as follows. The input to the algorithm is a collection S of labeled examples
from Rd × {−1, +1}.
• Begin with ŵ1 := 0 ∈ Rd.
• For t = 1, 2, . . .:
– If there is a labeled example in S (call it (xt, yt)) such that yt〈ŵt, xt〉 ≤ 0, then set ŵt+1 := ŵt+ytxt.
– Else, return ŵt.
Theorem. Let S be a collection of labeled examples from Rd × {−1, +1}. Suppose there exists a vector
w? ∈ Rd such that
min
(x,y)∈S
y〈w?, x〉 = 1.
Then Perceptron on input S halts after at most ‖w?‖22L2 loop iterations, where L := max(x,y)∈S ‖x‖2.
Proof. Suppose Perceptron does not exit the loop in the t-th iteration. Then there is a labeled example
(xt, yt) ∈ S such that
yt〈w?, xt〉 ≥ 1,
yt〈ŵt, xt〉 ≤ 0.
We bound 〈w?, wt+1〉 from above and below to deduce a bound on the number of loop iterations. First, we
bound 〈w?, ŵt〉 from below:
〈w?, ŵt+1〉 = 〈w?, ŵt〉+ yt〈w?, xt〉 ≥ 〈w?, ŵt〉+ 1.
1
Since ŵ1 = 0, we have
〈w?, ŵt+1〉 ≥ t.
We now bound 〈w?, ŵt+1〉 from above. By Cauchy-Schwarz,
〈w?, ŵt+1〉 ≤ ‖w?‖2‖ŵt+1‖2.
Also,
‖ŵt+1‖22 = ‖ŵt‖
2
2 + 2yt〈ŵt, xt〉+ y
2
t ‖xt‖
2
2 ≤ ‖ŵt‖
2
2 + L
2.
Since ‖ŵ1‖2 = 0, we have
‖ŵt+1‖22 ≤ L
2t,
so
〈w?, ŵt+1〉 ≤ ‖w?‖2L
√
t.
Combining the upper and lower bounds on 〈w?, ŵt+1〉 shows that
t ≤ 〈w?, ŵt+1〉 ≤ ‖w?‖2L
√
t,
which in turn implies the inequality t ≤ ‖w?‖22L2.
Online Perceptron algorithm
The Online Perceptron algorithm is given as follows. The input to the algorithm is a sequence
(x1, y1), (x2, y2), . . . of labeled examples from Rd × {−1, +1}.
• Begin with ŵ1 := 0 ∈ Rd.
• For t = 1, 2, . . .:
– If yt〈ŵt, xt〉 ≤ 0, then set ŵt+1 := ŵt + ytxt.
– Else, ŵt+1 := ŵt.
We say that Online Perceptron makes a mistake in round t if yt〈ŵt, xt〉 ≤ 0.
Theorem. Let (x1, y1), (x2, y2), . . . be a sequence of labeled examples from Rd × {−1, +1} such that there
exists a vector w? ∈ Rd satisfying
min
t=1,2,…
yt〈w?, xt〉 = 1.
Then Online Perceptron on input (x1, y1), (x2, y2), . . . makes at most ‖w?‖22L2 mistakes, where L :=
maxt=1,2,… ‖xt‖2.
Proof. The proof of this theorem is essentially the same as the proof of the iteration bound for Perceptron.
Online Perceptron may be applied to a collection of labeled examples S by considering the labeled examples
in S in any (e.g., random) order. If S is linearly separable, then the number of mistakes made by Online
Perceptron can be bounded using the theorem.
However, Online Perceptron is also useful when S is not linearly separable. This is especially notable in
comparison to Perceptron, which never terminates if S is not linearly separable.
Theorem. Let (x1, y1), (x2, y2), . . . be a sequence of labeled examples from Rd×{−1, +1}. Online Perceptron
on input (x1, y1), (x2, y2), . . . makes at most
min
w?∈Rd
‖w?‖22L2 + ‖w?‖2L√ ∑
t∈M
`(〈w?, xt〉, yt) +
∑
t∈M
`(〈w?, xt〉, yt)
mistakes, where L := maxt=1,2,… ‖xt‖2,M is the set of rounds on which Online Perceptron makes a mistake,
and `(ŷ, y) := [1− ŷy]+ = max{0, 1− ŷy} is the hinge loss of ŷ when y is the correct label.
2
Proof. Fix any w? ∈ Rd. Consider any round t in which Online Perceptron makes a mistake. Let
Mt := {1, . . . , t} ∩M and Mt := |Mt|. We will bound 〈w?, ŵt+1〉 from above and below to deduce a bound
on Mt, the number of mistakes made by Online Perceptron through the first t rounds. First we bound
〈w?, ŵt+1〉 from above. By Cauchy-Schwarz,
〈w?, ŵt+1〉 ≤ ‖w?‖2‖ŵt+1‖2.
Moreover,
‖ŵt+1‖22 = ‖ŵt‖
2
2 + 2yt〈ŵt, xt〉+ y
2
t ‖xt‖
2
2 ≤ ‖ŵt‖
2
2 + L
2.
Since ŵ1 = 0, we have
‖ŵt+1‖22 ≤ L
2Mt,
and thus
〈w?, ŵt+1〉 ≤ ‖w?‖2L
√
Mt.
We now bound 〈w?, wt+1〉 from below:
〈w?, ŵt+1〉 = 〈w?, ŵt〉+ 1− [1− yt〈w?, xt〉]
≥ 〈w?, ŵt〉+ 1− [1− yt〈w?, xt〉]+
= 〈w?, ŵt〉+ 1− `(〈w?, xt〉, yt),
Since ŵ1 = 0,
〈w?, ŵt+1〉 ≥Mt −Ht,
where
Ht :=
∑
t∈Mt
`(〈w?, xt〉, yt).
Combining the upper and lower bounds on 〈w?, ŵt+1〉 shows that
Mt −Ht ≤ 〈w?, ŵt+1〉 ≤ ‖w?‖2L
√
Mt,
i.e.,
Mt − ‖w?‖2L
√
Mt −Ht ≤ 0.
This inequality is quadratic in
√
Mt. By solving it1, we deduce the bound
Mt ≤
1
2
‖w?‖22L
2 +
1
2
‖w?‖2L
√
‖w?‖22L2 + 4Ht + Ht,
which can be further loosened to the following (slightly more interpretable) bound:
Mt ≤ ‖w?‖22L
2 + ‖w?‖2L
√
Ht + Ht.
The claim follows.
1The inequality is of the form x2 − bx− c ≤ 0 for some non-negative b and c. This implies that x ≤ (b +
√
b2 + 4c)/2, and
hence x2 ≤ (b2 + 2b
√
b2 + 4c + b2 + 4c)/4. We can then use the fact that
√
A + B ≤
√
A +
√
B for any non-negative A and B to
deduce x2 ≤ b2 + b
√
c + c.
3
Margins
Perceptron algorithm
Online Perceptron algorithm