Data Mining: Learning From Large Data Sets
Lecture 10: Online convex programming (continued)
Hamed Hassani
SGD for SVM
X.
•
Moana
til
–
Online SVM:
T
w)s –
subject to:
1
min max(0,1 yt T t
wi t=1
||w||2 p
is
w xi)
SGD for SVM
WEargminFew ) WES
•
Projection:
XTIN
min fi(w)s.t. w2S €
T
w
is
i
t=1
1
(w) = max(0, 1 y tttt
SGD scheme:
wttl = WE
– Rtu43
of
w xi) S = {w : ||w||2 p }
choose a small
(
=
V Projects
( tail
wet ,
f
E
mini –
,
– –
I
WE
batch 24
}
7 first WH)
sips ( .ie/wtl-n—8feBw)
Projection projection
a
g.
point set
y
=
ang
minHy- y es
of
s , is defined
onto as
ift
a
‘ll y’
In 5
our
Case
:
: Iw: Hswllzf )
§
Projects=/
÷r°÷yμ .
“Y
if yes
if gets I
9
SGD for SVM
•
Gradient:
HiCw) =
=
–
max to, I 4.Jai
min
w
E
f(w)s.t. w2S
fis(w)=max(0,1 yaiwTxie) 1
XTM
tttt
t=1
S = {w : ||w||2 p }
SGD scheme:
wth a
iopwjects (
wt
–
ntsaoeheshg ⇒
Of )) Tw I g oh C w ))
Y( Hw
Max to, I-
WTX i =
x )
x)=
–
)
Yi
Tl
) ÷
go
Hwi
gym
, –
Hw –
)
Y WTki
h ÷÷÷÷:÷÷÷::÷
(
gCa) =
x>I max to it – X)
Casa
,
. 7hlwEhxi
‘ g
‘
i iii.
.
%
.
=L ..
.
.
six =L. -,
Yr
tfilw )
Tf
Hick
glxlzmeexlo
if hlwl > I ⇒ g’ ,
h Lw)
–
=
zgiwtxi g)
>
I
Off# to
it yiwt, I if yiwtxi =/
:{
–
tfilw Ofhog Iw ) )
! who
±
.
Thew )
)
Tiki
if
↳
=
Sub gradient
=
yiwtxicl
) )
lhlw ) I-x
-o
•
Subgradient
We only require to linearize properly:
Subgradient
• Given a convex (not necessarily differentiable) function f, a vector gx is called a
subgradient of f at x if: –
UH ¥•a¥.
8 x 0 2 S : f ( x 0 ) f ( x ) + g xT ( x 0 x ) –
→
f
few!
#
They
fcxtgxcx
) –
Ie
‘
‘
–
x)
fear
•
•
Hinge loss:
Subgradient:
f (w) = max(0, 1 y : wT xi.) t sina.sitt
✓
Subgradient for SVM
•
•
Initialize:
10
SGD for SVM
(
=O
2=0.01
we
Projects
–
Hq
Tots .
,
=
e.
g
Wo ↳ iteration
Projects (wt
¥2
–
Lini;
.
t
)
; win; Ahs
Hiatt
(g)
,
–
ht – Ire
10 8
H:E ) Wo E S L 11W
pts }
Each round do:
Receive a new batch of data points: ( Xi , , , ) .
,÷
=
,,
a
Yi
( ,
Kilo
?
Yolo
,
(
–
—
•
Large-Scale supervised learning
Recall that our goal was to discuss how to solve problems of the form:
Xn
min fi(w) s.t. w 2 S
w i=1
for convex problems very efficiently via:
Online convex programming –
Stochastic convex optimization
•
•
we have
the data points and
–
–
→
all
in a
access to
each iteration Gmat and random
we
) Subset
choose
t update
Online learning
• •
• •
Data arrives sequentially (
Need to classify one data point at a time
→
,
y Naik)
, ) ,
sequential data = Learning algorithm
sequential rule
online )
Use different decision rule (linear classifier) each time Can’t remember all the data points
cat
Streaming ,
I÷ time 9
4.04294) (X3
gawd w , o
w3 WL
time
2
b
t
– ,→
.
..
time
Hyees ,,
•
•
Online SVM optimisation
–
Keep track of hyperplane parameters wt online For each time t = 1,2,··· ,T
—
Z
New data point arrives
Classify according to sign(wtT xt)
f-
xt –
–
–
Incur loss `t = max(0, 1 yiwtT xt) Updatewt based on (xt,yt)
—
loss It
cumulative –
•
Goal: minimize the (cumulative) loss:
minimize
•
Generally: Online convex programming
Input:
Feasible set S ✓ RD Initial point w0 2 S
Each round t do
Pick new feasible point wt 2 S
Receive convex function ft : S ! R Incur loss `t = ft(wt)
•
•
•
Goal: minimize the (cumulative loss) How do we evaluate performance?
Regret
We will have to pay a price as we do not process the whole data at once Loss in the online setting:
Loss in the setting where we can process the whole data at once:
•
•
•
•
•
Special case: online SVM optimisation
Keep track of hyperplane parameters wt online For each time t = 1,2,··· ,T
New data point arrives
Incur loss `t = max(0, 1 yiwtT xt)
xt Classify according to sign(wtT xt)
Updatewt based on (xt,yt) Best we could have done:
•
Our regret:
L⇤ =
XT w:||w||2 p t=1
XT t=1
min 1
max(0,1 ytwTxt) `t L⇤
•
RT =
•
For SVMs, having a no regret algorithm means:
The average excess error compared to the (expensive) quadratic program goes to zero
This is independent of how we process the data set
•
Recall:
An online algorithm is called no regret if: RT
T
for any sequence of (convex) functions
•
No regret algorithms
XT t=1 w2S t=1
XT
RT = `t min
ft(w)
!0 f1,f2,··· ,fT
How can we design a no-regret algorithm?
Three approaches:
Follow the Leader (FTL)
Follow the Regularized Leader (FoReL)
Online gradient descent
•
How can we design a no-regret algorithm?
Online gradient descent
• ft isconvexfort=1,···,T, andS=RD
We use an important property of convex functions:
f convex=)8w,u:f(u)f(w)+rf(w)T(u w)
f (u) f(w)
•
w
u
f(w) + rf(w)T (u w)
•
Linearize!
Online gradient descent: Intuition
XT XT swimwear
• RT = f(wt) ft(w⇤) t=1 t=1
•
•
Online gradient descent (OCP) Simple update rule: wt+1 = wt ⌘trf(wt)
We may go out of the set S :
ProjectS(w) = arg min ||w0 w||2 w02S
Online gradient descent update rule:
•
•
How well does this simple algorithm do?
Regret for OCP
• Theorem [Zinkevich’03]: Let f1 , f2 , · · · , ft be an arbitrary sequence of convex
functions with bounded convex set S 1
Set ⌘t = pt Then:
RT 1 ⇤2 2 T pT[||w0 w||2+||rf||2]
OCP for SVM
Online SVM:
XT
min max(0, 1 ytwT xt)
w t=1
•
subject to:
||w||2 p
1
OCP for SVM
XT w t=1
ft(w) s.t. w 2 S OCP scheme:
ft(w) = max(0, 1 ytwT xt) 1
min
S = {w : ||w||2 p } wt+1 = ProjS(wt ⌘trft(wt))
•
Projection:
OCP for SVM
XT w t=1
ft(w) s.t. w 2 S OCP scheme:
ft(w) = max(0, 1 ytwT xt) 1
min
S = {w : ||w||2 p } wt+1 = ProjS(wt ⌘trft(wt))
•
Gradient:
•
•
OCP for SVM Initialize:
Each round do:
Receive a new point (xt, yt)