程序代写代做 algorithm data mining go C Data Mining: Learning From Large Data Sets

Data Mining: Learning From Large Data Sets
Lecture 10: Online convex programming (continued)
Hamed Hassani

SGD for SVM
X.

Moana
til

Online SVM:
T
w)s –
subject to:
1
min max(0,1yt T t
wi t=1
||w||2  p
is
w xi)

SGD for SVM
WEargminFew ) WES

Projection:
XTIN
min fi(w)s.t. w2S €
T
w
is
i
t=1
1
(w) = max(0, 1 y tttt
SGD scheme:
wttl = WE
– Rtu43
of
w xi) S = {w : ||w||2  p}
choose a small
(
=
V Projects
( tail
wet ,
f
E
mini –
,
– –
I
WE
batch 24
}
7 first WH)
sips ( .ie/wtl-n—8feBw)

Projection projection
a
g.
point set
y
=
ang
minHy- y es
of
s , is defined
onto as
ift
a
‘ll y’

In 5
our
Case
:
: Iw: Hswllzf )
§
Projects=/
÷r°÷yμ .
“Y
if yes
if gets I
9

SGD for SVM

Gradient:
HiCw) =
=

max to, I 4.Jai
min
w
E
f(w)s.t. w2S
fis(w)=max(0,1yaiwTxie) 1
XTM
tttt
t=1
S = {w : ||w||2  p}
SGD scheme:
wth a
iopwjects (
wt

ntsaoeheshg ⇒
Of )) Tw I g oh C w ))
Y( Hw
Max to, I-
WTX i =
x )
x)=

)
Yi

Tl
) ÷
go
Hwi
gym
, –
Hw –
)
Y WTki
h ÷÷÷÷:÷÷÷::÷
(
gCa) =
x>I max to it – X)
Casa
,
. 7hlwEhxi
‘ g

i iii.
.
%
.
=L ..
.
.
six =L. -,
Yr

tfilw )
Tf
Hick
glxlzmeexlo
if hlwl > I ⇒ g’ ,
h Lw)

=
zgiwtxi g)
>
I
Off# to
it yiwt, I if yiwtxi =/
:{

tfilw Ofhog Iw ) )
! who
±
.
Thew )
)
Tiki
if

=
Sub gradient
=
yiwtxicl
) )
lhlw ) I-x
-o


Subgradient
We only require to linearize properly:

Subgradient
• Given a convex (not necessarily differentiable) function f, a vector gx is called a
subgradient of f at x if: –
UH ¥•a¥.
8 x 0 2 S : f ( x 0 ) f ( x ) + g xT ( x 0 x ) –

f
few!
#
They
fcxtgxcx
) –
Ie



x)
fear



Hinge loss:
Subgradient:
f (w) = max(0, 1 y : wT xi.) t sina.sitt

Subgradient for SVM



Initialize:
10
SGD for SVM
(
=O
2=0.01
we
Projects

Hq
Tots .
,
=
e.
g
Wo ↳ iteration
Projects (wt
¥2

Lini;
.
t
)
; win; Ahs
Hiatt
(g)
,

ht – Ire
10 8
H:E ) Wo E S L 11W
pts }
Each round do:
Receive a new batch of data points: ( Xi , , , ) .

=
,,
a
Yi
( ,
Kilo
?
Yolo
,
(


Large-Scale supervised learning
Recall that our goal was to discuss how to solve problems of the form:
Xn
min fi(w) s.t. w 2 S
w i=1
for convex problems very efficiently via:
Online convex programming –
Stochastic convex optimization


we have
the data points and



all
in a
access to
each iteration Gmat and random
we
) Subset
choose
t update

Online learning
• •
• •
Data arrives sequentially (
Need to classify one data point at a time

,
y Naik)
, ) ,
sequential data = Learning algorithm
sequential rule
online )
Use different decision rule (linear classifier) each time Can’t remember all the data points
cat
Streaming ,
I÷ time 9
4.04294) (X3
gawd w , o
w3 WL
time
2
b
t
– ,→
.
..
time
Hyees ,,



Online SVM optimisation

Keep track of hyperplane parameters wt online For each time t = 1,2,··· ,T

Z
New data point arrives
Classify according to sign(wtT xt)
f-
xt –


Incur loss `t = max(0, 1 yiwtT xt) Updatewt based on (xt,yt)

loss It
cumulative –

Goal: minimize the (cumulative) loss:
minimize


Generally: Online convex programming
Input:
Feasible set S ✓ RD Initial point w0 2 S
Each round t do
Pick new feasible point wt 2 S
Receive convex function ft : S ! R Incur loss `t = ft(wt)



Goal: minimize the (cumulative loss) How do we evaluate performance?

Regret
We will have to pay a price as we do not process the whole data at once Loss in the online setting:
Loss in the setting where we can process the whole data at once:




Special case: online SVM optimisation
Keep track of hyperplane parameters wt online For each time t = 1,2,··· ,T
New data point arrives
Incur loss `t = max(0, 1 yiwtT xt)
xt Classify according to sign(wtT xt)
Updatewt based on (xt,yt) Best we could have done:

Our regret:
L⇤ =
XT w:||w||2  p t=1
XT t=1
min 1
max(0,1ytwTxt) `tL⇤

RT =


For SVMs, having a no regret algorithm means:
The average excess error compared to the (expensive) quadratic program goes to zero
This is independent of how we process the data set

Recall:
An online algorithm is called no regret if: RT
T
for any sequence of (convex) functions

No regret algorithms
XT t=1 w2S t=1
XT
RT = `tmin
ft(w)
!0 f1,f2,··· ,fT

How can we design a no-regret algorithm?
Three approaches:
Follow the Leader (FTL)
Follow the Regularized Leader (FoReL)
Online gradient descent

How can we design a no-regret algorithm?

Online gradient descent
• ft isconvexfort=1,···,T, andS=RD
We use an important property of convex functions:
f convex=)8w,u:f(u)f(w)+rf(w)T(uw)
f (u) f(w)

w
u
f(w) + rf(w)T (u w)


Linearize!
Online gradient descent: Intuition
XT XT swimwear
• RT = f(wt) ft(w⇤) t=1 t=1



Online gradient descent (OCP) Simple update rule: wt+1 = wt ⌘trf(wt)
We may go out of the set S :
ProjectS(w) = arg min ||w0 w||2 w02S
Online gradient descent update rule:


How well does this simple algorithm do?

Regret for OCP
• Theorem [Zinkevich’03]: Let f1 , f2 , · · · , ft be an arbitrary sequence of convex
functions with bounded convex set S 1
Set ⌘t = pt Then:
RT 1 ⇤2 2 T pT[||w0w||2+||rf||2]

OCP for SVM
Online SVM:
XT
min max(0, 1 ytwT xt)
w t=1

subject to:
||w||2  p
1

OCP for SVM
XT w t=1
ft(w) s.t. w 2 S OCP scheme:
ft(w) = max(0, 1 ytwT xt) 1
min
S = {w : ||w||2  p} wt+1 = ProjS(wt ⌘trft(wt))

Projection:

OCP for SVM
XT w t=1
ft(w) s.t. w 2 S OCP scheme:
ft(w) = max(0, 1 ytwT xt) 1
min
S = {w : ||w||2  p} wt+1 = ProjS(wt ⌘trft(wt))

Gradient:



OCP for SVM Initialize:
Each round do:
Receive a new point (xt, yt)