Data Mining: Learning From Large Data Sets
Lecture 11: Stochastic convex programming
Hamed Hassani
•
Generally: Online convex programming
Input:
Feasible set S ✓ RD Initial point w0 2 S
Each round t do
Pick new feasible point wt 2 S
Receive convex function ft : S ! R Incur loss `t = ft(wt)
–
Goal: minimize the (cumulative loss) How do we evaluate performance?
•
typically
p –
data point,q,
‘The Yew )
Ithtwtf
,
I
•
•
minimize Z
ht =
Svmifzcw) g-
min 10 ,
on
. I
–
in
Corresponds Performance
ML , te to the
a
,
Regret
–
•
•
We will have to pay a price as we do not process the whole data at once Loss in the online setting:
=line
•
Loss in the setting where we can process the whole data at once:
challenge
of online cxnah-rgfxei.ge#wt-snenda?tffPointT
day
t
:
–
–
Hutt
online
Laing
: day t
–
–
txt Yt ) ,
–
t= .fi?zImaxloii-h.wTxeIormwinsIIttty –
– learning :
cry
day t history :
ML
problem
:
Svm
→
÷2 ,-
filed llai÷gi
⇐?filw2 –
Yet )
Guidi )
Get -1
T wt
stet
→
predict
,
–7
,
I it
we S
min w
label of I
)
the
: waxtoI ,
Yew
– toss
FIH
Cw) m
new
data
at :
point
Sign ( w ‘t It
fi
•
•
Keep track of hyperplane parameters wt online For each time t = 1,2,··· ,T
New data point arrives
Incur loss `t = max(0, 1 y wtT x )
Special case: online SVM optimisation
L
•
(
Our regret:
– XT L⇤ = min
max(0, 1 y wT x ) tt
•
loss ‘
assumes
to the
access ) ← w:||w|| .
xt Classify according to sign(wt xt)
it Updatewt based on (xt,yt)
Best we could have done:
while
YEE
2 p t=1
– —
reg
ret
←
learning
Als
RT ⇒T
possible e/gX1iIc#st
RT = `t L⇤ .g 1
T
= t=1 online
•
No regret algorithms
•
Recall:
An online algorithm is called no regret if: RT
.
XT t=1 w2S t=1
XT
RT = `t min
ft(w)
-8
!0 f1,f2,··· ,fT
T
•
For SVMs, having a no regret algorithm means:
The average excess error compared to the (expensive) quadratic program goes to zero
This is independent of how we process the data set
for any sequence of (convex) functions
How can we design a no-regret algorithm?
Three approaches:
Follow the Leader (FTL)
Follow the Regularized Leader (FoReL)
Online gra0dient descent
•
How can we design a no-regret algorithm?
μSGD each
:
day → WE j
minFicus iteration
“-T¥ I
£4″ update
ftp.qq/
time
we
?
n.mn
” ”
”
•
•
Online gradient descent (OCP)
–
Simple update rule: wt+1 = wt ⌘trf(wt) -I
We may go out of the set S : –
ProjectS(w) = arg min ||w0 w||2
w02S
–
•
Online gradient descent update rule:
Projects (wt – Lift
we
#
=
•
How well does this simple algorithm do?
Regret for OCP
• Theorem [Zinkevich’03]: Let f1 , f2 , · · · , ft be an arbitrary sequence of convex
functions with bounded convex set S 1
Set ⌘t = p –
t
–
loss
of cumulative O minimize Offutt
Then:
initial point
RT 1 I ~ a mga
T pT[||w0 w⇤||2+||rf||2 ] –
01¥ )
OCP for SVM
→ day
XT
min max(0,1 y wTx )
‘t
↳
at
Le*w
•
Online SVM:
–
we
¥)T
1
*
L –
)
Max to, i –
*
L
Reset
tt
subject to:
feat )
||w||2 p
=
w t=1
T
T
I
:
ta
OCP for SVM
•
Projection:
‘
”
.
y Tgif
XT w t=1
I
ft(w) s.t. w 2 S OCP scheme:
ft(w) = max(0, 1 ytwT xt) 1
min
s
Y
#←
.se ¥. ‘IsI
Tfg
o,
S = {w : ||w||2 p } –
wt+1 = ProjS(wt ⌘trft(wt))
=
Min ( NYU , ,
if
gets
)
OCP for SVM
XT w t=1
ft(w) s.t. w 2 S OCP scheme:
Trftlwt )
ft(w) = max(0, 1 ytwT xt) 1
min
S = {w : ||w||2 p } wt+1 = ProjS(wt ⌘trft(wt))
• Gradient:
– { it it
Lew’The > I
NIH.io
Lewisite
*
‘s tftlwt )=
–
Text
•
•
Initialize:
T
W .o=0 or t
arbitrary
point HHS
in
#
OCP for SVM
Each round do:
Receive a new point (xt, yt)
any
SE
{
w: 40ft
}
(
if Ye ‘ Xt
–
–
gradient
!
in if (2)
yzwttxt we
31
Ll
nothing( –
min¥t”s wtf←#
– projection
(
wee
–
)
)
o¥w⇒ –
Cut Pwjs
–
– Do
→ Wtt Wt
re .
– wth = Wiffen
=
at
‘t
Tiz
‘ =I
,
•
Large-Scale supervised learning
Recall that our goal was to discuss how to solve problems of the form:
w i=1
for convex problems very efficiently via:
• Online convex programming → To
Ocp
✓
few I
)
•
Stochastic convex optimization
↳
→
ficus mwins
Xn p
min fi(w) s.t. w 2 S
issue
:
n
is
corresponds the i – Th data
to point
fiCw) =
Max Co, I –
Y wTxi) ,
Is ,
very large!
–
→
in
doses
What have we learned so far?
–
– for liver =
for cry set
of convex losses
-01¥ )
)
–
OCP →
F
el
RT
lusts
online L
stochastic
opt
.)
R tFTra
–
–
learning Copy
learning I
one )
Holtz
¥÷÷;fI÷ ,
o
Oct )
Management
RT
.
–
GD –
–
era
SGD
– emo
–
. :
•
•
How can we exploit parallelism? –
Further questions
Can we have faster convergences if we have more structure? –
Strongly convex objectives Adapting to geometry
↳
structural loss
property function
→
the
structural the
property data L e.g .
-7
of
of sparsity ,
• •
•
–
Can we achieve faster convergence?
Without any assumptions on the objectives, convergence rate / avg. regret is 1 O(pT )
Under Specific assumptions, can achieve faster rates:
–
O( log T ) T
–
A functionf if called H-strongly convex if: f(u) f(w)+rf(w)T(u w)+ ||u w||2
each .÷I¥;iauμm
In
w
u
Hangout quadratic lower bound to the tune .
I
can fit a
point
H oo÷
2
t.no
Fai :*
:* .tn#*ftstHtuTeu-w
)
bat
we
Are these functions strongly convex?
f(x) = x2 f(x)=x4
¥ →
Strongly cornea
twice
differentiable
R strongly convex
f(x) = |x| f(x) = x
f(x) = x2 + x4
) .
→
strongly
(
are least It
wi _¥
It –
I
–
→
–
a function
”
of jfk BE
÷*:¥i÷
eigenvalues
at
” ”
if
‘
Cowen
.
Iffhs , HI
.
and notte
only
f is
r
if
guy, → It taek
iff all the
)
– :s
•
OCP for strongly convex functions
Theorem [Hazan, Kalai, Kale, Agrawal ’06]: Let f1 , f2 , · · · , fT be an arbitrary sequence
of strongly convex functions with bounded convex setS Set ⌘t = 1
O
Ht
–
9
Then, the regret of online convex programming is bounded by:
wttl-projsfwz-njofecw.es/#TII)Kooc
SGD Weil :=
Prog’s ( WE –
–
Ata ¥.
nett )
→
where ETH past]
Convergence rate .
||rf ||2
RT 2H (1+logT)
–
, –
–
.
016,1 )
Hewey
Strongly convex formulation for SVM
-T ft(w)=max(0,1 ytw xt)
I Gyzmnx
(Og
I
x –
)
*
but we equivalent
of strong
( 0*7
(
!
can use an formulation
not
strongly
convex
q ⇒#
which
is convex I next slide
SVM
)
.
Further questions
Can we have faster convergences if we have more structure? Strongly convex objectives
—
Adapting to geometry
•
•
How can we exploit parallelism?
Strongly convex formulation for SVM
ft(w) = max(0, 1 ytwT xt)
(equivalent
fomentation helps here ! )
Srm :
min what
mi; min
miwn
aww
I
,
ficw ) tkvwttficw
f.
‘
I
Yi
gyu, .in?*tn?…9iH
lula –
Max to,
–
#HE since
⇒
convex :÷::÷
Yw
w ?.
,
not
strongly convex
I
==
=
) –
WTH) –
In
,
i
gicw ,
p at In ?_? few)
Ill w K2 b –
–
Gi Strongly convex and gits = -1 –
:a –
Lbw
Hilly
on :* –
.
wmeitmnising Ii.sin
I-
\
↳fi
(
=
who Trfifw ) Tiki
p
GEE state-of-the-art for
SGD
with the 2- Strongly
convex
formulation
of
Svm
.
PEGASOS [Shalev-Shwartz et al. ‘07]
:*
–
of Eep
ifixf
swum
)
classification
=
→i –
quiz a
At
random with
Subset
IATKB –
of
I
‘
p
.
min
batch
size
( e.g
.
–
–
→
is data
a
b
B
nroieclionfIFIiectslw.IE )
–
–
WEH
A
.
Wao
Statistical performance of PEGASOS
•✏
Theorem [Shalev-Shwartz et al. ’07]: Runtime required to find an -accurate solution
with probability at least 1 is:
Idimansion
(it
:3:’ c-
1
✏
min flu ) WES
Ecfcw
,
) )
Performance depends on number of dimensionsD
–
Flutist –
– optimal
‘ ^
•
• Does not depend on the number of examples n #
generalizes Unseen
.
optimization
“difficulty” of the problem ( and ✏ ) /
to generalization
:
, )T=O⇤(Dlog)
–
–
–
T. –
one
I
PEGASUS
data
↳
is feasible .
large-scale
•
Differences between PEGASOS and standard OCP / OGD
Uses mini-batch of training examples
TB
a
fer tens
Empirically more efficient —
Not theoretically required
for
Improved convergence rate and better empirical performance
•
Uses strongly convex loss functions
(
Svm
)
–
–
Further questions
Can we have faster convergences if we have more structure? Strongly convex objectives
Adapting to geometry
How can we exploit parallelism?
•
ADA GRAD
–
ADAM
{
•
Adapting to geometry (Duchi et al.’12)
•
SGD update:
•
•
SGD uses the same step size for all the features. Is this right?
–
–
We often have high-dimensional feature vectors
Many features are irrelevant / redundant
Rare features are very informative
exaple
:
design
a
–
–
lfeatwec holiday
wt+1 =wt ⌘trfj(wt)
y
–
informative
words
. .
, account fit
credit
irrelevant features
,
, Dear
–
–
:
,
University
Adapting to geometry (Duchi et al.’12)
&
÷.
÷
If *IIIa
features
e. g. credit Is b
.
b
=
←
.I
.
–
μ
–
–
– =
—
–
Example (Duchi et al.’12)
High-dimensional image features
. .
”
I
•
Ada-grad (Duchi et al.’12)
Ada-grad provides a feature specific adaptive learning rate by incorporating knowledge of the geometry of past observations
Changing geometry: The main idea
•
•
g
=
¥9
.
?÷
. origin
star
Questions:
How do we design G?
•
What are the update rules forwt?
.
P ”
.
: :p
£1 , fz , minimise
—
g
ft
=
ttlut
Projects ( we
S
OCP
) ,
wtf
:
we #
–
0ft lwt )
Changing geometry in projection
C) what’s the update Gt€ the
“” easy
fleitas ) – .ci#qGyfywZ.doIain
we
need to
specify
.
two things
.
the
solve domain ? problem
” hard ”
world
world
min ftw) I
WES gczt
ZITFZT – Hoff # ⇒ w ‘t
and transfer
equivalent
min
Gtzes
faith
)
the
is .
rule
in
w-
then
( 9th )=ftCE’ Ze ) Me baek
!
domain (2) what’s the projection in the ?
Idea :
: to
Use
SGDIOCP
w
–
I
min WES
–
w . G” ==
z
Ditzes 48A) –
W
I -1¥
wt —
– update
W=
Z
w* ,
after Projection
Etl
–
flu
– –
. Fig/ s
‘t #
– ←
The
,
in
u
–
z domain
11 Z Eze ‘s
–
Gw
=
min f IG”ZA
Z )–
Solve
Zt
Eee
: =
–
GCZE)
#
=
project
¥11
y
:c :c
;iII
– –
WH
we
T÷÷
I
htZt IIi=Zt
) Gw
.
-471¥
Getzla#fII’
↳ we
Eze
9.tl#=IfGIt )
ftlwt
.
=
)
Z
=
Z→ t=Gwt
.
Changing geometry in projection
Z
eel
=
argmin =
=
argmin
Z z if ZE
2-4*112
argmin ‘
” S
–
I I’
wt GI zt
es I
It Gl WES
z
11
G
(
‘
G- Z –
G’¥+1111
T wet
w ‘t# 311
Z G-W =
wa Gtz
w
–
{ Gt ZE
we =
•
Gradient descent with Mahalanobis projection Rescaling coordinates is equivalent to the following algorithm:
wt0+1 = wt ⌘G 1rft(wt)
w ‘t# arwgemsin
% w
= arg min ||w w0 || w2S t+1 G
11GW
–
Ill
t+1
=
– Hally
-_
46kHz
•
Hereby, for a vector w 2 RD the Euclidean norm is replaced by the Mahalanobis norm #
||w||2G = ||Gw||2 –
–
•
Main question: Which matrix G should we choose? I