程序代写代做 algorithm ada data mining go Data Mining: Learning From Large Data Sets

Data Mining: Learning From Large Data Sets
Lecture 11: Stochastic convex programming
Hamed Hassani


Generally: Online convex programming
Input:
Feasible set S ✓ RD Initial point w0 2 S
Each round t do
Pick new feasible point wt 2 S
Receive convex function ft : S ! R Incur loss `t = ft(wt)

Goal: minimize the (cumulative loss) How do we evaluate performance?

typically
p –
data point,q,
‘The Yew )
Ithtwtf
,
I


minimize Z
ht =
Svmifzcw) g-
min 10 ,
on
. I

in
Corresponds Performance
ML , te to the
a
,

Regret



We will have to pay a price as we do not process the whole data at once Loss in the online setting:
=line

Loss in the setting where we can process the whole data at once:
challenge
of online cxnah-rgfxei.ge#wt-snenda?tffPointT
day
t
:


Hutt
online
Laing
: day t


txt Yt ) ,

t= .fi?zImaxloii-h.wTxeIormwinsIIttty –
– learning :

cry
day t history :
ML
problem
:
Svm

÷2 ,-
filed llai÷gi
⇐?filw2 –
Yet )
Guidi )
Get -1
T wt
stet

predict
,
–7
,
I it
we S
min w
label of I
)
the
: waxtoI ,
Yew
– toss
FIH
Cw) m
new
data
at :
point
Sign ( w ‘t It
fi



Keep track of hyperplane parameters wt online For each time t = 1,2,··· ,T
New data point arrives
Incur loss `t = max(0, 1 y wtT x )
Special case: online SVM optimisation
L

(
Our regret:
– XT L⇤ = min
max(0, 1 y wT x ) tt

loss ‘
assumes
to the
access ) ← w:||w||  .
xt Classify according to sign(wt xt)
it Updatewt based on (xt,yt)
Best we could have done:
while
YEE
2 pt=1
– —
reg
ret

learning
Als
RT ⇒T
possible e/gX1iIc#st
RT = `tL⇤ .g 1
T
= t=1 online


No regret algorithms

Recall:
An online algorithm is called no regret if: RT
.
XT t=1 w2S t=1
XT
RT = `tmin
ft(w)
-8
!0 f1,f2,··· ,fT
T

For SVMs, having a no regret algorithm means:
The average excess error compared to the (expensive) quadratic program goes to zero
This is independent of how we process the data set
for any sequence of (convex) functions

How can we design a no-regret algorithm?
Three approaches:
Follow the Leader (FTL)
Follow the Regularized Leader (FoReL)
Online gra0dient descent

How can we design a no-regret algorithm?
μSGD each
:
day → WE j
minFicus iteration
“-T¥ I
£4″ update
ftp.qq/
time
we
?
n.mn
” ”



Online gradient descent (OCP)

Simple update rule: wt+1 = wt ⌘trf(wt) -I
We may go out of the set S : –
ProjectS(w) = arg min ||w0 w||2
w02S


Online gradient descent update rule:
Projects (wt – Lift
we
#
=

How well does this simple algorithm do?

Regret for OCP
• Theorem [Zinkevich’03]: Let f1 , f2 , · · · , ft be an arbitrary sequence of convex
functions with bounded convex set S 1
Set ⌘t = p –
t

loss
of cumulative O minimize Offutt
Then:
initial point
RT 1 I ~ a mga
T pT[||w0w⇤||2+||rf||2 ] –
01¥ )

OCP for SVM
→ day
XT
min max(0,1y wTx )
‘t

at
Le*w

Online SVM:

we
¥)T
1
*
L –
)
Max to, i –
*
L
Reset
tt
subject to:
feat )
||w||2  p
=
w t=1
T
T
I
:
ta

OCP for SVM

Projection:


.
y Tgif
XT w t=1
I
ft(w) s.t. w 2 S OCP scheme:
ft(w) = max(0, 1 ytwT xt) 1
min
s
Y
#←
.se ¥. ‘IsI
Tfg
o,
S = {w : ||w||2  p} –
wt+1 = ProjS(wt ⌘trft(wt))
=
Min ( NYU , ,
if
gets
)

OCP for SVM
XT w t=1
ft(w) s.t. w 2 S OCP scheme:
Trftlwt )
ft(w) = max(0, 1 ytwT xt) 1
min
S = {w : ||w||2  p} wt+1 = ProjS(wt ⌘trft(wt))
• Gradient:
– { it it
Lew’The > I
NIH.io
Lewisite
*
‘s tftlwt )=

Text



Initialize:
T
W .o=0 or t
arbitrary
point HHS
in
#
OCP for SVM
Each round do:
Receive a new point (xt, yt)
any
SE
{
w: 40ft
}
(
if Ye ‘ Xt


gradient
!
in if (2)
yzwttxt we
31
Ll
nothing( –
min¥t”s wtf←#
– projection
(
wee

)
)
o¥w⇒ –
Cut Pwjs

– Do
→ Wtt Wt
re .
– wth = Wiffen
=
at
‘t
Tiz
‘ =I
,


Large-Scale supervised learning
Recall that our goal was to discuss how to solve problems of the form:
w i=1
for convex problems very efficiently via:
• Online convex programming → To
Ocp

few I
)

Stochastic convex optimization


ficus mwins
Xn p
min fi(w) s.t. w 2 S
issue
:
n
is
corresponds the i – Th data
to point
fiCw) =
Max Co, I –
Y wTxi) ,
Is ,
very large!



in
doses
What have we learned so far?

– for liver =
for cry set
of convex losses
-01¥ )
)

OCP →
F
el
RT
lusts
online L
stochastic
opt
.)
R tFTra


learning Copy
learning I
one )
Holtz
¥÷÷;fI÷ ,
o
Oct )
Management
RT
.

GD –

era
SGD
– emo

. :



How can we exploit parallelism? –
Further questions
Can we have faster convergences if we have more structure? –
Strongly convex objectives Adapting to geometry

structural loss
property function

the
structural the
property data L e.g .
-7
of
of sparsity ,

• •


Can we achieve faster convergence?
Without any assumptions on the objectives, convergence rate / avg. regret is 1 O(pT )
Under Specific assumptions, can achieve faster rates:

O( log T ) T

A functionf if called H-strongly convex if: f(u)f(w)+rf(w)T(uw)+ ||uw||2
each .÷I¥;iauμm
In
w
u
Hangout quadratic lower bound to the tune .
I
can fit a
point
H oo÷
2
t.no
Fai :*
:* .tn#*ftstHtuTeu-w
)
bat
we

Are these functions strongly convex?
f(x) = x2 f(x)=x4
¥ →
Strongly cornea
twice
differentiable
R strongly convex
f(x) = |x| f(x) = x
f(x) = x2 + x4
) .

strongly
(
are least It
wi _¥
It –
I



a function

of jfk BE
÷*:¥i÷
eigenvalues
at
” ”
if

Cowen
.
Iffhs , HI
.
and notte
only
f is
r
if
guy, → It taek
iff all the
)
– :s


OCP for strongly convex functions
Theorem [Hazan, Kalai, Kale, Agrawal ’06]: Let f1 , f2 , · · · , fT be an arbitrary sequence
of strongly convex functions with bounded convex setS Set ⌘t = 1
O
Ht

9
Then, the regret of online convex programming is bounded by:
wttl-projsfwz-njofecw.es/#TII)Kooc
SGD Weil :=
Prog’s ( WE –

Ata ¥.
nett )

where ETH past]
Convergence rate .
||rf ||2
RT  2H (1+logT)

, –

.
016,1 )
Hewey

Strongly convex formulation for SVM
-T ft(w)=max(0,1ytw xt)
I Gyzmnx
(Og
I
x –
)
*
but we equivalent
of strong
( 0*7
(
!
can use an formulation
not
strongly
convex
q ⇒#
which
is convex I next slide
SVM
)
.

Further questions
Can we have faster convergences if we have more structure? Strongly convex objectives

Adapting to geometry


How can we exploit parallelism?

Strongly convex formulation for SVM
ft(w) = max(0, 1 ytwT xt)
(equivalent
fomentation helps here ! )
Srm :
min what
mi; min
miwn
aww
I
,
ficw ) tkvwttficw
f.

I
Yi
gyu, .in?*tn?…9iH
lula –
Max to,

#HE since

convex :÷::÷
Yw
w ?.
,
not
strongly convex
I
==
=
) –
WTH) –
In
,
i
gicw ,
p at In ?_? few)
Ill w K2 b –

Gi Strongly convex and gits = -1 –
:a –
Lbw
Hilly
on :* –
.

wmeitmnising Ii.sin
I-
\
↳fi
(
=
who Trfifw ) Tiki
p
GEE state-of-the-art for
SGD
with the 2- Strongly
convex
formulation
of
Svm
.
PEGASOS [Shalev-Shwartz et al. ‘07]
:*

of Eep
ifixf
swum
)
classification
=
→i –
quiz a
At
random with
Subset
IATKB –
of
I

p
.
min
batch
size
( e.g
.



is data
a
b
B
nroieclionfIFIiectslw.IE )


WEH
A
.
Wao

Statistical performance of PEGASOS
•✏
Theorem [Shalev-Shwartz et al. ’07]: Runtime required to find an -accurate solution
with probability at least 1 is:
Idimansion
(it
:3:’ c-
1


min flu ) WES
Ecfcw
,
) )
Performance depends on number of dimensionsD

Flutist –
– optimal
‘ ^

• Does not depend on the number of examples n #
generalizes Unseen
.
optimization
“difficulty” of the problem ( and ✏ ) /
to generalization
:
, )T=O⇤(Dlog)



T. –
one
I
PEGASUS
data

is feasible .
large-scale


Differences between PEGASOS and standard OCP / OGD
Uses mini-batch of training examples
TB
a
fer tens
Empirically more efficient —
Not theoretically required
for
Improved convergence rate and better empirical performance

Uses strongly convex loss functions
(
Svm
)

Further questions
Can we have faster convergences if we have more structure? Strongly convex objectives
Adapting to geometry
How can we exploit parallelism?

ADA GRAD

ADAM
{

Adapting to geometry (Duchi et al.’12)

SGD update:


SGD uses the same step size for all the features. Is this right?


We often have high-dimensional feature vectors
Many features are irrelevant / redundant
Rare features are very informative
exaple
:
design
a


lfeatwec holiday
wt+1 =wt ⌘trfj(wt)
y

informative
words
. .
, account fit
credit
irrelevant features
,
, Dear


:
,
University

Adapting to geometry (Duchi et al.’12)
&
÷.
÷
If *IIIa
features
e. g. credit Is b
.
b
=

.I
.

μ


– =

Example (Duchi et al.’12)
High-dimensional image features
. .

I


Ada-grad (Duchi et al.’12)
Ada-grad provides a feature specific adaptive learning rate by incorporating knowledge of the geometry of past observations

Changing geometry: The main idea


g
=
¥9
.

. origin
star
Questions:
How do we design G?

What are the update rules forwt?
.
P ”
.
: :p

£1 , fz , minimise

g
ft
=
ttlut
Projects ( we
S
OCP
) ,
wtf
:
we #

0ft lwt )

Changing geometry in projection
C) what’s the update Gt€ the
“” easy
fleitas ) – .ci#qGyfywZ.doIain
we
need to
specify
.
two things
.
the
solve domain ? problem
” hard ”
world
world
min ftw) I
WES gczt
ZITFZT – Hoff # ⇒ w ‘t
and transfer
equivalent
min
Gtzes
faith
)
the
is .
rule
in
w-
then
( 9th )=ftCE’ Ze ) Me baek
!
domain (2) what’s the projection in the ?
Idea :
: to
Use
SGDIOCP
w

I

min WES

w . G” ==
z
Ditzes 48A) –
W
I -1¥
wt —
– update
W=
Z
w* ,
after Projection
Etl

flu
– –
. Fig/ s
‘t #
– ←
The
,
in
u

z domain
11 Z Eze ‘s

Gw
=
min f IG”ZA
Z )–
Solve
Zt
Eee
: =

GCZE)
#
=
project
¥11
y
:c :c
;iII

– –
WH
we
T÷÷
I
htZt IIi=Zt
) Gw
.
-471¥
Getzla#fII’
↳ we
Eze
9.tl#=IfGIt )
ftlwt
.
=
)
Z
=
Z→ t=Gwt
.

Changing geometry in projection
Z
eel
=
argmin =
=
argmin
Z z if ZE
2-4*112
argmin ‘
” S

I I’
wt GI zt
es I
It Gl WES
z
11
G
(

G- Z –
G’¥+1111
T wet
w ‘t# 311
Z G-W =
wa Gtz
w

{ Gt ZE
we =


Gradient descent with Mahalanobis projection Rescaling coordinates is equivalent to the following algorithm:
wt0+1 = wt ⌘G1rft(wt)
w ‘t# arwgemsin
% w
= arg min ||w w0 || w2S t+1 G
11GW

Ill
t+1
=
– Hally
-_
46kHz

Hereby, for a vector w 2 RD the Euclidean norm is replaced by the Mahalanobis norm #
||w||2G = ||Gw||2 –


Main question: Which matrix G should we choose? I