CS计算机代考程序代写 chain flex algorithm COMP3221: Distributed

COMP3221: Distributed
Systems

Distributed Optimization

Dr Nguyen Tran
School of Computer Science

Outline

1. Linear regression

2. Distributed linear regression

3. Gradient descent

Goal: Find w* that minimizes

•
•

Closed form solution exists
Gradient Descent is iterative
(Intuition: go downhill!)

f(w) = ||Xw ̶ y||22

f(w)

Scalar objective:

Linear Regression Optimization

å
=

-==
n

jj ywxwwf
1

2)()(2

2
)(y-x)(

Start at a random point

1. Determine a descent direction
2. Choose a step size
3. Update

f(w)

w1 w0w*

Gradient descent

Start at a random point
Repeat

Determine a descent direction
Choose a step size
Update

Until stopping criterion is satisfied
w

f(w)

w2 w1 w0w*

Gradient descent

f(w)
Start at a random point
Repeat

Determine a descent direction
Choose a step size
Update

Until stopping criterion is satisfied
ww1 w0w* … w2

Gradient descent

g(w) Non-convex

Any local minimum is a global minimum Multiple local minima may exist

Least Squares, Ridge Regression, and Logistic Regression are all convex!

…

f(w) Convex

w* ẃ w*

…

Where will we converge?

Update Rule:We can only move in two directions

Negative slope is direction of descent!

w0 w

f(w)

w* w

f(w)

positive go left!

Negative go right!

zero done!

Step Size

Negative Slope

Choosing a descent direction (1D)

Þ Þ

)(1 iiii wdw
df

ww a-=+

We can move anywhere in Rd

Negative gradient is direction of
steepest descent!

2D Example:

•

Function values are in black/white and
black represents higher values
Arrows are gradients

“Gradient2” by Sarang. Licensed under CC BY-SA 2.5 via Wikimedia Commons
http://commons.wikimedia.org/wiki/File:Gradient2.svg#/media/File:Gradient2.svg

Update Rule:

Step Size

Negative Slope

Choosing a descent direction

)(1 iiii wfww Ñ-=+ a

Scalar objective:

Derivative:
(chain rule)

Scalar Update:
(2 absorbed in α )

Vector Update:

Update Rule:

Gradient descent for least squares

)(1 iiii wdw
df

ww a-=+

å
=

-==
n

jj ywxwwf
1

2)()(2

2
)(y-x)(

å
=

-=
n

jjj xywxw
dw
df

)()()( )(2)(

å
=

+ –=
n

jjj
iii xyxwww

)()()(
1 )(a

å
=

+ –=
n

jjj
ii y

)()()(T
i1i x)xw(ww a

Theoretical convergence results for various step sizes

Too small: converge very slowly Too big: overshoot, can diverge
w*

Reduce size over time

A common step size is
Constant

Iteration #
# Training Points

Choosing a step size

in
i

a
a =

f(w)

w* w w

f(w)

w* w

f(w)

(a) (b) (c)

O(nk) Distributed
Storage

Vector Update:

Compute summands in parallel!
note: workers must all have wi

O(nk)
Distributed

Computation

O(k) Local
Storage

O(k) Local
Computation

O(k) Local
Storage

Parallel gradient descent for least squares

å
=

+ –=
n

jjj
ii y

)()()(T
i1 x)xw(ww a

Example: n = 6; k = 100, 3 workers

x(2)
x(6)

workers: x(3)
x(4)

map: i(w xT (j) —y )x(j) (j) i(w xT (j) (j) (j) i—y )x (w xT (j) —y )x(j) (j)

reduce:

wi+1

x(1)
x(5)

å
=

–
n

jjj
i y

)()()(T x)xw(

Pros:
•
•
•

Easily parallelized
Cheap at each iteration
Stochastic variants can make
things even cheaper

Cons:
•

•

Slow convergence (especially
compared with closed-form)
Requires communication across
nodes!

Gradient descent summary

DIST R I B U T E D M ACHI NE L E A R N I N G

Communication Principles

CPU

Access rates fall sharply with distance

>50× gap between memory and network!

50 GB/s
1 GB/s

Local disksRAM Rack

0.3 GB/s1 GB/s

Different
Racks

Communication hierarchy

Be mindful of this hierarchy when developing parallel/distributed algorithms!

RECALL

CPU Local disksRAM Rack

0.3 GB/s

Access rates fall sharply with distance
• Parallelism makes computation fast
• Network makes communication slow

50 GB/s
1 GB/s 1 GB/s

Different
Racks

Communication hierarchy

Be mindful of this hierarchy when developing parallel/distributed algorithms!

RECALL

Persisting in memory reduces communication
• Especially for iterative computation (gradient descent)
Scale-up (powerful multicore machine)
✓ No network communication

Expensive hardware, eventually hit a wall

2nd Rule of thumb
Perform parallel and in-memory computation

CPU

Disk

RAM

CPU
Disk

RAM

CPU

Disk

…RAM
CPU

Disk

RAM

CPU

Disk

RAM

CPU

Disk

2nd Rule of thumb
Perform parallel and in-memory computation

Persisting in memory reduces communication
• Especially for iterative computation (gradient descent)
Scale-out (distributed, e.g., cloud-based)
✓ Commodity hardware, scales to massive problems

Need to deal with network communication

Network

3rd Rule of thumb
Minimize network communication

Q: How should we leverage distributed computing while mitigating
network communication?

First Observation: We need to store and potentially communicate Data,
Model, and Intermediate objects
• A: Keep large objects local

•
•
•

Solve via closed form (not iterative!)
Communicate O(k2) intermediate data
Compute locally on data (Data Parallel)

x(2)
x(6)

workers: x(3)
x(4)

map:
x(
i)

x(i)

x(
i)

x(i)

x(
i)

x(i)

( )-1reduce: x(i) x(i)

3rd Rule of thumb
Minimize network communication — stay local

Example: Linear regression, big n and small k

x(1)
x(5)

å 20

•
•
•

Gradient descent, communicate wi
O(k) communication: OK for fairly large k
Compute locally on data (Data Parallel)

x(2)
x(6)

workers: x(3)
x(4)

reduce:

map: (wiTx(j) —y(j))x(j) (wiTx(j) —y(j))x(j) (wiTx(j) —y(j))x(j)wi+1

3rd Rule of thumb
Minimize network communication — stay local

Example: Linear regression, big n and big k

x(1)
x(5)

å
=

–
n

jjj
i y

)()()(T x)xw(
21

3rd Rule of thumb
Minimize network communication — stay local

Example: Linear regression, big n and huge k

• Gradient descent
• O(k) communication slow with hundreds of millions parameters
• Distribute data and model (Data and Model Parallel)
• Often rely on sparsity to reduce communication

3rd Rule of thumb
Minimize network communication

Q: How should we leverage distributed computing while mitigating
network communication?

First Observation: We need to store and potentially communicate Data,
Model and Intermediate objects
• A: Keep large objects local

Second Observation: ML methods are typically iterative
• A: Reduce # iterations

Distributed iterative algorithms must compute and communicate
• In Bulk Synchronous Parallel (BSP) systems, e.g., Apache Spark, we strictly

alternate between the two

Distributed Computing Properties
•
•

Parallelism makes computation fast
Network makes communication slow

Idea: Design algorithms that compute more, communicate less
•
•

Do more computation at each iteration
Reduce total number of iterations

3rd Rule of thumb
Minimize network communication — reduce iterations

3rd Rule of thumb
Minimize network communication

2nd Rule of thumb
Perform parallel and in-memory computation

1st Rule of thumb
Computation and storage should be linear in (n, k)

Outline

1. Stochastic gradient descent (SGD)

2. Mini-batch SGD

3. Divide & conquer methods

4. Local-updating methods

Regularized Empirical Risk Minimization

• How to solve ERM objectives with optimization methods?
• What techniques exist for large-scale optimization?

( )å
=

+=
n

i
ii Ryxfn 1

)(),(
1

minargˆ qlq qq !

Techniques for large-scale optimization

Reduce computation:
• Distribute computation
• Use first-order methods (i.e., no 2nd-order or higher gradients)
• Use stochastic methods

Reduce communication:
• Keep large objects local
• Reduce iterations

Note: these techniques may be at odds with one another!
28

Methods for distributed optimization

less local computation,
more communication

more local computation,
less communication

stochastic gradient
descent

gradient descent divide-and-conquer / one-
shot averaging

mini-batch SGD

local-updating methods
29

Outline

1. Stochastic gradient descent (SGD)

2. Mini-batch SGD

3. Divide & conquer methods

4. Local-updating methods

Gradient Descent Update:

How can we reduce computation?
Idea: approximate full gradient with just one observation

Stochastic gradient descent

Stochastic Gradient Descent
(SGD) Update:

the sum is gone!

[FOR LINEAR R EGRESSION]

( )å
=

+ –=
n

jjj
iiii y

)()()(T
1 xxwww a

( ) )()()(T1 xxwww jjjiiii y–=+ a

Stochastic gradient descent
MORE G E N E R A L LY

Gradient Descent Update:

Objective we want to solve: where

Stochastic Gradient Descent
(SGD) Update:

with j sampled at random

)w(min
w
f å

=
n

j
jff

1
)w(:)w(

)w(ww 1 iiii fÑ-=+ a

)w(ww 1 ijiii fÑ-=+ a

Stochastic gradient descent

What could go wrong?
• Gradient with respect to one random example might point in the wrongdirection

Why would we expect it to work?
• On average, gradient of random examples will approximate the full gradient
• Method should converge in expectation

Stochastic gradient descent

Pros
•
•

•

Less computation
n times cheaper than gradient
descent at each iteration
This faster per-iteration cost
might lead to faster overall
convergence

Cons
•

•

Less stable convergence than
gradient descent
In terms of iterations: slower
convergence than gradient descent
More communication!

Example: n = 6; k = 100, 3 workers

x(2)
x(6)

workers: O(nk) Distributed
Storage

Vector Update:

O(k)
Distributed

Computation

O(k) Local
Storage

map:

reduce: O(k) Local
Computation

O(k) Local
Storage

wi+1

Parallel stochastic gradient descent for least squares

the sum is gone!

x(1)
x(5)

x(3)
x(4)

( ) )()()(T1 xxwww jjjiiii y–=+ a

( ) )()()(T xxw jjji y-

( )å
=

–
n

jjj
i y

)()()(T xxw

Is this a good idea for distributed computing?

Issue:
• While we’ve reduced computation, we may require more iterations (i.e., more

communication) to converge
• Additionally, if we only process one observation at each iteration, some

machines will be idle

Another idea:
• Compute an approximate gradient with respect to a few (more than one)

observations [mini-batch methods]

Outline

1. Stochastic gradient descent (SGD)

2. Mini-batch SGD

3. Divide & conquer methods

4. Local-updating methods

Gradient Descent Update:

Mini-batch SGD/GD

Stochastic Gradient Descent
(SGD) Update:

use a single observation

[FOR LINEAR R EGRE SSI ON]

Mini-batch SGD/GD Update:

sum over a mini-batch of observations

( )å
=

+ –=
n

jjj
iiii y

)()()(T
1 xxwww a

( ) )()()(T1 xxwww jjjiiii y–=+ a

( )å
Î

+ –=
n

Βj

jjj
iiii

y )()()(T1 xxwww a

Mini-batch SGD
MORE G E N E R A L LY

Gradient Descent Update:

Objective we want to solve: where

Stochastic Gradient Descent
(SGD) Update: with j sampled at random

Mini-batch SGD Update:
with mini-batch B i ⊆{1,… n} sampled at random

)w(min
w
f å

=
n

j
jff

1
)w(:)w(

)w(ww 1 iiii fÑ-=+ a

)w(ww 1 ijiii fÑ-=+ a

)w(ww 1 iΒiii ifÑ-=+ a

Mini-batch SGD

reduce: m=1w = w + ∑
M Δwm

SGD: one observation

reduce: m=1w = w + ∑
M Δwm

gradient descent: all observations

m=1
reduce: w = w + ∑M Δwm

mini-batch SGD: some observations

Mini-batch SGD

https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-
10f652806a3

Mini-batch SGD

Pros
•
•

•

In contrast to SGD: parallelizable
Less computation than GD (might
lead to faster overall convergence)
Can tune computation vs.
communication depending on batch
size

Cons
•

•
•

In terms of iterations: slower convergence
than gradient descent
Another parameter to tune (batch size)
Still might be too much communication …

Outline

1. Stochastic gradient descent (SGD)

2. Mini-batch SGD

3. Divide & conquer methods

4. Local-updating methods

CPU Local disksRAM Rack

0.3GB/s

Access rates fall sharply with distance
• Parallelism makes computation fast
• Network makes communication slow

50 GB/s
1 GB/s 1 GB/s

Different
Racks

Communication hierarchy

Be mindful of this hierarchy when developing parallel/distributed algorithms!

RECALL

One-shot averaging

map: find
optimal local

model w⋆m

average: å =
*=

m m1
w

M
1

Challenge: One-shot averaging might not work

ideal

Challenge: One-shot averaging might not work

average

Challenge: One-shot averaging might not work

ideal

average

Challenge: One-shot averaging might not work

One-shot averaging

Outline

1. Stochastic gradient descent (SGD)

2. Mini-batch SGD

3. Divide & conquer methods

4. Local-updating methods

Local-updating methods

One-shot averaging
Local-updating methods

Methods for distributed optimization

less local computation, more
communication

more local computation, less
communication

stochastic gradient
descent

gradient descent divide-and-conquer / one-
shot averaging

mini-batch SGD

local-updating methods
53

Local-updating methods

Pros
•

•

Fully flexible in terms of
computation vs. communication
Allows re-use of local solvers

Cons
•
•

Another parameter to tune
Not well-understood for non-convex
objectives [current active area of
research!]

Federated Learning

Additional Reading

• Optimization for large-scale machine learning: https://
leon.bottou.org/publications/pdf/tr-optml-2016.pdf

Next week
• Federated Learning

Related Posts