COMP3221: Distributed
Systems
Distributed Optimization
Dr Nguyen Tran
School of Computer Science
Outline
1. Linear regression
2. Distributed linear regression
3. Gradient descent
2
Goal: Find w* that minimizes
•
•
Closed form solution exists
Gradient Descent is iterative
(Intuition: go downhill!)
f(w) = ||Xw ̶ y||22
w
f(w)
w*
Scalar objective:
Linear Regression Optimization
å
=
-==
n
j
jj ywxwwf
1
2)()(2
2
)(y-x)(
3
Start at a random point
1. Determine a descent direction
2. Choose a step size
3. Update
w
f(w)
w1 w0w*
Gradient descent
4
Start at a random point
Repeat
Determine a descent direction
Choose a step size
Update
Until stopping criterion is satisfied
w
f(w)
w2 w1 w0w*
Gradient descent
5
f(w)
Start at a random point
Repeat
Determine a descent direction
Choose a step size
Update
Until stopping criterion is satisfied
ww1 w0w* … w2
Gradient descent
6
w
g(w) Non-convex
Any local minimum is a global minimum Multiple local minima may exist
Least Squares, Ridge Regression, and Logistic Regression are all convex!
…
…
w
f(w) Convex
w* ẃ w*
…
Where will we converge?
7
Update Rule:We can only move in two directions
Negative slope is direction of descent!
w0 w
f(w)
w* w
f(w)
w*
positive go left!
w0
Negative go right!
zero done!
Step Size
Negative Slope
Choosing a descent direction (1D)
Þ Þ
Þ
)(1 iiii wdw
df
ww a-=+
8
We can move anywhere in Rd
Negative gradient is direction of
steepest descent!
2D Example:
•
•
Function values are in black/white and
black represents higher values
Arrows are gradients
“Gradient2” by Sarang. Licensed under CC BY-SA 2.5 via Wikimedia Commons
http://commons.wikimedia.org/wiki/File:Gradient2.svg#/media/File:Gradient2.svg
Update Rule:
Step Size
Negative Slope
Choosing a descent direction
)(1 iiii wfww Ñ-=+ a
9
Scalar objective:
Derivative:
(chain rule)
Scalar Update:
(2 absorbed in α )
Vector Update:
Update Rule:
Gradient descent for least squares
)(1 iiii wdw
df
ww a-=+
å
=
-==
n
j
jj ywxwwf
1
2)()(2
2
)(y-x)(
å
=
-=
n
j
jjj xywxw
dw
df
1
)()()( )(2)(
å
=
+ –=
n
j
jjj
iii xyxwww
1
)()()(
1 )(a
å
=
+ –=
n
j
jjj
ii y
1
)()()(T
i1i x)xw(ww a
10
Theoretical convergence results for various step sizes
Too small: converge very slowly Too big: overshoot, can diverge
w*
Reduce size over time
A common step size is
Constant
Iteration #
# Training Points
Choosing a step size
in
i
a
a =
11
f(w)
w* w w
f(w)
w* w
f(w)
(a) (b) (c)
O(nk) Distributed
Storage
Vector Update:
Compute summands in parallel!
note: workers must all have wi
O(nk)
Distributed
Computation
O(k) Local
Storage
O(k) Local
Computation
O(k) Local
Storage
Parallel gradient descent for least squares
å
=
+ –=
n
j
jjj
ii y
1
)()()(T
i1 x)xw(ww a
12
Example: n = 6; k = 100, 3 workers
x(2)
x(6)
workers: x(3)
x(4)
map: i(w xT (j) —y )x(j) (j) i(w xT (j) (j) (j) i—y )x (w xT (j) —y )x(j) (j)
reduce:
wi+1
x(1)
x(5)
å
=
–
n
j
jjj
i y
1
)()()(T x)xw(
Pros:
•
•
•
Easily parallelized
Cheap at each iteration
Stochastic variants can make
things even cheaper
Cons:
•
•
Slow convergence (especially
compared with closed-form)
Requires communication across
nodes!
Gradient descent summary
13
DIST R I B U T E D M ACHI NE L E A R N I N G
Communication Principles
14
CPU
Access rates fall sharply with distance
>50× gap between memory and network!
50 GB/s
1 GB/s
Local disksRAM Rack
0.3 GB/s1 GB/s
Different
Racks
Communication hierarchy
Be mindful of this hierarchy when developing parallel/distributed algorithms!
RECALL
15
CPU Local disksRAM Rack
0.3 GB/s
Access rates fall sharply with distance
• Parallelism makes computation fast
• Network makes communication slow
50 GB/s
1 GB/s 1 GB/s
Different
Racks
Communication hierarchy
Be mindful of this hierarchy when developing parallel/distributed algorithms!
RECALL
16
Persisting in memory reduces communication
• Especially for iterative computation (gradient descent)
Scale-up (powerful multicore machine)
✓ No network communication
Expensive hardware, eventually hit a wall
2nd Rule of thumb
Perform parallel and in-memory computation
CPU
Disk
RAM
CPU
Disk
RAM
17
RAM
CPU
Disk
…RAM
CPU
Disk
RAM
CPU
Disk
RAM
CPU
Disk
2nd Rule of thumb
Perform parallel and in-memory computation
Persisting in memory reduces communication
• Especially for iterative computation (gradient descent)
Scale-out (distributed, e.g., cloud-based)
✓ Commodity hardware, scales to massive problems
Need to deal with network communication
Network
18
3rd Rule of thumb
Minimize network communication
Q: How should we leverage distributed computing while mitigating
network communication?
First Observation: We need to store and potentially communicate Data,
Model, and Intermediate objects
• A: Keep large objects local
19
•
•
•
Solve via closed form (not iterative!)
Communicate O(k2) intermediate data
Compute locally on data (Data Parallel)
x(2)
x(6)
workers: x(3)
x(4)
map:
x(
i)
x(i)
x(
i)
x(i)
x(
i)
x(i)
( )-1reduce: x(i) x(i)
3rd Rule of thumb
Minimize network communication — stay local
Example: Linear regression, big n and small k
x(1)
x(5)
å 20
•
•
•
Gradient descent, communicate wi
O(k) communication: OK for fairly large k
Compute locally on data (Data Parallel)
x(2)
x(6)
workers: x(3)
x(4)
reduce:
map: (wiTx(j) —y(j))x(j) (wiTx(j) —y(j))x(j) (wiTx(j) —y(j))x(j)wi+1
3rd Rule of thumb
Minimize network communication — stay local
Example: Linear regression, big n and big k
x(1)
x(5)
å
=
–
n
j
jjj
i y
1
)()()(T x)xw(
21
3rd Rule of thumb
Minimize network communication — stay local
Example: Linear regression, big n and huge k
• Gradient descent
• O(k) communication slow with hundreds of millions parameters
• Distribute data and model (Data and Model Parallel)
• Often rely on sparsity to reduce communication
22
3rd Rule of thumb
Minimize network communication
Q: How should we leverage distributed computing while mitigating
network communication?
First Observation: We need to store and potentially communicate Data,
Model and Intermediate objects
• A: Keep large objects local
Second Observation: ML methods are typically iterative
• A: Reduce # iterations
23
Distributed iterative algorithms must compute and communicate
• In Bulk Synchronous Parallel (BSP) systems, e.g., Apache Spark, we strictly
alternate between the two
Distributed Computing Properties
•
•
Parallelism makes computation fast
Network makes communication slow
Idea: Design algorithms that compute more, communicate less
•
•
Do more computation at each iteration
Reduce total number of iterations
3rd Rule of thumb
Minimize network communication — reduce iterations
24
3rd Rule of thumb
Minimize network communication
2nd Rule of thumb
Perform parallel and in-memory computation
1st Rule of thumb
Computation and storage should be linear in (n, k)
25
Outline
1. Stochastic gradient descent (SGD)
2. Mini-batch SGD
3. Divide & conquer methods
4. Local-updating methods
26
Regularized Empirical Risk Minimization
• How to solve ERM objectives with optimization methods?
• What techniques exist for large-scale optimization?
( )å
=
+=
n
i
ii Ryxfn 1
)(),(
1
minargˆ qlq qq !
27
Techniques for large-scale optimization
Reduce computation:
• Distribute computation
• Use first-order methods (i.e., no 2nd-order or higher gradients)
• Use stochastic methods
Reduce communication:
• Keep large objects local
• Reduce iterations
Note: these techniques may be at odds with one another!
28
Methods for distributed optimization
less local computation,
more communication
more local computation,
less communication
stochastic gradient
descent
gradient descent divide-and-conquer / one-
shot averaging
mini-batch SGD
local-updating methods
29
Outline
1. Stochastic gradient descent (SGD)
2. Mini-batch SGD
3. Divide & conquer methods
4. Local-updating methods
30
Gradient Descent Update:
How can we reduce computation?
Idea: approximate full gradient with just one observation
Stochastic gradient descent
Stochastic Gradient Descent
(SGD) Update:
the sum is gone!
[FOR LINEAR R EGRESSION]
( )å
=
+ –=
n
j
jjj
iiii y
1
)()()(T
1 xxwww a
( ) )()()(T1 xxwww jjjiiii y–=+ a
31
Stochastic gradient descent
MORE G E N E R A L LY
Gradient Descent Update:
Objective we want to solve: where
Stochastic Gradient Descent
(SGD) Update:
with j sampled at random
)w(min
w
f å
=
=
n
j
jff
1
)w(:)w(
)w(ww 1 iiii fÑ-=+ a
)w(ww 1 ijiii fÑ-=+ a
32
Stochastic gradient descent
What could go wrong?
• Gradient with respect to one random example might point in the wrongdirection
Why would we expect it to work?
• On average, gradient of random examples will approximate the full gradient
• Method should converge in expectation
33
Stochastic gradient descent
Pros
•
•
•
Less computation
n times cheaper than gradient
descent at each iteration
This faster per-iteration cost
might lead to faster overall
convergence
Cons
•
•
•
Less stable convergence than
gradient descent
In terms of iterations: slower
convergence than gradient descent
More communication!
34
Example: n = 6; k = 100, 3 workers
x(2)
x(6)
workers: O(nk) Distributed
Storage
Vector Update:
O(k)
Distributed
Computation
O(k) Local
Storage
map:
reduce: O(k) Local
Computation
O(k) Local
Storage
wi+1
Parallel stochastic gradient descent for least squares
the sum is gone!
x(1)
x(5)
x(3)
x(4)
( ) )()()(T1 xxwww jjjiiii y–=+ a
( ) )()()(T xxw jjji y-
( )å
=
–
n
j
jjj
i y
1
)()()(T xxw
35
Is this a good idea for distributed computing?
Issue:
• While we’ve reduced computation, we may require more iterations (i.e., more
communication) to converge
• Additionally, if we only process one observation at each iteration, some
machines will be idle
Another idea:
• Compute an approximate gradient with respect to a few (more than one)
observations [mini-batch methods]
36
Outline
1. Stochastic gradient descent (SGD)
2. Mini-batch SGD
3. Divide & conquer methods
4. Local-updating methods
37
Gradient Descent Update:
Mini-batch SGD/GD
Stochastic Gradient Descent
(SGD) Update:
use a single observation
[FOR LINEAR R EGRE SSI ON]
Mini-batch SGD/GD Update:
sum over a mini-batch of observations
( )å
=
+ –=
n
j
jjj
iiii y
1
)()()(T
1 xxwww a
( ) )()()(T1 xxwww jjjiiii y–=+ a
( )å
Î
+ –=
n
Βj
jjj
iiii
i
y )()()(T1 xxwww a
38
Mini-batch SGD
MORE G E N E R A L LY
Gradient Descent Update:
Objective we want to solve: where
Stochastic Gradient Descent
(SGD) Update: with j sampled at random
Mini-batch SGD Update:
with mini-batch B i ⊆{1,… n} sampled at random
)w(min
w
f å
=
=
n
j
jff
1
)w(:)w(
)w(ww 1 iiii fÑ-=+ a
)w(ww 1 ijiii fÑ-=+ a
)w(ww 1 iΒiii ifÑ-=+ a
39
Mini-batch SGD
reduce: m=1w = w + ∑
M Δwm
SGD: one observation
reduce: m=1w = w + ∑
M Δwm
gradient descent: all observations
m=1
reduce: w = w + ∑M Δwm
mini-batch SGD: some observations
40
Mini-batch SGD
https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-
10f652806a3
41
Mini-batch SGD
Pros
•
•
•
In contrast to SGD: parallelizable
Less computation than GD (might
lead to faster overall convergence)
Can tune computation vs.
communication depending on batch
size
Cons
•
•
•
In terms of iterations: slower convergence
than gradient descent
Another parameter to tune (batch size)
Still might be too much communication …
42
Outline
1. Stochastic gradient descent (SGD)
2. Mini-batch SGD
3. Divide & conquer methods
4. Local-updating methods
43
CPU Local disksRAM Rack
0.3GB/s
Access rates fall sharply with distance
• Parallelism makes computation fast
• Network makes communication slow
50 GB/s
1 GB/s 1 GB/s
Different
Racks
Communication hierarchy
Be mindful of this hierarchy when developing parallel/distributed algorithms!
RECALL
44
One-shot averaging
map: find
optimal local
model w⋆m
average: å =
*=
M
m m1
w
M
1
:w
45
Challenge: One-shot averaging might not work
ideal
46
Challenge: One-shot averaging might not work
47
Challenge: One-shot averaging might not work
average
48
Challenge: One-shot averaging might not work
ideal
average
49
Challenge: One-shot averaging might not work
One-shot averaging
50
Outline
1. Stochastic gradient descent (SGD)
2. Mini-batch SGD
3. Divide & conquer methods
4. Local-updating methods
51
Local-updating methods
One-shot averaging
Local-updating methods
52
Methods for distributed optimization
less local computation, more
communication
more local computation, less
communication
stochastic gradient
descent
gradient descent divide-and-conquer / one-
shot averaging
mini-batch SGD
local-updating methods
53
Local-updating methods
Pros
•
•
Fully flexible in terms of
computation vs. communication
Allows re-use of local solvers
Cons
•
•
Another parameter to tune
Not well-understood for non-convex
objectives [current active area of
research!]
54
Federated Learning
Additional Reading
• Optimization for large-scale machine learning: https://
leon.bottou.org/publications/pdf/tr-optml-2016.pdf
55
Next week
• Federated Learning