COMP3221: Distributed
Systems
Distributed Machine Learning
Dr Nguyen Tran
School of Computer Science
Outline
1. ML Pipeline
2. Empirical Risk Minimization (ERM)
3. Cross-validation
2
Machine Learning Pipelines
Machine learning is: the study of methods that improve their
performance on some task with experience using data
4
Machine Learning
Regression
How much should you sell your house for?
data
input: houses & features
regression
house size
pi
rc
e
($
)
learn: x → y relationship
intelligence
=??
predict: y (continuous)
5
Classification
Cat or dog?
data
input: cats & dogs
classification
learn: x → y relationship
intelligence
=??
predict: y (categorical)
6
Machine learning pipeline
ML methoddata intelligence
feature
extraction
model &
parameters
optimization evaluation
7
Terminology
Observations (n). Items or entities used for learning or evaluation
Features (k). Attributes (typically numeric) used to represent an observation
Labels. Values / categories assigned to observations
Training and Test Data. Observations used to train and evaluate a learning algorithm
• Training data is given to the algorithm for training
• Test data is withheld at train time
8
Terminology
Observations (n). Items or entities used for learning or evaluation, e.g., emails
Features (k). Attributes (typically numeric) used to represent an observation, e.g., length,
date, presence of keywords
Labels. Values / categories assigned to observations, e.g., {spam, not-spam}
Training and Test Data. Observations used to train and evaluate a learning algorithm,
e.g., a set of emails along with their labels
• Training data is given to the algorithm for training
• Test data is withheld at train time
EXAMPLE : SPAM DETECTION
9
Two common learning settings
Supervised learning. Learning from labeled observations
• Labels ‘teach’ algorithm to learn mapping from observations to labels
Unsupervised learning. Learning from unlabeled observations
• Learning algorithm must find latent structure from features alone
• Can be goal in itself (discover hidden patterns, exploratory data analysis)
• Can be means to an end (preprocessing for supervised task)
10
Goal: Learn a mapping from observations to discrete labels given a set of training
examples (supervised learning)
Example: Spam Classification
• Observations are emails
• Labels are {spam, not-spam} (Binary Classification)
• Given a set of labeled emails, we want to predict whether a new email is spam or
not-spam
Sample classification pipeline
11
training
set
Raw data consists of a set of labeled training
observations
Obtain Raw Data
Feature Extraction
Evaluation
Predict
Supervised Learning
Classification pipeline
12
From:
“Eliminate your debt by
giving us your money…”
From:
“Hi, it’s been a while!
How are you? …”
Observation
spam
not-spam
Label
Obtain Raw Data
Feature Extraction
Supervised Learning
Evaluation
Predict
E.g., spam classification
13
training
set
Feature extraction typically transforms each observations
into a vector of real numbers (features)
Success or failure of a classifier often depends on choosing
good descriptions of observations!
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Classification pipeline
14
training
set classifier
Supervised Learning: Train classifier using training data
• Common classifiers include Logistic Regression, SVMs,
Decision Trees, Random Forests, etc.
Training (especially at scale) often involves iterative computations,
e.g., gradient descent
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Classification pipeline
15
Goal: Find linear decision boundary
•
•
•
Parameters to learn are feature weights and offset
Nice probabilistic interpretation
Discussed in more detail later in course
Obtain Raw Data
Feature Extraction
Supervised Learning
Evaluation
Predict
E.g., logistic regression
16
How can we evaluate the quality of our classifier?
We want good predictions on unobserved data
• ’Generalization’ ability
Accuracy on training data is overly optimistic since classifier
has already learned from it
• We might be ‘overfitting’
training
set classifier
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Classification pipeline
17
Fitting training data does not guarantee generalization
Left: perfectly fits training samples, but it is complex / may be overfitting
Right: misclassifies a few points, but is simple / generalizes
Obtain Raw Data
Feature Extraction
Supervised Learning
Evaluation
Predict
Overfitting and generalization
18
How can we evaluate the quality of our classifier?
Idea: Create test set to simulate unobserved data
training
set classifier
Obtain Raw Data
Feature Extraction
Evaluation
Predict
Supervised Learning
Classification pipeline
19
Evaluation: Split dataset into training / testing datasets
•
•
•
Train on training set (don’t expose test set to classifier)
Make predictions on test set (ignoring test labels)
Compare test predictions with underlying test labels
training
set classifier
full
dataset
test set accuracy
Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Classification pipeline
20
• Evaluation criterion is called a ‘loss’ function
• Accuracy (or 0-1 loss) is common for classification
new entity
prediction
Predict: Final classifier can then be used to make predictions on
future observations, e.g., new emails we receive
training
set classifier
full
dataset
test set accuracy Obtain Raw Data
Feature Extraction
Predict
Evaluation
Supervised Learning
Classification pipeline
21
Outline
1. Revisiting the ML Pipeline
2. Supervised learning: ERM
3. Cross-validation
22
Supervised learning
• Examples:
• Cat vs. dog (classification): predict whether a picture is a cat or a dog
• Housing prices (regression): predict price of a house based on features (size,
location, etc)
• For supervised learning problems, it is natural to talk about the performance
measure first, and model second
• A decision-theoretic loss function is often part of the problem specification in
supervised learning — the goal is to minimizeℓ(f(x),y)
In supervised learning, you have access to input variables, X, and outputs, Y, and the
goal is to predict an output given an input
23
Loss function
Supervised learning: Given x ∊X, predict y ∊Y by constructing a prediction rule
f: X → Y
• What’s the ‘penalty’ we suffer by making an incorrect decision with f?
• Loss function: a performance metric that measures the closeness between the true
label y and the prediction f(x)
24
Loss function: Classification
EXAMPLE
x y f(x) 0/1 Loss
dog dog
cat
0
1
Classification: Given x ∊X, predict y ∊Y (discrete) with prediction rule f: X → Y
0/1 Loss
ℓ( f(x),y) =1f(x)≠y
25
Loss function: Regression
EXA MPLE
x y f(x) 0/1 Loss
$300,000 $300,000 0
$300,000 $300,500 1
$300,000 $400,000 1?
Regression: Given x ∊X, predict y ∊Y (continuous) with prediction rule f: X → Y
0/1 Loss
ℓ( f(x),y) =1f(x)≠y
26
Loss function: Regression
EXA MPLE
Regression: Given x ∊X, predict y ∊Y (continuous) with prediction rule f: X → Y
Square Loss
ℓ( f(x), y) = ( f(x) − y)2
x y f(x) 0/1 Loss Square Loss
$300,000 $300,000 0 0
$300,000 $300,500 1 250,000
$300,000 $400,000 1? 1E+10
27
Loss function: Regression
EXAMPLE
Regression: Given x ∊X, predict y ∊Y (continuous) with prediction rule f: X → Y
Absolute Loss
ℓ( f(x), y) = | f(x) − y |
x y f(x) 0/1 Loss Square Loss Absolute Loss
$300,000 $300,000 0 0 0
$300,000 $300,500 1 250,000 500
$300,000 $400,000 1? 1E+10 100,000
28
Loss function
Supervised learning: Given x ∊X, predict y ∊Y by constructing a prediction rule
f: X → Y
• What’s the ‘penalty’ we suffer by making an incorrect decision with f ?
• Loss function: a performance metric that measures the closeness between the true
label y and the prediction f(x)
• Examples:
• 0/1 Loss: ℓ( f(x), y) =1f(x)≠y
• Squared error loss: ℓ( f(x),y) = ( f(x) − y)2
• Absolute error loss: ℓ( f(x),y) = | f(x) − y |
29
Empirical risk minimization
• A popular approach in supervised ML
• Given a loss ℓ and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing
the empirical risk
å
=
Î
=
n
i
iiFfn
yxf
n
f
1
)),((
1
minargˆ !
30
• Computational considerations: How can we solve the resulting optimization
problem in a computationally tractable manner?
• Statistical considerations: What guarantees do we have for the resulting
empirical risk minimizer?
Computational Considerations
• Even when the class of functions, F, is simple (e.g., linear functions), the above
optimization problem might be non-convex and thus difficult to solve
• For example, this (non-convexity) is the case for the 0/1 loss
EMPIRICAL RISK MINIMIZATION
å
=
Î
=
n
i
iiFfn
yxf
n
f
1
)),((
1
minargˆ !
31
Computational Considerations
• We will also consider how to solve this objective when:
• the number of features (k) is large
• the number of observations (n) is large
• Techniques:
• parallel and distributed learning, large-scale optimization, efficient data
structures, dimensionality reduction, etc
• Will first focus on convex objectives (linear regression, logistic regression), and then
discuss non-convex objectives (deep neural networks) later in the course
EMPIRICAL RISK MINIMIZATION
å
=
Î
=
n
i
iiFfn
yxf
n
f
1
)),((
1
minargˆ !
32
Overfitting and generalization
33
Overfitting and generalization
34
Fitting training data does not guarantee generalization
Left: perfectly fits training samples, but it is complex / may be overfitting.
Right: misclassifies a few points, but is simple / generalizes
Overfitting and generalization
More complex hypothesis class → greater risk of overfitting
35
Regularization
• Key idea: modify ERM objective to penalize complex models
• Helps to prevent overfitting
• Note: have re-parameterized our hypothesis, f, in terms of w
• Larger 𝝀
• more regularization, reduce likelihood of overfitting
• more bias, less variance
• Challenge: How to tune 𝝀 ?
HOW TO PREVENT OVERF ITTING?
36
Outline
1. Revisiting the ML Pipeline
2. Empirical Risk Minimization
3. Cross-validation
37
Hyperparameter tuning
• Challenge: We want to pick a value for 𝝀 that effectively regularizes the model and
ensures we generalize well to new data
• ** However, we can’t use the test data to tune 𝝀 **
• Otherwise we will not have a reliable measure of generalization
• The test data is only used to evaluate model performance (and estimate the true risk)
after training — should **NOT** be used to inform the model
38
Tuning with a validation set
A FIRST APPROACH
Training data: Dtrain = {(x1, y1), … (xn,yn)}
• Used to learn the ERM predictor
Test data: Dtest = {(x1, y1), … (xm,ym)}
• Used to assess the performance of the ERM predictor
Validation data: Dval = {(x1, y1), … (xL,yL)}
• Used to optimize hyperparameter(s)
training, validation, and test data should not overlap!
39
Tuning with a validation set
A FIRST APPROACH
full dataset (D)
training data: Dtrain validation data: Dval test data: Dtest
40
Tuning with a validation set
A FIRST APPROACH
• For each possible value of the hyperparameter (e.g., 𝝀 = {0.001, 0.01, 0.1, 1, 10})
• Train a model using Dtrain
• Evaluate the resulting model using Dval
• Chose a model with the best performance on Dval
• Evaluate final model on Dtest
what about if we don’t have validation data?
41
Cross-validation
• Split the training data into K equal parts
• Use each part in turn as a validation dataset, with the remaining for training
• Choose the hyperparameter such that the model performs the best (e.g., on average) across the K
validation datasets
• Special case: when K=n, leave-one-out (LOO) cross-validation
42
Example: Predicting house price from size, location, age
We can augment the feature vector to incorporate offset:
xT = [1 x1 x2 x3 ]
We can then rewrite this linear mapping as scalar product:
Linear regression
å
=
==»
3
0
Txwˆ
i
ii xwyy
43
Goal: find the line of best fit
x coordinate: features
y coordinate: labels
x
y
Intercept / Offset Slope
1D example
xwwyy 10ˆ +=»
44
Empirical risk minimization
• A popular approach in supervised ML
• Given a loss ℓ and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing
the empirical risk
• We typically restrict this predictor to lie in some class, F
• Could reflect our prior knowledge about the task
• Or may be for computational convenience
LINEAR REGRESSION VIA
linear functions
Question: how should we select our function class, F,
and our loss function, ℓ ??
å
=
Î
=
n
i
iiFfn
yxf
n
f
1
)),((
1
minargˆ !
45
Can measure ‘closeness’ between label and prediction
•
•
House price: better to be incorrect by $50 than $50,000
Song year prediction: better to be off by a year than by 20years
What is an appropriate evaluation metric or ‘loss’ function?
• Absolute loss:
• Square loss:
Evaluating predictions
yy ˆ-
( )2ŷy –
46
Loss function: Regression
Absolute Loss
ℓ( f(x), y) = | f(x) − y |
x y f(x) Square Loss Absolute Loss
$300,000 $300,000 0 0
$300,000 $300,500 250,000 500
$300,000 $400,000 1E+10 100,000
Square Loss
ℓ( f(x), y) = ( f(x) − y)2
47
← Has nice mathematical properties
Can measure ‘closeness’ between label and prediction
•
•
House price: better to be incorrect by $50 than $50,000
Song year prediction: better to be off by a year than by 20years
What is an appropriate evaluation metric or ‘loss’ function?
• Absolute loss:
• Square loss:
Evaluating predictions
yy ˆ-
48
( )2ŷy –
Empirical risk minimization
• A popular approach in supervised ML
• Given a loss ℓ and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing
the empirical risk
• We typically restrict this predictor to lie in some class, F
• Could reflect our prior knowledge about the task
• Or may be for computational convenience
LINEAR REGRESSION VIA
linear functions
Question: how should we select our function class, F, and our loss function, ℓ ??
square loss
å
=
Î
=
n
i
iiFfn
yxf
n
f
1
)),((
1
minargˆ !
49
Assume we have n training points, where x(i)denotes the ith point
Recall two earlier points:
•
•
Linear assumption: ŷ =wTx
We use square loss:
Idea: Find that minimizes square loss over training points:
How can we learn model (w)?
w
( )2ŷy –
å
=
–
n
i
ii y
1
2)()(T
w
)xw(min {
)(ˆ iy
50
w
min ||Xw—y|| 22
Equivalent by definition of Euclidean norm
Given n training points with k features, we define:
• X ∈ ℝn×k : matrix storing points
• y ∈ ℝn : real-valued labels
=Xw• ŷ ∈ ℝn : predicted labels, where ŷ
•w ∈ ℝk : regression parameters / model to learn
Least Squares Regression: Learn mapping (w) from features to labels that
minimizes residual sum of squares:
å
=
–
n
i
iy
1
2)((i)T
w
)xw(min
51
How can we learn model (w)?
Closed form solution: w = (XTX)-1XTy (if inverse exists)
Find solution by setting derivative to zero
s
wxT x
˛¸
— xT y
x
→→ wxTx—xTy = 0
→→ w = (xTx)—1xTy
Least Squares Regression: Learn mapping from features to labels that
minimizes residual sum of squares:
min ||Xw—y||
w
2
2
å
=
-==
n
i
ii ywxwwf
1
2)()(2
2
)(y-x)(1D:
å
=
=-=
n
i
iii ywxxw
dw
df
1
)()()( 0)(2)(
52
How can we learn model (w)?
We want good predictions on new data, i.e., ’generalization’
Least squares regression minimizes training error, and could overfit
• Simpler models are more likely to generalize (Occam’s razor)
Overfitting and generalization
Can we change the problem to penalize for model complexity?
• Intuitively, models with smaller weights are simpler
53
Regularization
• Key idea: modify ERM objective to penalize complex models
• Helps to prevent overfitting
• Note: have re-parameterized our hypothesis, f, in terms of w
• Larger 𝝀
• more regularization, reduce likelihood of overfitting
• more bias, less variance
HOW TO PREVENT OVERF ITTING?
Question: how to select our regularizer, R() ??
54
w 2
+ λ||w||2 2
Closed-form solution: w = (X⊤X + λIk)−1X⊤y
free parameter trades off
between training error and
model complexity
= Xw
Given n training points with k features, we define:
• X ∈ ℝn×k : matrix storing points
• y ∈ ℝn : real-valued labels
• ŷ ∈ ℝn : predicted labels, where ŷ
•w ∈ ℝk : regression parameters / model to learn
Ridge Regression: Learn mapping (w) that minimizes residual
sum of squares along with a regularization term:
Training Error Model Complexity
55
min ||Xw—y|| 2
•
•
Matrix multiply of XTX : O(nk2) operations
Matrix inverse: O(k3) operations
w =(XTX)—1XTy
Computation: O(nk2 + k3) operations
Consider number of arithmetic operations ( +, −, ×, / )
Computational bottlenecks:
Computing closed form solution
LEAST SQUARES REGRESSION
56
Storage requirements
w =(XTX)—1XTy
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats
Consider storing values as floats (8 bytes)
Storage bottlenecks:
• XTX and its inverse: O(k2) floats
• X: O(nk) floats
LEAST SQUARES REGRESSION
57
•
•
Store data points (rows of X ) across machines
Compute XTX as a sum of outer products
w = (XTX)-1XTy
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats
Assume O(k3) computation and O(k2) storage feasible on single machine.
Storing X and computing XTXare the bottlenecks
Can distribute storage and computation!
Large n and small k
LEAST SQUARES REGRESSION
58
9 3 5
4 1 2
1 2
3 -5
2 3
=
9 1 + 3 3 + 5 2 = 28
Each entry of output matrix is result of inner product of inputs matrices
Matrix multiplication via inner products
´ ´ ´
59
9 3 5
4 1 2
1 2
3 -5
2 3
=
Output matrix is sum of outer products between corresponding rows and
columns of input matrices
Matrix multiplication via outer products
60
Example: n = 6; 3 workers
O(nk) Distributed Storage
…
x(1)
x(2)
x(n)
…
x(
1)
x(
2)
x(
n)
k
n
n
k
= x(i)
x(i)
x(2)
x(6)
workers: x(3)
x(4)
map:
x(
i)
x(i)
x(
i)
x(i)
x(
i)
x(i)
( )-1reduce: x(i) x(i)
XTX =
O(nk2)
Distributed
Computation
O(k2) Local
Storage
O(k3) Local
Computation
O(k2) Local
Storage
x(1)
x(5)
å
=
n
i 1
å
61
•
•
Store data points (rows of X ) across machines
Compute XTX as a sum of outer products
Large n and small k
w =(XTX)—1XTy
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats
Assume O(k3) computation and O(k2) storage feasible on single machine
LEAST SQUARES REGRESSION
Can distribute storage and computation!
62
w =(XTX)—1XTy
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats
As before, storing X and computing XTX are bottlenecks.
Now, storing and operating on XTX is also a bottleneck
• Can’t easily be distributed!
Large n and large k
63
Example: n = 6; 3 workers
…
x(1)
x(2)
x(n)
…
x(
1)
x(
2)
x(
n)
k
n
n
k
= x(i)
x(i)
x(2)
x(6)
workers: x(3)
x(4)
map:
x(
i)
x(i)
x(
i)
x(i)
x(
i)
x(i)
( )-1reduce: x(i) x(i)
XTX =
O(nk) Distributed Storage
O(nk2)
Distributed
Computation
O(k2) Local
Storage
O(k3) Local
Computation
O(k2) Local
Storage
x(1)
x(5)
å
=
n
i 1
å 64
Large n and large k
w =(XTX)—1XTy
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats
As before, storing Xand computing XTX are bottlenecks.
Now, storing and operating on XTX is also a bottleneck
• Can’t easily be distributed!
1st Rule of thumb
Computation and storage should be linear (in n, k)
LEAST SQUARES REGRESSION
65
n
≈
n
k r k
r
‘Low-rank’
We need methods that are linear in time and space
One idea: Exploit sparsity or reduce dimension
•
•
Explicit sparsity can provide orders of magnitude storage and
computational gains
Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank
approximation (unsupervised learning)
Large n and large k
66
Another idea: Use different algorithms
• Gradient descent is an iterative algorithm that requires
O(nk) computation and O(k) local storage per iteration
We need methods that are linear in time and space
One idea: Exploit sparsity or reduce dimension
•
•
Explicit sparsity can provide orders of magnitude storage and
computational gains
Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank
approximation (unsupervised learning)
Large n and large k
67
Example: n = 6; 3 workers
x(2)
x(6)
workers: x(3)
x(4)
map:
x(
i)
x(i)
x(
i)
x(i)
x(
i)
x(i)
( )-1reduce: x(i) x(i)
O(nk) Distributed
Storage
O(nk2)
Distributed
Computation
O(k2) Local
Storage
O(k3) Local
Computation
O(k2) Local
Storage
Closed form solution for large n and large k
x(1)
x(5)
å
68
Example: n = 6; 3 workers
x(2)
x(6)
workers: x(3)
x(4)
reduce:
O(nk) Distributed
Storage
map: ? ? ?
O(k)
O(k2) Local
Storage
O(nk)
O(nk2)
Distributed
Computation
O(k)
O(k3) Local
Computation
O(k)
O(k2) Local
Storage
?
Gradient descent for large n and large k
x(1)
x(5)
69
Conclusion
ML pipeline
Linear Regression
Distributed Linear Regression
What’s Next ?
Tutorial: Review Linear Algebra and big O complexity
Distributed Optimization for Large-scale Machine Learning