CS计算机代考程序代写 algorithm COMP3221: Distributed

COMP3221: Distributed
Systems

Distributed Machine Learning

Dr Nguyen Tran
School of Computer Science

Outline

1. ML Pipeline

2. Empirical Risk Minimization (ERM)

3. Cross-validation

Machine Learning Pipelines

Machine learning is: the study of methods that improve their
performance on some task with experience using data

Machine Learning

Regression

How much should you sell your house for?

data

input: houses & features

regression

house size

pi
rc
e
($
)

learn: x → y relationship

intelligence

=??

predict: y (continuous)
5

Classification

Cat or dog?

data

input: cats & dogs

classification

learn: x → y relationship

intelligence

=??

predict: y (categorical)
6

Machine learning pipeline

ML methoddata intelligence

feature
extraction

model &
parameters

optimization evaluation

Terminology

Observations (n). Items or entities used for learning or evaluation

Features (k). Attributes (typically numeric) used to represent an observation

Labels. Values / categories assigned to observations

Training and Test Data. Observations used to train and evaluate a learning algorithm
• Training data is given to the algorithm for training
• Test data is withheld at train time

Terminology

Observations (n). Items or entities used for learning or evaluation, e.g., emails

Features (k). Attributes (typically numeric) used to represent an observation, e.g., length,
date, presence of keywords

Labels. Values / categories assigned to observations, e.g., {spam, not-spam}

Training and Test Data. Observations used to train and evaluate a learning algorithm,
e.g., a set of emails along with their labels

• Training data is given to the algorithm for training
• Test data is withheld at train time

EXAMPLE : SPAM DETECTION

Two common learning settings

Supervised learning. Learning from labeled observations
• Labels ‘teach’ algorithm to learn mapping from observations to labels

Unsupervised learning. Learning from unlabeled observations
• Learning algorithm must find latent structure from features alone
• Can be goal in itself (discover hidden patterns, exploratory data analysis)
• Can be means to an end (preprocessing for supervised task)

Goal: Learn a mapping from observations to discrete labels given a set of training
examples (supervised learning)

Example: Spam Classification
• Observations are emails
• Labels are {spam, not-spam} (Binary Classification)
• Given a set of labeled emails, we want to predict whether a new email is spam or

not-spam

Sample classification pipeline

training
set

Raw data consists of a set of labeled training
observations

Obtain Raw Data

Feature Extraction

Evaluation

Predict

Supervised Learning

Classification pipeline

From:

“Eliminate your debt by
giving us your money…”

From:

“Hi, it’s been a while!
How are you? …”

Observation

spam

not-spam

Label

Obtain Raw Data

Feature Extraction

Supervised Learning

Evaluation

Predict

E.g., spam classification

training
set

Feature extraction typically transforms each observations
into a vector of real numbers (features)

Success or failure of a classifier often depends on choosing
good descriptions of observations!

Obtain Raw Data

Feature Extraction

Predict

Evaluation

Supervised Learning

Classification pipeline

training
set classifier

Supervised Learning: Train classifier using training data
• Common classifiers include Logistic Regression, SVMs,
Decision Trees, Random Forests, etc.

Training (especially at scale) often involves iterative computations,
e.g., gradient descent

Obtain Raw Data

Feature Extraction

Predict

Evaluation

Supervised Learning

Classification pipeline

Goal: Find linear decision boundary
•
•
•

Parameters to learn are feature weights and offset
Nice probabilistic interpretation

Discussed in more detail later in course
Obtain Raw Data

Feature Extraction

Supervised Learning

Evaluation

Predict

E.g., logistic regression

How can we evaluate the quality of our classifier?

We want good predictions on unobserved data
• ’Generalization’ ability

Accuracy on training data is overly optimistic since classifier
has already learned from it
• We might be ‘overfitting’

training
set classifier

Obtain Raw Data

Feature Extraction

Predict

Evaluation

Supervised Learning

Classification pipeline

Fitting training data does not guarantee generalization

Left: perfectly fits training samples, but it is complex / may be overfitting

Right: misclassifies a few points, but is simple / generalizes

Obtain Raw Data

Feature Extraction

Supervised Learning

Evaluation

Predict

Overfitting and generalization

How can we evaluate the quality of our classifier?
Idea: Create test set to simulate unobserved data

training
set classifier

Obtain Raw Data

Feature Extraction

Evaluation

Predict

Supervised Learning

Classification pipeline

Evaluation: Split dataset into training / testing datasets

•
•
•

Train on training set (don’t expose test set to classifier)
Make predictions on test set (ignoring test labels)
Compare test predictions with underlying test labels

training
set classifier

full
dataset

test set accuracy
Obtain Raw Data

Feature Extraction

Predict

Evaluation

Supervised Learning

Classification pipeline

• Evaluation criterion is called a ‘loss’ function
• Accuracy (or 0-1 loss) is common for classification

new entity

prediction

Predict: Final classifier can then be used to make predictions on
future observations, e.g., new emails we receive

training
set classifier

full
dataset

test set accuracy Obtain Raw Data

Feature Extraction

Predict

Evaluation

Supervised Learning

Classification pipeline

Outline

1. Revisiting the ML Pipeline

2. Supervised learning: ERM

3. Cross-validation

Supervised learning

• Examples:
• Cat vs. dog (classification): predict whether a picture is a cat or a dog
• Housing prices (regression): predict price of a house based on features (size,

location, etc)
• For supervised learning problems, it is natural to talk about the performance

measure first, and model second
• A decision-theoretic loss function is often part of the problem specification in

supervised learning — the goal is to minimizeℓ(f(x),y)

In supervised learning, you have access to input variables, X, and outputs, Y, and the
goal is to predict an output given an input

Loss function

Supervised learning: Given x ∊X, predict y ∊Y by constructing a prediction rule
f: X → Y

• What’s the ‘penalty’ we suffer by making an incorrect decision with f?
• Loss function: a performance metric that measures the closeness between the true

label y and the prediction f(x)

Loss function: Classification
EXAMPLE

x y f(x) 0/1 Loss

dog dog
cat

0
1

Classification: Given x ∊X, predict y ∊Y (discrete) with prediction rule f: X → Y

0/1 Loss

ℓ( f(x),y) =1f(x)≠y

Loss function: Regression
EXA MPLE

x y f(x) 0/1 Loss

$300,000 $300,000 0
$300,000 $300,500 1
$300,000 $400,000 1?

Regression: Given x ∊X, predict y ∊Y (continuous) with prediction rule f: X → Y

0/1 Loss

ℓ( f(x),y) =1f(x)≠y

Loss function: Regression
EXA MPLE

Regression: Given x ∊X, predict y ∊Y (continuous) with prediction rule f: X → Y

Square Loss

ℓ( f(x), y) = ( f(x) − y)2

x y f(x) 0/1 Loss Square Loss

$300,000 $300,000 0 0
$300,000 $300,500 1 250,000
$300,000 $400,000 1? 1E+10

Loss function: Regression
EXAMPLE

Regression: Given x ∊X, predict y ∊Y (continuous) with prediction rule f: X → Y

Absolute Loss

ℓ( f(x), y) = | f(x) − y |

x y f(x) 0/1 Loss Square Loss Absolute Loss

$300,000 $300,000 0 0 0

$300,000 $300,500 1 250,000 500

$300,000 $400,000 1? 1E+10 100,000

Loss function

Supervised learning: Given x ∊X, predict y ∊Y by constructing a prediction rule
f: X → Y

• What’s the ‘penalty’ we suffer by making an incorrect decision with f ?
• Loss function: a performance metric that measures the closeness between the true

label y and the prediction f(x)
• Examples:

• 0/1 Loss: ℓ( f(x), y) =1f(x)≠y
• Squared error loss: ℓ( f(x),y) = ( f(x) − y)2
• Absolute error loss: ℓ( f(x),y) = | f(x) − y |

Empirical risk minimization

• A popular approach in supervised ML
• Given a loss ℓ and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing

the empirical risk

å
=

Î
=

i
iiFfn
yxf

n
f

1
)),((

1
minargˆ !

• Computational considerations: How can we solve the resulting optimization
problem in a computationally tractable manner?

• Statistical considerations: What guarantees do we have for the resulting
empirical risk minimizer?

Computational Considerations

• Even when the class of functions, F, is simple (e.g., linear functions), the above
optimization problem might be non-convex and thus difficult to solve

• For example, this (non-convexity) is the case for the 0/1 loss

EMPIRICAL RISK MINIMIZATION

å
=

Î
=

i
iiFfn
yxf

n
f

1
)),((

1
minargˆ !

Computational Considerations

• We will also consider how to solve this objective when:
• the number of features (k) is large
• the number of observations (n) is large

• Techniques:
• parallel and distributed learning, large-scale optimization, efficient data

structures, dimensionality reduction, etc
• Will first focus on convex objectives (linear regression, logistic regression), and then

discuss non-convex objectives (deep neural networks) later in the course

EMPIRICAL RISK MINIMIZATION

å
=

Î
=

i
iiFfn
yxf

n
f

1
)),((

1
minargˆ !

Overfitting and generalization

Fitting training data does not guarantee generalization

Left: perfectly fits training samples, but it is complex / may be overfitting.

Right: misclassifies a few points, but is simple / generalizes

Overfitting and generalization

More complex hypothesis class → greater risk of overfitting
35

Regularization

• Key idea: modify ERM objective to penalize complex models
• Helps to prevent overfitting
• Note: have re-parameterized our hypothesis, f, in terms of w
• Larger 𝝀

• more regularization, reduce likelihood of overfitting
• more bias, less variance

• Challenge: How to tune 𝝀 ?

HOW TO PREVENT OVERF ITTING?

Outline

1. Revisiting the ML Pipeline

2. Empirical Risk Minimization

3. Cross-validation

Hyperparameter tuning

• Challenge: We want to pick a value for 𝝀 that effectively regularizes the model and
ensures we generalize well to new data

• ** However, we can’t use the test data to tune 𝝀 **
• Otherwise we will not have a reliable measure of generalization
• The test data is only used to evaluate model performance (and estimate the true risk)

after training — should **NOT** be used to inform the model

Tuning with a validation set
A FIRST APPROACH

Training data: Dtrain = {(x1, y1), … (xn,yn)}
• Used to learn the ERM predictor

Test data: Dtest = {(x1, y1), … (xm,ym)}
• Used to assess the performance of the ERM predictor

Validation data: Dval = {(x1, y1), … (xL,yL)}
• Used to optimize hyperparameter(s)

training, validation, and test data should not overlap!
39

Tuning with a validation set
A FIRST APPROACH

full dataset (D)

training data: Dtrain validation data: Dval test data: Dtest

Tuning with a validation set
A FIRST APPROACH

• For each possible value of the hyperparameter (e.g., 𝝀 = {0.001, 0.01, 0.1, 1, 10})

• Train a model using Dtrain
• Evaluate the resulting model using Dval

• Chose a model with the best performance on Dval
• Evaluate final model on Dtest

what about if we don’t have validation data?

Cross-validation
• Split the training data into K equal parts

• Use each part in turn as a validation dataset, with the remaining for training

• Choose the hyperparameter such that the model performs the best (e.g., on average) across the K
validation datasets

• Special case: when K=n, leave-one-out (LOO) cross-validation

Example: Predicting house price from size, location, age

We can augment the feature vector to incorporate offset:

xT = [1 x1 x2 x3 ]

We can then rewrite this linear mapping as scalar product:

Linear regression

å
=

==»
3

Txwˆ
i

ii xwyy
43

Goal: find the line of best fit

x coordinate: features

y coordinate: labels

Intercept / Offset Slope

1D example

xwwyy 10ˆ +=»

Empirical risk minimization

• A popular approach in supervised ML
• Given a loss ℓ and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing

the empirical risk
• We typically restrict this predictor to lie in some class, F

• Could reflect our prior knowledge about the task
• Or may be for computational convenience

LINEAR REGRESSION VIA

linear functions

Question: how should we select our function class, F,
and our loss function, ℓ ??

å
=

Î
=

i
iiFfn
yxf

n
f

1
)),((

1
minargˆ !

Can measure ‘closeness’ between label and prediction

•
•

House price: better to be incorrect by $50 than $50,000

Song year prediction: better to be off by a year than by 20years

What is an appropriate evaluation metric or ‘loss’ function?
• Absolute loss:

• Square loss:

Evaluating predictions

yy ˆ-
( )2ŷy –

Loss function: Regression

Absolute Loss

ℓ( f(x), y) = | f(x) − y |

x y f(x) Square Loss Absolute Loss

$300,000 $300,000 0 0
$300,000 $300,500 250,000 500
$300,000 $400,000 1E+10 100,000

Square Loss

ℓ( f(x), y) = ( f(x) − y)2

← Has nice mathematical properties

Can measure ‘closeness’ between label and prediction

•
•

House price: better to be incorrect by $50 than $50,000

Song year prediction: better to be off by a year than by 20years

What is an appropriate evaluation metric or ‘loss’ function?
• Absolute loss:

• Square loss:

Evaluating predictions

yy ˆ-

( )2ŷy –

Empirical risk minimization

• A popular approach in supervised ML
• Given a loss ℓ and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing

the empirical risk
• We typically restrict this predictor to lie in some class, F

• Could reflect our prior knowledge about the task
• Or may be for computational convenience

LINEAR REGRESSION VIA

linear functions

Question: how should we select our function class, F, and our loss function, ℓ ??

square loss

å
=

Î
=

i
iiFfn
yxf

n
f

1
)),((

1
minargˆ !

Assume we have n training points, where x(i)denotes the ith point

Recall two earlier points:

•
•

Linear assumption: ŷ =wTx
We use square loss:

Idea: Find that minimizes square loss over training points:

How can we learn model (w)?

( )2ŷy –

å
=

–
n

ii y
1

2)()(T

w
)xw(min {

)(ˆ iy
50

w
min ||Xw—y|| 22

Equivalent by definition of Euclidean norm

Given n training points with k features, we define:
• X ∈ ℝn×k : matrix storing points
• y ∈ ℝn : real-valued labels

=Xw• ŷ ∈ ℝn : predicted labels, where ŷ
•w ∈ ℝk : regression parameters / model to learn

Least Squares Regression: Learn mapping (w) from features to labels that
minimizes residual sum of squares:

å
=

–
n

iy
1

2)((i)T

w
)xw(min

How can we learn model (w)?

Closed form solution: w = (XTX)-1XTy (if inverse exists)

Find solution by setting derivative to zero

s
wxT x

˛¸
— xT y

→→ wxTx—xTy = 0

→→ w = (xTx)—1xTy

Least Squares Regression: Learn mapping from features to labels that
minimizes residual sum of squares:

min ||Xw—y||
w

2
2

å
=

-==
n

ii ywxwwf
1

2)()(2

2
)(y-x)(1D:

å
=

=-=
n

iii ywxxw
dw
df

)()()( 0)(2)(

How can we learn model (w)?

We want good predictions on new data, i.e., ’generalization’

Least squares regression minimizes training error, and could overfit
• Simpler models are more likely to generalize (Occam’s razor)

Overfitting and generalization

Can we change the problem to penalize for model complexity?
• Intuitively, models with smaller weights are simpler

Regularization

• Key idea: modify ERM objective to penalize complex models
• Helps to prevent overfitting
• Note: have re-parameterized our hypothesis, f, in terms of w
• Larger 𝝀

• more regularization, reduce likelihood of overfitting
• more bias, less variance

HOW TO PREVENT OVERF ITTING?

Question: how to select our regularizer, R() ??
54

w 2
+ λ||w||2 2

Closed-form solution: w = (X⊤X + λIk)−1X⊤y
free parameter trades off

between training error and
model complexity

= Xw

Given n training points with k features, we define:
• X ∈ ℝn×k : matrix storing points
• y ∈ ℝn : real-valued labels
• ŷ ∈ ℝn : predicted labels, where ŷ
•w ∈ ℝk : regression parameters / model to learn

Ridge Regression: Learn mapping (w) that minimizes residual
sum of squares along with a regularization term:

Training Error Model Complexity

min ||Xw—y|| 2

•
•

Matrix multiply of XTX : O(nk2) operations
Matrix inverse: O(k3) operations

w =(XTX)—1XTy
Computation: O(nk2 + k3) operations

Consider number of arithmetic operations ( +, −, ×, / )

Computational bottlenecks:

Computing closed form solution
LEAST SQUARES REGRESSION

Storage requirements

w =(XTX)—1XTy
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats

Consider storing values as floats (8 bytes)

Storage bottlenecks:

• XTX and its inverse: O(k2) floats
• X: O(nk) floats

LEAST SQUARES REGRESSION

•
•

Store data points (rows of X ) across machines
Compute XTX as a sum of outer products

w = (XTX)-1XTy

Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats

Assume O(k3) computation and O(k2) storage feasible on single machine.

Storing X and computing XTXare the bottlenecks

Can distribute storage and computation!

Large n and small k
LEAST SQUARES REGRESSION

9 3 5
4 1 2

1 2
3 -5
2 3

9 1 + 3 3 + 5 2 = 28

Each entry of output matrix is result of inner product of inputs matrices

Matrix multiplication via inner products

´ ´ ´

9 3 5
4 1 2

1 2
3 -5
2 3

Output matrix is sum of outer products between corresponding rows and
columns of input matrices

Matrix multiplication via outer products

Example: n = 6; 3 workers

O(nk) Distributed Storage

…

x(1)
x(2)

x(n)

…

x(
1)

x(
2)

x(
n)

= x(i)

x(i)

x(2)
x(6)

workers: x(3)
x(4)

map:

x(
i)

x(i)
x(
i)

x(i)

x(
i)

x(i)

( )-1reduce: x(i) x(i)

XTX =

O(nk2)
Distributed

Computation

O(k2) Local
Storage

O(k3) Local
Computation

O(k2) Local
Storage

x(1)
x(5)

å
=

i 1

å
61

•
•

Store data points (rows of X ) across machines
Compute XTX as a sum of outer products

Large n and small k

w =(XTX)—1XTy

Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats

Assume O(k3) computation and O(k2) storage feasible on single machine

LEAST SQUARES REGRESSION

Can distribute storage and computation!

w =(XTX)—1XTy
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats

As before, storing X and computing XTX are bottlenecks.
Now, storing and operating on XTX is also a bottleneck
• Can’t easily be distributed!

Large n and large k

Example: n = 6; 3 workers

…

x(1)
x(2)

x(n)

…

x(
1)

x(
2)

x(
n)

= x(i)

x(i)

x(2)
x(6)

workers: x(3)
x(4)

map:

x(
i)

x(i)
x(
i)

x(i)

x(
i)

x(i)

( )-1reduce: x(i) x(i)

XTX =

O(nk) Distributed Storage

O(nk2)
Distributed

Computation

O(k2) Local
Storage

O(k3) Local
Computation

O(k2) Local
Storage

x(1)
x(5)

å
=

i 1

å 64

Large n and large k
w =(XTX)—1XTy

Computation: O(nk2 + k3) operations
Storage: O(nk + k2) floats

As before, storing Xand computing XTX are bottlenecks.
Now, storing and operating on XTX is also a bottleneck

• Can’t easily be distributed!

1st Rule of thumb
Computation and storage should be linear (in n, k)

LEAST SQUARES REGRESSION

n
≈
n

k r k
r

‘Low-rank’

We need methods that are linear in time and space

One idea: Exploit sparsity or reduce dimension

•

Explicit sparsity can provide orders of magnitude storage and
computational gains
Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank
approximation (unsupervised learning)

Large n and large k

Another idea: Use different algorithms
• Gradient descent is an iterative algorithm that requires

O(nk) computation and O(k) local storage per iteration

We need methods that are linear in time and space

One idea: Exploit sparsity or reduce dimension

•

Explicit sparsity can provide orders of magnitude storage and
computational gains
Latent sparsity assumption can be used to reduce dimension, e.g., PCA, low-rank
approximation (unsupervised learning)

Large n and large k

Example: n = 6; 3 workers

x(2)
x(6)

workers: x(3)
x(4)

map:

x(
i)

x(i)

x(
i)

x(i)

x(
i)

x(i)

( )-1reduce: x(i) x(i)

O(nk) Distributed
Storage

O(nk2)
Distributed

Computation

O(k2) Local
Storage

O(k3) Local
Computation

O(k2) Local
Storage

Closed form solution for large n and large k

x(1)
x(5)

å
68

Example: n = 6; 3 workers

x(2)
x(6)

workers: x(3)
x(4)

reduce:

O(nk) Distributed
Storage

map: ? ? ?
O(k)
O(k2) Local

Storage

O(nk)
O(nk2)

Distributed
Computation
O(k)
O(k3) Local

Computation

O(k)
O(k2) Local

Storage
?

Gradient descent for large n and large k

x(1)
x(5)

Conclusion

ML pipeline
Linear Regression
Distributed Linear Regression

What’s Next ?

Tutorial: Review Linear Algebra and big O complexity
Distributed Optimization for Large-scale Machine Learning

Related Posts