COMP3221: Distributed
Systems
Distributed Logistic Regression
Dr Nguyen Tran
School of Computer Science
Regression
How much should you sell your house for?
data
input: houses & features
regression
house size
pi
rc
e
($
)
learn: x → y relationship
intelligence
=?
predict: y (continuous)
2
Classification
Cat or dog?
data
input: cats & dogs
classification
learn: x → y relationship
intelligence
=?
predict: y (categorical)
3
Goal: Learn a mapping from observations to discrete labels given a set of
training examples (supervised learning)
Example: Spam Classification
• Observations are emails
• Labels are {spam, not-spam} (Binary Classification)
• Given a set of labeled emails, we want to predict whether a new email is
spam or not-spam
Classification: example
4
Goal: Learn a mapping from observations to discrete labels given a set of
training examples (supervised learning)
Example: Click-through Rate Prediction
• Observations are user-ad-publisher triples
• Labels are {not-click, click} (Binary Classification)
• Given a set of labeled observations, we want to predict whether a new
user-ad-publisher triple will result in a click
Classification: example
5
Example: Predicting house price from size, location, age
For each observation we have a feature vector, x, and label, y
xT = [ x1 x2 x3 ]
We assume a linear mapping between features and label:
y ≈ w0 +w1x1 +w2x2 + w3x3
Linear least squares regression
RECALL
6
Example: Predicting house price from size, location, age
We can augment the feature vector to incorporate offset:
xT = [1 x1 x2 x3 ]
We can then rewrite this linear mapping as scalar product:
Linear least squares regression
RECALL
7
Simple
Often works well in practice
Can introduce complexity via feature extraction
Why a linear mapping?
Can we do something similar for classification?
8
Example: Predicting rain from temperature, cloudiness, and humidity
Use the same feature representation: XT =
How can we make class predictions?
• {not-rain, rain}, {not-spam, spam}, {not-click, click}
• Idea: threshold by sign
[1 x1 x2 x3 ]
Linear regression linear classification
å
=
=Þ==
3
0
TT x)w(ˆxwˆ
i
ii signyxwy
9
Example: wT = [ ̶ 1 3 ̶ 4]
x1
x2
xT = [1 2 3] : wTx= —7
xT = [1 2 1] : wTx= 1
xT = [1 5 .5] : wTx= 12
xT = [1 3 2.5] : wTx= —2
1 23x — 4x — 1 = 0
Decision
Boundary
Linear classifier decision boundary
10
Regression: can measure ‘closeness’ between label and prediction
•
•
House price prediction: better to be off by 10K than by 1or 0
Squared loss:
Classification: Class predictions are discrete
• 0-1 loss: Penalty is 0 for correct prediction, and 1 otherwise
Evaluating predictions
2)ˆ( yy –
11
0-1
z = y ·wTx
if z < 0 otherwise Let y { ̶ 1, 1} and define z = y · w T x 0/1 loss minimization Î î í ì = 0 1 )(10 z! (z)! 12 Idea: Find w that minimizes average 0-1 loss over training points: How can we learn model (w)? Assume we have n training points, where x(i) denotes the ith point Recall two earlier points: • • Linear assumption: ŷ = sign(wTx) We use 0-1 loss: ( ))(T)( 1 10 xwmin ii n iw y ×å = ! 13 0-1 z = y ·wTx if z < 0 otherwise Let y { ̶ 1, 1} and define z = y · wTx 0/1 loss minimization Issue: hard optimization problem, not convex! Î Î î í ì = 0 1 )(10 z! (z)! 14 0-1 Adaboost Logistic Regression SVM ℓ (z) Approximate 0/1 loss Solution: Approximate 0/1 loss with convex loss (“surrogate” loss) Logistic loss (logloss): ℓlog(z) = log(1 + e−z) Click to add text Click to add text 15 Logistic Regression: Learn mapping (w) that minimizes logistic loss on training data Training LogLoss Model Complexity • Convex • Closed-form solution doesn’t exist Logistic regression optimization . Σ 2 2( ))(T)( 1 xwmin ii n i log w y ×å = ! 16 Empirical risk minimization • A popular approach in supervised ML • Given a loss ℓ and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing the empirical risk • We typically restrict this predictor to lie in some class, F • Could reflect our prior knowledge about the task • Or may be for computational convenience LOGISTIC REGRESSION VIA Question: how should we select our function class, F, and our loss function, ℓ ? logistic loss )),(( 1 minargˆ 1 ii n i Ffn yxf n f å = Î = ! 17 Approximate 0/1 loss: Logistic regression with probabilistic interpretation 18 0-1 Adaboost Logistic Regression SVM ℓ( z) z = y · wTx SVM (hinge), Logistic regression (logistic), Adaboost (exponential) Approximate 0/1 loss Solution: Approximate 0/1 loss with convex loss (“surrogate” loss) Logistic loss (logloss): ℓlog(z) = log(1 + e−z) 19 x x Goal: Model conditional probability: P[ y = 1 | x] Example: Predict rain from temperature, cloudiness, humidity • P[ y = rain | t = 14○F, c = LOW, h = 2% ] = .05 • P[ y = rain | t = 70○F, c = HIGH, h = 95% ] = .9 Example: Predict click from ad’s historical performance, user’s click frequency, and publisher page’s relevance • P[ y = click | h = GOOD, f = HIGH, r = HIGH ] = .6 • P[ y = click | h = BAD, f = LOW, r = LOW ] = .001 Probabilistic interpretation 20 Probabilistic interpretation Goal: Model conditional probability: P[ y = 1 | x] First thought: P[ y = 1 | x ] = wTx ? • Linear regression returns any real number, but probabilities are in range [0,1] How can we transform its output? • Use logistic (or sigmoid) function: P[ y = 1 | x ] = σ(wTx) 21 σ(z) = 1 1 + exp( ̶ z) z σ (z) Maps real numbers to [0,1] Logistic function • • Large positive inputs 1 Large negative inputs 0 Þ Þ 22 Goal: Model conditional probability: P[ y = 1 | x] Logistic regression uses logistic function to model this conditional probability • P[ y = 1 | x ] = σ(wTx) • P[ y = 0 | x ] = 1 ̶ σ(wTx) For notational convenience we now define Probabilistic interpretation 23 To make class predictions, we need to convert probabilities to values in [0,1] We can do this by setting a threshold on the probabilities • • x Default threshold is 0.5 P[y = 1 | x] > 0.5
Example: Predict rain from temperature, cloudiness, humidity
• P[ y = rain | t = 14○F, c = LOW, h = 2% ] = .05
• P[ y = rain | t = 70○F, c = HIGH, h = 95% ] = .9
How do we use probabilities?
1ˆ =Þ y
1ˆ =y
0ˆ =y
24
Connection with decision boundary?
•
•
Decision Boundary
x2
Threshold by sign to make class
predictions: ŷ = sign(wTx)
ŷ = 1 : wTx> 0
ŷ = 0 : wTx< 0
• decision boundary: wTx = 0
x1
How does this compare with thresholding probability?
• P[ y = 1 | x ] = σ(wTx) > 0.5 => ŷ= 1
25
Threshold by sign to make class
predictions: = sign(wTx)
• decision boundary: wTx= 0
Connection with decision boundary?
•
•
: wTx > 0
: wTx < 0
How does this compare with thresholding probability?
• P[ y = 1 | x] = σ(wTx) > 0.5
• With threshold of 0.5, the decision boundaries are identical!
σ (z)
σ(0) = 0.5
z
wTx = 0 σ(wTx) =0.5Û
1ˆ =Þ y
1ˆ =y
0ˆ =y
ŷ
26
•
•
Confidence of prediction
Useful when combining predictions with other information
In such cases, we want to evaluate probabilities directly
• Logistic loss makes sense for evaluation!
Example: Predict click from ad’s historical performance, user’s click frequency, and
publisher page’s relevance
• P[y = click | h = GOOD, f = HIGH, r = HIGH ] = .6
• P[y = click | h = BAD, f = LOW, r = LOW ] = .001
Probabilities provide more granular information
Working directly with probabilities
0ˆ =y
27
1ˆ =Þ y
28
Example: Spam classification
29
Example: Spam classification
Likelihood function
30
Log likelihood or cross-entropy error
31It’s Gradient
if y = 1
if y = 0
•
•
At p=1, cost = 0 when prediction is correct)
Increasing penalty when p away from 1
When y = 1
Similar when y = 0
p
̶ log(p) ̶ log(1 ̶ p)
Log-likelihood loss
{ )log( )1log(),( p plog yp – –=!
32
Captures intuition that that larger mistakes should get
larger penalties
Multinomial Logistic Regression
33
Log likelihood
34
Cross-entropy error function
35
Goal: Find w* that minimizes
• Can solve via Gradient Descent
.
Update Rule:
Step Size
Gradient
Logistic regression optimization
σ(z) =
1
1 + exp(-z)where
w
f(w)
w1 w0w* … w2
36
Example: n = 6; 3 workers
x(2)
x(6)
workers: x(1)
x(5)
x(3)
x(4)
O(nk) Distributed
Storage
O(nk)
Distributed
Computation
O(k) Local
Storage
map:
reduce: O(n) Local
Computation
O(k) Local
Storage
wt+1
Parallel gradient descent for logistic regression
Vector Update:
Compute summands in parallel!
37
38
Next week
• Security
• Releasing marks Assignment 1