CS计算机代考程序代写 COMP3221: Distributed

COMP3221: Distributed
Systems

Distributed Logistic Regression

Dr Nguyen Tran
School of Computer Science

Regression
How much should you sell your house for?

data

input: houses & features

regression

house size

pi
rc
e
($
)

learn: x → y relationship

intelligence

=?

predict: y (continuous)
2

Classification

Cat or dog?

data

input: cats & dogs

classification

learn: x → y relationship

intelligence

=?

predict: y (categorical)
3

Goal: Learn a mapping from observations to discrete labels given a set of
training examples (supervised learning)

Example: Spam Classification
• Observations are emails
• Labels are {spam, not-spam} (Binary Classification)
• Given a set of labeled emails, we want to predict whether a new email is

spam or not-spam

Classification: example

4

Goal: Learn a mapping from observations to discrete labels given a set of
training examples (supervised learning)

Example: Click-through Rate Prediction
• Observations are user-ad-publisher triples
• Labels are {not-click, click} (Binary Classification)
• Given a set of labeled observations, we want to predict whether a new

user-ad-publisher triple will result in a click

Classification: example

5

Example: Predicting house price from size, location, age

For each observation we have a feature vector, x, and label, y

xT = [ x1 x2 x3 ]

We assume a linear mapping between features and label:

y ≈ w0 +w1x1 +w2x2 + w3x3

Linear least squares regression
RECALL

6

Example: Predicting house price from size, location, age

We can augment the feature vector to incorporate offset:

xT = [1 x1 x2 x3 ]

We can then rewrite this linear mapping as scalar product:

Linear least squares regression
RECALL

7

Simple

Often works well in practice

Can introduce complexity via feature extraction

Why a linear mapping?

Can we do something similar for classification?

8

Example: Predicting rain from temperature, cloudiness, and humidity

Use the same feature representation: XT =

How can we make class predictions?
• {not-rain, rain}, {not-spam, spam}, {not-click, click}
• Idea: threshold by sign

[1 x1 x2 x3 ]

Linear regression linear classification

å
=

=Þ==
3

0

TT x)w(ˆxwˆ
i

ii signyxwy

9

Example: wT = [ ̶ 1 3 ̶ 4]

x1

x2
xT = [1 2 3] : wTx= —7

xT = [1 2 1] : wTx= 1

xT = [1 5 .5] : wTx= 12

xT = [1 3 2.5] : wTx= —2

1 23x — 4x — 1 = 0
Decision
Boundary

Linear classifier decision boundary

10

Regression: can measure ‘closeness’ between label and prediction


House price prediction: better to be off by 10K than by 1or 0
Squared loss:

Classification: Class predictions are discrete
• 0-1 loss: Penalty is 0 for correct prediction, and 1 otherwise

Evaluating predictions

2)ˆ( yy –

11

0-1

z = y ·wTx

if z < 0 otherwise Let y { ̶ 1, 1} and define z = y · w T x 0/1 loss minimization Î î í ì = 0 1 )(10 z! (z)! 12 Idea: Find w that minimizes average 0-1 loss over training points: How can we learn model (w)? Assume we have n training points, where x(i) denotes the ith point Recall two earlier points: • • Linear assumption: ŷ = sign(wTx) We use 0-1 loss: ( ))(T)( 1 10 xwmin ii n iw y ×å = ! 13 0-1 z = y ·wTx if z < 0 otherwise Let y { ̶ 1, 1} and define z = y · wTx 0/1 loss minimization Issue: hard optimization problem, not convex! Î Î î í ì = 0 1 )(10 z! (z)! 14 0-1 Adaboost Logistic Regression SVM ℓ (z) Approximate 0/1 loss Solution: Approximate 0/1 loss with convex loss (“surrogate” loss) Logistic loss (logloss): ℓlog(z) = log(1 + e−z) Click to add text Click to add text 15 Logistic Regression: Learn mapping (w) that minimizes logistic loss on training data Training LogLoss Model Complexity • Convex • Closed-form solution doesn’t exist Logistic regression optimization . Σ 2 2( ))(T)( 1 xwmin ii n i log w y ×å = ! 16 Empirical risk minimization • A popular approach in supervised ML • Given a loss ℓ and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing the empirical risk • We typically restrict this predictor to lie in some class, F • Could reflect our prior knowledge about the task • Or may be for computational convenience LOGISTIC REGRESSION VIA Question: how should we select our function class, F, and our loss function, ℓ ? logistic loss )),(( 1 minargˆ 1 ii n i Ffn yxf n f å = Î = ! 17 Approximate 0/1 loss: Logistic regression with probabilistic interpretation 18 0-1 Adaboost Logistic Regression SVM ℓ( z) z = y · wTx SVM (hinge), Logistic regression (logistic), Adaboost (exponential) Approximate 0/1 loss Solution: Approximate 0/1 loss with convex loss (“surrogate” loss) Logistic loss (logloss): ℓlog(z) = log(1 + e−z) 19 x x Goal: Model conditional probability: P[ y = 1 | x] Example: Predict rain from temperature, cloudiness, humidity • P[ y = rain | t = 14○F, c = LOW, h = 2% ] = .05 • P[ y = rain | t = 70○F, c = HIGH, h = 95% ] = .9 Example: Predict click from ad’s historical performance, user’s click frequency, and publisher page’s relevance • P[ y = click | h = GOOD, f = HIGH, r = HIGH ] = .6 • P[ y = click | h = BAD, f = LOW, r = LOW ] = .001 Probabilistic interpretation 20 Probabilistic interpretation Goal: Model conditional probability: P[ y = 1 | x] First thought: P[ y = 1 | x ] = wTx ? • Linear regression returns any real number, but probabilities are in range [0,1] How can we transform its output? • Use logistic (or sigmoid) function: P[ y = 1 | x ] = σ(wTx) 21 σ(z) = 1 1 + exp( ̶ z) z σ (z) Maps real numbers to [0,1] Logistic function • • Large positive inputs 1 Large negative inputs 0 Þ Þ 22 Goal: Model conditional probability: P[ y = 1 | x] Logistic regression uses logistic function to model this conditional probability • P[ y = 1 | x ] = σ(wTx) • P[ y = 0 | x ] = 1 ̶ σ(wTx) For notational convenience we now define Probabilistic interpretation 23 To make class predictions, we need to convert probabilities to values in [0,1] We can do this by setting a threshold on the probabilities • • x Default threshold is 0.5 P[y = 1 | x] > 0.5

Example: Predict rain from temperature, cloudiness, humidity
• P[ y = rain | t = 14○F, c = LOW, h = 2% ] = .05
• P[ y = rain | t = 70○F, c = HIGH, h = 95% ] = .9

How do we use probabilities?

1ˆ =Þ y

1ˆ =y
0ˆ =y

24

Connection with decision boundary?


Decision Boundary
x2

Threshold by sign to make class
predictions: ŷ = sign(wTx)

ŷ = 1 : wTx> 0
ŷ = 0 : wTx< 0 • decision boundary: wTx = 0 x1 How does this compare with thresholding probability? • P[ y = 1 | x ] = σ(wTx) > 0.5 => ŷ= 1
25

Threshold by sign to make class
predictions: = sign(wTx)

• decision boundary: wTx= 0

Connection with decision boundary?


: wTx > 0
: wTx < 0 How does this compare with thresholding probability? • P[ y = 1 | x] = σ(wTx) > 0.5
• With threshold of 0.5, the decision boundaries are identical!

σ (z)

σ(0) = 0.5

z

wTx = 0 σ(wTx) =0.5Û

1ˆ =Þ y

1ˆ =y

0ˆ =y

26


Confidence of prediction
Useful when combining predictions with other information

In such cases, we want to evaluate probabilities directly
• Logistic loss makes sense for evaluation!

Example: Predict click from ad’s historical performance, user’s click frequency, and
publisher page’s relevance
• P[y = click | h = GOOD, f = HIGH, r = HIGH ] = .6
• P[y = click | h = BAD, f = LOW, r = LOW ] = .001

Probabilities provide more granular information

Working directly with probabilities

0ˆ =y

27

1ˆ =Þ y

28

Example: Spam classification

29

Example: Spam classification

Likelihood function

30

Log likelihood or cross-entropy error

31It’s Gradient

if y = 1
if y = 0


At p=1, cost = 0 when prediction is correct)
Increasing penalty when p away from 1

When y = 1

Similar when y = 0

p

̶ log(p) ̶ log(1 ̶ p)

Log-likelihood loss

{ )log( )1log(),( p plog yp – –=!

32

Captures intuition that that larger mistakes should get
larger penalties

Multinomial Logistic Regression

33

Log likelihood

34

Cross-entropy error function

35

Goal: Find w* that minimizes

• Can solve via Gradient Descent

.

Update Rule:

Step Size

Gradient

Logistic regression optimization

σ(z) =
1

1 + exp(-z)where

w

f(w)

w1 w0w* … w2

36

Example: n = 6; 3 workers

x(2)
x(6)

workers: x(1)
x(5)

x(3)
x(4)

O(nk) Distributed
Storage

O(nk)
Distributed

Computation

O(k) Local
Storage

map:

reduce: O(n) Local
Computation

O(k) Local
Storage

wt+1

Parallel gradient descent for logistic regression

Vector Update:

Compute summands in parallel!

37

38

Next week

• Security
• Releasing marks Assignment 1