CS计算机代考程序代写 chain Lecture 7 (part 2): Logistic Regression

Lecture 7 (part 2): Logistic Regression

COMP90049
Introduction to Machine Learning
Semester 1, 2020

Lea Frermann, CIS

1

Roadmap

Sofar…

• Naive Bayes

• Optimization (closed-form and iterative)

• Evaluation

Today : more classification!

• Logistic Regression

2

Logistic Regression

Quick Refresher

Recall Naive Bayes

P(x , y) = P(y)P(x |y) =
N∏

i=1

P(y i)
M∏

m=1

P(x im|y
i
)

• a probabilistic generative model of the joint probability P(x , y)
• optimized to maximize the likelihood of the observed data
• naive due to unrealistic feature indepencence assumptions

For prediction, we apply Bayes Rule to obtain the conditional distribution

P(x , y) = P(y)P(x |y) = P(y |x)P(x)

P(y |x) =
P(y)P(x |y)

P(x)

ŷ = argmax
y

P(y |x) ≈ P(y)P(x |y)

How about we model P(y |x) directly? → Logistic Regression

3

Quick Refresher

Recall Naive Bayes

P(x , y) = P(y)P(x |y) =
N∏

i=1

P(y i)
M∏

m=1

P(x im|y
i
)

• a probabilistic generative model of the joint probability P(x , y)
• optimized to maximize the likelihood of the observed data
• naive due to unrealistic feature indepencence assumptions

For prediction, we apply Bayes Rule to obtain the conditional distribution

P(x , y) = P(y)P(x |y) = P(y |x)P(x)

P(y |x) =
P(y)P(x |y)

P(x)

ŷ = argmax
y

P(y |x) ≈ P(y)P(x |y)

How about we model P(y |x) directly? → Logistic Regression

3

Introduction to Logistic Regression

Logistic Regression on a high level

• Is a binary classification model

• Is a probabilistic discriminative model because it optimizes P(y |x)
directly

• Learns to optimally discriminate between inputs which belong to
different classes

• No model of P(x |y)→ no conditional feature independence assumption

4

Aside: Linear Regression

• Regression: predict a real-valued quantity y given features x , e.g.,

housing price given {location, size, age, …}
success of movie ($) given {cast, genre, budget, …}
air quality given {temperature, timeOfDay, CO2, …}

• linear regression is the simples regression model

• a real-valued ŷ is predicted as a linear combination of weighted feature
values

ŷ = θ0 + θ1×1 + θ2×2 + . . .

= θ0 +

i

θixi

• The weights θ0, θ1, . . . are model parameters, and need to be optimized
during training

• Loss (error) is the sum of squared errors (SSE): L =
∑N

i=1(ŷ
i − y i)2

5

Aside: Linear Regression

• Regression: predict a real-valued quantity y given features x , e.g.,

housing price given {location, size, age, …}
success of movie ($) given {cast, genre, budget, …}
air quality given {temperature, timeOfDay, CO2, …}

• linear regression is the simples regression model

• a real-valued ŷ is predicted as a linear combination of weighted feature
values

ŷ = θ0 + θ1×1 + θ2×2 + . . .

= θ0 +

i

θixi

• The weights θ0, θ1, . . . are model parameters, and need to be optimized
during training

• Loss (error) is the sum of squared errors (SSE): L =
∑N

i=1(ŷ
i − y i)2

5

Logistic Regression: Derivation I

• Let’s assume a binary classification task, y is true (1) or false (0).

• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]

• We want to use a regression approach

6

Logistic Regression: Derivation I

• Let’s assume a binary classification task, y is true (1) or false (0).

• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]

• We want to use a regression approach

6

Logistic Regression: Derivation I

• Let’s assume a binary classification task, y is true (1) or false (0).

• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]

• We want to use a regression approach

• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.

p(x) = θ0 + θ1×1 + …θF xF

6

Logistic Regression: Derivation I

• Let’s assume a binary classification task, y is true (1) or false (0).

• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]

• We want to use a regression approach

• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.

6

Logistic Regression: Derivation I

• Let’s assume a binary classification task, y is true (1) or false (0).

• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]

• We want to use a regression approach

• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.

• How about: log p(x) as a linear function of x . Problem: log is bounded
in one direction, linear functions are not.

log p(x) = θ0 + θ1×1 + …θF xF

6

Logistic Regression: Derivation I

• Let’s assume a binary classification task, y is true (1) or false (0).

• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]

• We want to use a regression approach

• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.

• How about: log p(x) as a linear function of x . Problem: log is bounded
in one direction, linear functions are not.

6

Logistic Regression: Derivation I

• Let’s assume a binary classification task, y is true (1) or false (0).

• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]

• We want to use a regression approach

• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.

• How about: log p(x) as a linear function of x . Problem: log is bounded
in one direction, linear functions are not.

• How about: minimally modifying log p(x) such that is is unbounded, by
applying the logistic transformation

log
p(x)

1− p(x)
= θ0 + θ1×1 + …θF xF

6

Logistic Regression: Derivation II

log
p(x)

1− p(x)
= θ0 + θ1×1 + …θF xF

• also called the log odds

• the odds are defined as the fraction of success over the fraction of
failures

odds =
P(success)
P(failures)

=
P(success)

1− P(success)
• e.g., the odds of rolling a 6 with a fair dice are:

1/6
1− (1/6)

=
0.17
0.83

= 0.2

7

Logistic Regression: Derivation III

log
P(x)

1− P(x)
= θ0 + θ1×1 + …θF xF

If we rearrange and solve for P(x), we get

P(x) =
exp(θ0 + θ1×1 + …θF xF )

1 + exp(θ0 + θ1×1 + …θF xF )
=

exp(θ0 +
∑F

f=1 θf xf )

1 + exp(θ0 +
∑F

f=1 θf xf )

=
1

1 + exp(−(θ0 + θ1×1 + …θF xF ))
=

1
1 + exp(−(θ0 +

∑F
f=1 θf xf ))

• where the RHS is the inverse logit (or
logistic function)

• we pass a regression model through
the logistic function to obtain a valid
probability predicton

8

Logistic Regression: Interpretation

P(y |x ; θ) =
1

1 + exp(−(θ0 +
∑F

f=1 θf xf ))

A closer look at the logistic function

Most inputs lead to P(y |x)=0 or P(y |x)=1. That is intended, because all true
labels are either 0 or 1.

• (θ0
∑F

f=1 θf xf ) > 0 means y = 1

• (θ0
∑F

f=1 θf xf ) ≈ 0 means most
uncertainty

• (θ0
∑F

f=1 θf xf ) < 0 means y = 0 9 Logistic Regression: Prediction • The logistic function returns the probability of P(y = 1) given an in put x P(y = 1|x1, x2, ..., xF ; θ) = 1 1 + exp(−(θ0 + ∑F f=1 θf xf )) = σ(x ; θ) • We define a decision boundary, e.g., predict y = 1 if P(y = 1|x1, x2, ..., xF ; θ) > 0.5 and y = 0 otherwise

10

Example!

P(y = 1|x1, x2, …, xF ; θ) =
1

1+exp(−(θ0+
∑F

f=1 θf xf ))
= 1

1+exp(−(θT x)) = σ(θ
T x)

Model parameters
θ = [0.1, −3.5, 0.7, 2.1]

(Small) Test Data set
Outlook Temp Humidity Class

rainy cool normal 0
sunny hot high 1

Feature Function
x0 = 1 (bias term)

x1 =




1 if outlook=sunny

2 if outlook=overcast

3 if outlook=rainy

x2 =




1 if temp=hot

2 if temp=mild

3 if temp=cool

x3 =

{
1 if humidity=normal

2 if humidity=high

11

Example!

P(y = 1|x1, x2, …, xF ; θ) =
1

1+exp(−(θ0+
∑F

f=1 θf xf ))
= 1

1+exp(−(θT x)) = σ(θ
T x)

Model parameters
θ = [0.1, −3.5, 0.7, 2.1]

(Small) Test Data set
Outlook Temp Humidity Class

rainy cool normal 0
sunny hot high 1

Feature Function
x0 = 1 (bias term)

x1 =




1 if outlook=sunny

2 if outlook=overcast

3 if outlook=rainy

x2 =




1 if temp=hot

2 if temp=mild

3 if temp=cool

x3 =

{
1 if humidity=normal

2 if humidity=high

12

Parameter Estimation

What are the four steps we would follow in finding the optimal param-
eters?

13

Objective Function

Mimimize the Negative conditional log likelihood

L(θ) = −P(Y |X ; θ) = −
N∏

i=1

P(y i |x i ; θ)

note that
P(y = 1|x ; θ) = σ(θT x)

P(y = 0|x ; θ) = 1− σ(θT x)

so

L(θ) = −P(Y |X ; θ) = −
N∏

i=1

P(y i |x i ; θ)

= −
N∏

i=1

(σ(θ
T x i))y

i
∗ (1− σ(θT x i))1−y

i

take the log of this function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

14

Objective Function

Mimimize the Negative conditional log likelihood

L(θ) = −P(Y |X ; θ) = −
N∏

i=1

P(y i |x i ; θ)

note that
P(y = 1|x ; θ) = σ(θT x)

P(y = 0|x ; θ) = 1− σ(θT x)
so

L(θ) = −P(Y |X ; θ) = −
N∏

i=1

P(y i |x i ; θ)

= −
N∏

i=1

(σ(θ
T x i))y

i
∗ (1− σ(θT x i))1−y

i

take the log of this function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

14

Objective Function

Mimimize the Negative conditional log likelihood

L(θ) = −P(Y |X ; θ) = −
N∏

i=1

P(y i |x i ; θ)

note that
P(y = 1|x ; θ) = σ(θT x)

P(y = 0|x ; θ) = 1− σ(θT x)
so

L(θ) = −P(Y |X ; θ) = −
N∏

i=1

P(y i |x i ; θ)

= −
N∏

i=1

(σ(θ
T x i))y

i
∗ (1− σ(θT x i))1−y

i

take the log of this function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

14

Take 1st Derivative of the Objective Function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂D =

∂A
∂B ×

∂B
∂C ×

∂C
∂D

15

Take 1st Derivative of the Objective Function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂D =

∂A
∂B ×

∂B
∂C ×

∂C
∂D

Also
• Derivative of sum = sum of derivatives→ focus on 1 training input

• Compute ∂L
∂θj

for each θj individually, so focus on 1 θj

15

Take 1st Derivative of the Objective Function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂D =

∂A
∂B ×

∂B
∂C ×

∂C
∂D

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

15

Take 1st Derivative of the Objective Function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂D =

∂A
∂B ×

∂B
∂C ×

∂C
∂D

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

∂ logL(θ)
∂p

= −
y
p

1− y
1− p

( because L(θ) = −[ylogp + (1− y)log(1− p)]

15

Take 1st Derivative of the Objective Function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂D =

∂A
∂B ×

∂B
∂C ×

∂C
∂D

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

∂p
∂z

=
∂σ(z)
∂z

= σ(z)[1− σ(z)]

15

Take 1st Derivative of the Objective Function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂D =

∂A
∂B ×

∂B
∂C ×

∂C
∂D

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

∂z
∂θj

=
∂θT x
∂z

= xj

15

Take 1st Derivative of the Objective Function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂D =

∂A
∂B ×

∂B
∂C ×

∂C
∂D

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

= −
y
p

1− y
1− p

× σ(z)[1− σ(z)] × xj

=
[
σ(θ

T x)− y
]
× xj

15

Take 1st Derivative of the Objective Function

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂D =

∂A
∂B ×

∂B
∂C ×

∂C
∂D

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

= −
y
p

1− y
1− p

× σ(z)[1− σ(z)] × xj

=
[
σ(θ

T x)− y
]
× xj

15

Logistic Regression: Parameter Estimation III

The derivative of the log likelihood wrt. a single parameter θj for all training
examples

logL(θ)
∂θj

=
N∑

i=1

(
σ(θ

T x i)− y i
)

x ij

• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4)

• Unfortunately, that’s not straightforward here (as for Naive Bayes)

• Instead, we will use an iterative method: Gradient Descent

θ
(new)
j ← θ

(old)
j − η

∂ logL(θ)
∂θj

θ
(new)
j ← θ

(old)
j − η

N∑
i=1

(
σ(θ

T x i)− y i
)

x ij

16

Logistic Regression: Parameter Estimation III

The derivative of the log likelihood wrt. a single parameter θj for all training
examples

logL(θ)
∂θj

=
N∑

i=1

(
σ(θ

T x i)− y i
)

x ij

• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4)

• Unfortunately, that’s not straightforward here (as for Naive Bayes)

• Instead, we will use an iterative method: Gradient Descent

θ
(new)
j ← θ

(old)
j − η

∂ logL(θ)
∂θj

θ
(new)
j ← θ

(old)
j − η

N∑
i=1

(
σ(θ

T x i)− y i
)

x ij

16

Multinomial Logistic Regression

• So far we looked at problems where either y = 0 or y = 1 (e.g., spam
classification: y ∈ {play, not play})

P(y = 1|x ; θ) = σ(θT x) =
exp(θT x)

1 + exp(θT x)

P(y = 0|x ; θ) = 1− σ(θT x) = 1−
exp(θT x)

1 + exp(θT x)

• But what if we have more than 2 classes, e.g., y ∈ {positive, negative,
neutral}

• we predict the probability of each class c by passing the input
representation through the softmax function, a generalization of the
sigmoid

p(y = c|x ; θ) =
exp(θcx)∑
k exp(θk x)

• we learn a parameter vector θc for each class c

17

Multinomial Logistic Regression

• So far we looked at problems where either y = 0 or y = 1 (e.g., spam
classification: y ∈ {play, not play})

P(y = 1|x ; θ) = σ(θT x) =
exp(θT x)

1 + exp(θT x)

P(y = 0|x ; θ) = 1− σ(θT x) = 1−
exp(θT x)

1 + exp(θT x)

• But what if we have more than 2 classes, e.g., y ∈ {positive, negative,
neutral}

• we predict the probability of each class c by passing the input
representation through the softmax function, a generalization of the
sigmoid

p(y = c|x ; θ) =
exp(θcx)∑
k exp(θk x)

• we learn a parameter vector θc for each class c

17

Example!

p(y = c|x ; θ) = exp(θcx)∑
k exp(θk x)

Model parameters
θc0 = [0.1, −3.5, 0.7, 2.1]
θc1 = [0.6, 2.5, 2.7, −2.1]
θc2 = [3.1, 1.5, 0.07, 3.6]

(Small) Test Data set
Outlook Temp Humidity Class

rainy cool normal 0 (don’t play)
sunny cool normal 1 (maybe play)
sunny hot high 2 (play)

Feature Function
x0 = 1 (bias term)

x1 =




1 if outlook=sunny

2 if outlook=overcast

3 if outlook=rainy

x2 =




1 if temp=hot

2 if temp=mild

3 if temp=cool

x3 =

{
1 if humidity=normal

2 if humidity=high
18

Logistic Regression: Final Thoughts

Pros

• Probabilistic interpretation

• No restrictive assumptions on features

• Often outperforms Naive Bayes

• Particularly suited to frequency-based features (so, popular in NLP)

Cons

• Can only learn linear feature-data relationships

• Some feature scaling issues

• Often needs a lot of data to work well

• Regularisation a nuisance, but important since overfitting can be a big
problem

19

Summary

• Derivation of logistic regression

• Prediction

• Derivation of maximum likelihood

20

References

Cosma Shalizi. Advanced Data Analysis from an Elementary Point of View.
Chapters 11.1 and 11.2. Online Draft.
https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf

Dan Jurafsky and James H. Martin. Speech and Language Processing.
Chapter 5. Online Draft V3.0.
https://web.stanford.edu/~jurafsky/slp3/

21

https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
https://web.stanford.edu/~jurafsky/slp3/

Optional: Detailed Parameter Estimation

Step 2 Differentiate the loglikelihood wrt. the parameters

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂C =

∂A
∂B ×

∂B
∂C

22

Optional: Detailed Parameter Estimation

Step 2 Differentiate the loglikelihood wrt. the parameters

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂C =

∂A
∂B ×

∂B
∂C

Also
• Derivative of sum = sum of derivatives→ focus on 1 training input

• Compute ∂L
∂θj

for each θj individually, so focus on 1 θj

22

Optional: Detailed Parameter Estimation

Step 2 Differentiate the loglikelihood wrt. the parameters

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂C =

∂A
∂B ×

∂B
∂C

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

22

Optional: Detailed Parameter Estimation

Step 2 Differentiate the loglikelihood wrt. the parameters

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂C =

∂A
∂B ×

∂B
∂C

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

∂ logL(θ)
∂p

= −
y
p

1− y
1− p

( because L(θ) = −[ylogp + (1− y)log(1− p)]

22

Optional: Detailed Parameter Estimation

Step 2 Differentiate the loglikelihood wrt. the parameters

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂C =

∂A
∂B ×

∂B
∂C

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

∂p
∂z

=
∂σ(z)
∂z

= σ(z)[1− σ(z)]

22

Optional: Detailed Parameter Estimation

Step 2 Differentiate the loglikelihood wrt. the parameters

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]

• The chain rule tells us that ∂A
∂C =

∂A
∂B ×

∂B
∂C

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

∂z
∂θj

=
∂θT x
∂z

= xj

22

Optional: Detailed Parameter Estimation

Step 2 Differentiate the loglikelihood wrt. the parameters

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A

∂C =
∂A
∂B ×

∂B
∂C

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

= −
[y

p

1− y
1− p

]
× σ(z)[1− σ(z)] × xj [[ combine 3 derivatives]]

= −
[y

p

1− y
1− p

]
× p[1− p] × xj [[ σ(z) = p]]

= −
[y(1− p)

p(1− p)

p(1− y)
p(1− p)

]
× p[1− p] × xj [[ ×

1− p
1− p

and
p
p

]]

= −
[
y(1− p)− p(1− y)

]
× xj [[ cancel terms ]]

22

Optional: Detailed Parameter Estimation

Step 2 Differentiate the loglikelihood wrt. the parameters

logL(θ) = −
N∑

i=1

y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))

Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)

∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A

∂C =
∂A
∂B ×

∂B
∂C

∂ logL(θ)
∂θj

=
∂ logL(θ)

∂p
×
∂p
∂z
×
∂z
∂θj

where p = σ(θT x) and z = θT x

= −
[
y(1− p)− p(1− y)

]
× xj [[ copy from last slide ]]

= −
[
y − yp − p + yp

]
× xj [[ multiply out ]]

= −
[
y − p

]
× xj [[ -yp+yp=0 ]]

=
[
p − y

]
× xj [[ -[y-p] = -y+p = p-y ]]

=
[
σ(θ

T x)− y
]
× xj [[p = σ(z), z = θ

T x ]]

22

Logistic Regression