Lecture 7 (part 2): Logistic Regression
COMP90049
Introduction to Machine Learning
Semester 1, 2020
Lea Frermann, CIS
1
Roadmap
Sofar…
• Naive Bayes
• Optimization (closed-form and iterative)
• Evaluation
Today : more classification!
• Logistic Regression
2
Logistic Regression
Quick Refresher
Recall Naive Bayes
P(x , y) = P(y)P(x |y) =
N∏
i=1
P(y i)
M∏
m=1
P(x im|y
i
)
• a probabilistic generative model of the joint probability P(x , y)
• optimized to maximize the likelihood of the observed data
• naive due to unrealistic feature indepencence assumptions
For prediction, we apply Bayes Rule to obtain the conditional distribution
P(x , y) = P(y)P(x |y) = P(y |x)P(x)
P(y |x) =
P(y)P(x |y)
P(x)
ŷ = argmax
y
P(y |x) ≈ P(y)P(x |y)
How about we model P(y |x) directly? → Logistic Regression
3
Quick Refresher
Recall Naive Bayes
P(x , y) = P(y)P(x |y) =
N∏
i=1
P(y i)
M∏
m=1
P(x im|y
i
)
• a probabilistic generative model of the joint probability P(x , y)
• optimized to maximize the likelihood of the observed data
• naive due to unrealistic feature indepencence assumptions
For prediction, we apply Bayes Rule to obtain the conditional distribution
P(x , y) = P(y)P(x |y) = P(y |x)P(x)
P(y |x) =
P(y)P(x |y)
P(x)
ŷ = argmax
y
P(y |x) ≈ P(y)P(x |y)
How about we model P(y |x) directly? → Logistic Regression
3
Introduction to Logistic Regression
Logistic Regression on a high level
• Is a binary classification model
• Is a probabilistic discriminative model because it optimizes P(y |x)
directly
• Learns to optimally discriminate between inputs which belong to
different classes
• No model of P(x |y)→ no conditional feature independence assumption
4
Aside: Linear Regression
• Regression: predict a real-valued quantity y given features x , e.g.,
housing price given {location, size, age, …}
success of movie ($) given {cast, genre, budget, …}
air quality given {temperature, timeOfDay, CO2, …}
• linear regression is the simples regression model
• a real-valued ŷ is predicted as a linear combination of weighted feature
values
ŷ = θ0 + θ1×1 + θ2×2 + . . .
= θ0 +
∑
i
θixi
• The weights θ0, θ1, . . . are model parameters, and need to be optimized
during training
• Loss (error) is the sum of squared errors (SSE): L =
∑N
i=1(ŷ
i − y i)2
5
Aside: Linear Regression
• Regression: predict a real-valued quantity y given features x , e.g.,
housing price given {location, size, age, …}
success of movie ($) given {cast, genre, budget, …}
air quality given {temperature, timeOfDay, CO2, …}
• linear regression is the simples regression model
• a real-valued ŷ is predicted as a linear combination of weighted feature
values
ŷ = θ0 + θ1×1 + θ2×2 + . . .
= θ0 +
∑
i
θixi
• The weights θ0, θ1, . . . are model parameters, and need to be optimized
during training
• Loss (error) is the sum of squared errors (SSE): L =
∑N
i=1(ŷ
i − y i)2
5
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]
• We want to use a regression approach
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]
• We want to use a regression approach
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.
p(x) = θ0 + θ1×1 + …θF xF
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x . Problem: log is bounded
in one direction, linear functions are not.
log p(x) = θ0 + θ1×1 + …θF xF
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x . Problem: log is bounded
in one direction, linear functions are not.
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x ; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x ; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x . Problem: probabilities are
bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x . Problem: log is bounded
in one direction, linear functions are not.
• How about: minimally modifying log p(x) such that is is unbounded, by
applying the logistic transformation
log
p(x)
1− p(x)
= θ0 + θ1×1 + …θF xF
6
Logistic Regression: Derivation II
log
p(x)
1− p(x)
= θ0 + θ1×1 + …θF xF
• also called the log odds
• the odds are defined as the fraction of success over the fraction of
failures
odds =
P(success)
P(failures)
=
P(success)
1− P(success)
• e.g., the odds of rolling a 6 with a fair dice are:
1/6
1− (1/6)
=
0.17
0.83
= 0.2
7
Logistic Regression: Derivation III
log
P(x)
1− P(x)
= θ0 + θ1×1 + …θF xF
If we rearrange and solve for P(x), we get
P(x) =
exp(θ0 + θ1×1 + …θF xF )
1 + exp(θ0 + θ1×1 + …θF xF )
=
exp(θ0 +
∑F
f=1 θf xf )
1 + exp(θ0 +
∑F
f=1 θf xf )
=
1
1 + exp(−(θ0 + θ1×1 + …θF xF ))
=
1
1 + exp(−(θ0 +
∑F
f=1 θf xf ))
• where the RHS is the inverse logit (or
logistic function)
• we pass a regression model through
the logistic function to obtain a valid
probability predicton
8
Logistic Regression: Interpretation
P(y |x ; θ) =
1
1 + exp(−(θ0 +
∑F
f=1 θf xf ))
A closer look at the logistic function
Most inputs lead to P(y |x)=0 or P(y |x)=1. That is intended, because all true
labels are either 0 or 1.
• (θ0
∑F
f=1 θf xf ) > 0 means y = 1
• (θ0
∑F
f=1 θf xf ) ≈ 0 means most
uncertainty
• (θ0
∑F
f=1 θf xf ) < 0 means y = 0 9 Logistic Regression: Prediction • The logistic function returns the probability of P(y = 1) given an in put x P(y = 1|x1, x2, ..., xF ; θ) = 1 1 + exp(−(θ0 + ∑F f=1 θf xf )) = σ(x ; θ) • We define a decision boundary, e.g., predict y = 1 if P(y = 1|x1, x2, ..., xF ; θ) > 0.5 and y = 0 otherwise
10
Example!
P(y = 1|x1, x2, …, xF ; θ) =
1
1+exp(−(θ0+
∑F
f=1 θf xf ))
= 1
1+exp(−(θT x)) = σ(θ
T x)
Model parameters
θ = [0.1, −3.5, 0.7, 2.1]
(Small) Test Data set
Outlook Temp Humidity Class
rainy cool normal 0
sunny hot high 1
Feature Function
x0 = 1 (bias term)
x1 =
1 if outlook=sunny
2 if outlook=overcast
3 if outlook=rainy
x2 =
1 if temp=hot
2 if temp=mild
3 if temp=cool
x3 =
{
1 if humidity=normal
2 if humidity=high
11
Example!
P(y = 1|x1, x2, …, xF ; θ) =
1
1+exp(−(θ0+
∑F
f=1 θf xf ))
= 1
1+exp(−(θT x)) = σ(θ
T x)
Model parameters
θ = [0.1, −3.5, 0.7, 2.1]
(Small) Test Data set
Outlook Temp Humidity Class
rainy cool normal 0
sunny hot high 1
Feature Function
x0 = 1 (bias term)
x1 =
1 if outlook=sunny
2 if outlook=overcast
3 if outlook=rainy
x2 =
1 if temp=hot
2 if temp=mild
3 if temp=cool
x3 =
{
1 if humidity=normal
2 if humidity=high
12
Parameter Estimation
What are the four steps we would follow in finding the optimal param-
eters?
13
Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y |X ; θ) = −
N∏
i=1
P(y i |x i ; θ)
note that
P(y = 1|x ; θ) = σ(θT x)
P(y = 0|x ; θ) = 1− σ(θT x)
so
L(θ) = −P(Y |X ; θ) = −
N∏
i=1
P(y i |x i ; θ)
= −
N∏
i=1
(σ(θ
T x i))y
i
∗ (1− σ(θT x i))1−y
i
take the log of this function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
14
Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y |X ; θ) = −
N∏
i=1
P(y i |x i ; θ)
note that
P(y = 1|x ; θ) = σ(θT x)
P(y = 0|x ; θ) = 1− σ(θT x)
so
L(θ) = −P(Y |X ; θ) = −
N∏
i=1
P(y i |x i ; θ)
= −
N∏
i=1
(σ(θ
T x i))y
i
∗ (1− σ(θT x i))1−y
i
take the log of this function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
14
Objective Function
Mimimize the Negative conditional log likelihood
L(θ) = −P(Y |X ; θ) = −
N∏
i=1
P(y i |x i ; θ)
note that
P(y = 1|x ; θ) = σ(θT x)
P(y = 0|x ; θ) = 1− σ(θT x)
so
L(θ) = −P(Y |X ; θ) = −
N∏
i=1
P(y i |x i ; θ)
= −
N∏
i=1
(σ(θ
T x i))y
i
∗ (1− σ(θT x i))1−y
i
take the log of this function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
14
Take 1st Derivative of the Objective Function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂D =
∂A
∂B ×
∂B
∂C ×
∂C
∂D
15
Take 1st Derivative of the Objective Function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂D =
∂A
∂B ×
∂B
∂C ×
∂C
∂D
Also
• Derivative of sum = sum of derivatives→ focus on 1 training input
• Compute ∂L
∂θj
for each θj individually, so focus on 1 θj
15
Take 1st Derivative of the Objective Function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂D =
∂A
∂B ×
∂B
∂C ×
∂C
∂D
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
15
Take 1st Derivative of the Objective Function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂D =
∂A
∂B ×
∂B
∂C ×
∂C
∂D
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
∂ logL(θ)
∂p
= −
y
p
−
1− y
1− p
( because L(θ) = −[ylogp + (1− y)log(1− p)]
15
Take 1st Derivative of the Objective Function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂D =
∂A
∂B ×
∂B
∂C ×
∂C
∂D
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
∂p
∂z
=
∂σ(z)
∂z
= σ(z)[1− σ(z)]
15
Take 1st Derivative of the Objective Function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂D =
∂A
∂B ×
∂B
∂C ×
∂C
∂D
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
∂z
∂θj
=
∂θT x
∂z
= xj
15
Take 1st Derivative of the Objective Function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂D =
∂A
∂B ×
∂B
∂C ×
∂C
∂D
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
= −
y
p
−
1− y
1− p
× σ(z)[1− σ(z)] × xj
=
[
σ(θ
T x)− y
]
× xj
15
Take 1st Derivative of the Objective Function
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂D =
∂A
∂B ×
∂B
∂C ×
∂C
∂D
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
= −
y
p
−
1− y
1− p
× σ(z)[1− σ(z)] × xj
=
[
σ(θ
T x)− y
]
× xj
15
Logistic Regression: Parameter Estimation III
The derivative of the log likelihood wrt. a single parameter θj for all training
examples
logL(θ)
∂θj
=
N∑
i=1
(
σ(θ
T x i)− y i
)
x ij
• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4)
• Unfortunately, that’s not straightforward here (as for Naive Bayes)
• Instead, we will use an iterative method: Gradient Descent
θ
(new)
j ← θ
(old)
j − η
∂ logL(θ)
∂θj
θ
(new)
j ← θ
(old)
j − η
N∑
i=1
(
σ(θ
T x i)− y i
)
x ij
16
Logistic Regression: Parameter Estimation III
The derivative of the log likelihood wrt. a single parameter θj for all training
examples
logL(θ)
∂θj
=
N∑
i=1
(
σ(θ
T x i)− y i
)
x ij
• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4)
• Unfortunately, that’s not straightforward here (as for Naive Bayes)
• Instead, we will use an iterative method: Gradient Descent
θ
(new)
j ← θ
(old)
j − η
∂ logL(θ)
∂θj
θ
(new)
j ← θ
(old)
j − η
N∑
i=1
(
σ(θ
T x i)− y i
)
x ij
16
Multinomial Logistic Regression
• So far we looked at problems where either y = 0 or y = 1 (e.g., spam
classification: y ∈ {play, not play})
P(y = 1|x ; θ) = σ(θT x) =
exp(θT x)
1 + exp(θT x)
P(y = 0|x ; θ) = 1− σ(θT x) = 1−
exp(θT x)
1 + exp(θT x)
• But what if we have more than 2 classes, e.g., y ∈ {positive, negative,
neutral}
• we predict the probability of each class c by passing the input
representation through the softmax function, a generalization of the
sigmoid
p(y = c|x ; θ) =
exp(θcx)∑
k exp(θk x)
• we learn a parameter vector θc for each class c
17
Multinomial Logistic Regression
• So far we looked at problems where either y = 0 or y = 1 (e.g., spam
classification: y ∈ {play, not play})
P(y = 1|x ; θ) = σ(θT x) =
exp(θT x)
1 + exp(θT x)
P(y = 0|x ; θ) = 1− σ(θT x) = 1−
exp(θT x)
1 + exp(θT x)
• But what if we have more than 2 classes, e.g., y ∈ {positive, negative,
neutral}
• we predict the probability of each class c by passing the input
representation through the softmax function, a generalization of the
sigmoid
p(y = c|x ; θ) =
exp(θcx)∑
k exp(θk x)
• we learn a parameter vector θc for each class c
17
Example!
p(y = c|x ; θ) = exp(θcx)∑
k exp(θk x)
Model parameters
θc0 = [0.1, −3.5, 0.7, 2.1]
θc1 = [0.6, 2.5, 2.7, −2.1]
θc2 = [3.1, 1.5, 0.07, 3.6]
(Small) Test Data set
Outlook Temp Humidity Class
rainy cool normal 0 (don’t play)
sunny cool normal 1 (maybe play)
sunny hot high 2 (play)
Feature Function
x0 = 1 (bias term)
x1 =
1 if outlook=sunny
2 if outlook=overcast
3 if outlook=rainy
x2 =
1 if temp=hot
2 if temp=mild
3 if temp=cool
x3 =
{
1 if humidity=normal
2 if humidity=high
18
Logistic Regression: Final Thoughts
Pros
• Probabilistic interpretation
• No restrictive assumptions on features
• Often outperforms Naive Bayes
• Particularly suited to frequency-based features (so, popular in NLP)
Cons
• Can only learn linear feature-data relationships
• Some feature scaling issues
• Often needs a lot of data to work well
• Regularisation a nuisance, but important since overfitting can be a big
problem
19
Summary
• Derivation of logistic regression
• Prediction
• Derivation of maximum likelihood
20
References
Cosma Shalizi. Advanced Data Analysis from an Elementary Point of View.
Chapters 11.1 and 11.2. Online Draft.
https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
Dan Jurafsky and James H. Martin. Speech and Language Processing.
Chapter 5. Online Draft V3.0.
https://web.stanford.edu/~jurafsky/slp3/
21
https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
https://web.stanford.edu/~jurafsky/slp3/
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂C =
∂A
∂B ×
∂B
∂C
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂C =
∂A
∂B ×
∂B
∂C
Also
• Derivative of sum = sum of derivatives→ focus on 1 training input
• Compute ∂L
∂θj
for each θj individually, so focus on 1 θj
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂C =
∂A
∂B ×
∂B
∂C
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂C =
∂A
∂B ×
∂B
∂C
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
∂ logL(θ)
∂p
= −
y
p
−
1− y
1− p
( because L(θ) = −[ylogp + (1− y)log(1− p)]
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂C =
∂A
∂B ×
∂B
∂C
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
∂p
∂z
=
∂σ(z)
∂z
= σ(z)[1− σ(z)]
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂C =
∂A
∂B ×
∂B
∂C
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
∂z
∂θj
=
∂θT x
∂z
= xj
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂C =
∂A
∂B ×
∂B
∂C
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
= −
[y
p
−
1− y
1− p
]
× σ(z)[1− σ(z)] × xj [[ combine 3 derivatives]]
= −
[y
p
−
1− y
1− p
]
× p[1− p] × xj [[ σ(z) = p]]
= −
[y(1− p)
p(1− p)
−
p(1− y)
p(1− p)
]
× p[1− p] × xj [[ ×
1− p
1− p
and
p
p
]]
= −
[
y(1− p)− p(1− y)
]
× xj [[ cancel terms ]]
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
logL(θ) = −
N∑
i=1
y i log σ(θT x i) + (1− y i) log(1− σ(θT x i))
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z)
∂z = σ(z)[1− σ(z)]
• The chain rule tells us that ∂A
∂C =
∂A
∂B ×
∂B
∂C
∂ logL(θ)
∂θj
=
∂ logL(θ)
∂p
×
∂p
∂z
×
∂z
∂θj
where p = σ(θT x) and z = θT x
= −
[
y(1− p)− p(1− y)
]
× xj [[ copy from last slide ]]
= −
[
y − yp − p + yp
]
× xj [[ multiply out ]]
= −
[
y − p
]
× xj [[ -yp+yp=0 ]]
=
[
p − y
]
× xj [[ -[y-p] = -y+p = p-y ]]
=
[
σ(θ
T x)− y
]
× xj [[p = σ(z), z = θ
T x ]]
22
Logistic Regression