Lecture 7 (part 2): Logistic Regression
COMP90049
Introduction to Machine Learning Semester 1, 2020
Lea Frermann, CIS
1
Roadmap
Sofar…
• Naive Bayes
• Optimization (closed-form and iterative) • Evaluation
Today : more classification! • Logistic Regression
2
Logistic Regression
Quick Refresher
Recall Naive Bayes
P(x, y) = P(y)P(x|y) = P(yi ) P(xmi |yi )
• a probabilistic generative model of the joint probability P(x,y) • optimized to maximize the likelihood of the observed data
• naive due to unrealistic feature indepencence assumptions
NM i=1 m=1
3
Quick Refresher
Recall Naive Bayes
P(x, y) = P(y)P(x|y) = P(yi ) P(xmi |yi )
• a probabilistic generative model of the joint probability P(x,y) • optimized to maximize the likelihood of the observed data
• naive due to unrealistic feature indepencence assumptions
For prediction, we apply Bayes Rule to obtain the conditional distribution P(x, y) = P(y)P(x|y) = P(y|x)P(x)
P(y|x) = P(y)P(x|y) P(x)
yˆ = argmax P(y|x) ≈ P(y)P(x|y) y
How about we model P(y|x) directly? → Logistic Regression
NM i=1 m=1
3
Introduction to Logistic Regression
Logistic Regression on a high level
• Is a binary classification model
• Is a probabilistic discriminative model because it optimizes P(y|x) directly
• Learns to optimally discriminate between inputs which belong to different classes
• No model of P(x|y) → no conditional feature independence assumption
4
Aside: Linear Regression
• Regression: predict a real-valued quantity y given features x, e.g.,
housing price success of movie ($) air quality
given given given
{location, size, age, …}
{cast, genre, budget, …} {temperature, timeOfDay, CO2, …}
5
Aside: Linear Regression
• Regression: predict a real-valued quantity y given features x, e.g.,
housing price success of movie ($) air quality
given given given
{location, size, age, …}
{cast, genre, budget, …} {temperature, timeOfDay, CO2, …}
• linear regression is the simples regression model
• a real-valued yˆ is predicted as a linear combination of weighted feature
values
yˆ = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . = θ0 + θi xi
i
• The weights θ0 , θ1 , . . . are model parameters, and need to be optimized during training
• Loss(error)isthesumofsquarederrors(SSE): L=Ni=1(yˆi −yi)2
5
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0). • We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ] • We want to use a regression approach
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
p(x)=θ0 +θ1×1 +…θFxF
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not.
logp(x)=θ0 +θ1×1 +…θFxF
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not.
6
Logistic Regression: Derivation I
• Let’s assume a binary classification task, y is true (1) or false (0).
• We model probabilites P(y = 1|x; θ) = p(x) as a function of
observations x under parameters θ. [ What about P(y = 0|x; θ)? ]
• We want to use a regression approach
• How about: p(x) as a linear function of x. Problem: probabilities are bounded in 0 and 1, linear functions are not.
• How about: log p(x) as a linear function of x. Problem: log is bounded in one direction, linear functions are not.
• How about: minimally modifying log p(x) such that is is unbounded, by applying the logistic transformation
log p(x) =θ0 +θ1×1 +…θFxF 1−p(x)
6
Logistic Regression: Derivation II
log p(x) =θ0 +θ1×1 +…θFxF 1−p(x)
• also called the log odds
• the odds are defined as the fraction of success over the fraction of
failures
odds = P(success) = P(success) P(failures) 1 − P(success)
• e.g., the odds of rolling a 6 with a fair dice are:
1/6 = 0.17 = 0.2 1 − (1/6) 0.83
7
Logistic Regression: Derivation III
log P(x) =θ0 +θ1×1 +…θFxF 1−P(x)
If we rearrange and solve for P(x), we get
exp(θ0 + θ1×1 + …θF xF ) exp(θ0 + Ff=1 θf xf )
P(x)= 1+exp(θ0 +θ1×1 +…θFxF) = 1+exp(θ0 +Ff=1θfxf) 11
= 1+exp(−(θ0 +θ1×1 +…θFxF)) = 1+exp(−(θ0 +Ff=1 θfxf)) • where the RHS is the inverse logit (or
logistic function)
• we pass a regression model through the logistic function to obtain a valid probability predicton
8
Logistic Regression: Interpretation
1
P(y|x;θ)= 1+exp(−(θ0 +Ff=1θfxf))
A closer look at the logistic function
Most inputs lead to P(y|x)=0 or P(y|x)=1. That is intended, because all true labels are either 0 or 1.
• (θ0Ff=1θfxf)>0meansy=1 • (θ0Ff=1θfxf)≈0meansmost
uncertainty
• (θ0Ff=1θfxf)<0meansy=0
9
Logistic Regression: Prediction
• The logistic function returns the probability of P(y = 1) given an in put x 1
P(y=1|x1,x2,...,xF;θ) = 1+exp(−(θ0+Ff=1θfxf)) = σ(x;θ)
• We define a decision boundary, e.g., predict y = 1 if P(y = 1|x1,x2,...,xF;θ) > 0.5 and y = 0 otherwise
10
Example!
11T P(y = 1|x1, x2, …, xF ; θ) = 1+exp(−(θ0+Ff=1 θf xf )) = 1+exp(−(θT x)) = σ(θ x)
Model parameters
θ = [0.1, −3.5, 0.7, 2.1]
Feature Function
(Small) Test Data set
Outlook
rainy sunny
Temp Humidity
cool normal 0
Class hot high 1
x0 = x1 =
x2 =
1 (bias term)
1 if outlook=sunny
3 if outlook=rainy
1 if temp=hot
2 if temp=mild 3 if temp=cool
1 if humidity=normal
2 if outlook=overcast
x3 =
2 if humidity=high
11
Example!
11T P(y = 1|x1, x2, …, xF ; θ) = 1+exp(−(θ0+Ff=1 θf xf )) = 1+exp(−(θT x)) = σ(θ x)
Model parameters
θ = [0.1, −3.5, 0.7, 2.1]
Feature Function
(Small) Test Data set
Outlook
rainy sunny
Temp Humidity
cool normal 0
Class hot high 1
x0 = x1 =
x2 =
1 (bias term)
1 if outlook=sunny
3 if outlook=rainy
1 if temp=hot
2 if temp=mild 3 if temp=cool
1 if humidity=normal
2 if outlook=overcast
x3 =
2 if humidity=high
12
Parameter Estimation
What are the four steps we would follow in finding the optimal param- eters?
13
Objective Function
Mimimize the Negative conditional log likelihood
note that
N
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
i=1
P(y =1|x;θ)=σ(θTx) P(y =0|x;θ)=1−σ(θTx)
14
Objective Function
Mimimize the Negative conditional log likelihood
note that
so
N
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
i=1 P(y =1|x;θ)=σ(θTx)
P(y =0|x;θ)=1−σ(θTx)
N
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
i=1 N
=−(σ(θTxi))yi ∗(1−σ(θTxi))1−yi i=1
14
Objective Function
Mimimize the Negative conditional log likelihood
note that
so
N
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
i=1 P(y =1|x;θ)=σ(θTx)
P(y =0|x;θ)=1−σ(θTx)
N
L(θ) = −P(Y|X;θ) = −P(yi|xi;θ)
take the log of this function
i=1 N
=−(σ(θTxi))yi ∗(1−σ(θTxi))1−yi i=1
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
14
Take 1st Derivative of the Objective Function
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
15
Take 1st Derivative of the Objective Function
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
Also
• Derivative of sum = sum of derivatives → focus on 1 training input
• Compute ∂L for each θj individually, so focus on 1 θj ∂θj
15
Take 1st Derivative of the Objective Function
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
15
Take 1st Derivative of the Objective Function
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂logL(θ) =−y−1−y ∂p p1−p
( because L(θ) = −[ylogp + (1 − y)log(1 − p)]
15
Take 1st Derivative of the Objective Function
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂p = ∂σ(z) = σ(z)[1 − σ(z)] ∂z ∂z
15
Take 1st Derivative of the Objective Function
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂z = ∂θTx=xj ∂θj ∂z
15
Take 1st Derivative of the Objective Function
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
=−y−1−y × σ(z)[1−σ(z)] × xj p 1−p
15
Take 1st Derivative of the Objective Function
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• The chain rule tells us that ∂A = ∂A × ∂B × ∂C ∂D ∂B ∂C ∂D
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
=−y−1−y × σ(z)[1−σ(z)] × xj p 1−p
=σ(θTx)−y × xj
15
Logistic Regression: Parameter Estimation III
The derivative of the log likelihood wrt. a single parameter θj for all training examples
logL(θ) =N σ(θTxi)−yixi ∂θ j
j i=1
• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4) • Unfortunately, that’s not straightforward here (as for Naive Bayes)
• Instead, we will use an iterative method: Gradient Descent
16
Logistic Regression: Parameter Estimation III
The derivative of the log likelihood wrt. a single parameter θj for all training examples
logL(θ) =N σ(θTxi)−yixi ∂θ j
j i=1
• Now, we would set derivatives to zero (Step 3) and solve for θ (Step 4) • Unfortunately, that’s not straightforward here (as for Naive Bayes)
• Instead, we will use an iterative method: Gradient Descent
θ(new) ← θ(old) − η ∂ log L(θ) j j ∂θj
N
θ(new) ← θ(old) −ησ(θTxi)−yixi jjj
i=1
16
Multinomial Logistic Regression
• So far we looked at problems where either y = 0 or y = 1 (e.g., spam classification: y ∈ {play, not play})
P(y=1|x;θ)=σ(θTx) = exp(θTx) 1+exp(θTx)
P(y=0|x;θ)=1−σ(θTx) =1− exp(θTx) 1+exp(θTx)
17
Multinomial Logistic Regression
• So far we looked at problems where either y = 0 or y = 1 (e.g., spam classification: y ∈ {play, not play})
P(y=1|x;θ)=σ(θTx) = exp(θTx) 1+exp(θTx)
P(y=0|x;θ)=1−σ(θTx) =1− exp(θTx) 1+exp(θTx)
• But what if we have more than 2 classes, e.g., y ∈ {positive, negative, neutral}
• we predict the probability of each class c by passing the input representation through the softmax function, a generalization of the sigmoid
exp(θcx) p(y = c|x;θ) = k exp(θkx)
• we learn a parameter vector θc for each class c
17
Example!
p(y=c|x;θ)= exp(θcx) k exp(θkx)
Model parameters
θc0 = [0.1, −3.5, 0.7, 2.1] θc1 = [0.6, 2.5, 2.7, −2.1] θc2 = [3.1, 1.5, 0.07, 3.6]
Feature Function
(Small) Test Data set
Outlook
rainy sunny sunny
Temp Humidity
cool normal cool normal hot high
Class
0 (don’t play) 1 (maybe play) 2 (play)
x0= x1=
x2=
1 (bias term)
1 if outlook=sunny
3 if outlook=rainy
1 if temp=hot
2 if temp=mild 3 if temp=cool
1 if humidity=normal
2 if outlook=overcast
x3 =
2 if humidity=high
18
Logistic Regression: Final Thoughts
Pros
• Probabilistic interpretation
• No restrictive assumptions on features
• Often outperforms Naive Bayes
• Particularly suited to frequency-based features (so, popular in NLP)
Cons
• Can only learn linear feature-data relationships
• Some feature scaling issues
• Often needs a lot of data to work well
• Regularisation a nuisance, but important since overfitting can be a big problem
19
Summary
• Derivation of logistic regression • Prediction
• Derivation of maximum likelihood
20
References
Cosma Shalizi. Advanced Data Analysis from an Elementary Point of View. Chapters 11.1 and 11.2. Online Draft. https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/ADAfaEPoV.pdf
Dan Jurafsky and James H. Martin. Speech and Language Processing. Chapter 5. Online Draft V3.0. https://web.stanford.edu/~jurafsky/slp3/
21
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• Thechainruletellsusthat ∂A = ∂A × ∂B ∂C ∂B ∂C
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• Thechainruletellsusthat ∂A = ∂A × ∂B ∂C ∂B ∂C
Also
• Derivative of sum = sum of derivatives → focus on 1 training input
• Compute ∂L for each θj individually, so focus on 1 θj ∂θj
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• Thechainruletellsusthat ∂A = ∂A × ∂B ∂C ∂B ∂C
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• Thechainruletellsusthat ∂A = ∂A × ∂B ∂C ∂B ∂C
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂logL(θ) =−y−1−y ∂p p1−p
( because L(θ) = −[ylogp + (1 − y)log(1 − p)]
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• Thechainruletellsusthat ∂A = ∂A × ∂B ∂C ∂B ∂C
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂p = ∂σ(z) = σ(z)[1 − σ(z)] ∂z ∂z
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• Thechainruletellsusthat ∂A = ∂A × ∂B ∂C ∂B ∂C
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
∂z = ∂θTx=xj ∂θj ∂z
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• Thechainruletellsusthat ∂A = ∂A × ∂B ∂C ∂B ∂C
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx ∂θj ∂p ∂z ∂θj
=−y − 1−y × σ(z)[1−σ(z)] × xj p 1−p
=−y −1−y × p[1−p] × xj p 1−p
=−y(1−p) − p(1−y) × p[1−p] × xj p(1−p) p(1−p)
=−y(1−p)−p(1−y) × xj
[[combine3derivatives]] [[σ(z)=p]]
[[×1−p and p ]] 1−p p
[[cancelterms]]
22
Optional: Detailed Parameter Estimation
Step 2 Differentiate the loglikelihood wrt. the parameters
N
logL(θ) = −yi logσ(θTxi)+(1−yi)log(1−σ(θTxi))
i=1
Preliminaries
• The derivative of the logistic (sigmoid) function is ∂σ(z) = σ(z)[1 − σ(z)] ∂z
• Thechainruletellsusthat ∂A = ∂A × ∂B ∂C ∂B ∂C
∂logL(θ)=∂logL(θ)×∂p×∂z wherep=σ(θTx) and z=θTx
∂θj
∂p ∂z ∂θj =−y(1−p)−p(1−y) × xj
=−y−yp−p+yp × xj =−y−p × xj =p−y × xj =σ(θTx)−y × xj
[[copyfromlastslide]] [[multiplyout]] [[-yp+yp=0]] [[-[y-p]=-y+p=p-y]] [[p=σ(z),z=θTx]]
22