Today
• Maximum Likelihood (cont’d)
• Classification
Reminder: ps0 due at midnight
Maximum Likelihood for Linear Regression
Maximum likelihood way of estimating model parameters 𝜃
In general, assume data is generated by some distribution
𝑈~ 𝑝(𝑈|𝜃) D={𝑢1 ,𝑢2 ,…,𝑢𝑚 }
Maximum likelihood estimate
LD=
𝜃𝑀𝐿 = argmax L D 𝜃
= argmax 𝜃
Observations (i.i.d.)
Likelihood
Log likelihood
Note: p replaces h!
ς𝑚 𝑝(𝑢(𝑖)|𝜃) 𝑖=1
σ𝑚 log 𝑝(𝑢(𝑖)|𝜃) 𝑖=1
log(f(x)) is monotonic/increasing, same argmax as f(x)
i.i.d. observations
• independently identically distributed random variables
• If 𝑢𝑖 are i.i.d. r.v.s, then
𝑝 𝑢1,𝑢2,…,𝑢𝑚 = 𝑝 𝑢1 𝑝 𝑢2 …𝑝(𝑢𝑚)
• A reasonable assumption about many datasets, but not always
• Find parameters 𝜃 𝑀𝐿
= [𝜇, 𝜎] that maximize ς10 𝑁(𝑥𝑖|𝜇, 𝜎) 𝑖=1
ML: Another example
• Observe a dataset of points 𝐷 = 𝑥𝑖
• Assume 𝑥 is generated by Normal distribution, 𝑥~𝑁(𝑥|𝜇, 𝜎)
𝑖=1:10
f1 ∼ N (10, 2.25) solution f2 ∼ N (10, 9)
f3 ∼ N (10, 0.25)
f4 ∼ N (8, 2.25)
ML for Linear Regression
h𝑥
𝑝 𝑡 𝑥,𝜃,𝛽
𝑥(𝑖)
𝑡𝑖 h 𝑥(𝑖)
Assume:
𝑡=𝑦+𝜖=h𝑥 +𝜖 Noise𝜖∼𝑁𝜖0,𝛽−1 ,
where β = 1 𝜎2
h 𝑥(𝑖)
𝑤𝑒 𝑑𝑜𝑛′𝑡 𝑔𝑒𝑡 𝑡𝑜 𝑠𝑒𝑒 𝑦, 𝑜𝑛𝑙𝑦 𝑡
ML for Linear Regression
Assume:
𝑡=𝑦+𝜖=h𝑥 +𝜖 Noise𝜖∼𝑁𝜖0,𝛽−1 ,
where β = 1 𝜎2
Probability of one data point
Max. likelihood solution
𝜃𝑀𝐿 =argmax𝑝(𝒕|𝒙,𝜃,𝛽) 𝜃
𝛽 = argmax 𝑝(𝒕|𝒙, 𝜃, 𝛽) 𝑀𝐿 𝛽
Likelihood function
Want to maximize w.r.t. 𝜃 𝛽𝑚𝑚𝑚
ln𝑝𝒕𝒙,𝜃,𝛽 =−2h𝑥(𝑖) −𝑡(𝑖) 2+2ln𝛽−2ln(2𝜋) 𝑖=1
… but this is same as minimizing sum-of-squares cost1
1𝑚
2𝑚 h 𝑥(𝑖) −𝑡(𝑖)
𝑖=1
… which is the same as our SSE cost from before!!
1multiply by − 1 , changing max to min, omit last two terms (don’t depend on 𝜃) 𝑚𝛽
2
Summary: Maximum Likelihood Solution for Linear Regression
Hypothesis:
h 𝑥 =𝜃𝑇𝑥 𝜃
𝜃: parameters 𝐷 = 𝑥(𝑖), 𝑡(𝑖)
Likelihood:
: data
500 400 300 200 100
0
0 1000 2000 3000
𝑝 𝒕 𝒙,𝜃,𝛽 = ς𝑚 𝑁(𝑡(𝑖)|h 𝑥(𝑖) , 𝛽−1) 𝑖=1 𝜃
Goal: maximize likelihood, equivalent to argmin 1 σ𝑚 h 𝑥(𝑖) −𝑡(𝑖) 2
𝜃 2𝑚 𝑖=1 𝜃 (same as minimizing SSE)
Probabilistic Motivation for SSE
• Under the Gaussian noise assumption, maximizing the probability of the data points is the same as minimizing a sum-of-squares cost function
• Alsoknownasleastsquaresmethod
• ML can be used for other hypotheses!
– But linear regression has closed-form solution
Supervised Learning: Classification
Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
Tumor: Malignant / Benign? Email: Spam / Not Spam? Video: Viral / Not Viral?
𝑦 𝜖 {0,1}
Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
Why not use least squares regression?
1𝑚
argmin 2𝑚 h𝜃 𝑥(𝑖) − 𝑦(𝑖) 2
𝜃 𝑖=1
𝑦 𝜖 {0,1}
𝑦 𝜖 {0,1}
Why not use least squares regression?
“decision boundary”
h𝜃 = .5
1
1𝑚
argmin 2𝑚 h𝜃 𝑥(𝑖)
𝜃 𝑖=1
− 𝑦(𝑖) 2
Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
• Indeed, this is possible!
– Predict 1 if h𝜃 𝑥 > .5, 0 otherwise
• However, outliers lead to problems…
• Instead, use logistic regression
0
Least Squares vs. Logistic Regression for Classification
Least squares
Figure 4.4 from Bishop. The left plot shows data from two classes, denoted by red crosses and blue circles, together with the decision boundary found by least squares (magenta curve) and also by the logistic regression model (green curve). The right-hand plot shows the corresponding results obtained when extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic regression.
(see Bishop 4.1.3 for more details)
Logistic Regression
map to (0, 1) with “sigmoid” function
1 0.5
0
h𝜃(𝑥) = 𝑝 𝑦 = 1 𝑥 “probability of class 1 given input”
z
Hypothesis:
Predict“𝑦=1”ifh𝜃 𝑥 ≥0.5 Predict“𝑦=0”ifh𝜃 𝑥 <0.5
“decision boundary”
h𝜃 = .5
1
Logistic Regression
0
Logistic Regression Cost
“decision boundary”
h𝜃 = .5
1
Hypothesis:
𝜃: parameters
𝐷 = 𝑥(𝑖), 𝑦(𝑖) : data
Cost Function: cross entropy
0
Goal: minimize cost
Cross Entropy Cost
• Cross entropy compares distribution q to reference p
• Here q is predicted probability of y=1 given x, reference distribution is p=y(i), which is either 1 or 0
Gradient of Cross Entropy Cost • Cross entropy cost
• its gradient w.r.t 𝜃 is:
• No direct closed-form solution
(left as exercise)
Cost
Want : Repeat
Gradient descent for Logistic Regression
(simultaneously update all )
Cost
Want : Repeat
Gradient descent for Logistic Regression
(simultaneously update all )
Maximum Likelihood Derivation of Logistic Regression Cost
We can derive the Logistic Regression cost
using Maximum Likelihood, by writing down the likelihood function as
where
then taking the log.
x2
3 2
1
123
Decision boundary
Predict “ “ if
x1
Non-linear decision boundaries
-1
x2
1
1 -1
x1
Predict “ “ if
x2