程序代写代做代考 Today

Today
• Maximum Likelihood (cont’d)
• Classification
Reminder: ps0 due at midnight

Maximum Likelihood for Linear Regression

Maximum likelihood way of estimating model parameters 𝜃
In general, assume data is generated by some distribution
𝑈~ 𝑝(𝑈|𝜃) D={𝑢1 ,𝑢2 ,…,𝑢𝑚 }
Maximum likelihood estimate
LD=
𝜃𝑀𝐿 = argmax L D 𝜃
= argmax 𝜃
Observations (i.i.d.)
Likelihood
Log likelihood
Note: p replaces h!
ς𝑚 𝑝(𝑢(𝑖)|𝜃) 𝑖=1
σ𝑚 log 𝑝(𝑢(𝑖)|𝜃) 𝑖=1
log(f(x)) is monotonic/increasing, same argmax as f(x)

i.i.d. observations
• independently identically distributed random variables
• If 𝑢𝑖 are i.i.d. r.v.s, then
𝑝 𝑢1,𝑢2,…,𝑢𝑚 = 𝑝 𝑢1 𝑝 𝑢2 …𝑝(𝑢𝑚)
• A reasonable assumption about many datasets, but not always

• Find parameters 𝜃 𝑀𝐿
= [𝜇, 𝜎] that maximize ς10 𝑁(𝑥𝑖|𝜇, 𝜎) 𝑖=1
ML: Another example
• Observe a dataset of points 𝐷 = 𝑥𝑖
• Assume 𝑥 is generated by Normal distribution, 𝑥~𝑁(𝑥|𝜇, 𝜎)
𝑖=1:10
f1 ∼ N (10, 2.25) solution f2 ∼ N (10, 9)
f3 ∼ N (10, 0.25)
f4 ∼ N (8, 2.25)

ML for Linear Regression
h𝑥
𝑝 𝑡 𝑥,𝜃,𝛽
𝑥(𝑖)
𝑡𝑖 h 𝑥(𝑖)
Assume:
𝑡=𝑦+𝜖=h𝑥 +𝜖 Noise𝜖∼𝑁𝜖0,𝛽−1 ,
where β = 1 𝜎2
h 𝑥(𝑖)
𝑤𝑒 𝑑𝑜𝑛′𝑡 𝑔𝑒𝑡 𝑡𝑜 𝑠𝑒𝑒 𝑦, 𝑜𝑛𝑙𝑦 𝑡

ML for Linear Regression
Assume:
𝑡=𝑦+𝜖=h𝑥 +𝜖 Noise𝜖∼𝑁𝜖0,𝛽−1 ,
where β = 1 𝜎2
Probability of one data point
Max. likelihood solution
𝜃𝑀𝐿 =argmax𝑝(𝒕|𝒙,𝜃,𝛽) 𝜃
𝛽 = argmax 𝑝(𝒕|𝒙, 𝜃, 𝛽) 𝑀𝐿 𝛽
Likelihood function

Want to maximize w.r.t. 𝜃 𝛽𝑚𝑚𝑚
ln𝑝𝒕𝒙,𝜃,𝛽 =−2෍h𝑥(𝑖) −𝑡(𝑖) 2+2ln𝛽−2ln(2𝜋) 𝑖=1
… but this is same as minimizing sum-of-squares cost1
1𝑚
2𝑚෍ h 𝑥(𝑖) −𝑡(𝑖)
𝑖=1
… which is the same as our SSE cost from before!!
1multiply by − 1 , changing max to min, omit last two terms (don’t depend on 𝜃) 𝑚𝛽
2

Summary: Maximum Likelihood Solution for Linear Regression
Hypothesis:
h 𝑥 =𝜃𝑇𝑥 𝜃
𝜃: parameters 𝐷 = 𝑥(𝑖), 𝑡(𝑖)
Likelihood:
: data
500 400 300 200 100
0
0 1000 2000 3000
𝑝 𝒕 𝒙,𝜃,𝛽 = ς𝑚 𝑁(𝑡(𝑖)|h 𝑥(𝑖) , 𝛽−1) 𝑖=1 𝜃
Goal: maximize likelihood, equivalent to argmin 1 σ𝑚 h 𝑥(𝑖) −𝑡(𝑖) 2
𝜃 2𝑚 𝑖=1 𝜃 (same as minimizing SSE)

Probabilistic Motivation for SSE
• Under the Gaussian noise assumption, maximizing the probability of the data points is the same as minimizing a sum-of-squares cost function
• Alsoknownasleastsquaresmethod
• ML can be used for other hypotheses!
– But linear regression has closed-form solution

Supervised Learning: Classification

Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
Tumor: Malignant / Benign? Email: Spam / Not Spam? Video: Viral / Not Viral?
𝑦 𝜖 {0,1}

Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
Why not use least squares regression?
1𝑚
argmin 2𝑚 ෍ h𝜃 𝑥(𝑖) − 𝑦(𝑖) 2
𝜃 𝑖=1
𝑦 𝜖 {0,1}

𝑦 𝜖 {0,1}
Why not use least squares regression?
“decision boundary”
h𝜃 = .5
1
1𝑚
argmin 2𝑚 ෍ h𝜃 𝑥(𝑖)
𝜃 𝑖=1
− 𝑦(𝑖) 2
Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
• Indeed, this is possible!
– Predict 1 if h𝜃 𝑥 > .5, 0 otherwise
• However, outliers lead to problems…
• Instead, use logistic regression
0

Least Squares vs. Logistic Regression for Classification
Least squares
Figure 4.4 from Bishop. The left plot shows data from two classes, denoted by red crosses and blue circles, together with the decision boundary found by least squares (magenta curve) and also by the logistic regression model (green curve). The right-hand plot shows the corresponding results obtained when extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic regression.
(see Bishop 4.1.3 for more details)

Logistic Regression
map to (0, 1) with “sigmoid” function
1 0.5
0
h𝜃(𝑥) = 𝑝 𝑦 = 1 𝑥 “probability of class 1 given input”
z

Hypothesis:
Predict“𝑦=1”ifh𝜃 𝑥 ≥0.5 Predict“𝑦=0”ifh𝜃 𝑥 <0.5 “decision boundary” h𝜃 = .5 1 Logistic Regression 0 Logistic Regression Cost “decision boundary” h𝜃 = .5 1 Hypothesis: 𝜃: parameters 𝐷 = 𝑥(𝑖), 𝑦(𝑖) : data Cost Function: cross entropy 0 Goal: minimize cost Cross Entropy Cost • Cross entropy compares distribution q to reference p • Here q is predicted probability of y=1 given x, reference distribution is p=y(i), which is either 1 or 0 Gradient of Cross Entropy Cost • Cross entropy cost • its gradient w.r.t 𝜃 is: • No direct closed-form solution (left as exercise) Cost Want : Repeat Gradient descent for Logistic Regression (simultaneously update all ) Cost Want : Repeat Gradient descent for Logistic Regression (simultaneously update all ) Maximum Likelihood Derivation of Logistic Regression Cost We can derive the Logistic Regression cost using Maximum Likelihood, by writing down the likelihood function as where then taking the log. x2 3 2 1 123 Decision boundary Predict “ “ if x1 Non-linear decision boundaries -1 x2 1 1 -1 x1 Predict “ “ if x2