程序代写代做代考 Today

Today
• Maximum Likelihood (cont’d)
• Classification
Reminder: ps1 due at midnight

Maximum Likelihood for Linear Regression

Maximum likelihood way of estimating model parameters 𝜃
In general, assume data is generated by some distribution
𝑈~ 𝑝(𝑈|𝜃) D={𝑢1 ,𝑢2 ,…,𝑢𝑚 }
Maximum likelihood estimate
LD=
𝜃𝑀𝐿 = argmax L D 𝜃
= argmax 𝜃
Observations (i.i.d.)
Likelihood
Log likelihood
Note: p replaces h!
𝑚 𝑝(𝑢(𝑖)|𝜃) 𝑖=1
𝑚 log 𝑝(𝑢(𝑖)|𝜃) 𝑖=1
log(f(x)) is monotonic/increasing, same argmax as f(x)

i.i.d. observations
• independently identically distributed random variables
• If 𝑢𝑖 are i.i.d. r.v.s, then
𝑝 𝑢1,𝑢2,…,𝑢𝑚 =𝑝 𝑢1 𝑝 𝑢2 …𝑝(𝑢𝑚)
• A reasonable assumption about many datasets, but not always

• Find parameters 𝜃 𝑀𝐿
= [𝜇, 𝜎] that maximize 10 𝑁(𝑥𝑖|𝜇, 𝜎) 𝑖=1
ML: Another example
• Observe a dataset of points 𝐷 = 𝑥𝑖
• Assume 𝑥 is generated by Normal distribution, 𝑥~𝑁(𝑥|𝜇, 𝜎)
𝑖=1:10
f1 ∼ N (10, 2.25) f2 ∼ N (10, 9)
f3 ∼ N (10, 0.25) f4 ∼ N (8, 2.25)
solution

ML for Linear Regression
h𝑥
𝑝 𝑡 𝑥,𝜃,𝛽
𝑥(𝑖)
𝑡𝑖 h 𝑥(𝑖)
Assume:
𝑡=𝑦+𝜖=h𝑥 +𝜖 Noise𝜖∼𝑁𝜖0,𝛽−1 ,
where β = 1 𝜎2
h 𝑥(𝑖)
𝑤𝑒 𝑑𝑜𝑛′𝑡 𝑔𝑒𝑡 𝑡𝑜 𝑠𝑒𝑒 𝑦, 𝑜𝑛𝑙𝑦 𝑡

ML for Linear Regression
Assume:
𝑡=𝑦+𝜖=h𝑥 +𝜖 Noise𝜖∼𝑁𝜖0,𝛽−1 ,
where β = 1 𝜎2
Probability of one data point
Max. likelihood solution
𝜃𝑀𝐿 =argmax𝑝(𝒕|𝒙,𝜃,𝛽) 𝜃
𝛽 = argmax 𝑝(𝒕|𝒙, 𝜃, 𝛽) 𝑀𝐿 𝛽
Likelihood function

Want to maximize w.r.t. 𝜃 𝛽𝑚𝑚𝑚
ln𝑝𝒕𝒙,𝜃,𝛽 =−2
𝑖=1
h𝑥(𝑖) −𝑡(𝑖) 2+2ln𝛽−2ln(2𝜋)
… but this is same as minimizing sum-of-squares cost1
1𝑚
h𝑥(𝑖) −𝑡(𝑖)
2𝑚 𝑖=1
… which is the same as our SSE cost from before!!
1multiply by − 1 , changing max to min, omit last two terms (don’t depend on 𝜃) 𝑚𝛽
2

Summary: Maximum Likelihood Solution for Linear Regression
Hypothesis:
h 𝑥 =𝜃𝑇𝑥 𝜃
𝜃: parameters 𝐷 = 𝑥(𝑖), 𝑡(𝑖)
Likelihood:
: data
500
400
300
200
100
0
0 1000 2000 3000
𝑝 𝒕 𝒙,𝜃,𝛽 = 𝑚 𝑁(𝑡(𝑖)|h 𝑥(𝑖) , 𝛽−1) 𝑖=1 𝜃
Goal: maximize likelihood, equivalent to argmin 1 𝑚 h 𝑥(𝑖) −𝑡(𝑖) 2
𝜃 2𝑚 𝑖=1 𝜃 (same as minimizing SSE)

Probabilistic Motivation for SSE
• Under the Gaussian noise assumption, maximizing the probability of the data points is the same as minimizing a sum-of-squares cost function
• Also known as least squares method
• ML can be used for other hypotheses!
– But linear regression has closed-form solution

Supervised Learning: Classification

Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
Tumor: Malignant / Benign? Email: Spam / Not Spam? Video: Viral / Not Viral?
𝑦 𝜖 {0,1}

Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
Why not use least squares regression?
1𝑚
argmin 2𝑚 h𝜃 𝑥(𝑖) − 𝑦(𝑖) 2
𝜃 𝑖=1
𝑦 𝜖 {0,1}

Classification
0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)
𝑦 𝜖 {0,1}
Why not use least squares regression?
1𝑚
argmin 2𝑚 h𝜃 𝑥(𝑖) − 𝑦(𝑖) 2
𝜃 𝑖=1
• Indeed, this is possible!
– Predict 1 if h𝜃 𝑥 > .5, 0 otherwise
• However, outliers lead to problems…
• Instead, use logistic regression
“decision boundary”
h𝜃 = .5
1
0

Least Squares vs. Logistic Regression for Classification
Least squares
Figure 4.4 from Bishop. The left plot shows data from two classes, denoted by red crosses and blue circles, together with the decision boundary found by least squares (magenta curve) and also by the logistic regression model (green curve). The right-hand plot shows the corresponding results obtained when extra data points are added at the bottom left of the diagram, showing that least squares is highly sensitive to outliers, unlike logistic regression.
(see Bishop 4.1.3 for more details)

Logistic Regression
map to (0, 1) with “sigmoid” function
1 0.5
0
h𝜃(𝑥) = 𝑝 𝑦 = 1 𝑥 “probability of class 1 given input”
z

Hypothesis:
Predict“𝑦=1”ifh𝜃 𝑥 ≥0.5 Predict“𝑦=0”ifh𝜃 𝑥 <0.5 “decision boundary” h𝜃 = .5 1 Logistic Regression 0 Logistic Regression Cost “decision boundary” h𝜃 = .5 1 Hypothesis: 𝜃: parameters 𝐷 = 𝑥(𝑖), 𝑦(𝑖) : data Cost Function: cross entropy 0 Goal: minimize cost Cross Entropy Cost • Cross entropy compares distribution q to reference p • Here q is predicted probability of y=1 given x, reference distribution is p=y(i), which is either 1 or 0 Gradient of Cross Entropy Cost • Cross entropy cost • its gradient w.r.t 𝜃 is: • No direct closed-form solution (left as exercise) Cost Want : Repeat Gradient descent for Logistic Regression (simultaneously update all ) Cost Want : Repeat Gradient descent for Logistic Regression (simultaneously update all ) Maximum Likelihood Derivation of Logistic Regression Cost We can derive the Logistic Regression cost using Maximum Likelihood, by writing down the likelihood function as where then taking the log. x2 3 2 1 123 Decision boundary Predict “ “ if x1 Non-linear decision boundaries -1 x2 1 1 -1 x1 Predict “ “ if x2 Supervised Learning II Non-linear features What to do if data is nonlinear? Nonlinear basis functions Transform the input/feature 𝜙(𝑥)∶𝑥∈𝑅2 →𝑧=𝑥1·𝑥2 Another example How to transform the input/feature? Transformed training data: linearly separable Intuition: suppose 𝜃 = Then𝜃𝑇𝑧=𝑥12 +𝑥22 1 0 1 Another example How to transform the input/feature? 𝜙(𝑥):𝑥∈𝑅2 →𝑧= 𝑥22 𝑥12 𝑥1 · 𝑥2 i.e., the sq. distance to the origin! Non-linear basis functions • We can use a nonlinear mapping, or basis function 𝜙(𝑥)∶𝑥∈𝑅𝑁 → 𝑧∈𝑅𝑀 • where M is the dimensionality of the new feature/input 𝑧 (or 𝜙(𝑥)) • Note that M could be either greater than D or less than, or the same Example with regression Add more polynomial basis functions Being too adaptive leads to better results on the training data, but not so great on data that has not been seen! good fit Supervised Learning II Overfitting Overfitting Parameters for higher-order polynomials are very large M =0 0.19 𝜃 w1 𝜃 w2 M =1 0.82 -1.27 M =3 0.31 7.99 -25.43 17.37 M =9 0.35 232.37 -5321.83 48568.31 𝜃 w0 𝜃 w3 𝜃 4 𝜃5 𝜃6 𝜃7 𝜃8 𝜃9 3 -231639.30 640042.26 -1061800.52 1042400.18 -557682.99 125201.43 Overfitting disaster Fitting the housing price data with M = 3 Note that the price would goes to zero (or negative) if you buy bigger houses! This is called poor generalization/overfitting. Detecting overfitting Plot model complexity versus objective function on test/train data As model becomes more complex, performance on training keeps improving while on test data it increases Horizontal axis: measure of model complexity In this example, we use the maximum order of the polynomial basis functions. Vertical axis: For regression, it would be SSE or mean SE (MSE) For classification, the vertical axis would be classification error rate or cross-entropy error function Overcoming overfitting • Basic ideas – Use more training data – Regularization methods – Cross-validation Solution: use more data M=9, increase N What if we do not have a lot of data? Overcoming overfitting • Basic ideas – Use more training data – Regularization methods – Cross-validation Next Class Supervised Learning 3: Regularization more logistic regression, regularization; bias- variance Reading: Bishop 3.1, 3.2 Discussion/Lab this week: Intro to Numpy PSet 2 out on Thursday