Supervised Learning I: Regression
Today
• Multivariate linear regression
• Solution for SSD cost – Indirect
– Direct
• Maximum likelihood cost
Hypothesis:
500
400
300
200
100
0
0 1000 2000 3000
Linear Regression
‘s: Parameters
Cost Function:
SSD = sum of squared differences, also SSE = sum of squared errors
Multidimensional inputs
Size (feet2) Number of Number of Age of home Price ($1000) bedrooms floors (years)
2104 5 1 45 460 1416 3 2 40 232 1534 3 2 30 315
852 2 1 36 178 ……………
Notation:
= number of features
= input (features) of training example. = value of feature in training example.
Multivariate Linear Regression
Hypothesis:
For convenience of notation, define . ‘s: Parameters
Cost Function:
Goal: minimize
How??
Two potential solutions
min 𝐽(θ;𝑥 1 ,𝑦 1 ,…,𝑥(𝑚),𝑦(𝑚)) θ
Gradient descent (or other iterative algorithm)
– Start with a guess for θ
– Change θ to decrease 𝐽(θ) – Until reach minimum
Direct minimization
– Take derivative, set to zero
– Sufficient condition for minima
– Not possible for most “interesting” cost functions
Solving Linear Regression
Gradient Descent
Set Repeat {
Gradient Descent Algorithm
𝜃=0
simultaneously for all
} until convergence
Gradient Descent: Intuition
(for fixed
, this is a function of x)
(function of the parameter )
3 2 1
0
-0.5 0 0.5 1 1.5 2 2.5
3 2
y
1 0
01×23
Gradient Descent: Intuition
(for fixed
, this is a function of x)
(function of the parameter )
3 2 1
0
-0.5 0 0.5 1 1.5 2 2.5
3 2
y
1 0
01×23
Gradient descent illustration (credit: https://towardsdatascience.com/
Gradient for Least Squares Cost
Gradient for Least Squares Cost
For one example
Gradient for Least Squares Cost For one example
Gradient for Least Squares Cost For one example
Gradient for Least Squares Cost For one example
What is this?
Gradient Descent Algorithm
Set
Repeat { 𝑚
} until convergence
𝜽 ≔ 𝜽 + 𝛼 1 𝒆𝑇𝑿 𝑚
𝒆𝑖=h𝜃𝒙(𝑖) −𝑦(𝑖)
𝜽=0
𝜽 ≔ 𝜽 − 𝛼 1
− 𝑦(𝑖) 𝒙(𝑖) 𝑗
simultaneously for all
h 𝒙(𝑖) 𝑗𝑗𝑚𝜃
𝑖=1
in vector form?
𝑥(1) ⋯ 𝑥(1) 1𝑛
𝑿=⋮⋱⋮ 𝑥(𝑚) ⋯ 𝑥(𝑚)
1𝑛
Feature normalization
• If features have very different scale, GD can get “stuck” since
𝑥 affects size of gradient in the direction of jth dimension 𝑗
• Normalizing features to be zero-mean (𝜇) and same-variance (𝜎) helps gradient descent converge faster
normalize
𝑥(𝑖) = 𝑗 𝑗
𝜎 𝑗
𝑥(𝑖) − 𝜇𝑗
Solving Linear Regression
Direct Solution
Direct solution
Want to minimize SSD:
Find minima of function:
(for every )
Solve for
Direct solution
Re-write SSD using vector-matrix notation:
𝐽𝜃=1 𝑋𝜃−𝑦𝑇(𝑋𝜃−𝑦) 2𝑚
Where:
Solution: Normal Equation
𝜃= 𝑋𝑇𝑋−1𝑋𝑇𝑦
Derivation of Normal Equations • SSE in matrix form:
•J𝜃=1 𝑋𝜃−𝑦𝑇𝑋𝜃−𝑦= 2𝑚
= 1 𝜃𝑇 𝑋𝑇𝑋 𝜃−2 𝑋𝑇𝑦 𝑇𝜃+𝑐𝑜𝑛𝑠𝑡 2𝑚
• Take derivative with respect to 𝜃 (vector), set to 0
– 𝜕𝐽 ∝ 𝑋𝑇𝑋 𝜃 − 𝑋𝑇𝑦 = 0 ignore constant multiplier
𝜕𝜃
–𝜃 =𝑋𝑇𝑋−1𝑋𝑇𝑦
• Also known as the least mean squares, or least squares solution
Example:
Size (feet2) Number of bedrooms
Number of floors
Age of home (years)
Price ($1000)
1 2104 5
1 1416 3
1 1534 3
1 852 2 1 36 178
Design Matrix
Normal Equation
1 2 2
45 40 30
460 232 315
Trade-offs m training examples, n features.
Gradient Descent
• Need to choose .
• Needs many iterations.
• Works well even when n is large.
Normal Equations
• No need to choose .
• Don’t need to iterate.
• Need to compute
• Slow if n is very large.
Maximum Likelihood for Linear Regression
So far, we have treated outputs as noiseless
• Defined cost function as “distance to true output”
• An alternate view:
– data (x,y) are generated by unknown process
– however, we only observe a noisy version – how can we model this uncertainty?
• Alternative cost function?
How to model uncertainty in data?
Hypothesis:
h𝜃 𝑥 =𝜃𝑇𝑥
𝜃: parameters 𝐷 = 𝑥(𝑖), 𝑦(𝑖)
500
400
300
200
100
0
New cost function:
: data
0 1000 2000 3000
maximize probability of data given model:
p((x i ,y i )|𝜃)
Recall: Cost Function Training Set Cost
{x(i), y(i)} Function 𝐽 Find 𝜃 with minimal
cost on training set
Input x h𝜃 Output y
Alternative View:
“Maximum Likelihood”
Training Set
{x(i), y(i)}
Find 𝜃 with maximum
probability of training set Input x h𝜃 Output y
p({x i ,y i }|𝜃)
Maximum Likelihood: Example
• Intuitive example: Estimate a coin toss
I have seen 3 flips of heads, 2 flips of tails, what is the chance
of head (or tail) of my next flip?
• Model:
Each flip is a Bernoulli random variable 𝑋
𝑋 can take only two values: 1 (head), 0 (tail) 𝑝𝑋=1 =𝜃, 𝑝𝑋=0 =1−𝜃
• 𝜃 is a parameter to be identified from data
Maximum Likelihood: Example
• 5 (independent) trials
• Likelihood of all 5 observations:
𝑝𝑋,…,𝑋|𝜃 =𝜃3 1−𝜃2 15
• Intuition
ML chooses 𝜃 such that likelihood is maximized
Maximum Likelihood: Example
• 5 (independent) trials
• Likelihood of all 5 observations:
𝑝𝑋,…,𝑋|𝜃 =𝜃3 1−𝜃2 15
• Solution (left as exercise)
3 3+2
i.e. fraction of heads in total number of trials
𝜃𝑀𝐿 =
PSet 1 Out
• Due on Tuesday 9/15 11:59pm GMT -5 (Boston Time)
• Diagnostic homework covering topics covered in prereqs
Next Class
Supervised Learning II: Classification:
classification; sigmoid function; logistic regression.
Reading: Bishop 4.3.1-4.3.2; 4.3.4 overview of logistic regression