Regression
Regression & Least squares
94
Setup
• Regression:
𝒟𝑛= 𝑥1,𝑦1 ,…,𝑥𝑛,𝑦𝑛 ;
𝑥(𝑖)∈R𝑑, 𝑦(𝑖)∈R
h:R𝑑 →R
• Linear regression: H:𝜃∈R𝑑,𝜃0∈R
𝐿𝑜𝑠𝑠 𝑔𝑢𝑒𝑠𝑠, 𝑎𝑐𝑡𝑢𝑎𝑙
h𝑥;𝜃,𝜃0 =𝜃𝑇𝑥+𝜃0
ordinary least squares (OLS)
= 𝑔𝑢𝑒𝑠𝑠 − 𝑎𝑐𝑡𝑢𝑎𝑙
2
squared loss
95
OLS solution using optimisation
𝑛
𝐽𝜃,𝜃0 =1𝜃𝑇𝑥(𝑖)+𝜃0−𝑦(𝑖) 𝑛
𝑖=1
2
We want:
𝜃∗,𝜃∗ =argmin𝐽(𝜃,𝜃 )
0 𝜃,𝜃 0 0
96
OLS analytical solution
𝑛
𝐽𝜃,𝜃0 =1𝜃𝑇𝑥(𝑖)+𝜃0−𝑦(𝑖) 𝑛
𝑖=1
𝑥(1) ⋯ 𝑥(1) 1𝑑
𝑋=⋮⋱⋮ Y=⋮ 𝑥(𝑛) ⋯ 𝑥(𝑛) 𝑦(𝑛)
1𝑑
Assume: Made one feature to be all 1’s
J𝜃 =1 𝑋𝜃−𝑌𝑇(𝑋𝜃−𝑌) 𝑛
2
𝑦(1)
97
OLS analytical solution
J𝜃 =1 𝑋𝜃−𝑌𝑇(𝑋𝜃−𝑌) 𝑛
∇𝜃𝐽= 2𝑋𝑇(𝑋𝜃−𝑌) 𝑛
𝑋𝑇𝑋𝜃 − 𝑋𝑇𝑌 = 0
𝑋𝑇𝑋𝜃 = 𝑋𝑇𝑌
𝑋𝑇𝑋 −1𝑋𝑇𝑋𝜃 = 𝑋𝑇𝑋 −1𝑋𝑇𝑌
∇𝜃𝐽: same shape as 𝜃, 𝑑 × 1 𝜕𝐽/𝜕𝜃1
⋮ 𝜕𝐽/𝜕𝜃𝑑
𝜃∗ = 𝑋𝑇𝑋 −1𝑋𝑇𝑌
98
Regularisation by Ridge Regression
𝑛
𝐽 𝜃,𝜃 =1[𝜃𝑇𝑥𝑖+𝜃−𝑦𝑖 2]+𝜆𝜃2
𝑖=1
𝑟𝑖𝑑𝑔𝑒 0 𝑛
0
99
Ridge Regression analytical solution
• We consider the case with 𝜃 only, without 𝜃0 𝑛
𝐽 𝑟𝑖𝑑𝑔𝑒
∇ 𝐽
𝜃 𝑟𝑖𝑑𝑔𝑒
0 𝑛
0
𝜃,𝜃 =1[𝜃𝑇𝑥𝑖+𝜃−𝑦𝑖 2]+𝜆𝜃2
𝑖=1
=2𝑋𝑇 𝑋𝜃−𝑌 +2𝜆𝜃 𝑛
𝜃∗ = 𝑋𝑇𝑋+𝑛𝜆𝐼 −1𝑋𝑇𝑌 𝑟𝑖𝑑𝑔𝑒
100
Ridge Regression using Gradient Descent
gradient_descent(…): 𝜃(0) = 𝜃𝑖𝑛𝑖𝑡; t = 0 loop
t = t+1
𝜃(𝑡) =𝜃𝑡−1−𝜂∇ 𝑡−1𝐽
until … return(𝜃𝑡)
𝜃∗ 𝑟𝑖𝑑𝑔𝑒
= 𝑋𝑇𝑋+𝑛𝜆𝐼−1𝑋𝑇𝑌
𝜃
𝑟𝑖𝑑𝑔𝑒
∇ 𝐽
𝜃 𝑟𝑖𝑑𝑔𝑒
=2𝑋𝑇 𝑋𝜃−𝑌 +2𝜆𝜃 𝑛
101
Stochastic Gradient Descent
• Objective has the form of a sum
• Sample, randomly, an element of the summation • Do a small step in that direction
𝑛
∇𝐽 =2[𝜃𝑇𝑥𝑖 −𝑦𝑖 𝑥(𝑖) ]+2𝜆𝜃
𝜃 𝑟𝑖𝑑𝑔𝑒
𝑖=1
𝑛
102
Stochastic Gradient Descent
SGD(…):
𝜃(0) = 𝜃𝑖𝑛𝑖𝑡; t = 0 loop
randomly select 𝑖 ∈ {1. . 𝑛} t = t+1
𝜃(𝑡) = 𝜃𝑡−1 −𝜂 𝑡 ∇ 𝑓(𝜃 𝑡−1 ) 𝜃𝑖
until … return(𝜃𝑡)
𝑛
𝑓𝜃 =𝑓(𝜃) 𝑖
𝑖=1
103
105