PowerPoint Presentation
Lecturer: Ben Rubinstein
Lecture 3. Linear Regression
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• Linear regression
∗ Simple model (convenient maths at expense of flexibility)
∗ Often needs less data, “interpretable”, lifts to non-linear
∗ Derivable under all Statistical Schools: Lect 2 case study
• This week: Frequentist + Decision theory derivations
Later in semester: Bayesian approach
∗ Convenient optimisation: Training by “analytic” (exact) solution
• Basis expansion: Data transform for more expressive models
2
COMP90051 Statistical Machine Learning
Linear Regression
via Decision Theory
A warm-up example
3
COMP90051 Statistical Machine Learning
Example: Predict humidity from temperature
Temperature Humidity
TRAINING DATA
85 85
80 90
83 86
70 96
68 80
65 70
64 65
72 95
69 70
75 80
TEST DATA
75 70
In regression, the task is to predict numeric
response (aka dependent variable) from features
(aka predictors or independent variables)
Assume a linear relation: 𝐻𝐻 = 𝑎𝑎 + 𝑏𝑏𝑏𝑏
(𝐻𝐻 – humidity; 𝑏𝑏 – temperature; 𝑎𝑎, 𝑏𝑏 – parameters)
4
? ?
COMP90051 Statistical Machine Learning
5
• The model is
𝐻𝐻 = 𝑎𝑎 + 𝑏𝑏𝑏𝑏
• Fitting the model =
finding “best” 𝑎𝑎, 𝑏𝑏
values for data at
hand
• Important criterion:
minimise the sum
of squared errors
(aka residual sum of
squares)
Example: Problem statement
? ?
COMP90051 Statistical Machine Learning
To find 𝑎𝑎, 𝑏𝑏 that minimise 𝐿𝐿 = ∑𝑖𝑖=1
10 𝐻𝐻𝑖𝑖 − 𝑎𝑎 + 𝑏𝑏 𝑏𝑏𝑖𝑖
2
set derivatives to zero:
𝜕𝜕𝐿𝐿
𝜕𝜕𝑎𝑎
= −2�
𝑖𝑖=1
10
𝐻𝐻𝑖𝑖 − 𝑎𝑎 − 𝑏𝑏 𝑏𝑏𝑖𝑖 = 0
if we know 𝑏𝑏, then �𝑎𝑎 = 1
10
∑𝑖𝑖=1
10 𝐻𝐻𝑖𝑖 − 𝑏𝑏 𝑏𝑏𝑖𝑖
𝜕𝜕𝐿𝐿
𝜕𝜕𝑏𝑏
= −2�
𝑖𝑖=1
10
𝑏𝑏𝑖𝑖 𝐻𝐻𝑖𝑖 − 𝑎𝑎 − 𝑏𝑏 𝑏𝑏𝑖𝑖 = 0
if we know 𝑎𝑎, then �𝑏𝑏 = 1
∑𝑖𝑖=1
10 𝑇𝑇𝑖𝑖
2 ∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖 𝐻𝐻𝑖𝑖 − 𝑎𝑎
Can we be more systematic?
6
High-school optimisation:
• Write derivative
• Set to zero
• Solve for model
• (Check 2nd derivatives)
Example: Minimise Sum Squared Errors
COMP90051 Statistical Machine Learning
• We have two equations and two unknowns 𝑎𝑎, 𝑏𝑏
• Rewrite as a system of linear equations
10 ∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖
∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖 ∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖
2
𝑎𝑎
𝑏𝑏 =
∑𝑖𝑖=1
10 𝐻𝐻𝑖𝑖
∑𝑖𝑖=1
10 𝑏𝑏𝑖𝑖𝐻𝐻𝑖𝑖
• Analytic solution: 𝑎𝑎 = 25.3, 𝑏𝑏 = 0.77
• (Solve using numpy.linalg.solve or sim.)
7
Example: Analytic solution
COMP90051 Statistical Machine Learning
• Adopt a linear relationship between response 𝑦𝑦 ∈ ℝ and
an instance with features 𝑥𝑥1, … , 𝑥𝑥𝑚𝑚 ∈ ℝ
�𝑦𝑦 = 𝑤𝑤0 + �
𝑖𝑖=1
𝑚𝑚
𝑥𝑥𝑖𝑖𝑤𝑤𝑖𝑖
Here 𝑤𝑤0, … ,𝑤𝑤𝑚𝑚 ∈ ℝ denote weights (model parameters)
• Trick: add a dummy feature 𝑥𝑥0 = 1 and use vector
notation
�𝑦𝑦 = �
𝑖𝑖=0
𝑚𝑚
𝑥𝑥𝑖𝑖𝑤𝑤𝑖𝑖 = 𝒙𝒙′𝒘𝒘
8
More general decision rule
A lowercase symbol in bold face indicates a vector; 𝒙𝒙′ denotes transpose
COMP90051 Statistical Machine Learning
Mini Summary
• Linear regression
∗ Simple, effective, “interpretable”, basis for many
approaches
∗ Decision-theoretic frequentist derivation
Next:
Frequentist derivation; Solution/training approach
9
COMP90051 Statistical Machine Learning
Linear Regression
via Frequentist Probabilistic Model
Max-Likelihood Estimation
10
COMP90051 Statistical Machine Learning
Data is noisy!
11
Example: predict mark for Statistical Machine Learning (SML)
from mark for Intro ML (IML aka KT)
IML mark
SM
L
m
ar
k
* synthetic data 🙂
Training
data*
IML mark
SM
L
m
ar
k
COMP90051 Statistical Machine Learning
• Assume a probabilistic
model: 𝑌𝑌 = 𝑿𝑿′𝒘𝒘 + 𝜀𝜀
∗ Here 𝑿𝑿,𝑌𝑌 and 𝜀𝜀 are r.v.’s
∗ Variable 𝜀𝜀 encodes noise
• Next, assume Gaussian
noise (indep. of X):
𝜀𝜀~𝒩𝒩 0,𝜎𝜎2
12
Regression as a probabilistic model
• Recall that 𝒩𝒩 𝑥𝑥;𝜇𝜇,𝜎𝜎2 ≡ 1
2𝜋𝜋𝜎𝜎2
exp − 𝑥𝑥−𝜇𝜇
2
2𝜎𝜎2
• Therefore
𝑝𝑝𝒘𝒘,𝜎𝜎2 𝑦𝑦|𝒙𝒙 =
1
2𝜋𝜋𝜎𝜎2
exp −
𝑦𝑦 − 𝒙𝒙′𝒘𝒘 2
2𝜎𝜎2
𝑥𝑥
𝑦𝑦
this is a
squared
error!
COMP90051 Statistical Machine Learning
13
Parametric probabilistic model
• Using simplified notation, discriminative
model is:
𝑝𝑝 𝑦𝑦|𝒙𝒙 =
1
2𝜋𝜋𝜎𝜎2
exp −
𝑦𝑦 − 𝒙𝒙′𝒘𝒘 2
2𝜎𝜎2
• Unknown parameters: 𝒘𝒘,𝜎𝜎2
𝑥𝑥
𝑦𝑦
• Given observed data { 𝑿𝑿1,𝑌𝑌1 , … , 𝑿𝑿𝑛𝑛,𝑌𝑌𝑛𝑛 }, we want to find
parameter values that “best” explain the data
• Maximum-likelihood estimation: choose parameter values that
maximise the probability of observed data
COMP90051 Statistical Machine Learning
Maximum likelihood estimation
• Assuming independence of data points, the probability of data is
𝑝𝑝 𝑦𝑦1, … ,𝑦𝑦𝑛𝑛|𝒙𝒙1, … ,𝒙𝒙𝑛𝑛 = �
𝑖𝑖=1
𝑛𝑛
𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊
• For 𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝑖𝑖 =
1
2𝜋𝜋𝜎𝜎2
exp − 𝑦𝑦𝑖𝑖−𝒙𝒙𝑖𝑖′𝒘𝒘
2
2𝜎𝜎2
• “Log trick”: Instead of maximising this quantity, we can maximise its
logarithm (Why? Explained soon)
�
𝑖𝑖=1
𝑛𝑛
log𝑝𝑝 𝑦𝑦𝑖𝑖|𝒙𝒙𝒊𝒊 = −
1
2𝜎𝜎2
�
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 𝒙𝒙𝑖𝑖′𝒘𝒘 2 + 𝐶𝐶
here 𝐶𝐶 doesn’t depend on 𝒘𝒘 (it’s a constant)
• Under this model, maximising log-likelihood as a function of 𝒘𝒘 is
equivalent to minimising the sum of squared errors
14
the sum of
squared
errors!
COMP90051 Statistical Machine Learning
• Training data: { 𝒙𝒙1, 𝑦𝑦1 , … , 𝒙𝒙𝑛𝑛,𝑦𝑦𝑛𝑛 }. Note bold face in 𝒙𝒙𝑖𝑖
• For convenience, place instances in rows (so attributes go in columns),
representing training data as an 𝑛𝑛 × (𝑚𝑚 + 1) matrix 𝑿𝑿, and 𝑛𝑛 vector 𝒚𝒚
• Probabilistic model/decision rule assumes 𝒚𝒚 ≈ 𝑿𝑿𝒘𝒘
• To find 𝒘𝒘, minimise the sum of squared errors
𝐿𝐿 = �
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 −�
𝑗𝑗=0
𝑚𝑚
𝑋𝑋𝑖𝑖𝑗𝑗𝑤𝑤𝑗𝑗
2
• Setting derivative to zero and solving for 𝒘𝒘 yields
�𝒘𝒘 = 𝑿𝑿′𝑿𝑿 −1𝑿𝑿′𝒚𝒚
∗ This system of equations called the normal equations
∗ System is well defined only if the inverse exists
15
Method of least squares Analytic solution:• Write derivative
• Set to zero
• Solve for model
In this slide: UPPERCASE symbol in bold face means matrix; 𝑿𝑿′ denotes transpose; 𝑿𝑿−1 denotes matrix inverse
COMP90051 Statistical Machine Learning
Wherefore art thou: Bayesian derivation?
• Later in the semester: return of linear regression
• Fully Bayesian, with a posterior:
∗ Bayesian linear regression
• Bayesian (MAP) point estimate of weight vector:
∗ Adds a penalty term to sum of squared losses
∗ Equivalent to L2 “regularisation” to be covered next week
∗ Called: ridge regression
16
COMP90051 Statistical Machine Learning
Mini Summary
• Linear regression
∗ Simple, effective, “interpretable”, basis for many
approaches
∗ Probabilistic frequentist derivation
∗ Solution by normal equations
Later in semester: Bayesian approaches
Next: Basis expansion for non-linear regression
17
COMP90051 Statistical Machine Learning
Basis Expansion
Extending the utility of models via
data transformation
18
COMP90051 Statistical Machine Learning
Basis expansion for linear regression
• Real data is likely to be non-linear
• What if we still wanted to use a
linear regression?
∗ Simple, easy to understand,
computationally efficient, etc.
• How to marry non-linear data to a
linear method?
19
𝑦𝑦
𝑥𝑥
If you can’t beat’em, join’em
COMP90051 Statistical Machine Learning
Transform the data
• The trick is to transform the data: Map data into another
features space, s.t. data is linear in that space
• Denote this transformation 𝜑𝜑:ℝ𝑚𝑚 → ℝ𝑘𝑘. If 𝒙𝒙 is the
original set of features, 𝜑𝜑 𝒙𝒙 denotes new feature set
• Example: suppose there is just one feature 𝑥𝑥, and the
data is scattered around a parabola rather than a straight
line
20
𝑦𝑦
𝑥𝑥
COMP90051 Statistical Machine Learning
Example: Polynomial regression
• No worries, mate: define
𝜑𝜑1 𝑥𝑥 = 𝑥𝑥
𝜑𝜑2 𝑥𝑥 = 𝑥𝑥2
• Next, apply linear regression to 𝜑𝜑1,𝜑𝜑2
𝑦𝑦 = 𝑤𝑤0 + 𝑤𝑤1𝜑𝜑1 𝑥𝑥 + 𝑤𝑤2𝜑𝜑2 𝑥𝑥 = 𝑤𝑤0 + 𝑤𝑤1𝑥𝑥 + 𝑤𝑤2𝑥𝑥2
and here you have quadratic regression
• More generally, obtain polynomial regression if the
new set of attributes are powers of 𝑥𝑥
• Similar idea basis of autoregression for time series
21
𝑦𝑦
𝑥𝑥
COMP90051 Statistical Machine Learning
Example: linear classification
• Example binary classification problem: Dataset not linearly separable
• Define transformation as
𝜑𝜑𝑖𝑖 𝒙𝒙 = 𝒙𝒙 − 𝒛𝒛𝑖𝑖 , where 𝒛𝒛𝑖𝑖 some pre-defined constants
• Choose 𝒛𝒛1 = 0,0 ′, 𝒛𝒛2 = 0,1 ′, 𝒛𝒛3 = 1,0 ′, 𝒛𝒛4 = 1,1 ′
22
𝑥𝑥1 𝑥𝑥2 𝑦𝑦
0 0 Class A
0 1 Class B
1 0 Class B
1 1 Class A
𝑥𝑥1
𝑥𝑥2
𝜑𝜑1 𝜑𝜑2 𝜑𝜑3 𝜑𝜑4
0 1 1 2
1 0 2 1
1 2 0 1
2 1 1 0
𝑤𝑤1 𝑤𝑤2 𝑤𝑤3 𝑤𝑤4
1 0 0 1
𝝋𝝋′𝒘𝒘 𝑦𝑦
2 Class A
2 Class B
2 Class B
2 Class A
The transformed
data is linearly
separable!
there exist weights that make
new data separable, e.g.:
COMP90051 Statistical Machine Learning
Radial basis functions
• Previous example: motivated by approximation
theory where sums of RBFs approx. functions
• A radial basis function is a function of the form
𝜑𝜑 𝒙𝒙 = 𝜓𝜓 𝒙𝒙 − 𝒛𝒛 , where 𝒛𝒛 is a constant
• Examples:
• 𝜑𝜑 𝒙𝒙 = 𝒙𝒙 − 𝒛𝒛
• 𝜑𝜑 𝒙𝒙 = exp − 1
𝜎𝜎
𝒙𝒙 − 𝒛𝒛 2
23
𝜑𝜑 𝑥𝑥
𝑥𝑥
COMP90051 Statistical Machine Learning
Challenges of basis expansion
• Basis expansion can significantly increase the utility of
methods, especially, linear methods
• In the above examples, one limitation is that the
transformation needs to be defined beforehand
∗ Need to choose the size of the new feature set
∗ If using RBFs, need to choose 𝒛𝒛𝑖𝑖
• Regarding 𝒛𝒛𝑖𝑖, one can choose uniformly spaced points, or
cluster training data and use cluster centroids
• Another popular idea is to use training data 𝒛𝒛𝑖𝑖 ≡ 𝒙𝒙𝑖𝑖
∗ E.g., 𝜑𝜑𝑖𝑖 𝒙𝒙 = 𝜓𝜓 𝒙𝒙 − 𝒙𝒙𝑖𝑖
∗ However, for large datasets, this results in a large number of
features computational hurdle
24
COMP90051 Statistical Machine Learning
Further directions
• There are several avenues for taking the idea of basis
expansion to the next level
∗ Will be covered later in this subject
• One idea is to learn the transformation 𝜑𝜑 from data
∗ E.g., Artificial Neural Networks
• Another powerful extension is the use of the kernel trick
∗ “Kernelised” methods, e.g., kernelised perceptron
• Finally, in sparse kernel machines, training depends only
on a few data points
∗ E.g., SVM
25
COMP90051 Statistical Machine Learning
Mini Summary
• Basis expansion
∗ Extending model expressiveness via data transformation
∗ Examples for linear and logistic regression
∗ Theoretical notes
Next time:
First/second-order iteration optimisation;
Logistic regression – linear probabilistic model for
classification.
26
�Lecturer: Ben Rubinstein
This lecture
Linear Regression�via Decision Theory
Example: Predict humidity from temperature
Example: Problem statement
Example: Minimise Sum Squared Errors
Example: Analytic solution
More general decision rule
Mini Summary
Linear Regression�via Frequentist Probabilistic Model
Data is noisy!
Regression as a probabilistic model
Parametric probabilistic model
Maximum likelihood estimation
Method of least squares
Wherefore art thou: Bayesian derivation?
Mini Summary
Basis Expansion
Basis expansion for linear regression
Transform the data
Example: Polynomial regression
Example: linear classification
Radial basis functions
Challenges of basis expansion
Further directions
Mini Summary