PowerPoint Presentation
Lecturer: Ben Rubinstein
Lecture 5. Regularisation
COMP90051 Statistical Machine Learning
Copyright: University of Melbourne
COMP90051 Statistical Machine Learning
This lecture
• How irrelevant features make optimisation ill-posed
• Regularising linear regression
∗ Ridge regression
∗ The lasso
∗ Connections to Bayesian MAP
• Regularising non-linear regression
• Bias-variance (again)
230/07/2013 Week 1, Lecture 2
COMP90051 Statistical Machine Learning
Regularisation
Process of introducing additional information in order to
solve an ill-posed problem or to prevent overfitting
• Major technique & theme, throughout ML
• Addresses one or more of the following related problems
∗ Avoids ill-conditioning (a computational problem)
∗ Avoids overfitting (a statistical problem)
∗ Introduce prior knowledge into modelling
• This is achieved by augmenting the objective function
• In this lecture: we cover the first two aspects. We will
cover more of regularisation throughout the subject
3
COMP90051 Statistical Machine Learning
The Problem with
Irrelevant Features
Linear regression on rank-deficient data.
4
COMP90051 Statistical Machine Learning
Example 1: Feature importance
• Linear model on three features
∗ 𝐗𝐗 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0
5-5
-4
-3
-2
-1
0
1
2
3
4
5
w1 w2 w3
Question: Which feature is
more important?
COMP90051 Statistical Machine Learning
Example 1: Feature importance
• Linear model on three features
∗ 𝐗𝐗 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0
6-5
-4
-3
-2
-1
0
1
2
3
4
5
w1 w2 w3
COMP90051 Statistical Machine Learning
Example 1: Irrelevant features
• Linear model on three features, first two same
∗ 𝐗𝐗 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0
∗ First two columns of 𝑿𝑿 identical
∗ Feature 2 (or 1) is irrelevant
7
3 3 7
6 6 9
21 21 79
34 34 2
• Effect of perturbations
on model predictions?
∗ Add ∆ to 𝑤𝑤1
∗ Subtract ∆ from 𝑤𝑤2
-5
-4
-3
-2
-1
0
1
2
3
4
5
w1 w2 w3
COMP90051 Statistical Machine Learning
Example 1: Irrelevant features
• Linear model on three features, first two same
∗ 𝐗𝐗 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0
∗ First two columns of 𝐗𝐗 identical
∗ Feature 2 (or 1) is irrelevant
8
3 3 7
6 6 9
21 21 79
34 34 2
• Effect of perturbations
on model predictions?
∗ Add ∆ to 𝑤𝑤1
∗ Subtract ∆ from 𝑤𝑤2
-5
-4
-3
-2
-1
0
1
2
3
4
5
w1 w2 w3
COMP90051 Statistical Machine Learning
Problems with irrelevant features
• In example, suppose �𝑤𝑤0, �𝑤𝑤1, �𝑤𝑤2, �𝑤𝑤3 ′ is “optimal”
• For any 𝛿𝛿 new �𝑤𝑤0, �𝑤𝑤1 + 𝛿𝛿,�𝑤𝑤2 − 𝛿𝛿,�𝑤𝑤3 ′ get
∗ Same predictions!
∗ Same sum of squared errors!
• Problems this highlights
∗ The solution is not unique
∗ Lack of interpretability
∗ Optimising to learn parameters is ill-posed problem
9
COMP90051 Statistical Machine Learning
Irrelevant (co-linear) features in general
• Extreme case: features complete clones
• For linear models, more generally
∗ Feature 𝐗𝐗⋅𝑗𝑗 is irrelevant if
∗ 𝐗𝐗⋅𝑗𝑗 is a linear combination of other columns
𝐗𝐗⋅𝑗𝑗 = �
𝑙𝑙≠𝑗𝑗
𝛼𝛼𝑙𝑙 𝐗𝐗⋅𝑙𝑙
… for some scalars 𝛼𝛼𝑙𝑙. Also called multicollinearity
∗ Equivalently: Some eigenvalue of 𝐗𝐗′𝐗𝐗 is zero
• Even near-irrelevance/colinearity can be problematic
∗ V small eigenvalues of 𝐗𝐗′𝐗𝐗
• Not just a pathological extreme; easy to happen!
10𝐗𝐗⋅𝑗𝑗 denotes the 𝑗𝑗-th column of 𝑋𝑋
COMP90051 Statistical Machine Learning
Example 2: Lack of data
• Extreme example:
∗ Model has two
parameters (slope and
intercept)
∗ Only one data point
• Underdetermined system
11
𝑦𝑦
𝑥𝑥
COMP90051 Statistical Machine Learning
Ill-posed problems
• In both examples, finding the best
parameters becomes an
ill-posed problem
• This means that the problem
solution is not defined
∗ In our case 𝑤𝑤1 and 𝑤𝑤2 cannot be
uniquely identified
• Remember normal equations
solution of linear regression:
�𝒘𝒘 = 𝐗𝐗′𝐗𝐗 −1𝐗𝐗′𝐲𝐲
• With irrelevant/multicolinear
features, matrix 𝐗𝐗′𝐗𝐗 has no inverse
12
convex, but not
strictly convex
𝑤𝑤1
𝑤𝑤2
sum of squared
errors
COMP90051 Statistical Machine Learning
Mini Summary
• Irrelevant features as collinearity
• Leads to
∗ Ill-posed optimisation for linear regression
∗ Broken interpretability
• Multiple intuitions: algebraic, geometric
• Next: Regularisation to the rescue!
13
COMP90051 Statistical Machine Learning
Regularisation in
Linear Models
Ridge regression and the Lasso
14
COMP90051 Statistical Machine Learning
Re-conditioning the problem
• Regularisation: introduce an additional
condition into the system
• The original problem is to minimise
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2
• The regularised problem is to minimise
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 + 𝜆𝜆 𝐗𝐗 2
2 for 𝜆𝜆 > 0
• The solution is now
�𝐗𝐗 = 𝐗𝐗′𝐗𝐗 + λ𝐈𝐈 −1𝐗𝐗′𝐲𝐲
• This formulation is called ridge regression
∗ Turns the ridge into a deep, singular valley
∗ Adds 𝜆𝜆 to eigenvalues of 𝐗𝐗′𝐗𝐗: makes invertible
15
𝑤𝑤1
𝑤𝑤2
sum of squared
errors
strictly convex
COMP90051 Statistical Machine Learning
Regulariser as a prior
• Without regularisation, parameters found based entirely
on the information contained in the training set 𝐗𝐗
∗ Regularisation introduces additional information
• Recall our probabilistic model 𝑌𝑌 = 𝐱𝐱′𝐗𝐗 + 𝜀𝜀
∗ Here 𝑌𝑌 and 𝜀𝜀 are random variables, where 𝜀𝜀 denotes noise
• Now suppose that 𝐗𝐗 is also a random variable (denoted
as 𝐖𝐖) with a Normal prior distribution
𝐖𝐖~𝒩𝒩 0,1/𝜆𝜆
∗ I.e. we expect small weights and that no one feature dominates
∗ Is this always appropriate? E.g. data centring and scaling
∗ We could encode much more elaborate problem knowledge
16
COMP90051 Statistical Machine Learning
Computing posterior using Bayes rule
• The prior is then used to compute the posterior
𝑝𝑝 𝐗𝐗|𝐗𝐗, 𝐲𝐲 =
𝑝𝑝 𝐲𝐲|𝐗𝐗,𝐗𝐗 𝑝𝑝 𝐗𝐗
𝑝𝑝 𝐲𝐲|𝐗𝐗
• Instead of maximum likelihood (MLE), take maximum a posteriori
estimate (MAP)
• Apply log trick, so that
log 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = log 𝑙𝑙𝑝𝑝𝑙𝑙𝑝𝑝𝑙𝑙𝑝𝑝𝑙𝑝𝑝𝑝𝑝𝑙𝑙 + log 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 − log 𝑚𝑚𝑚𝑚𝑝𝑝𝑚𝑚
• Arrive at the problem of minimising
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 + 𝜆𝜆 𝐗𝐗 2
2
17
posterior
likelihood prior
marginal
likelihood
this term doesn’t
affect optimisation
COMP90051 Statistical Machine Learning
Regulariser as a constraint
• For illustrative purposes, consider a modified problem:
minimise 𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 subject to 𝐗𝐗 2
2 ≤ 𝜆𝜆 for 𝜆𝜆 > 0
18
𝐗𝐗∗
Ridge regression ( 𝐗𝐗 2
2)
𝑤𝑤1
𝑤𝑤2
𝐗𝐗∗
Lasso ( 𝐗𝐗 1)
𝑤𝑤1
𝑤𝑤2
Regulariser defines
feasible region
Contour lines of
objective function
Solution to
linear regression
• Lasso (L1 regularisation) encourages solutions to sit on the axes
Some of the weights are set to zero Solution is sparse
COMP90051 Statistical Machine Learning
Regularised linear regression
Algorithm Minimises Regulariser Solution
Linear
regression
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 None
𝐗𝐗′𝐗𝐗 −1𝐗𝐗′𝐲𝐲
(if inverse exists)
Ridge
regression
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 + λ 𝐗𝐗 2
2
L2 norm
𝐗𝐗′𝐗𝐗 + 𝜆𝜆𝐈𝐈 −1𝐗𝐗′𝐲𝐲
Lasso 𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 + λ 𝐗𝐗 1
L1 norm No closed-form, but
solutions are sparse
and suitable for
high-dim data
19
COMP90051 Statistical Machine Learning
Mini Summary
• L2 regularisation: Ridge regression
∗ Re-conditions the optimisation
∗ Equivalent to MAP with Gaussian prior on weights
• L1 regularisation: The Lasso
∗ Particularly favoured in high-dim, low-example regimes
• Next: Regularisation and non-linear regression
20
COMP90051 Statistical Machine Learning
Regularisation in
Non-Linear Models
Model selection in ML
21
COMP90051 Statistical Machine Learning
Example regression problem
2 4 6 8 10
-5
0
5
10
X
Y
How complex a model should we use?
22
COMP90051 Statistical Machine Learning
Underfitting (linear regression)
2 4 6 8 10
-5
0
5
10
X
Y
Model class Θ can be too simple to possibly fit true model.
23
COMP90051 Statistical Machine Learning
Overfitting (non-parametric smoothing)
2 4 6 8 10
-5
0
5
10
X
Y
Model class Θ can be so complex it can fit true model + noise
24
COMP90051 Statistical Machine Learning
Actual model (𝑥𝑥sin 𝑥𝑥)
2 4 6 8 10
-5
0
5
10
X
Y
The right model class Θ will sacrifice some training error, for test error.
25
COMP90051 Statistical Machine Learning
Approach: Explicit model selection
• Try different classes of models. Example, try polynomial
models of various degree 𝑙𝑙 (linear, quadratic, cubic, …)
• Use held out validation (cross validation) to select the
model
1. Split training data into 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 and 𝐷𝐷𝑣𝑣𝑡𝑡𝑙𝑙𝑡𝑡𝑣𝑣𝑡𝑡𝑡𝑡𝑣𝑣 sets
2. For each degree 𝑙𝑙 we have model 𝑓𝑓𝑣𝑣
1. Train 𝑓𝑓𝑣𝑣 on 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
2. Test 𝑓𝑓𝑣𝑣 on 𝐷𝐷𝑣𝑣𝑡𝑡𝑙𝑙𝑡𝑡𝑣𝑣𝑡𝑡𝑡𝑡𝑣𝑣
3. Pick degree �̂�𝑙 that gives the best test score
4. Re-train model 𝑓𝑓 �𝑣𝑣 using all data
26
COMP90051 Statistical Machine Learning
Approach: Regularisation
• Augment the problem:
�𝜽𝜽 ∈ argmin
𝜽𝜽∈Θ
𝐿𝐿 𝑙𝑙𝑚𝑚𝑝𝑝𝑚𝑚,𝜽𝜽 + 𝜆𝜆𝜆𝜆 𝜽𝜽
• E.g., ridge regression
�𝒘𝒘 ∈ argmin
𝒘𝒘∈𝑊𝑊
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 + λ 𝐗𝐗 2
2
• Note that regulariser 𝜆𝜆 𝜽𝜽 does not depend on data
• Use held out validation/cross validation to choose 𝜆𝜆
27
COMP90051 Statistical Machine Learning
Example: Polynomial regression
28
• 9th-order polynomial regression
∗ model of form
𝑓𝑓 = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥 + … + 𝑤𝑤9𝑥𝑥9
∗ regularised with 𝜆𝜆 𝐗𝐗 2
2 term
𝞴𝞴 = 0
Figures from Bishop pp7-9,150
COMP90051 Statistical Machine Learning
Mini Summary
• Overfitting vs underfitting
• Effect of regularisation on nonlinear regression
∗ Controls balance of over- vs underfitting
∗ Controlled in this case by the penalty hyperparameter
• Next: Bias-variance view for regression
29
COMP90051 Statistical Machine Learning
Bias-variance trade-off
Train error, test error and
model complexity in
supervised regression
30
COMP90051 Statistical Machine Learning
Assessing generalisation
• Supervised learning: train the model on existing data,
then make predictions on new data
• Training the model: ERM / minimisation of training error
• Generalisation capacity is captured by risk / test error
• Model complexity is a major factor that influences the
ability of the model to generalise (vague still)
• In this section, our aim is to explore error in the context
of supervised regression. One way to decompose it.
31
COMP90051 Statistical Machine Learning
Training error and model complexity
• More complex model training error goes down
• Finite number of points usually can reduce
training error to 0 (is it always possible?)
32
model complexity
Training
error
𝑥𝑥
𝑦𝑦
What is this?
COMP90051 Statistical Machine Learning
(Another) Bias-variance decomposition
• Squared loss for supervised-regression predictions
𝑙𝑙 𝑌𝑌,𝑓𝑓 𝑿𝑿0 = 𝑌𝑌 − 𝑓𝑓 𝑿𝑿0
2
• Lemma: Bias-variance decomposition
𝔼𝔼 𝑙𝑙 𝑌𝑌,𝑓𝑓 𝑿𝑿0 = 𝔼𝔼 𝑌𝑌 − 𝔼𝔼 𝑓𝑓
2
+ 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌
33
(bias)2 variance
irreducible
error
Risk /
test error
for 𝒙𝒙0
* Prediction randomness comes from randomness in test features AND training data
Classification
later on
COMP90051 Statistical Machine Learning
Decomposition proof sketch
• Here (𝒙𝒙) is omitted to de-clutter notation
• 𝔼𝔼 𝑌𝑌 − 𝑓𝑓
2
= 𝔼𝔼 𝑌𝑌2 + 𝑓𝑓2 − 2𝑌𝑌𝑓𝑓
• = 𝔼𝔼 𝑌𝑌2 + 𝔼𝔼 𝑓𝑓2 − 𝔼𝔼 2𝑌𝑌𝑓𝑓
• = 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌 + 𝔼𝔼 𝑌𝑌 2 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝔼𝔼 𝑓𝑓
2
− 2𝔼𝔼 𝑌𝑌 𝔼𝔼 𝑓𝑓
• = 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝔼𝔼 𝑌𝑌 2 − 2𝔼𝔼 𝑌𝑌 𝔼𝔼 𝑓𝑓 + 𝔼𝔼 𝑓𝑓
2
• = 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝔼𝔼 𝑌𝑌 − 𝔼𝔼 𝑓𝑓
2
34
COMP90051 Statistical Machine Learning
Training data as a random variable
35
𝑦𝑦 𝑦𝑦𝑦𝑦𝐷𝐷1 𝐷𝐷2 𝐷𝐷3
𝑥𝑥 𝑥𝑥 𝑥𝑥
COMP90051 Statistical Machine Learning
Training data as a random variable
36
𝑦𝑦 𝑦𝑦𝑦𝑦𝐷𝐷1 𝐷𝐷2 𝐷𝐷3
𝑥𝑥 𝑥𝑥 𝑥𝑥
COMP90051 Statistical Machine Learning
Intuition: Model complexity and variance
• simple model low variance
• complex model high variance
37
𝑥𝑥0
𝑦𝑦
𝑥𝑥
𝑦𝑦
𝑥𝑥0 𝑥𝑥
COMP90051 Statistical Machine Learning
Intuition: Model complexity and variance
• simple model high bias
• complex model low bias
38
𝑦𝑦
𝑙 𝑥𝑥0
𝑦𝑦
𝑙 𝑥𝑥0
𝑥𝑥0 𝑥𝑥 𝑥𝑥0 𝑥𝑥
COMP90051 Statistical Machine Learning
Bias-variance trade-off
• simple model high bias, low variance
• complex model low bias, high variance
39
complex modelsimple model
(bias)2 variance
Test error
COMP90051 Statistical Machine Learning
Test error and training error
40
complex modelsimple model
Training
error
Test error
Underfit Overfit
But how to measure
model family complexity?
COMP90051 Statistical Machine Learning
Mini Summary
• Supervised regression: square-loss risk decomposes
to bias, variance and irreducible terms
• This trade-off mirrors under/overfitting
• Controlled by “model complexity”
∗ But we’ve been vague about what this means!?
• Next lectures: Bounding generalisation error in ML
41
�Lecturer: Ben Rubinstein
This lecture
Regularisation
The Problem with �Irrelevant Features
Example 1: Feature importance
Example 1: Feature importance
Example 1: Irrelevant features
Example 1: Irrelevant features
Problems with irrelevant features
Irrelevant (co-linear) features in general
Example 2: Lack of data
Ill-posed problems
Mini Summary
Regularisation in �Linear Models
Re-conditioning the problem
Regulariser as a prior
Computing posterior using Bayes rule
Regulariser as a constraint
Regularised linear regression
Mini Summary
Regularisation in �Non-Linear Models
Example regression problem
Underfitting (linear regression)
Overfitting (non-parametric smoothing)
Actual model (𝑥sin𝑥)
Approach: Explicit model selection
Approach: Regularisation
Example: Polynomial regression
Mini Summary
Bias-variance trade-off
Assessing generalisation
Training error and model complexity
(Another) Bias-variance decomposition
Decomposition proof sketch
Training data as a random variable
Training data as a random variable
Intuition: Model complexity and variance
Intuition: Model complexity and variance
Bias-variance trade-off
Test error and training error
Mini Summary