CS计算机代考程序代写 Bayesian algorithm PowerPoint Presentation

PowerPoint Presentation

Lecturer: Ben Rubinstein

Lecture 5. Regularisation

COMP90051 Statistical Machine Learning

Copyright: University of Melbourne

COMP90051 Statistical Machine Learning

This lecture

• How irrelevant features make optimisation ill-posed

• Regularising linear regression
∗ Ridge regression
∗ The lasso
∗ Connections to Bayesian MAP

• Regularising non-linear regression

• Bias-variance (again)

230/07/2013 Week 1, Lecture 2

COMP90051 Statistical Machine Learning

Regularisation

Process of introducing additional information in order to
solve an ill-posed problem or to prevent overfitting

• Major technique & theme, throughout ML
• Addresses one or more of the following related problems

∗ Avoids ill-conditioning (a computational problem)
∗ Avoids overfitting (a statistical problem)
∗ Introduce prior knowledge into modelling

• This is achieved by augmenting the objective function
• In this lecture: we cover the first two aspects. We will

cover more of regularisation throughout the subject

3

COMP90051 Statistical Machine Learning

The Problem with
Irrelevant Features

Linear regression on rank-deficient data.

4

COMP90051 Statistical Machine Learning

Example 1: Feature importance
• Linear model on three features

∗ 𝐗𝐗 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0

5-5
-4

-3

-2

-1

0

1

2

3

4

5

w1 w2 w3

Question: Which feature is
more important?

COMP90051 Statistical Machine Learning

Example 1: Feature importance
• Linear model on three features

∗ 𝐗𝐗 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0

6-5
-4

-3

-2

-1

0

1

2

3

4

5

w1 w2 w3

COMP90051 Statistical Machine Learning

Example 1: Irrelevant features
• Linear model on three features, first two same

∗ 𝐗𝐗 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0
∗ First two columns of 𝑿𝑿 identical
∗ Feature 2 (or 1) is irrelevant

7

3 3 7

6 6 9

21 21 79

34 34 2

• Effect of perturbations
on model predictions?
∗ Add ∆ to 𝑤𝑤1
∗ Subtract ∆ from 𝑤𝑤2

-5

-4

-3

-2

-1

0

1

2

3

4

5

w1 w2 w3

COMP90051 Statistical Machine Learning

Example 1: Irrelevant features
• Linear model on three features, first two same

∗ 𝐗𝐗 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0
∗ First two columns of 𝐗𝐗 identical
∗ Feature 2 (or 1) is irrelevant

8

3 3 7

6 6 9

21 21 79

34 34 2

• Effect of perturbations
on model predictions?
∗ Add ∆ to 𝑤𝑤1
∗ Subtract ∆ from 𝑤𝑤2

-5

-4

-3

-2

-1

0

1

2

3

4

5

w1 w2 w3

COMP90051 Statistical Machine Learning

Problems with irrelevant features

• In example, suppose �𝑤𝑤0, �𝑤𝑤1, �𝑤𝑤2, �𝑤𝑤3 ′ is “optimal”

• For any 𝛿𝛿 new �𝑤𝑤0, �𝑤𝑤1 + 𝛿𝛿,�𝑤𝑤2 − 𝛿𝛿,�𝑤𝑤3 ′ get
∗ Same predictions!
∗ Same sum of squared errors!

• Problems this highlights
∗ The solution is not unique
∗ Lack of interpretability
∗ Optimising to learn parameters is ill-posed problem

9

COMP90051 Statistical Machine Learning

Irrelevant (co-linear) features in general
• Extreme case: features complete clones
• For linear models, more generally

∗ Feature 𝐗𝐗⋅𝑗𝑗 is irrelevant if
∗ 𝐗𝐗⋅𝑗𝑗 is a linear combination of other columns

𝐗𝐗⋅𝑗𝑗 = �
𝑙𝑙≠𝑗𝑗
𝛼𝛼𝑙𝑙 𝐗𝐗⋅𝑙𝑙

… for some scalars 𝛼𝛼𝑙𝑙. Also called multicollinearity
∗ Equivalently: Some eigenvalue of 𝐗𝐗′𝐗𝐗 is zero

• Even near-irrelevance/colinearity can be problematic
∗ V small eigenvalues of 𝐗𝐗′𝐗𝐗

• Not just a pathological extreme; easy to happen!

10𝐗𝐗⋅𝑗𝑗 denotes the 𝑗𝑗-th column of 𝑋𝑋

COMP90051 Statistical Machine Learning

Example 2: Lack of data

• Extreme example:
∗ Model has two

parameters (slope and
intercept)

∗ Only one data point

• Underdetermined system

11

𝑦𝑦

𝑥𝑥

COMP90051 Statistical Machine Learning

Ill-posed problems
• In both examples, finding the best

parameters becomes an
ill-posed problem

• This means that the problem
solution is not defined
∗ In our case 𝑤𝑤1 and 𝑤𝑤2 cannot be

uniquely identified

• Remember normal equations
solution of linear regression:
�𝒘𝒘 = 𝐗𝐗′𝐗𝐗 −1𝐗𝐗′𝐲𝐲

• With irrelevant/multicolinear
features, matrix 𝐗𝐗′𝐗𝐗 has no inverse

12

convex, but not
strictly convex

𝑤𝑤1

𝑤𝑤2

sum of squared
errors

COMP90051 Statistical Machine Learning

Mini Summary

• Irrelevant features as collinearity

• Leads to
∗ Ill-posed optimisation for linear regression
∗ Broken interpretability

• Multiple intuitions: algebraic, geometric

• Next: Regularisation to the rescue!

13

COMP90051 Statistical Machine Learning

Regularisation in
Linear Models

Ridge regression and the Lasso

14

COMP90051 Statistical Machine Learning

Re-conditioning the problem

• Regularisation: introduce an additional
condition into the system

• The original problem is to minimise
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2

2

• The regularised problem is to minimise

𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 + 𝜆𝜆 𝐗𝐗 2

2 for 𝜆𝜆 > 0

• The solution is now
�𝐗𝐗 = 𝐗𝐗′𝐗𝐗 + λ𝐈𝐈 −1𝐗𝐗′𝐲𝐲

• This formulation is called ridge regression
∗ Turns the ridge into a deep, singular valley
∗ Adds 𝜆𝜆 to eigenvalues of 𝐗𝐗′𝐗𝐗: makes invertible

15

𝑤𝑤1

𝑤𝑤2

sum of squared
errors

strictly convex

COMP90051 Statistical Machine Learning

Regulariser as a prior

• Without regularisation, parameters found based entirely
on the information contained in the training set 𝐗𝐗
∗ Regularisation introduces additional information

• Recall our probabilistic model 𝑌𝑌 = 𝐱𝐱′𝐗𝐗 + 𝜀𝜀
∗ Here 𝑌𝑌 and 𝜀𝜀 are random variables, where 𝜀𝜀 denotes noise

• Now suppose that 𝐗𝐗 is also a random variable (denoted
as 𝐖𝐖) with a Normal prior distribution

𝐖𝐖~𝒩𝒩 0,1/𝜆𝜆
∗ I.e. we expect small weights and that no one feature dominates
∗ Is this always appropriate? E.g. data centring and scaling
∗ We could encode much more elaborate problem knowledge

16

COMP90051 Statistical Machine Learning

Computing posterior using Bayes rule
• The prior is then used to compute the posterior

𝑝𝑝 𝐗𝐗|𝐗𝐗, 𝐲𝐲 =
𝑝𝑝 𝐲𝐲|𝐗𝐗,𝐗𝐗 𝑝𝑝 𝐗𝐗

𝑝𝑝 𝐲𝐲|𝐗𝐗

• Instead of maximum likelihood (MLE), take maximum a posteriori
estimate (MAP)

• Apply log trick, so that
log 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = log 𝑙𝑙𝑝𝑝𝑙𝑙𝑝𝑝𝑙𝑙𝑝𝑝𝑙𝑝𝑝𝑝𝑝𝑙𝑙 + log 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 − log 𝑚𝑚𝑚𝑚𝑝𝑝𝑚𝑚

• Arrive at the problem of minimising
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2

2 + 𝜆𝜆 𝐗𝐗 2
2

17

posterior
likelihood prior

marginal
likelihood

this term doesn’t
affect optimisation

COMP90051 Statistical Machine Learning

Regulariser as a constraint
• For illustrative purposes, consider a modified problem:

minimise 𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 subject to 𝐗𝐗 2

2 ≤ 𝜆𝜆 for 𝜆𝜆 > 0

18

𝐗𝐗∗

Ridge regression ( 𝐗𝐗 2
2)

𝑤𝑤1

𝑤𝑤2

𝐗𝐗∗

Lasso ( 𝐗𝐗 1)

𝑤𝑤1

𝑤𝑤2

Regulariser defines
feasible region

Contour lines of
objective function

Solution to
linear regression

• Lasso (L1 regularisation) encourages solutions to sit on the axes

 Some of the weights are set to zero  Solution is sparse

COMP90051 Statistical Machine Learning

Regularised linear regression

Algorithm Minimises Regulariser Solution

Linear
regression

𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 None

𝐗𝐗′𝐗𝐗 −1𝐗𝐗′𝐲𝐲
(if inverse exists)

Ridge
regression

𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 + λ 𝐗𝐗 2

2

L2 norm

𝐗𝐗′𝐗𝐗 + 𝜆𝜆𝐈𝐈 −1𝐗𝐗′𝐲𝐲

Lasso 𝐲𝐲 − 𝐗𝐗𝐗𝐗 2
2 + λ 𝐗𝐗 1

L1 norm No closed-form, but
solutions are sparse

and suitable for
high-dim data

19

COMP90051 Statistical Machine Learning

Mini Summary

• L2 regularisation: Ridge regression
∗ Re-conditions the optimisation
∗ Equivalent to MAP with Gaussian prior on weights

• L1 regularisation: The Lasso
∗ Particularly favoured in high-dim, low-example regimes

• Next: Regularisation and non-linear regression

20

COMP90051 Statistical Machine Learning

Regularisation in
Non-Linear Models

Model selection in ML

21

COMP90051 Statistical Machine Learning

Example regression problem

2 4 6 8 10

-5
0

5
10

X

Y

How complex a model should we use?
22

COMP90051 Statistical Machine Learning

Underfitting (linear regression)

2 4 6 8 10

-5
0

5
10

X

Y

Model class Θ can be too simple to possibly fit true model.
23

COMP90051 Statistical Machine Learning

Overfitting (non-parametric smoothing)

2 4 6 8 10

-5
0

5
10

X

Y

Model class Θ can be so complex it can fit true model + noise
24

COMP90051 Statistical Machine Learning

Actual model (𝑥𝑥sin 𝑥𝑥)

2 4 6 8 10

-5
0

5
10

X

Y

The right model class Θ will sacrifice some training error, for test error.
25

COMP90051 Statistical Machine Learning

Approach: Explicit model selection
• Try different classes of models. Example, try polynomial

models of various degree 𝑙𝑙 (linear, quadratic, cubic, …)
• Use held out validation (cross validation) to select the

model

1. Split training data into 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 and 𝐷𝐷𝑣𝑣𝑡𝑡𝑙𝑙𝑡𝑡𝑣𝑣𝑡𝑡𝑡𝑡𝑣𝑣 sets
2. For each degree 𝑙𝑙 we have model 𝑓𝑓𝑣𝑣

1. Train 𝑓𝑓𝑣𝑣 on 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
2. Test 𝑓𝑓𝑣𝑣 on 𝐷𝐷𝑣𝑣𝑡𝑡𝑙𝑙𝑡𝑡𝑣𝑣𝑡𝑡𝑡𝑡𝑣𝑣

3. Pick degree �̂�𝑙 that gives the best test score

4. Re-train model 𝑓𝑓 �𝑣𝑣 using all data

26

COMP90051 Statistical Machine Learning

Approach: Regularisation

• Augment the problem:
�𝜽𝜽 ∈ argmin

𝜽𝜽∈Θ
𝐿𝐿 𝑙𝑙𝑚𝑚𝑝𝑝𝑚𝑚,𝜽𝜽 + 𝜆𝜆𝜆𝜆 𝜽𝜽

• E.g., ridge regression
�𝒘𝒘 ∈ argmin

𝒘𝒘∈𝑊𝑊
𝐲𝐲 − 𝐗𝐗𝐗𝐗 2

2 + λ 𝐗𝐗 2
2

• Note that regulariser 𝜆𝜆 𝜽𝜽 does not depend on data

• Use held out validation/cross validation to choose 𝜆𝜆

27

COMP90051 Statistical Machine Learning

Example: Polynomial regression

28

• 9th-order polynomial regression
∗ model of form

𝑓𝑓 = 𝑤𝑤0 + 𝑤𝑤1 𝑥𝑥 + … + 𝑤𝑤9𝑥𝑥9

∗ regularised with 𝜆𝜆 𝐗𝐗 2
2 term

𝞴𝞴 = 0

Figures from Bishop pp7-9,150

COMP90051 Statistical Machine Learning

Mini Summary

• Overfitting vs underfitting

• Effect of regularisation on nonlinear regression
∗ Controls balance of over- vs underfitting
∗ Controlled in this case by the penalty hyperparameter

• Next: Bias-variance view for regression

29

COMP90051 Statistical Machine Learning

Bias-variance trade-off

Train error, test error and
model complexity in

supervised regression

30

COMP90051 Statistical Machine Learning

Assessing generalisation
• Supervised learning: train the model on existing data,

then make predictions on new data

• Training the model: ERM / minimisation of training error

• Generalisation capacity is captured by risk / test error

• Model complexity is a major factor that influences the
ability of the model to generalise (vague still)

• In this section, our aim is to explore error in the context
of supervised regression. One way to decompose it.

31

COMP90051 Statistical Machine Learning

Training error and model complexity
• More complex model  training error goes down

• Finite number of points  usually can reduce
training error to 0 (is it always possible?)

32

model complexity

Training
error

𝑥𝑥

𝑦𝑦

What is this?

COMP90051 Statistical Machine Learning

(Another) Bias-variance decomposition
• Squared loss for supervised-regression predictions

𝑙𝑙 𝑌𝑌,𝑓𝑓 𝑿𝑿0 = 𝑌𝑌 − 𝑓𝑓 𝑿𝑿0
2

• Lemma: Bias-variance decomposition
𝔼𝔼 𝑙𝑙 𝑌𝑌,𝑓𝑓 𝑿𝑿0 = 𝔼𝔼 𝑌𝑌 − 𝔼𝔼 𝑓𝑓

2
+ 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌

33

(bias)2 variance
irreducible

error
Risk /

test error
for 𝒙𝒙0

* Prediction randomness comes from randomness in test features AND training data

Classification
later on

COMP90051 Statistical Machine Learning

Decomposition proof sketch
• Here (𝒙𝒙) is omitted to de-clutter notation

• 𝔼𝔼 𝑌𝑌 − 𝑓𝑓
2

= 𝔼𝔼 𝑌𝑌2 + 𝑓𝑓2 − 2𝑌𝑌𝑓𝑓

• = 𝔼𝔼 𝑌𝑌2 + 𝔼𝔼 𝑓𝑓2 − 𝔼𝔼 2𝑌𝑌𝑓𝑓

• = 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌 + 𝔼𝔼 𝑌𝑌 2 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝔼𝔼 𝑓𝑓
2
− 2𝔼𝔼 𝑌𝑌 𝔼𝔼 𝑓𝑓

• = 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝔼𝔼 𝑌𝑌 2 − 2𝔼𝔼 𝑌𝑌 𝔼𝔼 𝑓𝑓 + 𝔼𝔼 𝑓𝑓
2

• = 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝔼𝔼 𝑌𝑌 − 𝔼𝔼 𝑓𝑓
2

34

COMP90051 Statistical Machine Learning

Training data as a random variable

35

𝑦𝑦 𝑦𝑦𝑦𝑦𝐷𝐷1 𝐷𝐷2 𝐷𝐷3

𝑥𝑥 𝑥𝑥 𝑥𝑥

COMP90051 Statistical Machine Learning

Training data as a random variable

36

𝑦𝑦 𝑦𝑦𝑦𝑦𝐷𝐷1 𝐷𝐷2 𝐷𝐷3

𝑥𝑥 𝑥𝑥 𝑥𝑥

COMP90051 Statistical Machine Learning

Intuition: Model complexity and variance

• simple model  low variance

• complex model  high variance

37

𝑥𝑥0

𝑦𝑦

𝑥𝑥

𝑦𝑦

𝑥𝑥0 𝑥𝑥

COMP90051 Statistical Machine Learning

Intuition: Model complexity and variance

• simple model  high bias

• complex model  low bias

38

𝑦𝑦

𝑙 𝑥𝑥0

𝑦𝑦

𝑙 𝑥𝑥0

𝑥𝑥0 𝑥𝑥 𝑥𝑥0 𝑥𝑥

COMP90051 Statistical Machine Learning

Bias-variance trade-off
• simple model  high bias, low variance

• complex model  low bias, high variance

39

complex modelsimple model

(bias)2 variance

Test error

COMP90051 Statistical Machine Learning

Test error and training error

40

complex modelsimple model

Training
error

Test error
Underfit   Overfit

But how to measure
model family complexity?

COMP90051 Statistical Machine Learning

Mini Summary
• Supervised regression: square-loss risk decomposes

to bias, variance and irreducible terms

• This trade-off mirrors under/overfitting

• Controlled by “model complexity”
∗ But we’ve been vague about what this means!?

• Next lectures: Bounding generalisation error in ML

41

�Lecturer: Ben Rubinstein
This lecture
Regularisation
The Problem with �Irrelevant Features
Example 1: Feature importance
Example 1: Feature importance
Example 1: Irrelevant features
Example 1: Irrelevant features
Problems with irrelevant features
Irrelevant (co-linear) features in general
Example 2: Lack of data
Ill-posed problems
Mini Summary
Regularisation in �Linear Models
Re-conditioning the problem
Regulariser as a prior
Computing posterior using Bayes rule
Regulariser as a constraint
Regularised linear regression
Mini Summary
Regularisation in �Non-Linear Models
Example regression problem
Underfitting (linear regression)
Overfitting (non-parametric smoothing)
Actual model (𝑥sin⁡𝑥)
Approach: Explicit model selection
Approach: Regularisation
Example: Polynomial regression
Mini Summary
Bias-variance trade-off
Assessing generalisation
Training error and model complexity
(Another) Bias-variance decomposition
Decomposition proof sketch
Training data as a random variable
Training data as a random variable
Intuition: Model complexity and variance
Intuition: Model complexity and variance
Bias-variance trade-off
Test error and training error
Mini Summary