CS计算机代考程序代写 algorithm Lecture 5. Regularisation

Lecture 5. Regularisation
COMP90051 Statistical Machine Learning
Semester 2, 2019 Lecturer: Ben Rubinstein
Copyright: University of Melbourne

COMP90051 Statistical Machine Learning
This lecture: Regularisation
Process of introducing additional information in order to solve an ill-posed problem or to prevent overfitting
• Major technique & theme, throughout ML
• Addresses one or more of the following related problems ∗ Avoids ill-conditioning (a computational problem)
∗ Avoids overfitting (a statistical problem)
∗ Introduce prior knowledge into modelling
• This is achieved by augmenting the objective function
• In this lecture: we cover the first two aspects. We will cover more of regularisation throughout the subject
2

COMP90051 Statistical Machine Learning
Example 1: Feature importance
• Linear model on three features
∗ 𝑿𝑿 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0
5 4 3 2 1 0
-1 -2 -3 -4 -5
Question: Which feature is more important?
w1 w2 w3
3

30/07/2013 Week 1, Lecture 2 4

COMP90051 Statistical Machine Learning
5 4 3 2 1 0
-1 -2 -3 -4 -5
w1 w2 w3
Example 1: Feature importance
• Linear model on three features
∗ 𝑿𝑿 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0
5

COMP90051 Statistical Machine Learning
Example 1: Irrelevant features
• Linear model on three features, first two same ∗ 𝑿𝑿 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0 ∗ First two columns of 𝑿𝑿 identical
∗ Feature 2 (or 1) is irrelevant
3
3
7
6
6
9
21
21
79
34
34
2
5 4 3 2 1 0
-1 -2 -3 -4 -5
w1 w2 w3
∗ Add ∆ to 𝑤𝑤1 2 ∗ Subtract ∆ from 𝑤𝑤
• Effect of perturbations on model predictions?
6

COMP90051 Statistical Machine Learning
Example 1: Irrelevant features
• Linear model on three features, first two same ∗ 𝑿𝑿 is matrix on 𝑛𝑛 = 4 instances (rows)
∗ Model: 𝑦𝑦 = 𝑤𝑤1𝑥𝑥1 + 𝑤𝑤2𝑥𝑥2 + 𝑤𝑤3𝑥𝑥3 + 𝑤𝑤0 ∗ First two columns of 𝑿𝑿 identical
∗ Feature 2 (or 1) is irrelevant
3
3
7
6
6
9
21
21
79
34
34
2
5 4 3 2 1 0
-1 -2 -3 -4 -5
w1 w2 w3
∗ Add ∆ to 𝑤𝑤1 2 ∗ Subtract ∆ from 𝑤𝑤
• Effect of perturbations on model predictions?
7

30/07/2013 Week 1, Lecture 2 8

COMP90051 Statistical Machine Learning
Problems with irrelevant features
• In example, suppose 𝑤𝑤�0, 𝑤𝑤�1, 𝑤𝑤�2, 𝑤𝑤�3 ′ is “optimal”
• For any 𝛿𝛿 new 𝑤𝑤�0,𝑤𝑤�1 + 𝛿𝛿,𝑤𝑤�2 − 𝛿𝛿,𝑤𝑤�3 ′ get ∗ Same predictions!
∗ Same sum of squared errors!
• Problemsthishighlights ∗ The solution is not unique
∗ Lack of interpretability
∗ Optimising to learn parameters is ill-posed problem
9

COMP90051 Statistical Machine Learning
Irrelevant (co-linear) features in general
• Extreme case: features complete clones
• For linear models, more generally
∗ Feature 𝑿𝑿
∗ 𝑿𝑿⋅𝑗𝑗 is a linear combination of other columns
⋅𝑗𝑗
… for some scalars 𝛼𝛼𝑙𝑙 . Also called multicollinearity
is irrelevant if
𝑿𝑿⋅𝑗𝑗 = �𝑙𝑙≠𝑗𝑗𝛼𝛼𝑙𝑙 𝑿𝑿⋅𝑙𝑙
∗ Equivalently: Some eigenvalue of 𝑿𝑿′𝑿𝑿 is zero
• Even near-irrelevance/colinearity can be problematic ∗ V small eigenvalues of 𝑿𝑿′𝑿𝑿
• Not just a pathological extreme; easy to happen!
𝑿𝑿⋅𝑗𝑗 denotes the 𝑗𝑗-th column of 𝑋𝑋 10

COMP90051 Statistical Machine Learning
Example 2: Lack of data
• Extreme example:
∗ Model has two parameters (slope and intercept)
∗ Only one data point
• Underdeterminedsystem
𝑦𝑦
𝑥𝑥
11

COMP90051 Statistical Machine Learning
Ill-posed problems
• In both examples, finding the best parameters becomes an
ill-posed problem
𝑤𝑤2
∗ Inourcase𝑤𝑤 and𝑤𝑤 cannotbe 12
𝑤𝑤1
convex, but not strictly convex
uniquely identified
′ −1 ′ 𝒘𝒘� = 𝑿𝑿 𝑿𝑿 𝑿𝑿 𝒚𝒚
• Remember normal equations solution of linear regression:
sum of squared errors
• This means that the problem solution is not defined
• With irrelevant/multicolinear features, matrix 𝑿𝑿′𝑿𝑿 has no inverse
12

COMP90051 Statistical Machine Learning
Re-conditioning the problem
sum of squared errors
𝑤𝑤2
strictly convex
13
• Regularisation: introduce an additional condition into the system
𝒚𝒚 − 𝑿𝑿𝒘𝒘
2
• The original problem is to minimise
𝒚𝒚−𝑿𝑿𝒘𝒘 +𝜆𝜆 𝒘𝒘
2 2
for𝜆𝜆>0 𝒘𝒘� = 𝑿𝑿 ′ 𝑿𝑿 + 𝜆𝜆 𝑰𝑰 − 1 𝑿𝑿 ′ 𝒚𝒚
• This formation is called ridge regression
∗ Turns the ridge into a peak
∗ Adds 𝜆𝜆 to eigenvalues of 𝑿𝑿′𝑿𝑿: makes invertible
• The regularised problem is to minimise
• The solution is now
𝑤𝑤 1

COMP90051 Statistical Machine Learning
Regulariser as a prior
• Without regularisation, parameters found based entirely on the information contained in the training set 𝑿𝑿
∗ Regularisation introduces additional information • Recall our probabilistic model 𝑌𝑌 = 𝒙𝒙′𝒘𝒘 + 𝜀𝜀
∗ Here 𝑌𝑌 and 𝜀𝜀 are random variables, where 𝜀𝜀 denotes noise
• Now suppose that 𝒘𝒘 is also a random variable (denoted
as 𝑾𝑾) with a Normal prior distribution
𝑾𝑾~𝒩𝒩 0,1/𝜆𝜆
∗ I.e. we expect small weights and that no one feature dominates ∗ Is this always appropriate? E.g. data centring and scaling
∗ We could encode much more elaborate problem knowledge
14

COMP90051 Statistical Machine Learning
Computing posterior using Bayes rule
• The prior is then used to compute the posterior
posterior
𝑝𝑝𝒘𝒘|𝑿𝑿,𝒚𝒚 =𝑝𝑝𝒚𝒚|𝑿𝑿,𝒘𝒘𝑝𝑝𝒘𝒘 𝑝𝑝 𝒚𝒚|𝑿𝑿
likelihood
prior
marginal likelihood
• Instead of maximum likelihood (MLE), take maximum a posteriori estimate (MAP)
• Apply log trick, so that
log 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 = log 𝑙𝑙𝑝𝑝𝑙𝑙𝑝𝑝𝑙𝑙𝑝𝑝𝑙𝑝𝑝𝑝𝑝𝑙𝑙 + log 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 − log 𝑚𝑚𝑚𝑚𝑝𝑝𝑚𝑚
• Arrive at the problem of minimising 22
15
this term doesn’t 𝒚𝒚 − 𝑿𝑿𝒘𝒘 2 + 𝜆𝜆 𝒘𝒘 2 affect optimisation

COMP90051 Statistical Machine Learning
Regularisation in Non-Linear Models
Model selection in ML
16

COMP90051 Statistical Machine Learning
Example regression problem
2 4 6 8 10 X
How complex a model should we use?
17
Y
-5 0 5 10

COMP90051 Statistical Machine Learning
Underfitting (linear regression)
2 4 6 8 10 X
Model class Θ can be too simple to possibly fit true model.
18
Y
-5 0 5 10

COMP90051 Statistical Machine Learning
Overfitting (non-parametric smoothing)
2 4 6 8 10 X
Model class Θ can be so complex it can fit true model + noise
19
Y
-5 0 5 10

COMP90051 Statistical Machine Learning
Actual model (𝑥𝑥sin 𝑥𝑥)
2 4 6 8 10
X
The right model class Θ will sacrifice some training error, for test error.20
Y
-5 0 5 10

COMP90051 Statistical Machine Learning
How to “vary” model complexity
• Method1:Explicitmodelselection
• Method2:Regularisation
• Usually,method1canbeviewedaspecialcaseof method 2
21

COMP90051 Statistical Machine Learning


1. 2.
3. 4.
1. Explicit model selection
Try different classes of models. Example, try polynomial models of various degree 𝑙𝑙 (linear, quadratic, cubic, …)
Use held out validation (cross validation) to select the
model
Split training data into 𝐷𝐷 and 𝐷𝐷 sets
For each degree 𝑙𝑙 we have model 𝑓𝑓𝑣𝑣 1. Train 𝑓𝑓𝑣𝑣 on 𝐷𝐷𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡
2. Test 𝑓𝑓𝑣𝑣 on 𝐷𝐷𝑣𝑣𝑡𝑡𝑙𝑙𝑡𝑡𝑣𝑣𝑡𝑡𝑡𝑡𝑣𝑣
Pick degree 𝑙𝑙̂ that gives the best test score Re-train model 𝑓𝑓𝑣𝑣� using all data
22
𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 𝑣𝑣𝑡𝑡𝑙𝑙𝑡𝑡𝑣𝑣𝑡𝑡𝑡𝑡𝑣𝑣

COMP90051 Statistical Machine Learning
2. Vary complexity by regularisation
• Augment the problem:
𝜽𝜽�∈argmin 𝐿𝐿 𝑙𝑙𝑚𝑚𝑝𝑝𝑚𝑚,𝜽𝜽 +𝜆𝜆𝜆𝜆 𝜽𝜽
𝜽𝜽∈Θ
• E.g.,ridgeregression 2 2
𝒘𝒘� ∈ a r g m i n 𝒚𝒚 − 𝑿𝑿 𝒘𝒘 2 + 𝜆𝜆 𝒘𝒘 2 𝒘𝒘∈𝑊𝑊
• Note that regulariser 𝜆𝜆 𝜽𝜽 does not depend on data
• Useheldoutvalidation/crossvalidationtochoose𝜆𝜆
23

COMP90051 Statistical Machine Learning
Example: Polynomial regression
• 9th-orderpolynomialregression ∗ model of form
𝑓𝑓̂ = 𝑤𝑤 + 𝑤𝑤 𝑥𝑥 + … + 𝑤𝑤 𝑥𝑥 9 019
∗ regularised with 𝜆𝜆 𝒘𝒘 2 term
𝞴𝞴=0
Figures from Bishop pp7-9,150 24

COMP90051 Statistical Machine Learning
Regulariser as a constraint
• For illustrative purposes, consider a modified problem: minimise 𝒚𝒚−𝑿𝑿𝒘𝒘 2 subjectto 𝒘𝒘 2 ≤𝝀𝝀for𝜆𝜆>0
𝑤𝑤22𝑤𝑤 2 Solution to 𝐰𝐰 2
𝐰𝐰∗
𝑤𝑤1
Ridge regression ( 𝒘𝒘 2) 2
linear regression ∗ Contour lines of
objective function
Regulariser defines Lasso ( 𝒘𝒘 feasible region
𝑤𝑤1
1
• Lasso (L1 regularisation) encourages solutions to sit on the axes Some of the weights are set to zeroSolution is sparse
25
)

COMP90051 Statistical Machine Learning
Regularised linear regression
Algorithm Minimises Regulariser Solution
Linear 𝒚𝒚 − 𝑿𝑿𝒘𝒘 2 None 𝐗𝐗′𝐗𝐗 −1𝐗𝐗′𝐲𝐲 regression (if inverse exists)
Ridge regression
𝒚𝒚−𝑿𝑿𝒘𝒘 2 +𝜆𝜆 𝒘𝒘 2 L2 norm 𝐗𝐗′𝐗𝐗+𝜆𝜆𝐈𝐈 −1𝐗𝐗′𝐲𝐲
Lasso 𝒚𝒚 − 𝑿𝑿𝒘𝒘 2 + 𝜆𝜆 𝐰𝐰 1
L1 norm No closed-form, but solutions are sparse
and suitable for high-dim data
26

COMP90051 Statistical Machine Learning
Bias-variance trade-off
Analysis of relations between train error, test error and model complexity
27

COMP90051 Statistical Machine Learning
Assessing generalisation capacity
• Supervised learning: train the model on existing data, then make predictions on new data
• Training the model: ERM / minimisation of training error
• Generalisation capacity is captured by risk / test error
• Model complexity is a major factor that influences the ability of the model to generalise

In this section, our aim is to explore relations between training error, test error and model complexity
28

COMP90051 Statistical Machine Learning
Training error and model complexity
• More complex modeltraining error goes down • Finite number of pointsusually can reduce
𝑦𝑦training error to 0 (is it always possible?) Training
𝑥𝑥
error
model complexity
29

COMP90051 Statistical Machine Learning
(Another) Bias-variance decomposition
𝑙𝑙𝑌𝑌,𝑓𝑓̂𝑿𝑿 =𝑌𝑌−𝑓𝑓̂𝑿𝑿
variance
• Squaredlossforsupervised-regressionpredictions
̂0̂0̂
𝔼𝔼 𝑙𝑙 𝑌𝑌, 𝑓𝑓 𝑿𝑿 = 𝔼𝔼 𝑌𝑌 − 𝔼𝔼 𝑓𝑓 2 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑓𝑓 + 𝑉𝑉𝑚𝑚𝑝𝑝 𝑌𝑌
2
• Lemma:Bias-variancedecomposition
0
for 𝒙𝒙 (bias)2
irreducible error
Risk / test error
0
* Prediction randomness comes from randomness in test features AND training data
30

COMP90051 Statistical Machine Learning
Decomposition proof sketch
• Here (𝒙𝒙) is omitted to de-clutter notation
•𝔼𝔼 𝑌𝑌−𝑓𝑓̂2 =𝔼𝔼𝑌𝑌2+𝑓𝑓̂2−2𝑌𝑌𝑓𝑓̂
• =𝔼𝔼𝑌𝑌2 +𝔼𝔼𝑓𝑓̂2 −𝔼𝔼2𝑌𝑌𝑓𝑓̂̂ ̂ ̂
• =𝑉𝑉𝑚𝑚𝑝𝑝𝑌𝑌 +𝔼𝔼𝑌𝑌2+𝑉𝑉𝑚𝑚𝑝𝑝𝑓𝑓 +𝔼𝔼𝑓𝑓2−2𝔼𝔼𝑌𝑌𝔼𝔼𝑓𝑓
• =𝑉𝑉𝑚𝑚𝑝𝑝𝑌𝑌 +𝑉𝑉𝑚𝑚𝑝𝑝𝑓𝑓̂ + 𝔼𝔼𝑌𝑌2−2𝔼𝔼𝑌𝑌𝔼𝔼𝑓𝑓̂ +𝔼𝔼𝑓𝑓̂2
• =𝑉𝑉𝑚𝑚𝑝𝑝𝑌𝑌 +𝑉𝑉𝑚𝑚𝑝𝑝𝑓𝑓̂ + 𝔼𝔼𝑌𝑌 −𝔼𝔼𝑓𝑓̂ 2 * Green slides are non-examinable
31

COMP90051 Statistical Machine Learning
Training data as a random variable
𝑦𝑦 𝐷𝐷1 𝑦𝑦 𝐷𝐷2 𝑦𝑦 𝐷𝐷3
𝑥𝑥𝑥𝑥𝑥𝑥
32

COMP90051 Statistical Machine Learning
Training data as a random variable
𝑦𝑦 𝐷𝐷1 𝑦𝑦 𝐷𝐷2 𝑦𝑦 𝐷𝐷3 𝑥𝑥𝑥𝑥𝑥𝑥
33

COMP90051 Statistical Machine Learning
Model complexity and variance
• simple modellow variance
• complex modelhigh variance
𝑦𝑦𝑦𝑦
𝑥𝑥0 𝑥𝑥 𝑥𝑥0 𝑥𝑥
34

COMP90051 Statistical Machine Learning
Model complexity and bias
• simple modelhigh bias
• complex modellow bias
𝑙 𝑥𝑥𝑦𝑦 𝑙 𝑥𝑥𝑦𝑦
0 𝑥𝑥0 𝑥𝑥 0 𝑥𝑥0 𝑥𝑥
35

COMP90051 Statistical Machine Learning
Bias-variance trade-off
• simple modelhigh bias, low variance
• complex modellow bias, high variance
(bias)2
simple model
variance
complex model
Test error
36

COMP90051 Statistical Machine Learning
Test error and training error
Test error
error
complex model
Underfit 
 Overfit Training
simple model
37

COMP90051 Statistical Machine Learning
Summary
• Regularisation
∗ Irrelevant/multicolinear featuresill-posed problems
∗ Model complexity
∗ Bias-variance trade-off
• Nextlecture:Towardsneuralnetswithperceptron
38