RMIT Classification: Trusted
Regularisation
COSC 2673-2793 | Semester 1 2021 (Computational) Machine Learning
COSC2673 | COSC2793 Week 3: Regularisation 1
Image: Freepik.com
RMIT Classification: Trusted
Regression Assumptions/Issues
Temperature Temperature Temperature
𝜃! +𝜃”𝑥” 𝜃! +𝜃”𝑥” +𝜃#𝑥”# 𝜃! +𝜃”𝑥” +𝜃#𝑥”# +𝜃$𝑥”$ +𝜃%𝑥”% …
COSC2673 | COSC2793 Week 3: Regularisation 2
Power consumption
Power consumption
Power consumption
RMIT Classification: Trusted
Bias and Variance
What we have discussed here are two related concepts
These describe how an algorithm produces a hypothesis for different training data
Bias is the assumptions an algorithm makes to generalise across unseen examples
A high-bias hypothesis makes the wrong assumptions and incorrectly predicts A low-bias hypothesis consistently makes correct predictions
Variance describes how the hypothesis changes predicts based on the training data
A high-variance hypothesis/algorithm predicts significantly different values A low-variance hypothesis consistency predicts similar values
COSC2673 | COSC2793 Week 3: Regularisation 3
RMIT Classification: Trusted
Classification Assumptions/Issues
X1
X1 X1
𝑔(𝜃! +𝜃”𝑥” +𝜃#𝑥#) 𝑔(𝜃! +𝜃”𝑥” +⋯+𝜃& 𝑥&#)
𝑔(𝜃! + 𝜃”𝑥” + 𝜃#𝑥”# + 𝜃$𝑥”$ + 𝜃 % 𝑥 “% … )
COSC2673 | COSC2793 Week 3: Regularisation 4
X2
X2
X2
RMIT Classification: Trusted
Another form of complexity
𝑔(𝜃! + 𝜃”𝑥” + 𝜃#𝑥#) 𝑔(𝜃! + 𝜃”𝑥”)
1
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5 2 2.5
7 6 5 4 3 2 1
0 0.5 1 1.5 2 2.5
COSC2673 | COSC2793 Week 3: Regularisation 5
RMIT Classification: Trusted
Regularisation: Intuition
Temperature
Temperature
𝜃! +𝜃”𝑥” +𝜃#𝑥”# 𝜃! +𝜃”𝑥” +𝜃#𝑥”# +𝜃$𝑥”$ +𝜃%𝑥”%
Suppose we want to penalise 𝜃! and 𝜃! to make it very small? That was, we don’t have to “decide” which form of the hypothesis to use
COSC2673 | COSC2793 Week 3: Regularisation 6
Power consumption
Power consumption
Regularisation
Regularisation is the concept of “regulating” the weights Find the parameters that “matter the most”
Encourage small values for parameters 𝜃!, 𝜃”, … , 𝜃# Find a simpler hypothesis
• Forpolynomialregression,generallylowerorder
• Lowerweights“across-the-board” Less prone to overfitting
RMIT Classification: Trusted
COSC2673 | COSC2793 Week 3: Regularisation 7
RMIT Classification: Trusted
Regularisation of the Loss Function
Introduce the regularisation term to the loss-function
1& ($) ($)* #* 𝐽(𝜃)=2𝑛∑ h’ 𝐱 −𝑦 +𝜆∑𝜃+
Still find: 𝑚𝑖𝑛𝐽(𝜃) ‘
$%” +%”
COSC2673 | COSC2793 Week 3: Regularisation 8
RMIT Classification: Trusted
Intuitive effect of Regularisation
1& ($) ($)* #* 𝐽(𝜃) = 𝑚𝑖𝑛 𝑛 ∑ h’ 𝐱 − 𝑦 + 𝜆 ∑ 𝜃+
‘ $%” +%”
Temperature
COSC2673 | COSC2793 Week 3: Regularisation 9
Power consumption
RMIT Classification: Trusted
Intuitive effect of Regularisation
1& ($) ($)* #* 𝐽(𝜃) = 𝑚𝑖𝑛 𝑛 ∑ h’ 𝐱 − 𝑦 + 𝜆 ∑ 𝜃+
‘ $%” +%!
COSC2673 | COSC2793 Week 3: Regularisation 10
RMIT Classification: Trusted
Intuitive effect of Regularisation
1& ($) ($)* #* 𝐽(𝜃) = 𝑚𝑖𝑛 𝑛 ∑ h’ 𝐱 − 𝑦 + 𝜆 ∑ 𝜃+
‘ $%” +%!
COSC2673 | COSC2793 Week 3: Regularisation 11
RMIT Classification: Trusted
Effect of 𝜆 (Regularisation Weight) 1% (“) (“)( *(
𝐽(𝜃)=𝑚𝑖𝑛 ∑ h! 𝐱 −𝑦 +𝜆∑𝜃) ! 𝑛”#$ )#$
What happens if 𝜆 is set to an extremely large value (perhaps for too large for our problem, say 𝜆 = 10″!!
Drives all values to zero
Likely to underfit (fails to fit even training data well)
Temperature
𝜃! + 𝜃”𝑥” + 𝜃#𝑥”# + 𝜃$𝑥”$ + 𝜃%𝑥”%
12
Power consumption
RMIT Classification: Trusted
Effect of 𝜆 (Regularisation Weight)
1% (“) (“)( *( 𝐽(𝜃)=𝑚𝑖𝑛 ∑ h! 𝐱 −𝑦 +𝜆∑𝜃)
)#$
What happens if 𝜆 is set to an extremely small value (perhaps for too large for our problem, say 𝜆 = 10,”!!
Regularisation has almost no effect
Likely to overfit (fits training data well, but perhaps too well)
! 𝑛”#$
Temperature
𝜃! + 𝜃”𝑥” + 𝜃#𝑥”# + 𝜃$𝑥”$ + 𝜃%𝑥”%
13
Power consumption
RMIT Classification: Trusted
Setting the regularisation term
𝜆 relates to the complexity of model
Large 𝜆, lots of regularisation, less complex model (possibly underfit) Small 𝜆, little regularisation, more complex model (possibly overfit)
𝜆 is a tuneable parameter. Finding the best 𝜆 is non-trivial (“the art” of ML)
Next week we will discuss evaluation strategies to determine the “best” 𝜆
COSC2673 | COSC2793 Week 3: Regularisation 14
RMIT Classification: Trusted
Alternative Methods
There are many methods for regularisation
The one presented in this lecture is known as Ridge Regression This is also what it is called in the Sci-kit Learn tools
Another very popular method is Lasso (Least Absolute Shrinkage and Selection Operator) Regression
The main differences:
In ridge regression, all weights will be non-zero, but many will be very small
In lasso regression, some weights may be zero, effectively conducting feature selection
COSC2673 | COSC2793 Week 3: Regularisation 15
RMIT Classification: Trusted
Alternative Methods
There are many methods for regularisation
The one presented in this lecture is known as Ridge Regression
1& ($) ($)* #* 𝐽(𝜃)=𝑛∑ h’ 𝐱 −𝑦 +𝜆∑𝜃+
$%”
This is also what it is called in the Sci-kit Learn tools
+%”
Another very popular method is Lasso Regression
1 & 𝐽(𝜃)=𝑛∑ h’ 𝐱
$%”
($)
−𝑦
($) * # +𝜆∑|𝜃+|
+%”
COSC2673 | COSC2793 Week 3: Regularisation 16
RMIT Classification: Trusted
Regularised Linear Regression
1% (“) (“)( *(
𝐽(𝜃)=𝑚𝑖𝑛 ∑ h! 𝐱 −𝑦 ! 𝑛”#$
+𝜆∑𝜃) )#$
Gradient Descent: Repeat{ &
Simplified:
𝜃+ 2𝜆 2𝛼 %
=𝜃+ 1−𝛼𝑛 −𝑛∑h!𝐱(“) −𝑦(“)𝐱(“)
” # $
• 𝜃 ! = 𝜃 ! − 𝛼 &* ∑ h ‘ 𝐱 ( $ ) − 𝑦 ( $ ) ( %$%”
•𝜃+=𝜃+−𝛼3%∑h!𝐱(“) −𝑦(“)𝐱(“) (, “#$
− % 𝜃+4
}
COSC2673 | COSC2793 Week 3: Regularisation 17