Slide 1
BUSINESS SCHOOL
Discipline of Business Analytics
QBUS6850 Team
2
Topics covered
Machine learning model representation
Cost/loss function
Linear regression with single and multiple features
Optimisation algorithm: gradient descent
Model and feature selection techniques
Learning curves
Training, validation and test sets
Cross validation
Topics
3
References
Bishop (2006), Chapters 1.3 – 1.5; 3.2
Friedman et al. (2001), Chapters 2.3.1, 3.1 – 3.2, 7.1 – 7.6, 7.10
James et al., (2014), Chapters 2.1 – 2.2
James et al., (2014) Chapter 5.1 (Cross Validation)
4
Learning Objectives
Understand model representation and cost function
Understand the matrix representation of linear regression with single and
multiple features
Understand how gradient descent algorithm works
Understand overfitting and underfitting
Understand bias and variance decomposition and be able to diagnose
them
Be able to interpret the learning curves
Be able to do the polynomial order selection
Understand the reason and process of Cross Validation
5
6
Input/Feature Supervised learning:
An object is usually characterized by a feature scalar or vector
Denote by 𝐱𝐱 = (𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑)𝑇𝑇 where each component 𝑥𝑥𝑖𝑖 is a specified
feature for the object
Each component 𝑥𝑥𝑖𝑖 may be a quantitative value from ℝ (the set of all
real numbers) or one of finite categorical values.
Outcome/Target:
An unknown system (to be learnt) which generates an
output/outcome/target denoted by 𝐭𝐭 = (𝑡𝑡1, 𝑡𝑡2, … , 𝑡𝑡𝑚𝑚)𝑇𝑇 for each object
feature x
Each component 𝑡𝑡𝑗𝑗 may be a quantitative value from ℝ (the set of all
real numbers) or one of finite categorical values
In most cases, we assume m = 1. We may assume m = 1 in this
course. Thus t is a scalar
As a measurement value, we always suppose there are some noises ϵ
in t, i.e., the measurement is t = y + ϵ where y is the true target.
Ask students for examples
Terminology in ML
Andy Sun
7
Training/Test Dataset:
A pair of observed (𝐱𝐱, t) is called a training/test datum.
In unsupervised learning case, there is no target observation t.
All the available training data are collected together by a set of
training/test data, denoted 𝒟𝒟 with or without target observation
𝒟𝒟 = {(𝐱𝐱𝑛𝑛, 𝑡𝑡𝑛𝑛)}𝑛𝑛=1
𝑁𝑁 or {𝐱𝐱𝑛𝑛}𝑛𝑛=1
𝑁𝑁
Learner or Model:
Use this dataset 𝒟𝒟 to build a prediction model, or learner; which will
enable us to predict the outcome for new unseen objects or characterize
them if without outcomes.
A good learner is one that accurately predicts such an outcome or make
a right characterization.
Terminology in ML
8
Machine learning algorithms are built upon data. There exist different types
of data. Although the numeric data are widely seen in scientific world, the
categorical data are more common in business world
When we have a data set with 𝑁𝑁 observations {𝐱𝐱1, 𝐱𝐱2, … , 𝐱𝐱𝑁𝑁}, we will
organise them into a matrix of size 𝑁𝑁 × 𝑑𝑑 such that each row corresponds to
an observation (or a case). If we have the target/output for 𝑁𝑁 observations
{𝑡𝑡1, 𝑡𝑡2, … , 𝑡𝑡𝑁𝑁}, we also organise them into a column (simulating a row
corresponding to a case or an observation), denoted by as
Data Representation
𝐗𝐗 =
𝐱𝐱1
𝑇𝑇
𝐱𝐱2
𝑇𝑇
⋮
𝐱𝐱𝑁𝑁
𝑇𝑇
=
𝑥𝑥11 𝑥𝑥12
𝑥𝑥21 𝑥𝑥21
… 𝑥𝑥1𝑑𝑑
… 𝑥𝑥2𝑑𝑑
⋮ ⋮
𝑥𝑥𝑁𝑁1 𝑥𝑥𝑁𝑁2
⋱ ⋮
… 𝑥𝑥𝑁𝑁𝑑𝑑
∈ ℝ𝑁𝑁×𝑑𝑑, 𝐭𝐭 =
𝑡𝑡1
𝑡𝑡2
⋮
𝑡𝑡𝑁𝑁
∈ ℝ𝑁𝑁
9
Learning is the process of estimating an unknown dependency
between the input and output or structure of a system using a limited
number of observations.
A typical learning procedure consists of the following steps:
1. Statement of the Problem
2. Data Generation/Experiment Design
3. Data Collection and Pre-processing
4. Hypothesis Formulation and Objectives
5. Model Estimation and Assessment
6. Interpretation of the Model and Drawing Conclusions
Many recent application studies tend to focus on the learning
methods used (i.e., a neural network).
Machine Learning Flow
10
14.0
14.5
15.0
15.5
16.0
16.5
17.0
17.5
18.0
0.0 10.0 20.0 30.0 40.0 50.0 60.0
P
ric
e
Odometer
Supervised learning:
Step 1: Regression
‘Problem‘: Predict car price
Steps 2&3: Completed
(Data collected and
labelled)
Example: Linear Regression
11
Step 4a: Linear Model Hypothesis
𝑦𝑦 = 𝑓𝑓 𝒙𝒙;𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥
𝑓𝑓(): simple (univariate) linear regression model;
𝛃𝛃 = (𝛽𝛽0,𝛽𝛽1)𝑇𝑇 : model parameters.
Example: Linear Regression
Odometer (x) Price (t)
37.4 16.0
44.8 15.2
45.8 15.0
30.9 17.4
31.7 17.4
34.0 16.1
45.9 15.7
… …
41.4 15.3
Training set
N: number of training examples
X: “input” variable; features
t: “output” variable; “target” variable which
is a noised version of model output y, i.e.,
𝑡𝑡 = 𝑦𝑦 + 𝜀𝜀
ith training example(𝑥𝑥𝑖𝑖 , 𝑡𝑡𝑖𝑖)
𝑥𝑥1, 𝑡𝑡1 = (37.4, 16.0)
𝑥𝑥3, 𝑡𝑡3 = ? ?
12
f() representation?
14.0
14.5
15.0
15.5
16.0
16.5
17.0
17.5
18.0
18.5
0.0 10.0 20.0 30.0 40.0 50.0 60.0
P
ric
e
Odometer
How to choose 𝛃𝛃?
Or how can we estimate 𝛃𝛃?
This is the task in Step 5
𝑦𝑦 = 𝑓𝑓 𝒙𝒙;𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥
𝐿𝐿 𝛽𝛽0,𝛽𝛽1 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝑓𝑓 𝒙𝒙𝑛𝑛,𝜷𝜷 − 𝑡𝑡𝑛𝑛 2
13
Step 4b: Define an appropriate objective for the task in hand;
Purpose: to measure the error between the observed and the model. Loss
function, also called a cost function, which is a single, overall measure of loss
incurred in taking any of the available decisions or actions. (Bishop, C.M., 2006.)
Loss/cost function
14.0
14.5
15.0
15.5
16.0
16.5
17.0
17.5
18.0
18.5
0.0 10.0 20.0 30.0 40.0 50.0 60.0
P
ric
e
Odometer
( )10, ,Lmin10
ββ
ββ
Choose parameters so that estimated linear
regression line is close to our training examples.
This is also call argmin
Observed t valuesPredicted y values
For
computational
convenience
where 𝑓𝑓 𝒙𝒙;𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥
14
Optimisation Algorithm
Sometime this term is added for computational
convenience, but will not change the
estimation results of parameters
Loss function plot.
Quadratic.
Why?
𝛽𝛽1
𝐿𝐿(𝛽𝛽1)
𝛽𝛽0 not included for simplicity
Step 5: Find the model parameters;
For demo, we assume 𝛽𝛽0 = 0, hence the
loss function becomes
𝐿𝐿 𝛽𝛽1 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝑓𝑓 𝒙𝒙𝑛𝑛,𝜷𝜷 − 𝑡𝑡𝑛𝑛 2
15
Loss function plot.
Quadratic. 𝛽𝛽0 not
included for simplicity.
𝛽𝛽1𝑥𝑥1
𝑦𝑦
𝛽𝛽1 = 1
10.5 1.5
If 𝛽𝛽0 is included, how will the loss function 𝐿𝐿(𝛽𝛽0,𝛽𝛽1) plot look like?
𝐿𝐿(𝛽𝛽1)
𝛽𝛽1 = 1.6
𝛽𝛽1 = 0.5
16
Have some random starting point for 𝛽𝛽1;
Keep updating 𝛽𝛽1 to decrease the loss function 𝐿𝐿(𝛽𝛽1) value;
Repeat until achieving minimum (convergence).
𝛼𝛼 > 0 is called the learning rate: in empirical study, we can try
many 𝛼𝛼 values, and select the one generates least 𝐿𝐿(𝛽𝛽1)
Gradient descent can converge to a local minimum
Gradient descent
𝛽𝛽110.5 1.5
The partial derivatives become smaller and smaller
when moving towards the local minimum
𝛽𝛽0 not included for simplicity
𝐿𝐿(𝛽𝛽1)
Assignment notation: keep
updating 𝛽𝛽1 based on
calculations to the right hand
side of this notation
𝛽𝛽1 ≔ 𝛽𝛽1 − 𝛼𝛼
𝜕𝜕𝐿𝐿(𝛽𝛽1)
𝜕𝜕𝛽𝛽1
17
𝜃𝜃110.5 1.5
( )( )
1
1
11
L
:
β
β
αββ
∂
∂
−=
If starting point of 𝛽𝛽1 is to the right
of the local minimum:
If starting point of 𝛽𝛽1 is to the left
of the local minimum:
( )( )
0
L
1
1 >
∂
∂
β
β
( )( )
1
1
11
L
:
β
β
αββ
∂
∂
−=
( )( )
0
L
1
1 <
∂
∂
β
β
𝜃𝜃110.5 1.5
𝛽𝛽1 is updated
to be smaller
and smaller
𝛽𝛽1 is updated
to be lager
and larger
Positive derivative
Negative derivative
𝐿𝐿(𝛽𝛽1)
𝐿𝐿(𝛽𝛽1)
18
• Have some random starting points for 𝛽𝛽0 and 𝛽𝛽1;
• Keep updating 𝛽𝛽0 and 𝛽𝛽1 (simultaneously) to decrease the loss
function 𝐿𝐿(𝛽𝛽0,𝛽𝛽1) value;
• Repeat until achieving minimum (convergence).
Gradient descent of linear regression
Update
simultaneously
𝐿𝐿 𝛽𝛽0,𝛽𝛽1 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝑓𝑓 𝒙𝒙𝑛𝑛,𝜷𝜷 − 𝑡𝑡𝑛𝑛 2 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑛𝑛1 − 𝑡𝑡𝑛𝑛 2
𝛽𝛽0 ≔ 𝛽𝛽0 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝛽𝛽0,𝛽𝛽1
𝜕𝜕𝛽𝛽0
= 𝛽𝛽0 − 𝛼𝛼
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑛𝑛1 − 𝑡𝑡𝑛𝑛)
𝛽𝛽1 ≔ 𝛽𝛽1 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝛽𝛽0,𝛽𝛽1
𝜕𝜕𝛽𝛽1
= 𝛽𝛽1 − 𝛼𝛼
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑛𝑛1 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛1
Andy Sun
19
Why local minimum?
A
B
C
• If the starting point is the red dot, then gradient descent can only
converge to local minimum B
• If the starting point is the blue dot, then gradient descent can only
converge to local minimum C
• The derivatives at D and E are 0
• The derivatives at A, B or C are also 0
D E
20
21
Multiple features
Number (x1) Nearest (x2) Office (x3) Enrolment (x4) Income (x5) Distance (x6) Margin (y)
3203 4.2 54.9 8.0 40 4.3 55.5
2810 2.8 49.6 17.5 38 23.0 33.8
2890 2.4 25.4 20.0 38 4.2 49.0
3422 3.3 43.4 15.5 41 19.4 31.9
2687 0.9 67.8 15.5 46 11.0 57.4
3759 2.9 63.5 19.0 36 17.3 49.0
2341 2.3 58 23.0 31 11.8 46.0
3021 1.7 57.2 8.5 45 8.8 50.2
N: number of training examples
d: number of features
x: “input” variable; d features 𝐱𝐱 = (𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑)𝑇𝑇∈ ℝ𝒅𝒅
y: “output” variable; “target” variable. We consider a single output
For dataset, we use notation
d=6
nth training example of jth feature𝑥𝑥𝑛𝑛𝑗𝑗 𝑥𝑥32 = 2.4
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽3𝑥𝑥3 + 𝛽𝛽4𝑥𝑥4 + 𝛽𝛽5𝑥𝑥5 + 𝛽𝛽6𝑥𝑥6
For this example, the linear model is
Define a special feature , always taking value 1
So, new feature variable of a vector of d dimension
22
Matrix Representation
Think about why this T
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽3 +⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑
𝑥𝑥0 = 1
𝐱𝐱 = (𝑥𝑥0, 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑)𝑇𝑇∈ ℝ𝑑𝑑+1
Collect all parameters into a vector as 𝛽𝛽 =
𝛽𝛽0
𝛽𝛽1
𝛽𝛽2
⋮
𝛽𝛽𝑑𝑑
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝐱𝐱𝑇𝑇𝜷𝜷
In general with d features
Consider an input feature data 𝐱𝐱𝑛𝑛 = 𝑥𝑥𝑛𝑛0, 𝑥𝑥𝑛𝑛1, 𝑥𝑥𝑛𝑛2, … , 𝑥𝑥𝑛𝑛𝑑𝑑 and its
corresponding output (or target) 𝑦𝑦𝑛𝑛. The squared model error is
𝑒𝑒𝑛𝑛2 = (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2= (𝑡𝑡𝑛𝑛 − 𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷)𝟐𝟐
The overall “mean” error is
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝑒𝑒𝑛𝑛2 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷)𝟐𝟐
23
Multiple features loss function
Note 1
2
here is for mathematical convenience
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽3 +⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝐱𝐱𝑇𝑇𝜷𝜷
43816401
𝐗𝐗 =
𝐱𝐱1
𝑇𝑇
𝐱𝐱2
𝑇𝑇
⋮
𝐱𝐱𝑁𝑁
𝑇𝑇
=
𝑥𝑥10 𝑥𝑥11 𝑥𝑥12
𝑥𝑥20 𝑥𝑥21 𝑥𝑥21
… 𝑥𝑥1𝑑𝑑
… 𝑥𝑥2𝑑𝑑
⋮ ⋮
𝑥𝑥𝑁𝑁0 𝑥𝑥𝑁𝑁1 𝑥𝑥𝑁𝑁2
⋱ ⋮
… 𝑥𝑥𝑁𝑁𝑑𝑑
∈ ℝ𝑁𝑁×(𝑑𝑑+1), 𝐭𝐭 =
𝑡𝑡1
𝑡𝑡2
⋮
𝑡𝑡𝑁𝑁
∈ ℝ𝑁𝑁
24
Multiple features loss function
Another way to write the loss function
Question: What is the meaning of, for example, the second column vector of data
matrix X, called the design matrix? What is the 10th row of X?
It is easy to prove that
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
𝐭𝐭 − 𝐗𝐗𝜷𝜷 𝑇𝑇(𝐭𝐭 − 𝐗𝐗𝜷𝜷)
Denote
Closed-Form Solution: Normal equation
Normal equation is an analytical solution:
• Compared with the gradient descent: no need to choose learning rate 𝛼𝛼
and do not have to run a loop
• Can be slow when d is large
𝑁𝑁 × (𝑑𝑑 + 1)(𝑑𝑑 + 1) × 𝑁𝑁
(𝑑𝑑 + 1) × 𝑁𝑁
𝑁𝑁 × 1(𝑑𝑑 + 1) × 1
( )xxT Non-invertable?
𝜷𝜷 = 𝐗𝐗𝑇𝑇𝐗𝐗 −1𝐗𝐗𝑇𝑇𝐭𝐭
26
Reason: Multiconlinearity problem or redundant features.
Rank and determinant of 𝐗𝐗𝐓𝐓𝐗𝐗 = ?
Solution: Drop one or more highly correlated features from the model
or collect more data
Reason: The number of features is too large, e.g. (N≪ 𝑑𝑑 ).
Rank and determinant of 𝐗𝐗𝐓𝐓𝐗𝐗 =?
Solution: Drop some features or collect more data;
Add "regularization“ term into the model
𝐗𝐗𝐓𝐓𝐗𝐗 non-invertable?
For real matrices 𝐗𝐗, rank(𝐗𝐗𝐓𝐓𝐗𝐗) =rank(𝐗𝐗𝐗𝐗𝐓𝐓)=rank(𝐗𝐗)=rank(𝐗𝐗𝐓𝐓)
𝛽𝛽0 ≔ 𝛽𝛽0 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽0
= 𝛽𝛽0 − 𝛼𝛼
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)
27
• Have some random starting points for all 𝛽𝛽𝑖𝑖;
• Keep updating all 𝛽𝛽𝑖𝑖 (simultaneously) to decrease the loss
function 𝐿𝐿 𝜷𝜷 value;
• Repeat until achieving minimum (convergence).
Gradient descent of linear regression
with multiple features
Update
simultaneously...
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)2
𝛽𝛽1 ≔ 𝛽𝛽1 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽1
= 𝛽𝛽1 − 𝛼𝛼
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛1
𝛽𝛽𝑑𝑑 ≔ 𝛽𝛽𝑑𝑑 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽𝑑𝑑
= 𝛽𝛽𝑑𝑑 − 𝛼𝛼
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛𝑑𝑑
• We can write the Gradient Descent for linear regression with
multiple features in a matrix form
• The matrix form looks much more simple. First define
Then it can be proved that
𝐟𝐟 𝐗𝐗,𝜷𝜷 : =
𝑓𝑓(𝐱𝐱1,𝜷𝜷)
𝑓𝑓(𝐱𝐱2,𝜷𝜷)
⋮
𝑓𝑓(𝐱𝐱𝑁𝑁,𝜷𝜷)
; 𝐭𝐭 =
𝑡𝑡1
𝑡𝑡2
⋮
𝑡𝑡𝑁𝑁
; and 𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝜷𝜷
≔
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽0
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽1
⋮
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽𝑑𝑑
28
Gradient Descent in Matrix Form
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝜷𝜷
= 1
𝑁𝑁
𝐗𝐗𝑇𝑇(𝐟𝐟 𝐗𝐗,𝜷𝜷 − 𝐭𝐭)
• Hence gradient descent is
𝜷𝜷 ≔ 𝜷𝜷 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝜷𝜷
= 𝜷𝜷 − 𝛼𝛼
𝑁𝑁
𝐗𝐗𝑇𝑇(𝐟𝐟 𝐗𝐗,𝜷𝜷 − 𝐭𝐭)
29
Target: transform features to be on a similar scale.
Results: faster convergence of the optimisation algorithm
Feature Scaling
Number (x1): 1613 to 4214
Nearest (x2): 0.1 to 4.2
… s
_
)(
)( j
i
ji
j
xx
x
−
=
Mean Normalization
Goal: have all the features to be
approximately 0 mean and 1 variance
Number (x1) Nearest (x2) Office (x3) Enrolment (x4) Income (x5) Distance (x6) Margin (y)
3203 4.2 54.9 8.0 40 4.3 55.5
2810 2.8 49.6 17.5 38 23.0 33.8
2890 2.4 25.4 20.0 38 4.2 49.0
3422 3.3 43.4 15.5 41 19.4 31.9
2687 0.9 67.8 15.5 46 11.0 57.4
3759 2.9 63.5 19.0 36 17.3 49.0
2341 2.3 58 23.0 31 11.8 46.0
3021 1.7 57.2 8.5 45 8.8 50.2
30
Polynomial Regression
700
800
900
1000
1100
1200
1300
1400
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0
R
ev
en
ue
Age
Question:
Which model is the best model?
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2
We added a quadratic feature as
𝑥𝑥2 = 𝑥𝑥1
2 constructed from the first
feature; Similarly
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2 +𝛽𝛽3 𝑥𝑥1
3
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2 +𝛽𝛽3 𝑥𝑥1
3 +𝛽𝛽4 𝑥𝑥1
4
31
# fit a 4th order polynomial regression model
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(4)
poly_4 = poly.fit_transform(np.reshape(x_data, (100,1)))
poly_4 = np.asmatrix(poly_4)
lr_obj_4 = LinearRegression()
lr_obj_4.fit(poly_4, y_data)
print(lr_obj_4.intercept_) # This is the intercept \beta_0 in our notation
print(lr_obj_4.coef_)
32
# plot the fitted 4th order polynomial regression line
x_temp = np.reshape(np.linspace(np.min(x_data), np.max(x_data), 50), (50,1))
poly_temp0_4 = poly.fit_transform(np.reshape(x_temp, (50,1)))
y_temp = lr_obj.predict(poly_temp0_4)
plt.plot(x_temp,y_temp)
plt.scatter(odometer,car_price,label = "Observed Points", color = "red")
Is this a better
model?
33
Model selection
34
Model Selection and Assessment
Step 5
Model Selection:
estimate the performance of different models in order to choose the
(approximate) best one
Model Assessment:
after chosen the “best” model, estimate its prediction error (generalization
error) on new data. (Friedman et al., 2001).
In general, we shall divide the given dataset into three parts:
• Training dataset (60%): used to estimate a model (or models)
• Validation dataset (20%): used to select an appropriate model
• Test dataset (20%): used to assess the performance of the selected
model. This set of data should be hidden from training and validating
process. Some academics use 50%, 25%, 20% split
Underfitting & Overfitting
35
700
800
900
1000
1100
1200
1300
1400
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0
R
ev
en
ue
Age
700
800
900
1000
1100
1200
1300
1400
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0
R
ev
en
ue
Age
700
800
900
1000
1100
1200
1300
1400
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0
R
ev
en
ue
Age
Underfitting: High bias, low variance Overfitting: Low bias, high variance
Best model
36
Underfitting and Overfitting
How to address overfitting:
Drop some features
Model selection algorithm
Manually select features to keep
Regularization
Keep all features, reduce the magnitude/values of parameters
• Low training error, high generalization
error
• Poor predictive performance
• Overreacts to minor fluctuations in
training data
37
Odometer (x) Price (t)
37.4 16.0
44.8 15.2
45.8 15.0
30.9 17.4
31.7 17.4
34.0 16.1
45.9 15.7
19.1 17.5
40.1 16.6
40.2 16.2
Training set: 60%
Validation set: 20%
Test set: 20%
Cost function
Training, validation and test
sets
This loss function issued for
training, validation and test
sets respectively
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2
𝐿𝐿𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑛𝑛(𝜷𝜷)
𝐿𝐿𝑣𝑣(𝜷𝜷)
𝐿𝐿𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝜷𝜷)
Cannot use training set to
select the best model.
Not a fair competition.
Since the model minimize
this training data set, not
necessarily minimize the
new date sets.
Optimize the parameters 𝛉𝛉 employing the training
set for each polynomial degree
Find the polynomial degree 𝑑𝑑 with the smallest error
using the validation set
Estimate the generalization error using the test set.
Training set Validation set Test
Estimate the parameters Select the best model Estimate the generalization error
Which one to use?
Polynomial Order Selection
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2 +𝛽𝛽3 𝑥𝑥1
3 +𝛽𝛽4 𝑥𝑥1
4
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2 +𝛽𝛽3 𝑥𝑥1
3
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2
39
Suppose your training error is low, while
validation/test error is high.
Underfitting or overfitting problem?
Diagnosing Learning Curve
Learning Curve
Suppose your training error is high, while
validation/test error is also high.
Underfitting or overfitting problem?
Underfitting: training error is high, validation error is slightly > training error;
Overfitting: training error is low, validation error is significantly > training error.
Model complexity
Loss
( )βtrainL
( )βvL
40
Why is this?
Learning curve
Training set size N
Loss
Diagnosing Learning Curve
The impact of training set size
on loss function
( )βtrainL
( )βvL
41
Cross Validation
42
K-Fold Cross-Validation
If we had enough data, we would set aside a validation set and use it to
assess the performance of our prediction model
However, data are often scarce, this is usually not possible
Particularly when we do not have sufficient labelled data
K-fold cross-validation (CV) uses part of the available data to fit the model,
and a different part to test it, then iterate/repeat this process
43
5-fold Cross-Validation
The original sample is randomly partitioned into K equal size subsamples;
For each iteration, of the k subsamples, a single subsample is retained as
the validation data for testing the model, and the remaining K-1 subsamples
are used as training data.
K=5
Validation Training Training Training
1 2 3 4 5
Specifically, at iteration 1:
Data set 1 is chosen as validation set, and 2, 3, 4, 5 are chosen as training
sets. Estimate the parameters 𝛃𝛃𝟏𝟏 using training sets and calculate the
validation error Lv 𝛃𝛃𝟏𝟏
Training
44
At iteration 2:
Data set 2 is chosen as validation set, and 1, 3, 4, 5 (K-1 sets) are chosen as
training sets. Estimate the parameters 𝛃𝛃𝟐𝟐 using training sets and calculate the
validation error Lv 𝛃𝛃𝟐𝟐
Validation Training Training Training Training
1 2 3 4 5
Repeat until iteration K=5. Estimate the parameters 𝛃𝛃𝟓𝟓 using training sets
and calculate the validation error Lv 𝛃𝛃𝟓𝟓
Output mean validation error (Lv 𝛃𝛃𝟏𝟏 + Lv 𝛃𝛃𝟐𝟐 +…+Lv 𝛃𝛃𝟓𝟓 )/5 on validation
sets and select the model that generates the least error.
45
Computational cost
You must train each model K times.
The K training sets (and hence the trained models) are highly
correlated (see, e.g., Bengio Y & Grandvalet Y (2004)
Cross-Validation potential issues:
46
Scikit-learn Workflow
See Lecture02_Example01.py and Lecture02_Example02.py
Python scikit-learn package provides facilities for most popular machine
learning algorithms
The best way to learn how to use scikit-learn functionalities to learn from
examples and read user guide
Workflow:
A typical machine learning task starts with data preparation: For example,
loading data from database/files (e.g. using pandas); data cleaning;
feature extraction, feature scaling and dimensionality reduction etc; some
of these can be done with scikit-learn, some rely on other packages
Following data preparation there will be a step to define a machine
learning model, for example, linear regression etc.
Scikit-learn introduces the concept of pipeline that chains all the steps in a
linear sequence and automates the cross-validation
Read examples here https://machinelearningmastery.com/automate-
machine-learning-workflows-pipelines-python-scikit-learn/
Slide Number 1
Slide Number 2
Slide Number 3
Slide Number 4
Slide Number 5
Slide Number 6
Slide Number 7
Slide Number 8
Slide Number 9
Slide Number 10
Slide Number 11
Slide Number 12
Slide Number 13
Slide Number 14
Slide Number 15
Slide Number 16
Slide Number 17
Slide Number 18
Slide Number 19
Slide Number 20
Slide Number 21
Slide Number 22
Slide Number 23
Slide Number 24
Slide Number 25
Slide Number 26
Slide Number 27
Slide Number 28
Slide Number 29
Slide Number 30
Slide Number 31
Slide Number 32
Slide Number 33
Slide Number 34
Underfitting & Overfitting
Slide Number 36
Slide Number 37
Slide Number 38
Slide Number 39
Slide Number 40
Slide Number 41
Slide Number 42
Slide Number 43
Slide Number 44
Slide Number 45
Slide Number 46