程序代写代做代考 python database algorithm chain Slide 1

Slide 1

BUSINESS SCHOOL

 Discipline of Business Analytics

QBUS6850 Team

2

 Topics covered
 Machine learning model representation
 Cost/loss function
 Linear regression with single and multiple features
 Optimisation algorithm: gradient descent
 Model and feature selection techniques
 Learning curves
 Training, validation and test sets

 Cross validation

Topics

3

 References
 Bishop (2006), Chapters 1.3 – 1.5; 3.2
 Friedman et al. (2001), Chapters 2.3.1, 3.1 – 3.2, 7.1 – 7.6, 7.10
 James et al., (2014), Chapters 2.1 – 2.2
 James et al., (2014) Chapter 5.1 (Cross Validation)

4

Learning Objectives

 Understand model representation and cost function

 Understand the matrix representation of linear regression with single and

multiple features

 Understand how gradient descent algorithm works

 Understand overfitting and underfitting

 Understand bias and variance decomposition and be able to diagnose

them

 Be able to interpret the learning curves

 Be able to do the polynomial order selection

 Understand the reason and process of Cross Validation

5

6

 Input/Feature Supervised learning:
 An object is usually characterized by a feature scalar or vector
 Denote by 𝐱𝐱 = (𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑)𝑇𝑇 where each component 𝑥𝑥𝑖𝑖 is a specified

feature for the object
 Each component 𝑥𝑥𝑖𝑖 may be a quantitative value from ℝ (the set of all

real numbers) or one of finite categorical values.
 Outcome/Target:

 An unknown system (to be learnt) which generates an
output/outcome/target denoted by 𝐭𝐭 = (𝑡𝑡1, 𝑡𝑡2, … , 𝑡𝑡𝑚𝑚)𝑇𝑇 for each object
feature x

 Each component 𝑡𝑡𝑗𝑗 may be a quantitative value from ℝ (the set of all
real numbers) or one of finite categorical values

 In most cases, we assume m = 1. We may assume m = 1 in this
course. Thus t is a scalar

 As a measurement value, we always suppose there are some noises ϵ
in t, i.e., the measurement is t = y + ϵ where y is the true target.

 Ask students for examples

Terminology in ML

Andy Sun

7

 Training/Test Dataset:

 A pair of observed (𝐱𝐱, t) is called a training/test datum.
 In unsupervised learning case, there is no target observation t.
 All the available training data are collected together by a set of

training/test data, denoted 𝒟𝒟 with or without target observation
𝒟𝒟 = {(𝐱𝐱𝑛𝑛, 𝑡𝑡𝑛𝑛)}𝑛𝑛=1

𝑁𝑁 or {𝐱𝐱𝑛𝑛}𝑛𝑛=1
𝑁𝑁

 Learner or Model:

 Use this dataset 𝒟𝒟 to build a prediction model, or learner; which will
enable us to predict the outcome for new unseen objects or characterize
them if without outcomes.

 A good learner is one that accurately predicts such an outcome or make
a right characterization.

Terminology in ML

8

 Machine learning algorithms are built upon data. There exist different types
of data. Although the numeric data are widely seen in scientific world, the
categorical data are more common in business world

 When we have a data set with 𝑁𝑁 observations {𝐱𝐱1, 𝐱𝐱2, … , 𝐱𝐱𝑁𝑁}, we will
organise them into a matrix of size 𝑁𝑁 × 𝑑𝑑 such that each row corresponds to
an observation (or a case). If we have the target/output for 𝑁𝑁 observations
{𝑡𝑡1, 𝑡𝑡2, … , 𝑡𝑡𝑁𝑁}, we also organise them into a column (simulating a row
corresponding to a case or an observation), denoted by as

Data Representation

𝐗𝐗 =

𝐱𝐱1
𝑇𝑇

𝐱𝐱2
𝑇𝑇


𝐱𝐱𝑁𝑁
𝑇𝑇

=

𝑥𝑥11 𝑥𝑥12
𝑥𝑥21 𝑥𝑥21

… 𝑥𝑥1𝑑𝑑
… 𝑥𝑥2𝑑𝑑

⋮ ⋮
𝑥𝑥𝑁𝑁1 𝑥𝑥𝑁𝑁2

⋱ ⋮
… 𝑥𝑥𝑁𝑁𝑑𝑑

∈ ℝ𝑁𝑁×𝑑𝑑, 𝐭𝐭 =

𝑡𝑡1
𝑡𝑡2

𝑡𝑡𝑁𝑁

∈ ℝ𝑁𝑁

9

 Learning is the process of estimating an unknown dependency
between the input and output or structure of a system using a limited
number of observations.

 A typical learning procedure consists of the following steps:
1. Statement of the Problem
2. Data Generation/Experiment Design
3. Data Collection and Pre-processing
4. Hypothesis Formulation and Objectives
5. Model Estimation and Assessment
6. Interpretation of the Model and Drawing Conclusions

 Many recent application studies tend to focus on the learning
methods used (i.e., a neural network).

Machine Learning Flow

10

14.0

14.5

15.0

15.5

16.0

16.5

17.0

17.5

18.0

0.0 10.0 20.0 30.0 40.0 50.0 60.0

P
ric

e

Odometer

Supervised learning:

Step 1: Regression
‘Problem‘: Predict car price
Steps 2&3: Completed
(Data collected and
labelled)

Example: Linear Regression

11

Step 4a: Linear Model Hypothesis
𝑦𝑦 = 𝑓𝑓 𝒙𝒙;𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥

𝑓𝑓(): simple (univariate) linear regression model;
𝛃𝛃 = (𝛽𝛽0,𝛽𝛽1)𝑇𝑇 : model parameters.

Example: Linear Regression

Odometer (x) Price (t)
37.4 16.0
44.8 15.2
45.8 15.0
30.9 17.4
31.7 17.4
34.0 16.1
45.9 15.7
… …

41.4 15.3

Training set
N: number of training examples
X: “input” variable; features
t: “output” variable; “target” variable which
is a noised version of model output y, i.e.,

𝑡𝑡 = 𝑦𝑦 + 𝜀𝜀

ith training example(𝑥𝑥𝑖𝑖 , 𝑡𝑡𝑖𝑖)

𝑥𝑥1, 𝑡𝑡1 = (37.4, 16.0)

𝑥𝑥3, 𝑡𝑡3 = ? ?

12

f() representation?

14.0

14.5

15.0

15.5

16.0

16.5

17.0

17.5

18.0

18.5

0.0 10.0 20.0 30.0 40.0 50.0 60.0

P
ric

e

Odometer

How to choose 𝛃𝛃?
Or how can we estimate 𝛃𝛃?

This is the task in Step 5

𝑦𝑦 = 𝑓𝑓 𝒙𝒙;𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥

𝐿𝐿 𝛽𝛽0,𝛽𝛽1 =
1
2𝑁𝑁


𝑛𝑛=1

𝑁𝑁

𝑓𝑓 𝒙𝒙𝑛𝑛,𝜷𝜷 − 𝑡𝑡𝑛𝑛 2

13

Step 4b: Define an appropriate objective for the task in hand;

Purpose: to measure the error between the observed and the model. Loss
function, also called a cost function, which is a single, overall measure of loss
incurred in taking any of the available decisions or actions. (Bishop, C.M., 2006.)

Loss/cost function

14.0

14.5

15.0

15.5

16.0

16.5

17.0

17.5

18.0

18.5

0.0 10.0 20.0 30.0 40.0 50.0 60.0

P
ric

e

Odometer

( )10, ,Lmin10
ββ

ββ

Choose parameters so that estimated linear
regression line is close to our training examples.
This is also call argmin

Observed t valuesPredicted y values

For
computational
convenience

where 𝑓𝑓 𝒙𝒙;𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥

14

Optimisation Algorithm

Sometime this term is added for computational
convenience, but will not change the
estimation results of parameters

Loss function plot.
Quadratic.

Why?

𝛽𝛽1

𝐿𝐿(𝛽𝛽1)

𝛽𝛽0 not included for simplicity

Step 5: Find the model parameters;

For demo, we assume 𝛽𝛽0 = 0, hence the
loss function becomes

𝐿𝐿 𝛽𝛽1 =
1
2𝑁𝑁


𝑛𝑛=1

𝑁𝑁

𝑓𝑓 𝒙𝒙𝑛𝑛,𝜷𝜷 − 𝑡𝑡𝑛𝑛 2

15

Loss function plot.
Quadratic. 𝛽𝛽0 not
included for simplicity.

𝛽𝛽1𝑥𝑥1

𝑦𝑦
𝛽𝛽1 = 1

10.5 1.5

If 𝛽𝛽0 is included, how will the loss function 𝐿𝐿(𝛽𝛽0,𝛽𝛽1) plot look like?

𝐿𝐿(𝛽𝛽1)

𝛽𝛽1 = 1.6

𝛽𝛽1 = 0.5

16

 Have some random starting point for 𝛽𝛽1;
 Keep updating 𝛽𝛽1 to decrease the loss function 𝐿𝐿(𝛽𝛽1) value;
 Repeat until achieving minimum (convergence).
 𝛼𝛼 > 0 is called the learning rate: in empirical study, we can try

many 𝛼𝛼 values, and select the one generates least 𝐿𝐿(𝛽𝛽1)
 Gradient descent can converge to a local minimum

Gradient descent

𝛽𝛽110.5 1.5

The partial derivatives become smaller and smaller
when moving towards the local minimum

𝛽𝛽0 not included for simplicity

𝐿𝐿(𝛽𝛽1)

Assignment notation: keep
updating 𝛽𝛽1 based on
calculations to the right hand
side of this notation

𝛽𝛽1 ≔ 𝛽𝛽1 − 𝛼𝛼
𝜕𝜕𝐿𝐿(𝛽𝛽1)
𝜕𝜕𝛽𝛽1

17

𝜃𝜃110.5 1.5

( )( )
1

1
11

L
:

β
β

αββ


−=

If starting point of 𝛽𝛽1 is to the right
of the local minimum:

If starting point of 𝛽𝛽1 is to the left
of the local minimum:

( )( )
0

L

1

1 >


β
β

( )( )
1

1
11

L
:

β
β

αββ


−=

( )( )
0

L

1

1 < ∂ ∂ β β 𝜃𝜃110.5 1.5 𝛽𝛽1 is updated to be smaller and smaller 𝛽𝛽1 is updated to be lager and larger Positive derivative Negative derivative 𝐿𝐿(𝛽𝛽1) 𝐿𝐿(𝛽𝛽1) 18 • Have some random starting points for 𝛽𝛽0 and 𝛽𝛽1; • Keep updating 𝛽𝛽0 and 𝛽𝛽1 (simultaneously) to decrease the loss function 𝐿𝐿(𝛽𝛽0,𝛽𝛽1) value; • Repeat until achieving minimum (convergence). Gradient descent of linear regression Update simultaneously 𝐿𝐿 𝛽𝛽0,𝛽𝛽1 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 𝑓𝑓 𝒙𝒙𝑛𝑛,𝜷𝜷 − 𝑡𝑡𝑛𝑛 2 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑛𝑛1 − 𝑡𝑡𝑛𝑛 2 𝛽𝛽0 ≔ 𝛽𝛽0 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝛽𝛽0,𝛽𝛽1 𝜕𝜕𝛽𝛽0 = 𝛽𝛽0 − 𝛼𝛼 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑛𝑛1 − 𝑡𝑡𝑛𝑛) 𝛽𝛽1 ≔ 𝛽𝛽1 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝛽𝛽0,𝛽𝛽1 𝜕𝜕𝛽𝛽1 = 𝛽𝛽1 − 𝛼𝛼 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑛𝑛1 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛1 Andy Sun 19 Why local minimum? A B C • If the starting point is the red dot, then gradient descent can only converge to local minimum B • If the starting point is the blue dot, then gradient descent can only converge to local minimum C • The derivatives at D and E are 0 • The derivatives at A, B or C are also 0 D E 20 21 Multiple features Number (x1) Nearest (x2) Office (x3) Enrolment (x4) Income (x5) Distance (x6) Margin (y) 3203 4.2 54.9 8.0 40 4.3 55.5 2810 2.8 49.6 17.5 38 23.0 33.8 2890 2.4 25.4 20.0 38 4.2 49.0 3422 3.3 43.4 15.5 41 19.4 31.9 2687 0.9 67.8 15.5 46 11.0 57.4 3759 2.9 63.5 19.0 36 17.3 49.0 2341 2.3 58 23.0 31 11.8 46.0 3021 1.7 57.2 8.5 45 8.8 50.2 N: number of training examples d: number of features x: “input” variable; d features 𝐱𝐱 = (𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑)𝑇𝑇∈ ℝ𝒅𝒅 y: “output” variable; “target” variable. We consider a single output For dataset, we use notation d=6 nth training example of jth feature𝑥𝑥𝑛𝑛𝑗𝑗 𝑥𝑥32 = 2.4 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽3𝑥𝑥3 + 𝛽𝛽4𝑥𝑥4 + 𝛽𝛽5𝑥𝑥5 + 𝛽𝛽6𝑥𝑥6 For this example, the linear model is Define a special feature , always taking value 1 So, new feature variable of a vector of d dimension 22 Matrix Representation Think about why this T 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽3 +⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 𝑥𝑥0 = 1 𝐱𝐱 = (𝑥𝑥0, 𝑥𝑥1, 𝑥𝑥2, … , 𝑥𝑥𝑑𝑑)𝑇𝑇∈ ℝ𝑑𝑑+1 Collect all parameters into a vector as 𝛽𝛽 = 𝛽𝛽0 𝛽𝛽1 𝛽𝛽2 ⋮ 𝛽𝛽𝑑𝑑 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝐱𝐱𝑇𝑇𝜷𝜷 In general with d features Consider an input feature data 𝐱𝐱𝑛𝑛 = 𝑥𝑥𝑛𝑛0, 𝑥𝑥𝑛𝑛1, 𝑥𝑥𝑛𝑛2, … , 𝑥𝑥𝑛𝑛𝑑𝑑 and its corresponding output (or target) 𝑦𝑦𝑛𝑛. The squared model error is 𝑒𝑒𝑛𝑛2 = (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2= (𝑡𝑡𝑛𝑛 − 𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷)𝟐𝟐 The overall “mean” error is 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 𝑒𝑒𝑛𝑛2 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝐱𝐱𝑛𝑛𝑇𝑇𝜷𝜷)𝟐𝟐 23 Multiple features loss function Note 1 2 here is for mathematical convenience 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝛽𝛽3 +⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝐱𝐱𝑇𝑇𝜷𝜷 43816401 𝐗𝐗 = 𝐱𝐱1 𝑇𝑇 𝐱𝐱2 𝑇𝑇 ⋮ 𝐱𝐱𝑁𝑁 𝑇𝑇 = 𝑥𝑥10 𝑥𝑥11 𝑥𝑥12 𝑥𝑥20 𝑥𝑥21 𝑥𝑥21 … 𝑥𝑥1𝑑𝑑 … 𝑥𝑥2𝑑𝑑 ⋮ ⋮ 𝑥𝑥𝑁𝑁0 𝑥𝑥𝑁𝑁1 𝑥𝑥𝑁𝑁2 ⋱ ⋮ … 𝑥𝑥𝑁𝑁𝑑𝑑 ∈ ℝ𝑁𝑁×(𝑑𝑑+1), 𝐭𝐭 = 𝑡𝑡1 𝑡𝑡2 ⋮ 𝑡𝑡𝑁𝑁 ∈ ℝ𝑁𝑁 24 Multiple features loss function Another way to write the loss function Question: What is the meaning of, for example, the second column vector of data matrix X, called the design matrix? What is the 10th row of X? It is easy to prove that 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 𝐭𝐭 − 𝐗𝐗𝜷𝜷 𝑇𝑇(𝐭𝐭 − 𝐗𝐗𝜷𝜷) Denote Closed-Form Solution: Normal equation Normal equation is an analytical solution: • Compared with the gradient descent: no need to choose learning rate 𝛼𝛼 and do not have to run a loop • Can be slow when d is large 𝑁𝑁 × (𝑑𝑑 + 1)(𝑑𝑑 + 1) × 𝑁𝑁 (𝑑𝑑 + 1) × 𝑁𝑁 𝑁𝑁 × 1(𝑑𝑑 + 1) × 1 ( )xxT Non-invertable? 𝜷𝜷 = 𝐗𝐗𝑇𝑇𝐗𝐗 −1𝐗𝐗𝑇𝑇𝐭𝐭 26 Reason: Multiconlinearity problem or redundant features. Rank and determinant of 𝐗𝐗𝐓𝐓𝐗𝐗 = ? Solution: Drop one or more highly correlated features from the model or collect more data Reason: The number of features is too large, e.g. (N≪ 𝑑𝑑 ). Rank and determinant of 𝐗𝐗𝐓𝐓𝐗𝐗 =? Solution: Drop some features or collect more data; Add "regularization“ term into the model 𝐗𝐗𝐓𝐓𝐗𝐗 non-invertable? For real matrices 𝐗𝐗, rank(𝐗𝐗𝐓𝐓𝐗𝐗) =rank(𝐗𝐗𝐗𝐗𝐓𝐓)=rank(𝐗𝐗)=rank(𝐗𝐗𝐓𝐓) 𝛽𝛽0 ≔ 𝛽𝛽0 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽0 = 𝛽𝛽0 − 𝛼𝛼 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛) 27 • Have some random starting points for all 𝛽𝛽𝑖𝑖; • Keep updating all 𝛽𝛽𝑖𝑖 (simultaneously) to decrease the loss function 𝐿𝐿 𝜷𝜷 value; • Repeat until achieving minimum (convergence). Gradient descent of linear regression with multiple features Update simultaneously... 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)2 𝛽𝛽1 ≔ 𝛽𝛽1 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽1 = 𝛽𝛽1 − 𝛼𝛼 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛1 𝛽𝛽𝑑𝑑 ≔ 𝛽𝛽𝑑𝑑 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽𝑑𝑑 = 𝛽𝛽𝑑𝑑 − 𝛼𝛼 1 𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛𝑑𝑑 • We can write the Gradient Descent for linear regression with multiple features in a matrix form • The matrix form looks much more simple. First define Then it can be proved that 𝐟𝐟 𝐗𝐗,𝜷𝜷 : = 𝑓𝑓(𝐱𝐱1,𝜷𝜷) 𝑓𝑓(𝐱𝐱2,𝜷𝜷) ⋮ 𝑓𝑓(𝐱𝐱𝑁𝑁,𝜷𝜷) ; 𝐭𝐭 = 𝑡𝑡1 𝑡𝑡2 ⋮ 𝑡𝑡𝑁𝑁 ; and 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝜷𝜷 ≔ 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽0 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽1 ⋮ 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝛽𝛽𝑑𝑑 28 Gradient Descent in Matrix Form 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝜷𝜷 = 1 𝑁𝑁 𝐗𝐗𝑇𝑇(𝐟𝐟 𝐗𝐗,𝜷𝜷 − 𝐭𝐭) • Hence gradient descent is 𝜷𝜷 ≔ 𝜷𝜷 − 𝛼𝛼 𝜕𝜕𝐿𝐿 𝜷𝜷 𝜕𝜕𝜷𝜷 = 𝜷𝜷 − 𝛼𝛼 𝑁𝑁 𝐗𝐗𝑇𝑇(𝐟𝐟 𝐗𝐗,𝜷𝜷 − 𝐭𝐭) 29 Target: transform features to be on a similar scale. Results: faster convergence of the optimisation algorithm Feature Scaling Number (x1): 1613 to 4214 Nearest (x2): 0.1 to 4.2 … s _ )( )( j i ji j xx x − = Mean Normalization Goal: have all the features to be approximately 0 mean and 1 variance Number (x1) Nearest (x2) Office (x3) Enrolment (x4) Income (x5) Distance (x6) Margin (y) 3203 4.2 54.9 8.0 40 4.3 55.5 2810 2.8 49.6 17.5 38 23.0 33.8 2890 2.4 25.4 20.0 38 4.2 49.0 3422 3.3 43.4 15.5 41 19.4 31.9 2687 0.9 67.8 15.5 46 11.0 57.4 3759 2.9 63.5 19.0 36 17.3 49.0 2341 2.3 58 23.0 31 11.8 46.0 3021 1.7 57.2 8.5 45 8.8 50.2 30 Polynomial Regression 700 800 900 1000 1100 1200 1300 1400 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 R ev en ue Age Question: Which model is the best model? 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 We added a quadratic feature as 𝑥𝑥2 = 𝑥𝑥1 2 constructed from the first feature; Similarly 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 +𝛽𝛽3 𝑥𝑥1 3 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 +𝛽𝛽3 𝑥𝑥1 3 +𝛽𝛽4 𝑥𝑥1 4 31 # fit a 4th order polynomial regression model from sklearn.preprocessing import PolynomialFeatures poly = PolynomialFeatures(4) poly_4 = poly.fit_transform(np.reshape(x_data, (100,1))) poly_4 = np.asmatrix(poly_4) lr_obj_4 = LinearRegression() lr_obj_4.fit(poly_4, y_data) print(lr_obj_4.intercept_) # This is the intercept \beta_0 in our notation print(lr_obj_4.coef_) 32 # plot the fitted 4th order polynomial regression line x_temp = np.reshape(np.linspace(np.min(x_data), np.max(x_data), 50), (50,1)) poly_temp0_4 = poly.fit_transform(np.reshape(x_temp, (50,1))) y_temp = lr_obj.predict(poly_temp0_4) plt.plot(x_temp,y_temp) plt.scatter(odometer,car_price,label = "Observed Points", color = "red") Is this a better model? 33 Model selection 34 Model Selection and Assessment Step 5 Model Selection: estimate the performance of different models in order to choose the (approximate) best one Model Assessment: after chosen the “best” model, estimate its prediction error (generalization error) on new data. (Friedman et al., 2001). In general, we shall divide the given dataset into three parts: • Training dataset (60%): used to estimate a model (or models) • Validation dataset (20%): used to select an appropriate model • Test dataset (20%): used to assess the performance of the selected model. This set of data should be hidden from training and validating process. Some academics use 50%, 25%, 20% split Underfitting & Overfitting 35 700 800 900 1000 1100 1200 1300 1400 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 R ev en ue Age 700 800 900 1000 1100 1200 1300 1400 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 R ev en ue Age 700 800 900 1000 1100 1200 1300 1400 0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0 R ev en ue Age Underfitting: High bias, low variance Overfitting: Low bias, high variance Best model 36 Underfitting and Overfitting How to address overfitting:  Drop some features  Model selection algorithm  Manually select features to keep  Regularization  Keep all features, reduce the magnitude/values of parameters • Low training error, high generalization error • Poor predictive performance • Overreacts to minor fluctuations in training data 37 Odometer (x) Price (t) 37.4 16.0 44.8 15.2 45.8 15.0 30.9 17.4 31.7 17.4 34.0 16.1 45.9 15.7 19.1 17.5 40.1 16.6 40.2 16.2 Training set: 60% Validation set: 20% Test set: 20% Cost function Training, validation and test sets This loss function issued for training, validation and test sets respectively 𝐿𝐿 𝜷𝜷 = 1 2𝑁𝑁 � 𝑛𝑛=1 𝑁𝑁 (𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 𝐿𝐿𝑡𝑡𝑡𝑡𝑡𝑡𝑖𝑖𝑛𝑛(𝜷𝜷) 𝐿𝐿𝑣𝑣(𝜷𝜷) 𝐿𝐿𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡(𝜷𝜷) Cannot use training set to select the best model. Not a fair competition. Since the model minimize this training data set, not necessarily minimize the new date sets.  Optimize the parameters 𝛉𝛉 employing the training set for each polynomial degree  Find the polynomial degree 𝑑𝑑 with the smallest error using the validation set  Estimate the generalization error using the test set. Training set Validation set Test Estimate the parameters Select the best model Estimate the generalization error Which one to use? Polynomial Order Selection 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 +𝛽𝛽3 𝑥𝑥1 3 +𝛽𝛽4 𝑥𝑥1 4 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 +𝛽𝛽3 𝑥𝑥1 3 𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1 2 39  Suppose your training error is low, while validation/test error is high. Underfitting or overfitting problem? Diagnosing Learning Curve Learning Curve  Suppose your training error is high, while validation/test error is also high. Underfitting or overfitting problem? Underfitting: training error is high, validation error is slightly > training error;
Overfitting: training error is low, validation error is significantly > training error.

Model complexity

Loss

( )βtrainL

( )βvL

40

Why is this?

Learning curve

Training set size N

Loss

Diagnosing Learning Curve

The impact of training set size
on loss function

( )βtrainL

( )βvL

41

Cross Validation

42

K-Fold Cross-Validation

 If we had enough data, we would set aside a validation set and use it to

assess the performance of our prediction model

 However, data are often scarce, this is usually not possible

 Particularly when we do not have sufficient labelled data

 K-fold cross-validation (CV) uses part of the available data to fit the model,

and a different part to test it, then iterate/repeat this process

43

5-fold Cross-Validation

The original sample is randomly partitioned into K equal size subsamples;
For each iteration, of the k subsamples, a single subsample is retained as
the validation data for testing the model, and the remaining K-1 subsamples
are used as training data.

K=5

Validation Training Training Training

1 2 3 4 5

Specifically, at iteration 1:
Data set 1 is chosen as validation set, and 2, 3, 4, 5 are chosen as training
sets. Estimate the parameters 𝛃𝛃𝟏𝟏 using training sets and calculate the
validation error Lv 𝛃𝛃𝟏𝟏

Training

44

At iteration 2:
Data set 2 is chosen as validation set, and 1, 3, 4, 5 (K-1 sets) are chosen as
training sets. Estimate the parameters 𝛃𝛃𝟐𝟐 using training sets and calculate the
validation error Lv 𝛃𝛃𝟐𝟐

Validation Training Training Training Training

1 2 3 4 5

Repeat until iteration K=5. Estimate the parameters 𝛃𝛃𝟓𝟓 using training sets
and calculate the validation error Lv 𝛃𝛃𝟓𝟓

Output mean validation error (Lv 𝛃𝛃𝟏𝟏 + Lv 𝛃𝛃𝟐𝟐 +…+Lv 𝛃𝛃𝟓𝟓 )/5 on validation
sets and select the model that generates the least error.

45

 Computational cost
 You must train each model K times.
 The K training sets (and hence the trained models) are highly

correlated (see, e.g., Bengio Y & Grandvalet Y (2004)

Cross-Validation potential issues:

46

Scikit-learn Workflow

See Lecture02_Example01.py and Lecture02_Example02.py

 Python scikit-learn package provides facilities for most popular machine
learning algorithms

 The best way to learn how to use scikit-learn functionalities to learn from
examples and read user guide

 Workflow:
 A typical machine learning task starts with data preparation: For example,

loading data from database/files (e.g. using pandas); data cleaning;
feature extraction, feature scaling and dimensionality reduction etc; some
of these can be done with scikit-learn, some rely on other packages

 Following data preparation there will be a step to define a machine
learning model, for example, linear regression etc.

 Scikit-learn introduces the concept of pipeline that chains all the steps in a
linear sequence and automates the cross-validation

 Read examples here https://machinelearningmastery.com/automate-
machine-learning-workflows-pipelines-python-scikit-learn/

Slide Number 1
Slide Number 2
Slide Number 3
Slide Number 4
Slide Number 5
Slide Number 6
Slide Number 7
Slide Number 8
Slide Number 9
Slide Number 10
Slide Number 11
Slide Number 12
Slide Number 13
Slide Number 14
Slide Number 15
Slide Number 16
Slide Number 17
Slide Number 18
Slide Number 19
Slide Number 20
Slide Number 21
Slide Number 22
Slide Number 23
Slide Number 24
Slide Number 25
Slide Number 26
Slide Number 27
Slide Number 28
Slide Number 29
Slide Number 30
Slide Number 31
Slide Number 32
Slide Number 33
Slide Number 34
Underfitting & Overfitting
Slide Number 36
Slide Number 37
Slide Number 38
Slide Number 39
Slide Number 40
Slide Number 41
Slide Number 42
Slide Number 43
Slide Number 44
Slide Number 45
Slide Number 46