Slide 1
BUSINESS SCHOOL
Discipline of Business Analytics
QBUS6850 Team
2
Topics covered
Logistics regression
Intuition of regularization
Regularized Linear Regressions: Ridge, LASSO, Elastic Net
Feature Extraction
References
Bishop (2006), Chapters 3.1.4; 4.3.2
Hastie et al. (2001), Chapter 3.4, Chapter 7.7-7.10
James et al., (2014), Chapters 4.3; 6.2
3
Learning Objectives
Understand the intuition of regularization
Review how Ridge regression, LASSO regression and Elastic net
work
Understand the differences between various regularized
regressions
Understand difference between regression and classification
Be able to calculate the predicting probability with logistic
regression
Understand different types of features and feature extraction
Be able to extract features for text data
Understand Cross Validation and Be able to conduct Cross
Validation
4
Classification
The linear regression was used as an example to show the machine
learning workflow
The major ingredients are data, a model, and a criterion (objective)
What are they in linear regression?
In regression, the target 𝑡𝑡 was continuous numeric value; however in
many applications, we wish to predict class instead of an amount
When the target is categorical, we call the regression as a classification
For classification, how shall we choose a model? how shall we design a
criterion to measure the “error” between the observation and the model
prediction?
We will look at the logistic regression as an example
5
Supervised learning with categorical response: classification
2 classes. Linear regression for classification?
Classification
1
0
Default or not?
0.5
Annual Income
Higher income;
Less likely to default
Lower income;
More likely to default
If 𝑓𝑓 𝐱𝐱,𝜷𝜷 ≥ 0.5, predict 𝑡𝑡 = 1;
If 𝑓𝑓 𝐱𝐱,𝜷𝜷 < 0.5, predict 𝑡𝑡 = 0.
Income threshold is known.
$50,000
Presenter
Presentation Notes
Question: Can we change the target values “0 and 1” to “1 and “ or “1 and -1”?
6
Supervised learning with categorical response
2 classes. Linear regression for classification?
Classification
1
0
Default
or not?
0.5
One outlier significantly
changes the income
threshold ($50k to $65k).
Do we really need to
change the classification
rule/boundary?
$65,000
Annual Income
7
Logistic Regression
8
Re-coding Target
In fact, there is no numeric target value in classification. We manually
encode it as 1 (class A) or 0 (class B).
Can we re-code “Class A” as (1, 0) and “Class B” as (0, 1)? That is,
for each case, we have two target values or the target is a vector 𝐭𝐭 =
(𝑡𝑡1, 𝑡𝑡2) [Please recall our notation in Lecture 2 where 𝑚𝑚 = 2]
Hence we shall have two models 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 → 𝑡𝑡1 and 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 → 𝑡𝑡𝟐𝟐
How shall we measure the error between them? [We need a learning
criterion or objective]
It seems, for our coding (1, 0) and (0,1) for classes A and B, we have
𝑡𝑡1 + 𝑡𝑡2 = 1, and 0 ≤ 𝑡𝑡1 ≤ 1, 0 ≤ 𝑡𝑡2 ≤ 1.
Can we say 𝐭𝐭 = (𝑡𝑡1, 𝑡𝑡2) is a Bernoulli distribution with a parameter 𝑡𝑡1?
Hence, each training target (class A or class B) becomes an
“extreme” Bernoulli distribution either (1, 0) or (0, 1)
9
Developing New Objective
Hence, each training target (class A or class B) becomes an
“extreme” Bernoulli distribution either (1, 0) or (0, 1)
Two models 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 → 𝑡𝑡1 and 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 → 𝑡𝑡𝟐𝟐 shall aim to predict the
Bernoulli parameter, or 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 should be the probability for case 𝐱𝐱 to
be class A and 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 should be the probability for case 𝐱𝐱 to be class
B.
Three conditions: 0 ≤ 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 ≤ 1 and 0 ≤ 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ≤ 1,
and 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 + 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 = 𝟏𝟏
That is (𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 , 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ) is a Bernoulli distribution for each case 𝐱𝐱
too.
Suppose we already have models satisfying the above conditions,
now the question becomes how we tell the Bernoulli
(𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 , 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ) is close to Bernoulli (1, 0) [if 𝐱𝐱 is class A] or
Bernoulli (0, 1) [if 𝐱𝐱 is class B]
10
Cross Entropy Objective
Our simple example has demonstrated that simply measuring the squared
errors is not a good way
Although a Bernoulli distribution is represented by a 2D vector, they are
special: two components are between 0 and 1, and the sum of them is 1.
To measure the “distance” between distributions, we use either the so-called
Kullback-Leibler divergence, or the so-called cross entropy. For two Bernoulli
distributions (𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 , 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ) and 𝐭𝐭 = 𝑡𝑡1, 𝑡𝑡2 , the cross entropy is defined
as
For all the data we have
𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 − 𝑡𝑡1 2 + 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 − 𝑡𝑡2 2
−𝑡𝑡1 log(𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 ) − 𝑡𝑡2 log(𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 ))
𝐿𝐿 𝜷𝜷 = −
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝑡𝑡𝑛𝑛1 log 𝑓𝑓𝐴𝐴 𝐱𝐱𝑛𝑛,𝜷𝜷 +𝑡𝑡𝑛𝑛2 log 𝑓𝑓𝐵𝐵 𝐱𝐱𝑛𝑛,𝜷𝜷
11
Clarification
Don’t confuse with a number of things here
Two Classes A and B:
can be labelled as 1 and 0 respectively, so we use target value t = 1 or 0
can be encoded as (1, 0) and (0, 1) respectively, regarded as hot-one
code or Bernoulli distribution. We can focus on the first component 1 and
0, respectively [This is not label, but the probability]
In both cases, we can simply use one target variable (not a vector) 𝑡𝑡 which
takes value of 1 (for class A) and 0 (for class B).
Similarly we only need focus on 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 because 𝑓𝑓𝐵𝐵 𝐱𝐱,𝜷𝜷 = 1 − 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 .
Simply we use 𝑓𝑓 𝐱𝐱,𝜷𝜷 for 𝑓𝑓𝐴𝐴 𝐱𝐱,𝜷𝜷 , i.e., we need only one model
Finally the loss is defined as
𝐿𝐿 𝜷𝜷 = −
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝑡𝑡𝑛𝑛 log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 +(1 − 𝑡𝑡𝑛𝑛) log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷
12
How can we make a model satisfying 0 ≤ 𝑓𝑓 𝐱𝐱,𝜷𝜷 ≤ 1 ?
Logistic Function
Also called Sigmoid Function
x_list= np.linspace(-6,6,100)
y_list= 1/(1+np.exp(-x_list))
plt.plot(x_list, y_list)
Logistic Function
𝜎𝜎 𝑧𝑧 =
1
1 + 𝑒𝑒−𝑧𝑧
=
𝑒𝑒𝑧𝑧
𝑒𝑒𝑧𝑧 + 1
𝑧𝑧 ∈ (−∞, +∞)
𝜎𝜎(𝑧𝑧) ∈ (0,1)
13
Logistic Regression
Regression + Logistic Function
If 𝐱𝐱𝑇𝑇𝜷𝜷 ≥ 0;𝑓𝑓 𝐱𝐱,𝜷𝜷 ≥ 0.5
predict as class A;
If 𝐱𝐱𝑇𝑇𝜷𝜷 < 0;𝑓𝑓 𝐱𝐱,𝜷𝜷 < 0.5
predict as class B;
Question:
Why this function is better?
Think about “outlier” few slides before?
𝜎𝜎 𝑧𝑧 =
1
1 + 𝑒𝑒−𝑧𝑧
=
𝑒𝑒𝑧𝑧
𝑒𝑒𝑧𝑧 + 1
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜎𝜎 𝐱𝐱𝑇𝑇𝜷𝜷 =
1
1 + 𝑒𝑒−𝐱𝐱𝑇𝑇𝜷𝜷
14
Output Interpretation
𝑓𝑓 𝐱𝐱,𝜷𝜷 tells us the estimated probability of given input 𝐱𝐱 being class A,
parameterized by 𝜷𝜷.
Supposed one customer 𝐱𝐱𝒊𝒊 has annual income of $120,000
The risk management team of the bank can tell that this customer have 10%
probability to default (Class A).
This information is crucial for the bank decision making!
𝑃𝑃 𝑡𝑡 = 𝐴𝐴 𝐱𝐱,𝜷𝜷 ≔ 𝑓𝑓(𝐱𝐱,𝜷𝜷)
𝑃𝑃 𝑡𝑡 = 𝐵𝐵 𝐱𝐱,𝜷𝜷 ∶= 1 − 𝑃𝑃 𝑡𝑡 = 𝐴𝐴 𝐱𝐱,𝜷𝜷 = 1 − 𝑓𝑓(𝐱𝐱,𝜷𝜷)
𝑓𝑓 𝐱𝐱𝒊𝒊,𝜷𝜷 = 0.1 = 10%
15
Decision Boundary
If 𝑥𝑥1 + 𝑥𝑥2 − 5 ≥ 0; predict as
class A (𝑡𝑡 = 1);
If 𝑥𝑥1 + 𝑥𝑥2 − 5 < 0; predict as
class B (𝑡𝑡 = 0); How many
features do we have?
Are the data labelled or not?
5
5 𝑥𝑥1
𝑥𝑥2
t =1
t=0
1st and 2nd feature vectors
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜎𝜎 𝐱𝐱𝑇𝑇𝜷𝜷 = 𝜎𝜎 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 = 𝜎𝜎 −5 + 𝑥𝑥1 + 𝑥𝑥2 = 0.5
16
Non-linear Decision Boundary
If 𝑥𝑥1
2 + 𝑥𝑥2
2 − 5 ≥ 0; predict 𝑡𝑡 = 1;
If 𝑥𝑥1
2+𝑥𝑥2
2 − 5 < 0; predict 𝑡𝑡 = 0;
5
5
𝑥𝑥2
How does the decision boundary
look like? 𝑥𝑥1
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜎𝜎 𝐱𝐱𝑇𝑇𝜷𝜷 = 𝜎𝜎 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1
2 + 𝛽𝛽2𝑥𝑥2
2 = 𝜎𝜎 −5 + 𝑥𝑥1
2 + 𝑥𝑥2
2 = 0.5
17
Loss Function (Formal)
𝑁𝑁: number of training examples
𝑑𝑑: number of features
𝐱𝐱: “input” variable; features
𝑡𝑡: “output” variable; “target” variable, ∈ {0,1}
Note: this is not
between 0 and 1
𝒟𝒟 = 𝐱𝐱1, 𝑡𝑡1 , 𝐱𝐱2, 𝑡𝑡2 , 𝐱𝐱3, 𝑡𝑡3 , … , 𝐱𝐱𝑁𝑁 , 𝑡𝑡𝑁𝑁
Once again, collect all the inputs into a matrix X, whose size is 𝑁𝑁 × 𝑑𝑑 + 1
and define the parameter vector
𝜷𝜷 =
𝛽𝛽0
𝛽𝛽1
𝛽𝛽2
⋮
𝛽𝛽𝑑𝑑
Loss 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡𝑛𝑛 = �
− log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡 = 1
− log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡 = 0
𝐿𝐿 𝜷𝜷 =
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
Loss(𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡𝑛𝑛)
18
Logistic Regression Loss Function
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝜎𝜎 𝐱𝐱𝑇𝑇𝜷𝜷 =
1
1 + 𝑒𝑒−𝐱𝐱𝑇𝑇𝜷𝜷
21
Loss Function: Compact Representation
This loss function can be derived from statistics using the a
methodology called Maximum Likelihood Estimation (MLE)
𝐿𝐿 𝜷𝜷 =
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
Loss(𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡𝑛𝑛)
Loss 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡𝑛𝑛 = �
− log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡 = 1
− log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 , 𝑡𝑡 = 0
𝐿𝐿 𝜷𝜷 = −
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
𝑡𝑡𝑛𝑛 log 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 + (1 − 𝑡𝑡𝑛𝑛) log 1 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷
22
Logistic regression is a special case of Generalized
Linear Models (GLM)
logit or sigmoid function is a link function.
Many respectable numerical packages, e.g.,
sklearn.linear_model, contain GLM
implementation which includes logistic regression.
Logistic Regression Summary
23
Regularization Intuition
(QBUS6810)
24
What Have We Learnt?
Supervised learning with continuous response- regression single or
multiple features
700
800
900
1000
1100
1200
1300
1400
0.0 2.0 4.0 6.0 8.0 10.0 12.0 14.0 16.0
R
ev
en
ue
Age
Overfitting
Just right𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥12
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2 +𝛽𝛽3 𝑥𝑥1
3 +𝛽𝛽4 𝑥𝑥1
4
25
Can we have a way to penalize parameters 𝛽𝛽3 and 𝛽𝛽4 to be close to
0, so that the model is approximately:
If 𝜆𝜆 were very large, e.g., 10,000, then parameters 𝛽𝛽3 and 𝛽𝛽4 would
be heavily penalized, e.g., close to 0
Regularization
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2 +𝛽𝛽3 𝑥𝑥1
3 +𝛽𝛽4 𝑥𝑥1
4
𝑓𝑓 𝐱𝐱,𝜷𝜷 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥1
2
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 + 𝜆𝜆𝛽𝛽3
2 + 𝜆𝜆𝛽𝛽4
2
26
Regularized Linear Regressions
(QBUS6810)
27
The Ridge Regression Estimator is the minimiser of the cost
function with quadratic regularization term
Ridge Regression
𝜆𝜆 ≥ 0 is a regularization parameter which regulates the tradeoff (regulates
model complexity).
The penalised term does not include the intercept term 𝛽𝛽0
𝜆𝜆 = 0, we have the ordinary linear regression cost function.
𝜆𝜆 = 0 corresponds to the greatest complexity (bias is a minimum, but variance
is high)
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 +
𝜆𝜆
2𝑁𝑁
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
2
28
The penalty term penalises the departure from zero of the regression
parameter i.e. shrinks them toward zero.
Ridge regression cannot zero out a specific coefficient. The model either
ends up including all the coefficients in the model, or none of them
Small 𝝀𝝀 : no or low regularization.
Can fit a high order polynomial
model or complex model.
Large 𝝀𝝀 : high regularization. 𝜷𝜷 will
be small. If 𝜆𝜆 is very, very large,
model becomes a horizontal line to
the data.
700
800
900
1000
1100
1200
1300
1400
0.0 5.0 10.0 15.0 20.0
R
ev
en
ue
Age
700
800
900
1000
1100
1200
1300
1400
0.0 5.0 10.0 15.0 20.0
R
ev
en
ue
Age
29
Learning Curve
λ
Loss
Best Model
( )βtrainL
( )βvL
Tend to Overfitting.
High variance.
Tend to underfitting.
High bias.
This loss function is used
estimate the model
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 +
𝜆𝜆
2𝑁𝑁
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
2
𝐿𝐿train 𝜷𝜷 =
1
2𝑁𝑁𝑡𝑡𝑡𝑡
�
𝑛𝑛=1
𝑁𝑁𝑡𝑡𝑡𝑡
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2
𝐿𝐿v 𝜷𝜷 =
1
2𝑁𝑁𝑣𝑣
�
𝑛𝑛=1
𝑁𝑁𝑣𝑣
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2
Presenter
Presentation Notes
When lambda is large, beta become smaller, so the training is get larger. However, the validation will go down first, then get larger. The best model is the model with the smallest validation error.
30
λ Validation Loss
0.0001 Estimated
0.001 Estimated
0.01 Estimated
0.02 Estimated
0.04 Estimated Smallest
…
100 Estimated
How to Choose λ?
( )β
β
Lmin
Use this model for test set
Test a large number of different 𝜆𝜆 value, e.g., 10000 values between 0.0001
and 100, denoted by 𝜆𝜆𝑗𝑗 (𝑗𝑗 = 1,2,3, … , 10000)
31
Ridge Regression Gradient Descent
• Have some random starting points for all 𝛽𝛽𝑖𝑖;
• Keep updating all 𝛽𝛽𝑖𝑖 (simultaneously) to decrease the loss
function 𝐿𝐿 𝜷𝜷 value;
• Repeat until achieving minimum (convergence).
Update
simultaneously...
Partial derivative calculation omitted
𝛽𝛽0 ≔ 𝛽𝛽0 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽0
= 𝛽𝛽0 − 𝛼𝛼
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)
𝛽𝛽1 ≔ 𝛽𝛽1 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽1
= 𝛽𝛽1 − 𝛼𝛼
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛1 +
𝜆𝜆
𝑁𝑁
𝛽𝛽1
𝛽𝛽𝑑𝑑 ≔ 𝛽𝛽𝑑𝑑 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝛽𝛽𝑑𝑑
= 𝛽𝛽𝑑𝑑 − 𝛼𝛼
1
𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2 𝑥𝑥2 + ⋯+ 𝛽𝛽𝑑𝑑 𝑥𝑥𝑑𝑑 − 𝑡𝑡𝑛𝑛)𝑥𝑥𝑛𝑛𝑑𝑑 +
𝜆𝜆
𝑁𝑁
𝛽𝛽𝑑𝑑
𝜷𝜷 ≔ 𝜷𝜷 − 𝛼𝛼
𝜕𝜕𝐿𝐿 𝜷𝜷
𝜕𝜕𝜷𝜷
= 𝜷𝜷 − α
1
𝑁𝑁
𝐗𝐗𝑇𝑇 𝑓𝑓 𝐗𝐗,𝜷𝜷 − 𝐭𝐭 +
𝜆𝜆
𝑁𝑁
𝜷𝜷
32
The LASSO
Least absolute shrinkage & selection operator
LASSO does both parameter shrinkage and variable selection automatically.
Some coefficients are forced to zero as 𝜆𝜆 increases (effectively a subset
selection)
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 +
𝜆𝜆
2𝑁𝑁
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
33
Elastic Net
Elastic net is a regularized regression method that linearly combines
the penalties of the lasso and ridge methods.
Note that due to the shared L1/L2 regularisation of Elastic-Net it does not
aggressively prune features like Lasso. In practice it often performs well
when used for regression prediction.
See Lecture03_Example01.py
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 +
𝜆𝜆1
2𝑁𝑁
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗 +
𝜆𝜆2
2𝑁𝑁
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
2
34
CV with Regularization
• Suppose we have 150 data points and wish to find a better ridge
regression. That is to find an appropriate 𝜆𝜆 for the ridge regression model.
• Under K-CV, we are going to test a large number of different 𝜆𝜆 values, e.g.,
10000 values between 0.0001 and 100, denoted by 𝜆𝜆𝑗𝑗 (𝑗𝑗 = 1,2,3, … , 10000)
• Divide (randomly) 150 data points into 5 groups, each with 30 data points
CV with Regularization
• We will then run the 5-fold cross validation with following steps for each 𝜆𝜆𝑗𝑗
• For each 𝜆𝜆𝑗𝑗, output mean validation error (Lv 𝛃𝛃𝟏𝟏 + Lv 𝛃𝛃𝟐𝟐 +…+Lv 𝛃𝛃𝟓𝟓 )/5 on
validation sets and select the model with 𝜆𝜆 that generates the least error,
say 𝜆𝜆151
• Then build the model with 𝜆𝜆151by
This process can be incorporated with LASSO and Elastic net as well.
𝐿𝐿 𝜷𝜷 =
1
2𝑁𝑁
�
𝑛𝑛=1
𝑁𝑁
(𝑡𝑡𝑛𝑛 − 𝑓𝑓 𝐱𝐱𝑛𝑛,𝜷𝜷 )2 +
𝜆𝜆151
2𝑁𝑁
�
𝑗𝑗=1
𝑑𝑑
𝛽𝛽𝑗𝑗
2
35
Appropriate K in CV?
The special case K = N is known as the leave-one-out (LOO) cross-
validation
With K = N, the cross-validation estimator is approximately unbiased for the
true (expected) prediction error, but can have high variance because the K=N
“training sets” are so similar to one another.
The computational burden is also considerable, requiring m applications of the
learning method for LOO
On the other hand, with K = 5 say, cross-validation has lower variance, while
bias could be a problem, depending on how the performance of the learning
method varies with the size of the training set.
Overall, five-fold or ten-fold cross-validation are recommended as a good
compromise
36
Feature Extraction and
Representation
Processing Features
37
All we have assumed so far is that data come to us in good shape and
most of them are in numeric format and possibly in a categorical form
In Python machine learning, we normally organise data into a matrix
(or multidimensional arrays)
However data coming from application domains could be in any forms
or categories
For business applications, we may have data in the form of text (or
natural language), or in media such as audio and videos
It is easy to deal with numeric data which can be sent to a machine
learning straightaway
How the real data look like?
38
Categorical Features
39
In raw data, they are represented by strings. For example “Red”,
“Green”, “Blue”, “Yellow”, and “White”.
They are not suitable to machine learning. We need engineer them into
numeric numbers.
Label Representation: Label representation: For example “Red” to 4,
“Green” to 3, “Blue” to 2, “Yellow” to 1, and “White” to 0.
One-hot Encoding:
Use scikit-learn to make this transform or pandas’ get_dummies:
Lecture03_Example02.py
Transforming Categorical Features
in scikit-learn
40
Encoding categorical features
Converting a categorical feature to one-hot coding by OneHotEncoder
Loading features from dicts The number of features increased
3 to 9
2 to 4
from scikit-learn documentation
Ordinal Features
41
In raw data, they are represented by strings. For example “Strongly
Agree”, “Agree”, “Neutral”, “Disagree”, and “Strongly Disagree”.
The order information is important for modelling. Most time, we can
encode such feature in terms of integers, such as “Strongly Agree” to
5, “Agree” to 4, “Neutral” to 3, “Disagree” to 2, and “Strongly Disagree”
to 1.
LabelEncoder in scikit-learn can be used to transform such non-
numerical labels to numerical labels
from scikit-learn documentation
Bucketized Feature (in Tensorflow)
42
Some times it is more meaningful to convert numbers into numerical
ranges
Thus we shall engineer some numeric features into categorical feature
Then convert to one-hot coding or labels
Feature Hashing
43
If a categorical feature has huge of different values, then the one-hot
coding is a long (sparse) vector
Instead of using a long 0-1 vector, we use a hash function which
calculates a hash code. We are forcing the different input values to a
smaller set of categories
Collision: Both “kitchenware” and “sports” may be mapped to the
same values
hasher = FeatureHasher(input_type='string')
X = hasher.transform(raw_X)
44
In Business Intelligence, we analyse texts such business plan,
business report, even news.
Machine learning algorithms cannot work with raw text directly and the text
must be converted in to numbers.
The BoW representation of text describes the occurrence of words within a
document
For example, suppose we have a vocabulary of 1000 words. We have a
document in which ``ours'‘ appears 3 times, ``competition'' appears 1 time
and ``managers'' 10 times.
We will represent this document as a vector of dimension 1000 (such as
each component in the vector corresponds to a word in the vocabulary) such
that 3 will be in the position of the vector corresponding to ``ours'', 1 at the
position corresponding to `` competition'' and 15 at the position
corresponding ``managers''.
All other positions have values of 0. So the vector looks like
which is sparse. Why?
Bag-of-Words
𝑥𝑥 = (0, … , 0, 3, 0, … , 0, 1, 0, … , 15, 0, … , 0)𝑇𝑇∈ ℝ1000
Bag-of-Words
45
Sciki-learn can extract numerical features from text content
such as counting the occurrence of tokens in each
document
A corpus of documents can thus be represented by a
matrix with one row per document and one column per
token (e.g. word) occurring in the corpus
See Lecture03_Example03.py
Text Feature Extraction
46
tf-idf Term Weighting
tf-idf(t, d) is defined as
tf(t,d) (term frequency): is the frequency of a term t in a document
d, i.e., the occurring count number of t in d
idf(t) inverse document frequency: a measure of how much
information the word t provides, that is, whether the term is
common or rare across all documents
How is this done in scikit-learn? Lecture03_Example04.py
tf−idf(𝑡𝑡,𝑑𝑑) = tf(𝑡𝑡,𝑑𝑑) × idf(𝑡𝑡)
idf 𝑡𝑡 = log
the total of number of documents
the number of documents in corpus containing term t
There are other
definitions for these two
“frequencies”
47
Both One-hot Encoding and BoW may produce feature
representations which are of large dimension and sparse.
High dimension will result in the so-called curse of dimensionality
problem in many machine learning algorithms.
As those representations are actually sparse, so a natural question
is whether we can find a compact format for these sparse
representations.
The so-called learning embedding or dimensionality reduction can
achieve this goal
Embedding Representation
48
Principal component analysis seeks a
space of lower dimensionality, known
as the principal subspace and
denoted by the magenta line, such
that the orthogonal projection of the
data points (red dots) onto this
subspace maximizes the variance of
the projected points (green dots).
How can we find this line? Can we
do this by linear regression?
Principle Component Analysis (PCA)
49
Objective: given a set of d measurements on N individuals, we aim
at determining 𝑟𝑟 ≤ 𝑑𝑑 orthogonal (uncorrelated) variables, called
principal components, defined as linear combinations of the
original ones
The PCs are uncorrelated and have decreasing variance
Synthesis: information dimensionality reduction
Interpretation: express the original data in terms of a reduced
number of underlying variables (factors)
Score the individual proles, with a summary score
Obtain multivariate displays (scatterplot) of the units in two or
three dimensions
The first component is designed to capture as much of the
variability in the data as possible, and the succeeding components
in turn extract as much of residual variability as possible
Principle Component Analysis (PCA)
50
Given a set of data 𝒟𝒟 = 𝐱𝐱1, 𝐱𝐱2, … , 𝐱𝐱𝑁𝑁 . Suppose they have been
centralised, i.e., removing the mean from them. Collect them in a
data matrix 𝐗𝐗
Calculate the variance matrix 𝐒𝐒 = 1
𝑁𝑁
𝐗𝐗𝑇𝑇𝐗𝐗
Conduct the eigen-decomposition of 𝐒𝐒 such that
𝐒𝐒 = 𝐀𝐀Λ𝐀𝐀𝑇𝑇
where 𝐀𝐀𝐓𝐓𝐀𝐀= 𝐈𝐈𝐝𝐝 and Λ = diag 𝜆𝜆1, 𝜆𝜆2, … , 𝜆𝜆𝑑𝑑 .
The first 𝑟𝑟 (𝑟𝑟 ≤ 𝑑𝑑) principle components of 𝐗𝐗 are given by
𝐙𝐙𝑡𝑡 = 𝐗𝐗𝐀𝐀𝑡𝑡
where 𝐀𝐀𝑡𝑡 is the matrix of the first 𝑟𝑟 columns of 𝐀𝐀.
Each row of 𝐙𝐙𝑡𝑡 (in 𝑟𝑟 new factors/features) is a new representation
of the given data, i.e., the corresponding row in 𝐗𝐗 (in 𝑑𝑑
attributes/features)
PCA: The Algorithm
Size 𝑁𝑁 × 𝑑𝑑
Size 𝑑𝑑 × 𝑑𝑑
Size 𝑁𝑁 × 𝑟𝑟
51
Consider the share of the total variance absorbed by the first
𝑟𝑟 components 𝐙𝐙𝑡𝑡
𝑄𝑄𝑡𝑡 =
∑ℎ=1
𝑡𝑡 𝜆𝜆ℎ
∑ℎ=1
𝑑𝑑 𝜆𝜆ℎ
Select 𝑟𝑟 so that 𝑄𝑄𝑡𝑡 ≥ 0.95 for example
Kaiser criterion: computer the average eigenvalues
�̅�𝜆 =
1
𝑑𝑑
�
ℎ=1
𝑑𝑑
𝜆𝜆ℎ
Select the first 𝑟𝑟 components for which 𝜆𝜆ℎ > �̅�𝜆 . Note: if the
variables are standardised �̅�𝜆 = 1.
PCA: Selecting r
Python Example
(Lecture03_Example05.py)
52
Slide Number 1
Slide Number 2
Slide Number 3
Slide Number 4
Slide Number 5
Slide Number 6
Slide Number 7
Slide Number 8
Slide Number 9
Slide Number 10
Slide Number 11
Slide Number 12
Slide Number 13
Slide Number 14
Slide Number 15
Slide Number 16
Slide Number 17
Slide Number 18
Slide Number 21
Slide Number 22
Slide Number 23
Slide Number 24
Slide Number 25
Slide Number 26
Slide Number 27
Slide Number 28
Slide Number 29
Slide Number 30
Slide Number 31
Slide Number 32
Slide Number 33
Slide Number 34
Slide Number 35
Slide Number 36
�Processing Features
�How the real data look like?
�Categorical Features
�Transforming Categorical Features �in scikit-learn
�Ordinal Features
�Bucketized Feature (in Tensorflow)
�Feature Hashing
Slide Number 44
Bag-of-Words
Text Feature Extraction
Slide Number 47
Slide Number 48
Slide Number 49
Slide Number 50
Slide Number 51
Slide Number 52