Machine Learning for Financial Data
December 2020
FEATURE ENGINEERING (CONCEPTS – PART 3)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Feature Engineering
Contents
◦ Feature Selection
◦ Filter-based Feature Selection
◦ Feature Selection using Pearson’s Correlation
◦ Feature Selection using Hypothesis Testing
◦ Feature Transformation
◦ Feature Transformation using
Principal Component Analysis
◦ Feature Learning
Feature Selection
Feature selection selects the more relevant features and eliminate redundant, irrelevant, and noisy features
▪ Feature relevance is classified into three types: strong relevance, weak relevance, and irrelevant
▪ A feature which has an influence on the output and its role cannot be replaced by the rest is known as relevant feature and therefore cannot be removed
▪ A feature is said to be a weakly relevant if it is necessary for an optimal subset only at certain conditions
▪ An irrelevant feature is one which is not necessary at all because it does not contribute any information to the target and hence it should be removed
▪ A feature which takes the role of another is said to be redundant
▪ Removing irrelevant and redundant features will potentially give a better generalization,
understanding and visualization with less training and testing time Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Feature Engineering
Identifying which features are most relevant is particularly useful when there are only a few samples
Name
Daniel
Alex
Adrian
Vicky
Adams
…
Jones
Mary
Max
Peter
Anson
Amount
$2,600.45
$2,294.58
$1,003.30
$8,488.32
¥20000
…
₽3,250.11
₽8,156.20
€7475,11
₽500.00
₽7,475.11
Date
1-Jul-2020
1-Oct-2020
3-Oct-2020
4-Oct-2020
7-Oct-2020
…
Nov 1, 2020
Nov 1, 2020
Nov 8, 2020
Nov 9, 2020
Nov 9, 2020
Issued In
HK
HK
HK
JAPAN
AUS
…
HK
HK
UK
Hong Kong
Hong Kong
Used In
HK
RUS
HK
JAP
…
RUS
N/A
GER
RUS
RUS
22
None
25
64
58
…
43
27
32
0
20
Education
Secondary
Postgraduate
Graduate
Graduate
Primary
…
Graduate
Graduate
Primary
Postgraduate
Postgraduate
Fraud?
No
Yes
Yes
No
No
…
No
Yes
No
No
Yes
feature selection
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
5
Feature Engineering
Feature
Target
Age
Feature selection is the process of selecting a subset of relevant features for use in model construction
▪ Reasons for doing feature selection include
◦ to simplify models to make them easier to interpret by researchers / users
◦ to shorten training time
◦ to reduce the dimensionality of data involved
◦ to enhance generalization by reducing overfitting
◦ to reduce model scoring time (after model deployment)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 6
Feature Engineering
The three main categories of supervised feature selection algorithms are filter, wrapper, and embedded methods
Filter Methods
▫ A proxy measure, often statistical, instead of the error rate is used to score a subset
▫ Computationally less expensive ▫ Selection is more general & with
lower predictive performance
Wrapper Methods
▫ Each subset is used to train a model and the model error rate provides the score for the subset
▫ Computationally very expensive ▫ Selection is usually good
Embedded Methods
▫ A catch-all group of techniques being part of the model construction process
▫ Computational complexity is between filters and wrappers
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
7
Feature Engineering
Filter Methods
All features
▪ Apply a statistical measure (e.g. correlation with the target) to assign a score to each variable regardless of the ML model
▪ Variables are ranked by the score and either to be kept or removed from the dataset
▪ Often univariate and consider the feature independently
▪ Tend to select redundant variables as the relationships
between variables are not considered
▪ No consideration is given to the ML model during the filtering process; hence, may not be able to select the right features for the model
Performing a statistical measure between each feature and the target
Selecting features individually using some statistical measure threshold
Training ML algorithm
Assessing performance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
8
Feature Engineering
Wrapper Methods
All features
▪ Consider the selection as a search problem where different combinations are prepared, evaluated and compared to other combinations
▪ A predictive model is used to assign scores based on model accuracy
▪ Can detect possible interactions between variables ▪ Increase the overfitting risk when the number of
observations is insufficient
Selecting the best subset
Generate a subset Training ML algorithm
Assessing performance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 9
Feature Engineering
Embedded Methods
All features
Selecting the best subset
Generate a subset
Training ML algorithm +
Assessing performance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
10
Feature Engineering
▪ ▪
Try to combine the advantages of both filter and wrapper methods
A learning algorithm takes advantage of its own variable selection process and performs feature selection and assessment simultaneously
Filter-based Feature Selection
The choice of feature selection algorithm depends on the nature of the input features and output target
Target
Numerical
Categorical
Chi-Squared Test (contingency table)
Mutual Information
ANOVA Correlation Coefficient (linear)
Kendall’s Rank Coefficient (non-linear)
Numerical
Features
Categorical
ANOVA Correlation Coefficient (linear)
Kendall’s Rank Coefficient (non-linear)
Pearson’s Correlation Coefficient (linear)
Spearman’s Rank Correlation Coefficient (non-linear)
◦ Pearson’s can be used on quantitative continuous variables
◦ Spearman’s can be used on ordinal data when the ordered categories are replaced by their ranks
◦ Actually, mutual information is agnostic to data types
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Feature Engineering
Feature Selection using Pearson’s Correlation
Pearson’s Correlation
◦ The Pearson’s Correlation is a measure of the strength and direction of association that exists between two variables measured on at least an interval scale
◦ The coefficient measures the linear relationship between columns
◦ The coefficient value varies between -1 and +1
◦ The value 0 implies no correlation between columns
◦ Values closer to -1 or +1 imply an extremely strong linear relationship
◦ Pearson’s correlation coefficient generally requires that each column be normally distributed
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
14
Feature Engineering
Pearson’s correlation calculates the effect of change in one variable when the other variable changes
𝑟=
𝑁(σ𝑥𝑦)−(σ𝑥)(σ𝑦)
𝑁σ𝑥2 −(σ𝑥)2
𝑁σ𝑦2 −(σ𝑦)2
where
𝑁 = 𝑡h𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑖𝑟𝑠 𝑜𝑓 𝑠𝑐𝑜𝑟𝑒𝑠
𝑥𝑦 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑡h𝑒 𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠 𝑜𝑓 𝑝𝑎𝑖𝑟𝑒𝑑 𝑠𝑐𝑜𝑟𝑒𝑠
𝑥 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑥 𝑠𝑐𝑜𝑟𝑒𝑠
𝑦 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑦 𝑠𝑐𝑜𝑟𝑒𝑠
𝑥2 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑥 𝑠𝑐𝑜𝑟𝑒𝑠 𝑦2 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑦 𝑠𝑐𝑜𝑟𝑒𝑠
𝟎.𝟓< 𝒓 <𝟏.𝟎
𝟎.𝟑< 𝒓 <𝟎.𝟓
𝒓 ≅𝟎.𝟎
𝟎.𝟏< 𝒓 <𝟎.𝟑
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 15
Feature Engineering
Default Credit Card Payments
◦ A dataset about customer default payments in Taiwan
◦ Number of observations = 30,000
◦ Number of features = 24
◦ From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
16
Feature Engineering
The credit card default payment dataset
#
Feature
Description
1
LIMIT_BAL
Credit amount in NT dollar
2
SEX
Gender: 1=male, 2=female
3
EDUCATION
Education: 1=postgraduate, 2=graduate, 3=secondary, 4=others
4
MARRIAGE
Marital status: 1=married, 2=single, 3=others
5
AGE
Age in year
6
PAY_0
Repayment status of September to April 2005: -1= paid duly,
1=1 month delay,
...,
8=8 months' delay,
9=9 months or longer delay
7
PAY_2
8
PAY_3
9
PAY_4
10
PAY_5
11
PAY_6
#
Feature
Description
12
BILL_AMT1
Bill statement amount in NT dollar from September to April 2005.
13
BILL_AMT2
14
BILL_AMT3
15
BILL_AMT4
16
BILL_AMT5
17
BILL_AMT6
18
PAY_AMT1
Amount of previous payment in NT dollar from September to April 2005
19
PAY_AMT2
20
PAY_AMT3
21
PAY_AMT4
22
PAY_AMT5
23
PAY_AMT6
24
default pay ...
Default payment: yes=1, no=0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 17
Feature Engineering
Python: Correlation-based Feature Selection (1)
# load relevant packages
import matplotlib.pyplot as plt import seaborn as sns
import pandas as pd
import numpy as np
# load the credit card default dataset
data = pd.read_csv('FIN7790-02-3-credit_card_default.csv', header=1, index_col=0) # confirm the entire dataset is indeed loaded
data.shape
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 18
Feature Engineering
Python: Correlation-based Feature Selection (2)
# examine the first 5 rows
data.head().T
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19
Feature Engineering
Python: Correlation-based Feature Selection (3)
# examine the statistics about the dataset
data.describe().T
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 20
Feature Engineering
Python: Correlation-based Feature Selection (4)
# check if there is any null values
data.isnull().sum()
# assumes no preprocessing is required
# partition the dataset into features & target
target = 'default payment next month' X = data.drop(target, axis = 1)
y = data[target]
# showed the normalized value counts
y.value_counts(normalize=True)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 21
Feature Engineering
Python: Correlation-based Feature Selection (5)
# show the Pearson's correlation coefficients
data.corr()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 22
Feature Engineering
Python: Correlation-based Feature Selection (6)
# show Pearson's correlation # coefficient as a heatmap
sns.heatmap(data.corr())
Note that the heatmap function automatically chose the most correlated features to show.
For simplicity no normalization is performed before computing the Pearson's coefficients.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 23
Feature Engineering
Python: Correlation-based Feature Selection (7)
# list the coefficient against the target
data.corr()[target]
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Feature Engineering
# list the coefficient against the target # only if the absolute value > 0.2
data.corr()[target].abs() > 0.2
Python: Correlation-based Feature Selection (8)
# Retain the most correlated features
key_features = data.columns[data.corr()[target].abs() > 0.2] key_features
# display the retained features of the dataset
data_trimmed = data[key_features] data_trimmed
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 25
Feature Engineering
Prediction accuracy may suffer or improve as a result of feature selection depending of the choice of parameters
Before Feature Selection
Model Name
Accuracy (%)
Fit Time (sec)
Predict Time (sec)
Decision Tree
0.8203
0.158
0.002
After Feature Selection
Model Name
# of Features
Threshold
Accuracy (%)
Fit Time (sec)
Predict Time (sec)
Decision Tree
7
0.1
0.8206
0.105
0.003
Decision Tree
5
0.2
0.8197
0.010
0.002
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 26
Feature Engineering
Feature Selection using Hypothesis Testing
Hypothesis Testing
◦ Hypothesis testing is a method for testing a claim about a parameter in a population, using data measured in a sample
1) State the hypothesis
2) Set the criteria for a decision
3) Compute the test statistic
4) Make a decision
◦ The null hypothesis (H0) is a statement about the population parameter that is assumed to be true
◦ The reason of testing H0 is because we think it is wrong!
◦ An alternative hypothesis (H1) is a statement that directly contradicts H0 by stating that the population parameter is different to what is stated in H0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
28
Feature Engineering
Presumption of Innocence 無罪推定原則
▪ The presumption of innocence is the legal principle that one is considered
“innocent until proven guilty”. Under
the presumption of innocence, the legal burden of proof is thus on the prosecution, which must present compelling evidence to the trier of fact (a judge or a jury)
▪ 無罪推定原則,意指一個人在法院上應該先被 假定為無罪,除非被證實及判決有罪。 在這個 原則下,提起公訴的檢察官應負起舉證責任, 應負責收集足夠的可靠證據,以證明被告在事 實上的確有罪;而若法院要判被告有罪,則所 使用的證據必須符合法律限制,而且不能超越 合理懷疑。
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Feature Engineering
Chi-Squared Test
◦ The Chi-Squared test is used to determine whether a relationship between 2 categorical variables in a sample is likely to reflect a real association between these 2 variables in the population
◦ In the case of 2 variables being compared, the test can be interpreted as determining if there is a difference between the 2 variables
◦ The sample data is used to calculate a single number, the test statistic
◦ The size of the test statistic reflects the probability that the observed relationship between 2 variables has occurred by chance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
30
Feature Engineering
After rolling a dice 36 times, how can we determine if the dice is fair or unfair
How would you draw the conclusion?
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Feature Engineering
Chi2 test is used for categorical variables to reveal variance in observed & expected frequencies
𝐶h𝑖−𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑆𝑐𝑜𝑟𝑒
χ2 = σ (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦−𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦)2 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
32
Feature Engineering
where,
Observed Frequency = Number of observations of class
Expected Frequency = Number of expected observations of class if there was no relationship
between the feature and the target
Chi2 test calculates the variances in frequency and compares the sum with the Chi2 distribution
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 33
Feature Engineering
level of significance, typically set at 95%
The threshold in the Chi2 distribution for the corresponding degree of freedom determines H0’s acceptance or rejection
Conclusion
◦ test statistic (9.6) > threshold (9.236) ◦ suggests the dice is unbalanced
◦ reject the H0 hypothesis
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34
Feature Engineering
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 35
Feature Engineering
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36
Feature Engineering
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 37
Feature Engineering
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 38
Feature Engineering
threshold = 2.706
Chi-Squared based Feature Selection
◦ Chi2 measures the distance between observed and expected frequencies
◦ The null hypothesis (H0) is that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable
◦ If Score >= Threshold: target depends on the feature, significant result, reject the null hypothesis (H0), feature is to be retained
◦ If Statistic < Threshold : target does not depend on the feature, not significant result, fail to reject the null hypothesis (H0), feature should
be removed
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
39
Feature Engineering
Feature Transformation
Feature transformation creates new columns that are fundamentally different from the original dataset
Alex
Adrian
Vicky
Adams
...
Jones
Mary
Max
Peter
Anson
$2,294.58
$1,003.30
$8,488.32
¥20000
...
₽3,250.11
₽8,156.20
€7475,11
₽500.00
₽7,475.11
1-Jul-2020
1-Oct-2020
3-Oct-2020
4-Oct-2020
7-Oct-2020
...
Nov 1, 2020
Nov 1, 2020
Nov 8, 2020
Nov 9, 2020
Nov 9, 2020
HK
HK
JAPAN
AUS
...
HK
HK
UK
Hong Kong
Hong Kong
RUS
HK
JAP
...
RUS
N/A
GER
RUS
RUS
None
25
64
58
...
43
27
32
0
20
Secondary
Postgraduate
Graduate
Graduate
Primary
...
Graduate
Graduate
Primary
Postgraduate
Postgraduate
Yes
Yes
No
No
...
No
Yes
No
No
Yes
feature transformation
Name
Daniel
Amount
$2,600.45
Date
Issued In
HK
Used In
HK
Age
22
Education
Fraud?
No
Observations
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
41
Feature Engineering
Feature
Target
Feature transformation creates an entirely new, structurally different dataset from the original dataset
▪ Feature selection processes are limited to only being able to select features from the original set of columns
▪ Feature transformation uses the original columns and combines them in useful ways to create new columns that are better at describing the data than any single column from the original dataset
▪ These algorithms create brand new columns that are so powerful that we only need a few of them to explain the entire dataset accurately
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 42
Feature Engineering
Feature transformation relies on matrix algorithms whereas feature learning relies on deep learning
▪ Feature transformation deploys a suite of algorithms designed to alter the internal structure of data to produce mathematically superior columns
▪ Feature learning will focus on using non-parametric algorithms (those that do not depend on the shape of the data) to automatically learn new features
▪ Feature transformation uses a set of matrix algorithms that will structurally alter the dataset and produce what is essentially a brand new matrix of data
◦ The basic idea is that
◦ the original features of a dataset are the descriptors / characteristics of data points and
◦ it should be able to create a new set of features that explain the data-points just as well, perhaps even better, with fewer columns
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 43
Feature Engineering
Feature Transformation using Principal Component Analysis
Principal Component Analysis (PCA)
◦ PCA is used to extract the important information from a multivariate dataset and to express this information as a set of few new variables
called principal components
◦ The principal components explain most of the patterns & latent structures observed in the original dataset
◦ Often possible with only a few principal components
◦ An unsupervised dimension reduction technique providing a new lower-dimensional variable space to project the dataset on
◦ A linear static transformation using matrix multiplication
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
45
Feature Engineering
Graphically, PCA finds new orthogonal important dimensions that capture the largest variances
The dataset is represented in the X-Y coordinate system
The PC2 axis is the second most important direction, orthogonal to the PC1 axis
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 46
Feature Engineering
The PC1 axis is the first principal direction giving the largest sample variation
The original data in a 2-dimension space can be effectively represented in a 1-dimension space
The PC2 axis is the second most important direction, orthogonal to the PC1 axis
The PC1 axis is the first principal direction giving the largest sample variation
◦ The dimension reduction is achieved by identifying the principal directions, called principal components
◦ PCA assumes that the directions with the largest variances are the most important
◦ In this example, the two- dimensional data can be reduced to a single dimension by projecting each data onto the first principal component
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
47
Feature Engineering
The computation of PCA is typically done using linear algebra and the identification of eigenvectors & eigenvalues
(1) correlated data of n dimensions
(2) center (don't scale) the data
want dimension of the highest variance
(3) compute the covariance matrix
𝑐𝑜𝑣(h, h) 𝑐𝑜𝑣(h, 𝑢) 𝑐𝑜𝑣(𝑢, h) 𝑐𝑜𝑣(𝑢, 𝑢
2.0 0.8 = 0.8 0.6
(4) compute the eigenvector and eigenvalues of the covariance matrix
2.0 0.8 𝑒h = λ 𝑒h 0.8 0.6 𝑒𝑢 𝑒 𝑒𝑢
𝑓𝑓 2.0 0.8 h =λ𝑓 h
0.8 0.6 𝑓 𝑓 𝑢𝑢
n orthogonal eigenvectors for data of n dimensions
(5) keep the top k eigenvalues (sorted by descending order)
(6) uncorrelated data of lower dimensionality
(6) project data multiplying the transpose of the feature vector with the eigenvectors corresponding to the top eigenvalues
𝑥𝑒 = 𝑥𝜏𝑒
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
48
Feature Engineering
Some reminders on linear algebra
▪ Variance computes the variation of the data distributed across the dimensionality graph
▪ An eigenvector (𝒗) of a linear transformation (𝐴) is a non-zero vector (typically, a unity vector) that changes by a scalar factor (𝝀) when that linear transformation is applied
σ𝑛 (𝑥 − 𝑥ҧ)2 𝑖=1 𝑖
𝑛
▪ Covariance identifies the dependencies and relationships between the characteristics of datasets 𝑛
𝑐𝑜𝑣𝑥,𝑦 =σ𝑖=1(𝑥𝑖−𝑥ҧ)(𝑦𝑖−𝑦ത) 𝑛
▪ The sign of 𝑐𝑜𝑣 𝑥, 𝑦 is the key
▪ Positive: both dimensions increase together
▪ Negative: one dimension increases, the other dimension decreases
▪ 0: two dimensions are independent of each other Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
𝑣𝑎𝑟 𝑥 =
𝐴𝒗 = 𝝀𝒗
⋯ 𝑡1𝑛 𝑣1 𝑣1
⋱ ⋮ ⋮ =𝝀 ⋮ 𝑡𝑚1 ⋯ 𝑡𝑚𝑛 𝑣𝑛 𝑣𝑛
The corresponding eigenvalue is the factor by which the eigenvector is scaled
𝑡11 ⋮
49
Feature Engineering
▪
▪ Eigenvector and eigenvalue come in pair for a given linear transformation
Majority of the variance in the original dataset can be effectively explained by a few principal components
Understanding the Mathematics behind Principal Component Analysis (https://heartbeat.fritz.ai/understanding-the-mathematics-behind-principal-component-analysis-efd7c9ff0bb3)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50
Feature Engineering
Python: Using Principal Component Analysis (1)
# load relevant packages and data
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA
boston_dataset = load_boston()
data = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names) data['MEDV'] = boston_dataset.target
# separate the features from the target
X = data.drop('MEDV', axis = 1) y = data['MEDV']
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 51
Feature Engineering
Python: Using Principal Component Analysis (2)
# show the first 5 observations
data.head()
# show the number of rows and number of columns of the dataset
data.shape
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 52
Feature Engineering
Python: Using Principal Component Analysis (3)
# split the dataset into training dataset (70%) and testing dataset (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) X_train.shape
# set up the PCA
# n_components=None will keep all features in the original dataset # features will be ranked and selected in subsequent steps
pca = PCA(n_components = None)
# train the PCA with the training dataset
pca.fit(X_train)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 53
Feature Engineering
Python: Using Principal Component Analysis (4)
# a few of the components will capture most of the variance of the original dataset # to identify how many components capture most of the variability,
# we can plot the percentage of variance explained (by each component)
# versus the component number
# plot the percentage of the total variance # explained by each component
plt.plot(pca.explained_variance_ratio_, linewidth = 2)
plt.title('Percentage of Variance Explained') plt.xlabel('Number of Components') plt.ylabel('Percentage of Variance Explained')
# the plot indicates that we can use the first
# two components to train our machine learning # models using a linear model
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54
Feature Engineering
Python: Using Principal Component Analysis (5)
# transform the training and testing datasets
X_train_transformed = pca.transform(X_train) X_test_transformed = pca.transform(X_test) print(X_train_transformed)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 55
Feature Engineering
Python: Using Principal Component Analysis (6)
# reduce the dimensionality of the dataset based on the result of PCA
X_train_trimmed = pd.DataFrame(X_train_transformed[:,0:2]) X_train_trimmed.head()
# show the number of rows and columns of the reduced dataset
X_train_trimmed.shape
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 56
Feature Engineering
Feature Learning
Feature learning relieves the restriction on the original dataset and uses deep learning to create new columns
Alex
Adrian
Vicky
Adams
...
Jones
Mary
Max
Peter
Anson
$2,294.58
$1,003.30
$8,488.32
¥20000
...
₽3,250.11
₽8,156.20
€7475,11
₽500.00
₽7,475.11
1-Jul-2020
1-Oct-2020
3-Oct-2020
4-Oct-2020
7-Oct-2020
...
Nov 1, 2020
Nov 1, 2020
Nov 8, 2020
Nov 9, 2020
Nov 9, 2020
HK
HK
JAPAN
AUS
...
HK
HK
UK
Hong Kong
Hong Kong
RUS
HK
JAP
...
RUS
N/A
GER
RUS
RUS
None
25
64
58
...
43
27
32
0
20
Secondary
Postgraduate
Graduate
Graduate
Primary
...
Graduate
Graduate
Primary
Postgraduate
Postgraduate
feature learning
Name
Daniel
Amount
$2,600.45
Date
Issued In
HK
Used In
HK
Age
22
Education
Fraud?
No
Yes
Yes
No
No
Observations
...
No
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
Yes
No
No
Yes
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
58
Feature Engineering
Feature
Target
Feature Learning
◦ Creates brand-new features from existing features making no assumption on the shape of the data
∙ Feature learning algorithms are not parametric ◦ Relies on stochastic learning
∙ Instead of applying the same equation to the data every time, algorithms will discover the best features by looking at the data over and over again (in epochs) and converge onto a solution (potentially different ones at runtime)
◦ Can learn fewer or more features than in the original dataset and the exact number of features to learn depends on the problem and can be grid-searched
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 59
Feature Engineering
Feature Transformation vs Feature Learning
Feature Transformation Algorithms
Feature Learning Algorithms
Parametric
Simple to use
Yes
Yes
No
No
New feature set
Algorithms
Yes
No
Yes
Deep learning
Yes
PCA, LDA
Deep learning
▪ A model being non-parametric does not mean that no assumptions are made at all by the model during training
▪ Feature learning algorithms forgo the assumption on the shape of the data but they still may make assumptions on other aspects of the data (e.g., the variable values)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 60
Feature Engineering
References
References
"Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists", Alice Zhang & Amanda Casari, O'Reilly Media, April 2018, ISBN-13: 978-1-491-95324-2
"Python Feature Engineering Cookbook", Soledad Galli, Packt Publishing, January 2020, ISBN-13: 978-1-789-80631-1
"Feature Engineering Made Simple", Susan Ozdemir & Divya Susarla, Packt Publishing, January 2018, ISBN-13: 978-1- 787-28760-0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 62
Understanding Machine Learning
Feature Engineering
▪ "HowtoChooseaFeatureSelectionMethodforMachineLearning",JasonBrowniee,November2019 (https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)
▪ "CorrelationCoefficientCalculator"(https://byjus.com/correlation-coefficient-calculator/)
▪ "HowtoPerformFeatureSelectionwithCategoricalData",JasonBrowniee,November2019
(https://machinelearningmastery.com/feature-selection-with-categorical-data/)
▪ "AGentleIntroductiontotheChi-SquaredTestforMachineLearning",JasonBrowniee,June2018
(https://machinelearningmastery.com/chi-squared-test-for-machine-learning/)
▪ "Chi-SquaredTestforFeatureSelectioninMachineLearning",SampathKumarGajawada,October2019
(https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223)
▪ "Chi-SquaredTestCalculator"(https://www.socscistatistics.com/tests/chisquare2/default2.aspx)
▪ "ATutorialonPrincipalComponentsAnalysis",LindsaySmith,February2002 (http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 63
Feature Engineering
THANK YOU