Machine Learning for Financial Data
December 2020
FEATURE ENGINEERING (CONCEPTS – PART 2)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Feature Engineering
Contents
◦ Feature Improvement
◦ Feature Construction
Feature Improvement
Data Probability Distribution
Variable Transformations
▪ Linear and logistic regression assume that the variables are normally distributed
▪ If they are not, a mathematical transformation can be applied to change them into normal distribution, and sometimes even unmask linear relationships between variables and their targets
▪ Transforming variables may improve the performance of linear ML models ▪ Commonly used mathematical transformations include
◦ Logarithm, Reciprocal, Square Root, Cube Root, Power, Box-Cox and Yeo-Johnson
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 5
Feature Engineering
Variable Distribution
◦ A probability distribution is a function that describes the likelihood of obtaining the possible values of a variable
◦ There are many well-described variable distributions
∙ Normal distribution for continuous variables
∙ Binomial distribution for discrete variables
∙ Poisson distribution for discrete variables
◦ A better spread of values may improve model performance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
6
Feature Engineering
The Boston Housing Dataset
Index
Variable
Definition
0
AGE
proportion of owner-occupied units built prior to 1940
1
B
1000*(Bk-0.63)^2, Bk is the proportion of blacks by town
2
CHAS
Charles River dummy variable (1 if tract bounds river, 0 otherwise)
3
CRIM
per capita crime rate by town
4
DIS
weighted distances to five Boston employment centres
5
INDUS
proportion of non-retail business acres per town
6
LSTAT
% lower status of the population
7
NOX
nitric oxides concentration (parts per 10 million)
8
PTRATIO
pupil-teacher ratio by town
9
RAD
index of accessibility to radial highways
10
RM
average number of rooms per dwelling
11
TAX
full-value property-tax rate per US$10,000
12
ZN
proportion of residential land zoned for lots over 25,000 sq.ft.
13
The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 7
Source: https://www.kaggle.com/prasadperera/the-boston-housing-dataset Feature Engineering
Python: Examining Variable Distribution (1)
# load the relevant packages
import pandas as pd
import matplotlib.pyplot as plt
# load the Boston House Prices dataset from scikit-learn
from sklearn.datasets import load_boston data = load_boston()
data = pd.DataFrame(data.data,columns=data.feature_names)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 8
Feature Engineering
Python: Examining Variable Distribution (2)
# visualize the variable distribution with histograms
data.hist(bins = 30, figsize = (12,12), density = True) plt.show()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 9
Feature Engineering
Normal Distribution
◦ Linear models assume that the independent variables are normally distributed
◦ Failure to meet this assumption may produce algorithms that perform poorly
◦ To check for normal distribution, use histograms and Q-Q plots
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
10
Feature Engineering
∙
∙
In a Q-Q plot, the quantiles of the independent variable are plotted against the expected quantiles of the normal distribution
If the variable is normally distributed, the dots in the Q-Q plot should fall along a 45 degree diagonal
Most raw data as a whole are not normally distributed normal
▪ Normal / Gaussian distribution is a probability distribution that is symmetric about the mean
▪ Data near the mean are more frequent in occurrence than data far from the mean – a bell curve
▪ The mean, median & mode are all equal
▪ A common misconception that most data follows a normal distribution (i.e. it is the normal thing)
▪ Many statistics are normally distributed in their sampling distribution
▪ But errors, averages, and totals often are ▪ Assumptions of normality are generally
a last resort
▪ Used when empirical probability distributions are not available
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 11
Feature Engineering
Q-Q plots help to find the type of distribution for a random variable, typically if it is a normal distribution
▪ A Q-Q (Quantile-Quantile) Plot plots the quantiles of two probability distributions against each other
◦ Quantiles are cut points dividing the range of
a probability distribution into continuous intervals with equal probabilities
▪ QQ Plots are used to graphically analyze and compare two probability distributions to see if they are exactly equal
◦ If the two distributions are exactly equal, the points on the Q-Q Plot will perfectly lie on the straight line y = x
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Feature Engineering
Skewed Q-Q Plots
▪ Q-Q plots can find the skewness (a measure of asymmetry) of a distribution
▪ If the bottom end deviates from the straight line but the upper end does not, the distribution has a longer tail to its left
◦ left-skewed or negatively skewed
▪ If the upper end deviates from the straight line and the lower end follows the straight line, the distribution has a longer tail to its right
◦ right-skewed or positively skewed
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 13
Feature Engineering
Tailed Q-Q Plots
▪ Q-Q plots can find the Kurtosis (a measure of tailedness) of a distribution
▪ A distribution with a fat tail will have both ends of the plot deviating from the straight line and its centre following the straight line
▪ A thin-tailed distribution will form a Q-Q plot with a less or negligible deviation at both ends of the plot
◦ a perfect fit for the Normal Distribution
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 14
Feature Engineering
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 15
Feature Engineering
▪ Kurtosis measures how heavily the tails differ from a normal distribution
▪ It identifies whether the tails of a distribution contain extreme values
▪ In finance, it is used as a measure
of financial risk
▪ Alargekurtosisis associated with a high level of risk
▪ Asmallkurtosissignals a moderate level of risk because the probabilities of extreme returns are relatively low
Python: Identifying Normal Distribution (1)
# load the relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt import seaborn as sns
import scipy.stats as stats
# generate an array containing 200 observations that are normally distributed
np.random.seed(29)
x = np.random.randn(200)
# create a dataframe after transposing the generated array
data = pd.DataFrame([x]).T data.columns = [‘x’]
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 16
Feature Engineering
Python: Identifying Normal Distribution (2)
# display a Q-Q plot to assess a normal distribution
stats.probplot(data[‘x’], dist = “norm”, plot = plt) plt.show()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 17
Feature Engineering
Python: Identifying Normal Distribution (3)
# make a histogram and a density plot of the variable distribution
sns.distplot(data[‘x’], bins = 30)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 18
Feature Engineering
Data Normalization
Normalization ensures that all rows and columns are treated equally under the eyes of machine learning
▪ Many ML algorithms are sensitive to the scale and magnitude of the features
∙ linear models (e.g., clustering, principal component analysis) involving distance
calculation are particularly sensitive to these
∙ features with bigger value ranges tend to dominate over features with smaller ranges
▪ Normalization is applicable to numerical variables and will align/transform both columns and rows so as to satisfy a consistent set of rules
∙ e.g., to transform all quantitative columns to a value range between 0 and 1
∙ e.g., to make all columns having the same mean and standard deviation so that all
variable values appear nicely on the same histogram
▪ Normalization is meant to level the playing field of data by ensuring that all rows and columns are treated equally under the eyes of machine learning
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 20
Feature Engineering
Some ML algorithms are affected greatly by data scales and diversity of scales might result in suboptimal learning
data.hist(figsize=(15,15)) data.hist(figsize=(15,15), sharex=True)
# use the Boston Housing dataset
# make a histogram for each variable
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 21
Feature Engineering
# redraw the histograms
# use one and the same scale for the X-axis
Column values can be normalized so that different columns will have similar data value distribution
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
¥20000
7-Oct-2020
AUS
JAP
58
Primary
No
…
…
…
…
…
…
…
…
Jones
₽3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Mary
₽8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Max
€7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Peter
₽500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Anson
₽7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
22
Feature Engineering
Feature
Target
Standardization / Z-score Normalization
◦ Standardization is the process of centering the variable at 0 and standardizing the variance (square of standard deviation) to 1
◦ To standardize features, we subtract the mean from each observation and then divide the result by the standard deviation
𝑧 = 𝑥 − 𝑚𝑒𝑎𝑛(𝑋) 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑_𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛(𝑋)
◦ The z-score represents how many standard deviations a given observation deviates from the mean
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
23
Feature Engineering
Z-score provides a standard scale to compare data having different means & standard deviations
▪ The standard score or Z-score is the number of standard
deviations by which a data point is above or below the mean of the population
◦ Scores above the mean have positive standard scores, while those below the mean have negative standard scores
𝑍−𝑆𝑐𝑜𝑟𝑒= 𝑑𝑎𝑡𝑎𝑝𝑜𝑖𝑛𝑡−𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛𝑚𝑒𝑎𝑛 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
▪ This process of converting a data point into a standard score is called standardizing or normalizing
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Feature Engineering
Mean Normalization
◦ Center the variable mean at zero and rescale the distribution to the value range
◦ This procedure involves subtracting the mean from each observation and then dividing the result by the difference between the maximum and minimum values
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥 − 𝑚𝑒𝑎𝑛(𝑋) 𝑚𝑎𝑥𝑋 −𝑚𝑖𝑛(𝑋)
◦ This transformation results in a distribution centered at 0, with its minimum and maximum values within the range of -1 to 1
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
25
Feature Engineering
Min-Max Normalization
◦
◦
Scaling to the minimum and maximum values squeezes the values of the variables between 0 and 1
To implement this scaling technique, we need to subtract the minimum value from all the observations and divide the result by the value range, that is, the difference between the maximum and minimum values
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = 𝑥 − 𝑚𝑖𝑛(𝑋) 𝑚𝑎𝑥𝑋 −𝑚𝑖𝑛(𝑋)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
26
Feature Engineering
An observation can be represented as a vector in a multi- dimensional vector space
◦ Each column value can be considered a scaler value that can be captured using one dimension in a multi-dimensional space
◦ An observation can therefore be captured as a feature vector
◦ The direction and magnitude of the feature vector is dictated by the value along each dimension, i.e. the feature values
◦ The angle between the vectors indicates
similarity between them (e.g., cosine
similarity)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 27
Feature Engineering
Scaling Feature Vector to Unit Vector
◦ Scales the feature vector, as opposed to each individual variable
∙ A feature vector contains the values of several variables for a single observation
◦ Dividing each feature vector by its norm
∙ The Manhattan distance (l absolute variables of the vector
∙ 𝑙1𝑋=𝑥1+𝑥2+⋯+𝑥𝑛
∙ The Euclidean distance (l2 norm): square root of the
sum of the square of the variables of the vector
∙ 𝑙 𝑋 = 𝑥2+𝑥2+⋯+𝑋2 212𝑛
1
norm): the sum of the
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
28
Feature Engineering
•
Also referred to as Taxicab or City Block Distance
The distance between two points is measured along axes at right angle
The sum of differences across dimensions
More appropriate if columns are not similar in type
Less sensitive to
outliers
•
Most commonly used distance
Corresponds to the geometric distance into the multi-dimensional space
If columns have values with differing scales, it is common to first normalize or standardize the numerical columns
•
•
◦
◦
•
Manhattan Distance (l1 norm)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Feature Engineering
◦
Euclidean Distance (l2 norm)
Vector normalization takes a vector of any length and changes its length to 1 while keeping the direction unchanged
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
¥20000
7-Oct-2020
AUS
JAP
58
Primary
No
…
…
…
…
…
…
…
…
Jones
₽3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Mary
₽8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Max
€7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Peter
₽500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Anson
₽7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
30
Feature Engineering
Feature
Target
The choice of removal, imputation, and normalization is determined by the superiority of model accuracy
3
4
5
6
7
Imputation Technique
Dropping rows with missing values
Imputing missing values with zero
Imputing missing values with the mean
Imputing missing values with the median
z-Score normalization with median imputation
Min-max normalization with mean imputation
Row normalization with mean imputation
# of rows
in the training dataset
1
392
768
769
768
768
768
0.74489
2
768
0.7304
0.7318
0.7357
0.7422
0.7461
0.6823
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Feature Engineering
Accuracy
Feature Construction
Feature construction is a form of data enrichment that adds derived features to data
▪ Feature construction is a form of data enrichment that adds derived features to data
▪ Feature construction involves transforming a given set of input features to generate a new set of more powerful features which are then used for prediction
▪ This may be done either to compress the dataset by reducing the number of features or to improve the prediction performance
▪ The new features will ideally hold new information and generate new patterns that ML models will be able to exploit and use to increase performance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 33
Feature Engineering
New features may be constructed based on existing features to enable and enhance machine learning
categorical variable encoding
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Issued In_HK
…
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
1
…
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
1
…
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
1
…
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
0
…
Adams
¥20000
7-Oct-2020
AUS
JAP
58
Primary
No
0
…
…
…
…
…
…
…
…
…
…
Jones
₽3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
1
…
Mary
₽8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
1
…
Max
€7475,11
Nov 8, 2020
UK
GER
32
Primary
No
0
…
Peter
₽500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
1
…
Anson
₽7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
1
…
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
34
Feature Engineering
Feature
Target
Encoding
Nominal Qualitative Data
Categorical Encoding
◦ The values of categorical variables are often encoded as strings
◦ Scikit-learn does not support strings as values, therefore, we need to transform those strings into numbers
◦ The act of replacing strings with numbers is called categorical encoding
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
36
Feature Engineering
Dummy Variables
∙
Consider a simple regression analysis for wage determination
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
37
Feature Engineering
◦ ◦
Dummy variables take the value 0 or 1 to indicate the absence or presence of a category
They are proxy variables, or numerical stand- ins, for qualitative variables
◦ ◦
Say we are given gender, which is qualitative, and years of education, which is quantitative
In order to see if gender has an effect on wages, we would dummy code when the person is a female to female = 1, and female = 0 when the person is male.
◦
In one-hot encoding, we represent a categorical variable as a group of dummy variables, where each dummy variable represents one category
Gender
Female
Male
Male
Female
Gender_Female
Gender_Male
1
0
0
1
0
1
1
0
One-Hot Encoding
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
38
Feature Engineering
◦
One-hot encoding is applicable to nominal variables
∙ for categorical variables not having a natural rank ordering
Dummy Variable Traps
◦ When working with dummy variables, it is important to avoid the dummy variable trap
◦ The trap occurs when independent variables are multicollinear, or highly correlated
◦ To avoid the dummy variable trap, simply drop one of the dummy variables
Gender
Female
Male
Male
Female
Gender_Female
Gender_Male
1
0
0
1
0
1
1
0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
39
Feature Engineering
A categorical variable with k categories can be captured using k-1 dummy variables but sometimes still with k variables
▪ A categorical variable with k categories can be encoded in k-1 dummy variables
◦ For Gender, k is 2 (male and female), therefore, only one dummy variable (k – 1 = 1) is
needed to capture all of the information
◦ For a color variable that has three categories (red, blue, and green), two (k – 1 = 2) dummy variables are needed
∙ red(red=1,blue=0),blue(red=0,blue=1),green(red=0,blue=0)
▪ There are a few occasions when categorical variables are encoded with k dummy variables
▪ When training decision trees, as they do not evaluate the entire feature space at the same time
▪ When selecting features recursively
▪ When determining the importance of each category within a variable Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 40
Feature Engineering
Python: One-Hot Encoding (1)
# load the relevant packages
import pandas as pd
# load the dataset from the current working directory
data = pd.read_csv(‘FIN7790-02-2-feature_construction.csv’) # show the dataset, which serves purely as a demo dataset
data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 41
Feature Engineering
Python: One-Hot Encoding (2)
# list the nominal categorical variables to encode
cols =[‘city’, ‘boolean’]
encoding = pd.get_dummies(data[cols], drop_first=True) # show the one-hot encoding dataset
encoding
# use pandas get_dummies() to dummify the nominal categorical variables
# drop_first=True avoids the dummy variable trap by removing the first category
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 42
Feature Engineering
Python: One-Hot Encoding (3)
# combine the original dataframe with the one-hot encoding dataframe
# drop the ordinal categorical column first to avoid the dummy variable trap
data_enc = pd.concat([data.drop(columns=cols), encoding], axis=1) # show the encoded dataset
data_enc
get_dummies() will create one binary variable per found category. Hence, if there are more categories in the training dataset than in the testing dataset, get_dummies() will return more columns in the transformed training dataset than in the transformed testing dataset.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 43
Feature Engineering
Encoding
Ordinal Qualitative Data
Ordinal Encoding
◦ Ordinal encoding consists of
∙ replacing the categories with digits from 1 to k (or 0
to k-1, depending on the implementation)
∙ k is the number of distinct categories of the variable
◦ The numbers are assigned arbitrarily
◦ Ordinal encoding is better suited for non-linear machine learning models
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
45
Feature Engineering
∙
ML models can navigate through the arbitrarily assigned digits to try and find patterns that relate to the target
Python: Ordinal Encoding (1)
# load the relevant packages
import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
# load the dataset from the current working directory
data = pd.read_csv(‘FIN7790-02-2-feature_construction.csv’) # list the columns to encode
cols= [‘ordinal_column’] data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 46
Feature Engineering
Python: Ordinal Encoding (2)
# capture the encoding as an array of array
# each inner array applies to one column
# list categories in each inner array
# the order of the categories determines the values
# build the encoding
encoding = pd.DataFrame( encoder.transform(data[cols]),
columns=cols)
mapping = [[‘dislike’, ‘like’, ‘somewhat like’]]
# instantiate the encoder
encoder = OrdinalEncoder(categories=mapping, dtype=np.int32)
# fit the data to the encoder
encoder.fit(data[cols]) # list the categories
encoder.categories_
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 47
# show the encoding
encoding
Feature Engineering
Python: Ordinal Encoding (3)
# build the encoded dataset
data_enc = pd.concat([data.drop(columns=cols), encoding], axis=1) # show the encoded dataset
data_enc
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 48
Feature Engineering
Encoding Quantitative Data
Discretisation
▪ Discretization / Binning transforms continuous variables into discrete variables by creating a set of contiguous intervals (bins) spanning the value range
◦ Places outliers into the lower or higher intervals together with the remaining inlier values of the distribution
◦ Hence, these outliers no longer differ from the rest of the values at the tails of the distribution, as they are now all together in the same interval / bin
▪ Used to change the distribution of skewed variables, to minimize the influence of outliers, and hence to improve the performance of some ML models
▪ Binning can be achieved using supervised or unsupervised approaches Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50
Feature Engineering
Equal-Width Discretization
◦ ◦
The variable values are sorted into intervals of the same width
The number of intervals is decided arbitrarily
𝑊𝑖𝑑𝑡h = 𝑀𝑎𝑥 𝑋 − 𝑀𝑖𝑛(𝑋) 𝐵𝑖𝑛𝑠
∙ Values in the training dataset range from 0 to 100 and to create 5 bins, bin width = (100 – 0) / 5 = 20
∙ The bins will be 0-20, 20-40, 40-60, 80-100
∙ The first bin (0-20) and final bin (80-100) can be expanded to accommodate outliers found in other datasets, i.e., values < 0 or > 100 would be placed in those bins by extending the limits to minus and plus infinity
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
51
Feature Engineering
Python: Equal-Width Discretization (1) # load the relevant packages
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer
# load the dataset from the current working directory
data = pd.read_csv(‘FIN7790-02-2-feature_construction.csv’) # list the columns to encode
cols= [‘quantitative_column’] data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 52
Feature Engineering
Python: Equal-Width Discretization (2) # initiate an ordinal encoder
disc = KBinsDiscretizer(n_bins=10, encode=’ordinal’, strategy=’uniform’)
# fit the data to the discretizer
disc.fit(data[cols]) # list the learnt bins
disc.bin_edges_
# build the discretization for the quantitative variable
discretization = pd.DataFrame( disc.transform(data[cols]), columns=cols)
# show the discretization
discretization
[-ꝏ , 1.55)
0.0
[1.55, 3.6)
1.0
[3.6, 5.65)
2.0
[5.65, 7.7)
3.0
[7.7, 9.75)
4.0
[9.75, 11.8)
5.0
[11.8, 13.85)
6.0
[13.85, 15.9)
7.0
[15.9, 17.95)
8.0
[17.95, ꝏ)
9.0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 53
Feature Engineering
Python: Equal-Width Discretization (3) # build the discretized dataset
data_disc = pd.concat([data.drop(columns=cols), discretization], axis=1) # show the discretized dataset
data_disc
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54
Feature Engineering
After one-hot encoding, ordinal encoding, and discretization, the original dataset becomes a purely numerical dataset
index
ordinal_column
quantitative_column
boolean_yes
city_san francisco
city_seattle
city_tokyo
0
2
0.0
1
0
0
1
1
1
5.0
0
0
0
1
2
2
0.0
0
0
0
0
3
1
5.0
0
0
1
0
4
2
4.0
0
1
0
0
5
0
9.0
1
0
0
1
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 55
Feature Engineering
Extending Quantitative Data with Polynomial Features
Polynomial Expansion
◦ A combination of one feature with itself (i.e. a polynomial combination of the same feature) can also be quite informative or increase the predictive power of the predictive algorithms
∙ e.g., the target follows a quadratic relationship with a variable, creating a second degree polynomial of the feature allows us to use it in a linear model
◦ With similar logic, polynomial combinations of the same or different variables can return new variables that convey additional information and capture feature interaction
◦ Can be better inputs for our ML algorithms, particularly for linear models
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
57
Feature Engineering
A linear relationship can be created for polynomial features using a polynomial combination
▪ In the plot on the left, due to the quadratic relationship between the target (y) and the variable (x), there is a poor linear fit
▪ In the plot on the right, the x2 variable (a quadratic combination of x) shows a linear relationship with the target (y) and therefore improves the performance of the linear model, which predicts y from x2
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 58
Feature Engineering
Polynomial features may result in improved modeling performance at the cost of adding thousands of variables
▪ Often, the input features for a predictive modeling task interact in unexpected and often nonlinear ways
▪ These interactions can be identified and modeled by a learning algorithm
▪ Another approach is to engineer new features that expose these interactions
and see if they improve model performance
▪ Transforms like raising input variables to a power can help to better expose the important relationships between input variables and the target variable
▪ These features are called interaction/polynomial features and allow the use of simpler modeling algorithms as some of the complexity of interpreting the input variables and their relationships is pushed back to the data preparation stage
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 59
Feature Engineering
A set of new polynomial features is created based on the degree of the polynomial combination
▪ The degree of the polynomial is used to control the number of features added, e.g. a degree of 3 will add two new variables for each input variable
▪ Typically a small degree, such as 2 or 3, is used
◦ 2nd degree polynomial combinations return the following new features
𝑎,𝑏,𝑐 2 = 1,𝑎,𝑏,𝑐,𝑎𝑏,𝑎𝑐,𝑏𝑐,𝑎2,𝑏2,𝑐2
including all possible interactions of degree 1 and degree 2 plus the bias term 1
◦ 3rd degree polynomial combinations return the following new features
𝑎,𝑏,𝑐 3 = 1,𝑎,𝑏,𝑐,𝑎𝑏,𝑎𝑐,𝑏𝑐,𝑎𝑏𝑐,𝑎2𝑏,𝑎2𝑐,𝑏2𝑎,𝑏2𝑐,𝑐2𝑎,𝑐2𝑏,𝑎3,𝑏3,𝑐3
including all possible interactions of degree 1, degree 2, and degree 3 plus the bias term 1
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 60
Feature Engineering
The Accelerometer Dataset
▪ The dataset collects data from a wearable accelerometer mounted on the chest intended for activity recognition research
▪ Data are collected from 15 participants performing 7 activities
▪ It provides challenges for identification and authentication of people using motion patterns
▪ Sampling frequency: 52 Hz
▪ 15 datasets, one for each participant
Index
Variable
Definition
Values
0
ID
Identifier
Numerical
1
Xacc
X acceleration
Numerical
2
Yacc
Y acceleration
Numerical
3
Zacc
Z acceleration
Numerical
4
Label
Activity
1
working at computer
2
standing up, walking and going up/down stairs
3
standing
4
walking
5
Going up/down stairs
6
walking and talking with someone
7
talking while standing
▪ Data calibration: no Source: https://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 61
Feature Engineering
Python: Polynomial Combinations (1)
# load relevant packages and dataset with proper feature variables
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split from sklearn.preprocessing import PolynomialFeatures
data = pd.read_csv(‘FIN7790-02-2-accelerometer.csv’, header=None) data.columns = [‘ID’, ‘x’, ‘y’, ‘z’, ‘activity’]
data = data.astype({‘ID’: ‘int’})
data.head()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 62
Feature Engineering
Python: Polynomial Combinations (2)
# show information summary
data.info()
# show descriptive statistics
data.describe()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 63
Feature Engineering
Python: Polynomial Combinations (3)
# split the dataset into features and targets
X = data[[‘x’, ‘y’, ‘z’]]
y = data[‘activity’]
# set up a polynomial expansion transformer of a degree less than or equal to 2 # interaction_only=False retains all of the combinations
# include_bias=False avoids returning the bias term column of all 1’s
poly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
X_poly = poly.fit_transform(X)
data_X_poly = pd.DataFrame(X_poly, columns=poly.get_feature_names())
# show combinations covered by the transformer
poly.get_feature_names()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 64
Feature Engineering
# fit the transformer to the dataset
# let the transformer learn all of the possible polynomial combinations of the three variables
Python: Polynomial Combinations (4)
# calculate correlation matric between feature pairs
data_X_poly.corr()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 65
Feature Engineering
Python: Polynomial Combinations (5)
# show correlation matric between feature pairs
# the darker the color, the greater the correlation of the features
sns.heatmap(data_X_poly.corr(), cmap=sns.diverging_palette(20, 220, n=200))
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 66
Feature Engineering
References
References
“Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists”, Alice Zhang & Amanda Casari, O’Reilly Media, April 2018, ISBN-13: 978-1-491-95324-2
“Feature Engineering Made Simple”, Susan Ozdemir & Divya Susarla, Packt Publishing, January 2018, ISBN-13: 978-1-787- 28760-0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
68
Understanding Machine Learning
THANK YOU