CS计算机代考程序代写 algorithm decision tree python deep learning Machine Learning for Financial Data

Machine Learning for Financial Data
December 2020
FEATURE ENGINEERING (CONCEPTS – PART 3)

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Feature Engineering
Contents
◦ Feature Selection
◦ Filter-based Feature Selection
◦ Feature Selection using Pearson’s Correlation
◦ Feature Selection using Hypothesis Testing
◦ Feature Transformation
◦ Feature Transformation using
Principal Component Analysis
◦ Feature Learning

Feature Selection

Feature selection selects the more relevant features and eliminate redundant, irrelevant, and noisy features
▪ Feature relevance is classified into three types: strong relevance, weak relevance, and irrelevant
▪ A feature which has an influence on the output and its role cannot be replaced by the rest is known as relevant feature and therefore cannot be removed
▪ A feature is said to be a weakly relevant if it is necessary for an optimal subset only at certain conditions
▪ An irrelevant feature is one which is not necessary at all because it does not contribute any information to the target and hence it should be removed
▪ A feature which takes the role of another is said to be redundant
▪ Removing irrelevant and redundant features will potentially give a better generalization,
understanding and visualization with less training and testing time Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Feature Engineering

Identifying which features are most relevant is particularly useful when there are only a few samples
Name
Daniel
Alex
Adrian
Vicky
Adams

Jones
Mary
Max
Peter
Anson
Amount
$2,600.45
$2,294.58
$1,003.30
$8,488.32
¥20000

₽3,250.11
₽8,156.20
€7475,11
₽500.00
₽7,475.11
Date
1-Jul-2020
1-Oct-2020
3-Oct-2020
4-Oct-2020
7-Oct-2020

Nov 1, 2020
Nov 1, 2020
Nov 8, 2020
Nov 9, 2020
Nov 9, 2020
Issued In
HK
HK
HK
JAPAN
AUS

HK
HK
UK
Hong Kong
Hong Kong
Used In
HK
RUS
HK
JAP

RUS
N/A
GER
RUS
RUS
22
None
25
64
58

43
27
32
0
20
Education
Secondary
Postgraduate
Graduate
Graduate
Primary

Graduate
Graduate
Primary
Postgraduate
Postgraduate
Fraud?
No
Yes
Yes
No
No

No
Yes
No
No
Yes
feature selection
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
5
Feature Engineering
Feature
Target
Age

Feature selection is the process of selecting a subset of relevant features for use in model construction
▪ Reasons for doing feature selection include
◦ to simplify models to make them easier to interpret by researchers / users
◦ to shorten training time
◦ to reduce the dimensionality of data involved
◦ to enhance generalization by reducing overfitting
◦ to reduce model scoring time (after model deployment)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 6
Feature Engineering

The three main categories of supervised feature selection algorithms are filter, wrapper, and embedded methods
Filter Methods
▫ A proxy measure, often statistical, instead of the error rate is used to score a subset
▫ Computationally less expensive ▫ Selection is more general & with
lower predictive performance
Wrapper Methods
▫ Each subset is used to train a model and the model error rate provides the score for the subset
▫ Computationally very expensive ▫ Selection is usually good
Embedded Methods
▫ A catch-all group of techniques being part of the model construction process
▫ Computational complexity is between filters and wrappers
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
7
Feature Engineering

Filter Methods
All features
▪ Apply a statistical measure (e.g. correlation with the target) to assign a score to each variable regardless of the ML model
▪ Variables are ranked by the score and either to be kept or removed from the dataset
▪ Often univariate and consider the feature independently
▪ Tend to select redundant variables as the relationships
between variables are not considered
▪ No consideration is given to the ML model during the filtering process; hence, may not be able to select the right features for the model
Performing a statistical measure between each feature and the target
Selecting features individually using some statistical measure threshold
Training ML algorithm
Assessing performance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
8
Feature Engineering

Wrapper Methods
All features
▪ Consider the selection as a search problem where different combinations are prepared, evaluated and compared to other combinations
▪ A predictive model is used to assign scores based on model accuracy
▪ Can detect possible interactions between variables ▪ Increase the overfitting risk when the number of
observations is insufficient
Selecting the best subset
Generate a subset Training ML algorithm
Assessing performance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 9
Feature Engineering

Embedded Methods
All features
Selecting the best subset
Generate a subset
Training ML algorithm +
Assessing performance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
10
Feature Engineering
▪ ▪
Try to combine the advantages of both filter and wrapper methods
A learning algorithm takes advantage of its own variable selection process and performs feature selection and assessment simultaneously

Filter-based Feature Selection

The choice of feature selection algorithm depends on the nature of the input features and output target
Target
Numerical
Categorical
Chi-Squared Test (contingency table)
Mutual Information
ANOVA Correlation Coefficient (linear)
Kendall’s Rank Coefficient (non-linear)
Numerical
Features
Categorical
ANOVA Correlation Coefficient (linear)
Kendall’s Rank Coefficient (non-linear)
Pearson’s Correlation Coefficient (linear)
Spearman’s Rank Correlation Coefficient (non-linear)
◦ Pearson’s can be used on quantitative continuous variables
◦ Spearman’s can be used on ordinal data when the ordered categories are replaced by their ranks
◦ Actually, mutual information is agnostic to data types
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Feature Engineering

Feature Selection using Pearson’s Correlation

Pearson’s Correlation
◦ The Pearson’s Correlation is a measure of the strength and direction of association that exists between two variables measured on at least an interval scale
◦ The coefficient measures the linear relationship between columns
◦ The coefficient value varies between -1 and +1
◦ The value 0 implies no correlation between columns
◦ Values closer to -1 or +1 imply an extremely strong linear relationship
◦ Pearson’s correlation coefficient generally requires that each column be normally distributed
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
14
Feature Engineering

Pearson’s correlation calculates the effect of change in one variable when the other variable changes
𝑟=
𝑁(σ𝑥𝑦)−(σ𝑥)(σ𝑦)
𝑁σ𝑥2 −(σ𝑥)2
𝑁σ𝑦2 −(σ𝑦)2
where
𝑁 = 𝑡h𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑝𝑎𝑖𝑟𝑠 𝑜𝑓 𝑠𝑐𝑜𝑟𝑒𝑠
෍ 𝑥𝑦 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑡h𝑒 𝑝𝑟𝑜𝑑𝑢𝑐𝑡𝑠 𝑜𝑓 𝑝𝑎𝑖𝑟𝑒𝑑 𝑠𝑐𝑜𝑟𝑒𝑠
෍ 𝑥 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑥 𝑠𝑐𝑜𝑟𝑒𝑠
෍ 𝑦 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑦 𝑠𝑐𝑜𝑟𝑒𝑠
෍ 𝑥2 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑥 𝑠𝑐𝑜𝑟𝑒𝑠 ෍ 𝑦2 = 𝑡h𝑒 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑦 𝑠𝑐𝑜𝑟𝑒𝑠
𝟎.𝟓< 𝒓 <𝟏.𝟎 𝟎.𝟑< 𝒓 <𝟎.𝟓 𝒓 ≅𝟎.𝟎 𝟎.𝟏< 𝒓 <𝟎.𝟑 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 15 Feature Engineering Default Credit Card Payments ◦ A dataset about customer default payments in Taiwan ◦ Number of observations = 30,000 ◦ Number of features = 24 ◦ From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 16 Feature Engineering The credit card default payment dataset # Feature Description 1 LIMIT_BAL Credit amount in NT dollar 2 SEX Gender: 1=male, 2=female 3 EDUCATION Education: 1=postgraduate, 2=graduate, 3=secondary, 4=others 4 MARRIAGE Marital status: 1=married, 2=single, 3=others 5 AGE Age in year 6 PAY_0 Repayment status of September to April 2005: -1= paid duly, 1=1 month delay, ..., 8=8 months' delay, 9=9 months or longer delay 7 PAY_2 8 PAY_3 9 PAY_4 10 PAY_5 11 PAY_6 # Feature Description 12 BILL_AMT1 Bill statement amount in NT dollar from September to April 2005. 13 BILL_AMT2 14 BILL_AMT3 15 BILL_AMT4 16 BILL_AMT5 17 BILL_AMT6 18 PAY_AMT1 Amount of previous payment in NT dollar from September to April 2005 19 PAY_AMT2 20 PAY_AMT3 21 PAY_AMT4 22 PAY_AMT5 23 PAY_AMT6 24 default pay ... Default payment: yes=1, no=0 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 17 Feature Engineering Python: Correlation-based Feature Selection (1) # load relevant packages import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np # load the credit card default dataset data = pd.read_csv('FIN7790-02-3-credit_card_default.csv', header=1, index_col=0) # confirm the entire dataset is indeed loaded data.shape Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 18 Feature Engineering Python: Correlation-based Feature Selection (2) # examine the first 5 rows data.head().T Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19 Feature Engineering Python: Correlation-based Feature Selection (3) # examine the statistics about the dataset data.describe().T Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 20 Feature Engineering Python: Correlation-based Feature Selection (4) # check if there is any null values data.isnull().sum() # assumes no preprocessing is required # partition the dataset into features & target target = 'default payment next month' X = data.drop(target, axis = 1) y = data[target] # showed the normalized value counts y.value_counts(normalize=True) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 21 Feature Engineering Python: Correlation-based Feature Selection (5) # show the Pearson's correlation coefficients data.corr() Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 22 Feature Engineering Python: Correlation-based Feature Selection (6) # show Pearson's correlation # coefficient as a heatmap sns.heatmap(data.corr()) Note that the heatmap function automatically chose the most correlated features to show. For simplicity no normalization is performed before computing the Pearson's coefficients. Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 23 Feature Engineering Python: Correlation-based Feature Selection (7) # list the coefficient against the target data.corr()[target] Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24 Feature Engineering # list the coefficient against the target # only if the absolute value > 0.2
data.corr()[target].abs() > 0.2

Python: Correlation-based Feature Selection (8)
# Retain the most correlated features
key_features = data.columns[data.corr()[target].abs() > 0.2] key_features
# display the retained features of the dataset
data_trimmed = data[key_features] data_trimmed
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 25
Feature Engineering

Prediction accuracy may suffer or improve as a result of feature selection depending of the choice of parameters
Before Feature Selection
Model Name
Accuracy (%)
Fit Time (sec)
Predict Time (sec)
Decision Tree
0.8203
0.158
0.002
After Feature Selection
Model Name
# of Features
Threshold
Accuracy (%)
Fit Time (sec)
Predict Time (sec)
Decision Tree
7
0.1
0.8206
0.105
0.003
Decision Tree
5
0.2
0.8197
0.010
0.002
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 26
Feature Engineering

Feature Selection using Hypothesis Testing

Hypothesis Testing
◦ Hypothesis testing is a method for testing a claim about a parameter in a population, using data measured in a sample
1) State the hypothesis
2) Set the criteria for a decision
3) Compute the test statistic
4) Make a decision
◦ The null hypothesis (H0) is a statement about the population parameter that is assumed to be true
◦ The reason of testing H0 is because we think it is wrong!
◦ An alternative hypothesis (H1) is a statement that directly contradicts H0 by stating that the population parameter is different to what is stated in H0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
28
Feature Engineering

Presumption of Innocence 無罪推定原則
▪ The presumption of innocence is the legal principle that one is considered
“innocent until proven guilty”. Under
the presumption of innocence, the legal burden of proof is thus on the prosecution, which must present compelling evidence to the trier of fact (a judge or a jury)
▪ 無罪推定原則,意指一個人在法院上應該先被 假定為無罪,除非被證實及判決有罪。 在這個 原則下,提起公訴的檢察官應負起舉證責任, 應負責收集足夠的可靠證據,以證明被告在事 實上的確有罪;而若法院要判被告有罪,則所 使用的證據必須符合法律限制,而且不能超越 合理懷疑。
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Feature Engineering

Chi-Squared Test
◦ The Chi-Squared test is used to determine whether a relationship between 2 categorical variables in a sample is likely to reflect a real association between these 2 variables in the population
◦ In the case of 2 variables being compared, the test can be interpreted as determining if there is a difference between the 2 variables
◦ The sample data is used to calculate a single number, the test statistic
◦ The size of the test statistic reflects the probability that the observed relationship between 2 variables has occurred by chance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
30
Feature Engineering

After rolling a dice 36 times, how can we determine if the dice is fair or unfair
How would you draw the conclusion?
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Feature Engineering

Chi2 test is used for categorical variables to reveal variance in observed & expected frequencies
𝐶h𝑖−𝑆𝑞𝑢𝑎𝑟𝑒𝑑 𝑆𝑐𝑜𝑟𝑒
χ2 = σ (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦−𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦)2 𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
32
Feature Engineering
where,
Observed Frequency = Number of observations of class
Expected Frequency = Number of expected observations of class if there was no relationship
between the feature and the target

Chi2 test calculates the variances in frequency and compares the sum with the Chi2 distribution
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 33
Feature Engineering
level of significance, typically set at 95%

The threshold in the Chi2 distribution for the corresponding degree of freedom determines H0’s acceptance or rejection
Conclusion
◦ test statistic (9.6) > threshold (9.236) ◦ suggests the dice is unbalanced
◦ reject the H0 hypothesis
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34
Feature Engineering

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 35
Feature Engineering

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36
Feature Engineering

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 37
Feature Engineering

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 38
Feature Engineering
threshold = 2.706

Chi-Squared based Feature Selection
◦ Chi2 measures the distance between observed and expected frequencies
◦ The null hypothesis (H0) is that the observed frequencies for a categorical variable match the expected frequencies for the categorical variable
◦ If Score >= Threshold: target depends on the feature, significant result, reject the null hypothesis (H0), feature is to be retained
◦ If Statistic < Threshold : target does not depend on the feature, not significant result, fail to reject the null hypothesis (H0), feature should be removed Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 39 Feature Engineering Feature Transformation Feature transformation creates new columns that are fundamentally different from the original dataset Alex Adrian Vicky Adams ... Jones Mary Max Peter Anson $2,294.58 $1,003.30 $8,488.32 ¥20000 ... ₽3,250.11 ₽8,156.20 €7475,11 ₽500.00 ₽7,475.11 1-Jul-2020 1-Oct-2020 3-Oct-2020 4-Oct-2020 7-Oct-2020 ... Nov 1, 2020 Nov 1, 2020 Nov 8, 2020 Nov 9, 2020 Nov 9, 2020 HK HK JAPAN AUS ... HK HK UK Hong Kong Hong Kong RUS HK JAP ... RUS N/A GER RUS RUS None 25 64 58 ... 43 27 32 0 20 Secondary Postgraduate Graduate Graduate Primary ... Graduate Graduate Primary Postgraduate Postgraduate Yes Yes No No ... No Yes No No Yes feature transformation Name Daniel Amount $2,600.45 Date Issued In HK Used In HK Age 22 Education Fraud? No Observations ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 41 Feature Engineering Feature Target Feature transformation creates an entirely new, structurally different dataset from the original dataset ▪ Feature selection processes are limited to only being able to select features from the original set of columns ▪ Feature transformation uses the original columns and combines them in useful ways to create new columns that are better at describing the data than any single column from the original dataset ▪ These algorithms create brand new columns that are so powerful that we only need a few of them to explain the entire dataset accurately Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 42 Feature Engineering Feature transformation relies on matrix algorithms whereas feature learning relies on deep learning ▪ Feature transformation deploys a suite of algorithms designed to alter the internal structure of data to produce mathematically superior columns ▪ Feature learning will focus on using non-parametric algorithms (those that do not depend on the shape of the data) to automatically learn new features ▪ Feature transformation uses a set of matrix algorithms that will structurally alter the dataset and produce what is essentially a brand new matrix of data ◦ The basic idea is that ◦ the original features of a dataset are the descriptors / characteristics of data points and ◦ it should be able to create a new set of features that explain the data-points just as well, perhaps even better, with fewer columns Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 43 Feature Engineering Feature Transformation using Principal Component Analysis Principal Component Analysis (PCA) ◦ PCA is used to extract the important information from a multivariate dataset and to express this information as a set of few new variables called principal components ◦ The principal components explain most of the patterns & latent structures observed in the original dataset ◦ Often possible with only a few principal components ◦ An unsupervised dimension reduction technique providing a new lower-dimensional variable space to project the dataset on ◦ A linear static transformation using matrix multiplication Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 45 Feature Engineering Graphically, PCA finds new orthogonal important dimensions that capture the largest variances The dataset is represented in the X-Y coordinate system The PC2 axis is the second most important direction, orthogonal to the PC1 axis Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 46 Feature Engineering The PC1 axis is the first principal direction giving the largest sample variation The original data in a 2-dimension space can be effectively represented in a 1-dimension space The PC2 axis is the second most important direction, orthogonal to the PC1 axis The PC1 axis is the first principal direction giving the largest sample variation ◦ The dimension reduction is achieved by identifying the principal directions, called principal components ◦ PCA assumes that the directions with the largest variances are the most important ◦ In this example, the two- dimensional data can be reduced to a single dimension by projecting each data onto the first principal component Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 47 Feature Engineering The computation of PCA is typically done using linear algebra and the identification of eigenvectors & eigenvalues (1) correlated data of n dimensions (2) center (don't scale) the data want dimension of the highest variance (3) compute the covariance matrix 𝑐𝑜𝑣(h, h) 𝑐𝑜𝑣(h, 𝑢) 𝑐𝑜𝑣(𝑢, h) 𝑐𝑜𝑣(𝑢, 𝑢 2.0 0.8 = 0.8 0.6 (4) compute the eigenvector and eigenvalues of the covariance matrix 2.0 0.8 𝑒h = λ 𝑒h 0.8 0.6 𝑒𝑢 𝑒 𝑒𝑢 𝑓𝑓 2.0 0.8 h =λ𝑓 h 0.8 0.6 𝑓 𝑓 𝑢𝑢 n orthogonal eigenvectors for data of n dimensions (5) keep the top k eigenvalues (sorted by descending order) (6) uncorrelated data of lower dimensionality (6) project data multiplying the transpose of the feature vector with the eigenvectors corresponding to the top eigenvalues 𝑥𝑒 = 𝑥𝜏𝑒 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 48 Feature Engineering Some reminders on linear algebra ▪ Variance computes the variation of the data distributed across the dimensionality graph ▪ An eigenvector (𝒗) of a linear transformation (𝐴) is a non-zero vector (typically, a unity vector) that changes by a scalar factor (𝝀) when that linear transformation is applied σ𝑛 (𝑥 − 𝑥ҧ)2 𝑖=1 𝑖 𝑛 ▪ Covariance identifies the dependencies and relationships between the characteristics of datasets 𝑛 𝑐𝑜𝑣𝑥,𝑦 =σ𝑖=1(𝑥𝑖−𝑥ҧ)(𝑦𝑖−𝑦ത) 𝑛 ▪ The sign of 𝑐𝑜𝑣 𝑥, 𝑦 is the key ▪ Positive: both dimensions increase together ▪ Negative: one dimension increases, the other dimension decreases ▪ 0: two dimensions are independent of each other Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 𝑣𝑎𝑟 𝑥 = 𝐴𝒗 = 𝝀𝒗 ⋯ 𝑡1𝑛 𝑣1 𝑣1 ⋱ ⋮ ⋮ =𝝀 ⋮ 𝑡𝑚1 ⋯ 𝑡𝑚𝑛 𝑣𝑛 𝑣𝑛 The corresponding eigenvalue is the factor by which the eigenvector is scaled 𝑡11 ⋮ 49 Feature Engineering ▪ ▪ Eigenvector and eigenvalue come in pair for a given linear transformation Majority of the variance in the original dataset can be effectively explained by a few principal components Understanding the Mathematics behind Principal Component Analysis (https://heartbeat.fritz.ai/understanding-the-mathematics-behind-principal-component-analysis-efd7c9ff0bb3) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50 Feature Engineering Python: Using Principal Component Analysis (1) # load relevant packages and data import pandas as pd import matplotlib.pyplot as plt from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA boston_dataset = load_boston() data = pd.DataFrame(boston_dataset.data, columns = boston_dataset.feature_names) data['MEDV'] = boston_dataset.target # separate the features from the target X = data.drop('MEDV', axis = 1) y = data['MEDV'] Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 51 Feature Engineering Python: Using Principal Component Analysis (2) # show the first 5 observations data.head() # show the number of rows and number of columns of the dataset data.shape Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 52 Feature Engineering Python: Using Principal Component Analysis (3) # split the dataset into training dataset (70%) and testing dataset (30%) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) X_train.shape # set up the PCA # n_components=None will keep all features in the original dataset # features will be ranked and selected in subsequent steps pca = PCA(n_components = None) # train the PCA with the training dataset pca.fit(X_train) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 53 Feature Engineering Python: Using Principal Component Analysis (4) # a few of the components will capture most of the variance of the original dataset # to identify how many components capture most of the variability, # we can plot the percentage of variance explained (by each component) # versus the component number # plot the percentage of the total variance # explained by each component plt.plot(pca.explained_variance_ratio_, linewidth = 2) plt.title('Percentage of Variance Explained') plt.xlabel('Number of Components') plt.ylabel('Percentage of Variance Explained') # the plot indicates that we can use the first # two components to train our machine learning # models using a linear model Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54 Feature Engineering Python: Using Principal Component Analysis (5) # transform the training and testing datasets X_train_transformed = pca.transform(X_train) X_test_transformed = pca.transform(X_test) print(X_train_transformed) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 55 Feature Engineering Python: Using Principal Component Analysis (6) # reduce the dimensionality of the dataset based on the result of PCA X_train_trimmed = pd.DataFrame(X_train_transformed[:,0:2]) X_train_trimmed.head() # show the number of rows and columns of the reduced dataset X_train_trimmed.shape Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 56 Feature Engineering Feature Learning Feature learning relieves the restriction on the original dataset and uses deep learning to create new columns Alex Adrian Vicky Adams ... Jones Mary Max Peter Anson $2,294.58 $1,003.30 $8,488.32 ¥20000 ... ₽3,250.11 ₽8,156.20 €7475,11 ₽500.00 ₽7,475.11 1-Jul-2020 1-Oct-2020 3-Oct-2020 4-Oct-2020 7-Oct-2020 ... Nov 1, 2020 Nov 1, 2020 Nov 8, 2020 Nov 9, 2020 Nov 9, 2020 HK HK JAPAN AUS ... HK HK UK Hong Kong Hong Kong RUS HK JAP ... RUS N/A GER RUS RUS None 25 64 58 ... 43 27 32 0 20 Secondary Postgraduate Graduate Graduate Primary ... Graduate Graduate Primary Postgraduate Postgraduate feature learning Name Daniel Amount $2,600.45 Date Issued In HK Used In HK Age 22 Education Fraud? No Yes Yes No No Observations ... No ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... Yes No No Yes Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 58 Feature Engineering Feature Target Feature Learning ◦ Creates brand-new features from existing features making no assumption on the shape of the data ∙ Feature learning algorithms are not parametric ◦ Relies on stochastic learning ∙ Instead of applying the same equation to the data every time, algorithms will discover the best features by looking at the data over and over again (in epochs) and converge onto a solution (potentially different ones at runtime) ◦ Can learn fewer or more features than in the original dataset and the exact number of features to learn depends on the problem and can be grid-searched Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 59 Feature Engineering Feature Transformation vs Feature Learning Feature Transformation Algorithms Feature Learning Algorithms Parametric Simple to use Yes Yes No No New feature set Algorithms Yes No Yes Deep learning Yes PCA, LDA Deep learning ▪ A model being non-parametric does not mean that no assumptions are made at all by the model during training ▪ Feature learning algorithms forgo the assumption on the shape of the data but they still may make assumptions on other aspects of the data (e.g., the variable values) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 60 Feature Engineering References References "Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists", Alice Zhang & Amanda Casari, O'Reilly Media, April 2018, ISBN-13: 978-1-491-95324-2 "Python Feature Engineering Cookbook", Soledad Galli, Packt Publishing, January 2020, ISBN-13: 978-1-789-80631-1 "Feature Engineering Made Simple", Susan Ozdemir & Divya Susarla, Packt Publishing, January 2018, ISBN-13: 978-1- 787-28760-0 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 62 Understanding Machine Learning Feature Engineering ▪ "HowtoChooseaFeatureSelectionMethodforMachineLearning",JasonBrowniee,November2019 (https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/) ▪ "CorrelationCoefficientCalculator"(https://byjus.com/correlation-coefficient-calculator/) ▪ "HowtoPerformFeatureSelectionwithCategoricalData",JasonBrowniee,November2019 (https://machinelearningmastery.com/feature-selection-with-categorical-data/) ▪ "AGentleIntroductiontotheChi-SquaredTestforMachineLearning",JasonBrowniee,June2018 (https://machinelearningmastery.com/chi-squared-test-for-machine-learning/) ▪ "Chi-SquaredTestforFeatureSelectioninMachineLearning",SampathKumarGajawada,October2019 (https://towardsdatascience.com/chi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223) ▪ "Chi-SquaredTestCalculator"(https://www.socscistatistics.com/tests/chisquare2/default2.aspx) ▪ "ATutorialonPrincipalComponentsAnalysis",LindsaySmith,February2002 (http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 63 Feature Engineering THANK YOU