Machine Learning for Financial Data
December 2020
FEATURE ENGINEERING (CONCEPTS – PART 1)

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Feature Engineering
Contents
◦ Financial Data Sources
◦ What is Feature Engineering
◦ Feature Understanding
◦ Feature Improvement

Financial Data Source Yahoo Finance

Yahoo Finance is one of the reliable sources of stock market data
▪ Yahoo Finance (hk.finance.yahoo.com) supports market summaries, historical & current quotes, news feed about companies
▪ Historical & current stock prices in different frequencies (daily, weekly, monthly)
▪ Calculated metrics
▪ e.g., the beta, a measure of the volatility of an individual asset in comparison to the volatility of the entire market
▪ Financial data of a company since its listing in the stock market
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Machine Learning Models

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 7
Feature Engineering
Adjusted closing price adjusted for both dividends and splits

Python: Programmatic Access to Financial Data
# display the output of plotting commands inline
# use the “retina” display mode, i.e. to render higher resolution images
%matplotlib inline
%config InlineBackend.figure_format = ‘retina’
import matplotlib.pyplot as plt import warnings
# import the plotting module of the matplotlib package and binds it to the name “plt” # display all warnings
# customize the display style
# set the dots per inch (dpi) from the default 100 to 300 # suppress warnings related to future versions
plt.style.use(‘seaborn’)
plt.rcParams[‘figure.dpi’] = 300 warnings.simplefilter(action=’ignore’, category=FutureWarning)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 8
Feature Engineering

Python: Downloading Data as DataFrame
# import the relevant packages
import pandas as pd import yfinance as yf
# download the data – data since 1950
# use “MLCO” as the ticker of “Melco Resorts & Entertainment”
# disable the showing of the progress bar using “progress=False”
data = yf.download(‘MLCO’, start=’2010-01-01′,
end=’2020-12-31′, progress=False)
# inspect the data using formatted print
print(f’Downloaded {data.shape[0]} rows of data.’) data.head()
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 9
Feature Engineering

Financial Data Source Quandl

Quandl is a provider of alternative data products for investment professionals
▪ Quandl delivers market data from hundreds of sources via API, or directly into Python, R, Excel, and many other tools
▪ Featured data includes
▪ End of Day US Stock Prices, Core US Fundamentals Data, US Equity Historical & Option Implied Volatilities, Continuous Futures, Trading Economics, BNC Digital Currency Indexed EOD, Global Fundamentals Data, Global Index Prices
▪ Before downloading data, create an account (https://www.quandl.com)
▪ Obtain the API key in the profile (https://www.quandl.com/account/profile)
▪ Search data function (https://www.quandl.com/search)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 11
Machine Learning Models

Create a Quandl account
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Machine Learning Models

Python: Programmatic Data Access
# import the relevant packages
import pandas as pd import quandl as qd
start=’2010-01-01′, end=’2020-12-31′) print(f’Downloaded {data.shape[0]} rows of data.’)
data.head()
# use the API key generated during account creation # provide the DATASET/TICKER of the dataset
qd.ApiConfig.api_key = ‘GEM…………..xwB’ data = qd.get(‘HKEX/06883’,
# inspect the data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 17
Feature Engineering

What is Feature Engineering

Feature Engineering
Feature engineering is the process of transforming data into features that better represent the underlying problem, resulting in improved machine learning performance.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19
Feature Engineering

Feature engineering is about making data meaningful to the machine learning model
Raw and Partially Processed Data
Feature engineering can be applied to data at any stage and deals with raw & partially processed data typically in the form of observations (rows) and attributes (columns).
Meaningful Features
A feature is an attribute of data that is meaningful to the machine learning process. Some attributes can be unhelpful or even hurtful to the machine learning process.
Better Representation
Data always serve to represent a specific problem in a specific domain.
The rationale is to transform data so that it better represents the bigger problem at hand.
Model Performance Improvement
The eventual goal of feature engineering is to obtain data that the learning algorithms will be able to extract patterns from and use in order to obtain better results.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 20
Feature Engineering

Raw data are often in a state that cannot be directly consumed by machine learning algorithms
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Membership
…
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Silver
…
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Silver
…
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Bronze
…
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Gold
…
Adams
¥20000
7-Oct-2020
AUS
JAP
58
Primary
No
Silver
…
…
…
…
…
…
…
…
…
…
Jones
₽3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Silver
…
Mary
₽8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Gold
…
Max
€7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Premium
…
Peter
₽500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Bronze
…
Anson
₽7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Gold
…
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
21
Feature Engineering
Feature
Target

Feature engineering can be carried out in steps but different schools have different thoughts on structuring the steps
Feature Understanding Feature Improvement
Feature Selection Feature Construction
Feature Transformation
Feature Learning
interpreting the data and identify its qualitative and quantitative states cleaning data values and refilling missing data values
selecting features to reduce the noise in the dataset
encoding features and creating new features via feature interactions changing the dataset fundamental structure to prioritize impactful features deploying deep learning to help identify the key features in the dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 22
Feature Engineering

Feature Understanding

Correctly identifying numerical & categorical variables involves looking at the data types & inspecting their values
ALL
QUALITATIVE / CATEGORICAL NOMINAL ORDINAL
QUANTITATIVE / NUMERICAL INTERVAL RATIO
UNSTRUCTURED
transform
STRUCTURED
◦ Values are selected from a group of categories, also called labels
◦ Usually of type object or string
◦ The set of possible values is finite
◦ Generally whole numbers (e.g. 1, 2, …)
◦ Usually of type int
◦ Examples: number of children, number of pets
◦ Values may take any number within a range
◦ Usually of type float
◦ Examples: price of a product, income, house price, or interest rate
◦ Examples: gender (i.e. male and female)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Feature Engineering
◦ Examples: school grades (e.g. A+, A, …)

Understanding the Structure of Data

Structured Data
▪ Any data that can be stored, accessed, and processed in the form of fixed format
▪ The format is known in advance
▪ e.g. data stored in a relational database
▪ Also referred to as Schema-on-Write
Semi-Structured Data
▪ Semi-structured data has a lack of fixed, rigid structure
▪ There is no separation between the data and the schema – a self- describing structure
▪ e.g. XML files, JSON files, web pages in HTML, RDF files
Unstructured Data
▪ Any data with unknown form
▪ e.g. heterogeneous data sources containing simple text files, images & videos
▪ Also referred to as Schema-on-Read
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
26
Feature Engineering

Data in a relational database or an Excel spreadsheet are considered structured data
Data in a Relational Database Data in an Excel Spreadsheet
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 27
Feature Engineering

Semi-structured data embeds information about the data structure and the data contents in the same document

Your Title Here

Link Name
is a link to another nifty site

This is a Header

This is a Medium Header

Send me mail at support@yourcompany.com.

This is a new paragraph!

This is a new paragraph!

This is a new sentence without a paragraph break, in bold italics.

Web Page in HTML format Data-Value Pairs in JSON format
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 28
Feature Engineering
{
“quiz”: {
“sport”: { “q1”: {
“question”: “Which one is correct team name in NBA?”, “options”: [
“New York Bulls”,
“Huston Rocket” ],
“answer”: “Huston Rocket” }
}, }
}

Unstructured data are not really unstructured but structured in a way that is less convenient to manipulate
Image in JPEG format
Audio in WAVE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Feature Engineering

Most unstructured data can be transformed into structured
data through a few manipulations
Images are represented as structured data in the form of layers of matrices containing color intensity value for each pixel
matrix
Images are
considered unstructured data
Images can be decomposed into 3 color channels
red
green
blue
transformation
row
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
30
Feature Engineering

Training autonomous cars using semantic segmentation of road scenes
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Feature Engineering

CSV (Comma- Separated Values) Format
Text File
Used to represent tabular structure
Each line is a record
Each record has multiple columns separated by comma
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 32
Feature Engineering

Feature Improvement

Feature improvement is about altering data values and removing dataset columns/rows
▪ Feature improvement involves both feature cleaning and removal ▪ Cleaning alters columns and rows in the dataset
▪ Removal takes columns and rows away from the dataset
▪ Possible actions include
∙ Identifying missing values
∙ Removing harmful data
∙ Imputing (filling in) missing values
∙ Normalizing/standardizing data
∙ Z-score normalization, min-max scaling, L1 & L2 normalization
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34
Feature Engineering

Feature understanding focuses on data values and data value induced structural changes
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
¥20000
7-Oct-2020
AUS
JAP
58
Primary
No
…
…
…
…
…
…
…
…
Jones
₽3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Mary
₽8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Max
€7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Peter
₽500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Anson
₽7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
35
Feature Engineering
Feature
Target

Missing data is a common problem in datasets and needs to be dealt with before applying any ML model
▪ Missing data refers to the absence of values for certain observations and is an unavoidable problem in most data sources
▪ e.g. with survey data, some observations may not have been recorded
▪ Scikit-learn does not support missing values as input, so it is necessary to take
one of the following actions
▪ remove observations with missing data ▪ transform them into permitted values
▪ The goal of any imputation technique is to clean the data to produce a complete dataset that can be used to train ML models
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36
Feature Engineering

Data Type Rectification

Data Type Rectification
◦ When processing data using Python dataframes, some type mismatch might occur during the loading of data
◦ Normal practice is to store
∙ Discrete variables as the int type
∙ Continuous variables as the float type
∙ Categorical variables as the object type
◦ However, discrete variables can sometime be cast (loaded) as float
◦ To correctly identify variable types, both data types and data values need to be inspected
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
38
Feature Engineering

The “Date” values might be load as “string” type but would be more appropriate to be of “datetime” type
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
¥20000
7-Oct-2020
AUS
JAP
58
Primary
No
…
…
…
…
…
…
…
…
Jones
₽3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Mary
₽8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Max
€7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Peter
₽500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Anson
₽7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
39
Feature Engineering
Feature
Target

Missing Value Removal

Complete Case Analysis (CCA)
◦ Discarding those observations where the values in any of the variables are missing
◦ Can be applied to categorical and numerical variables
◦ Preserves the distribution of the variables, provided the data is missing at random and only a small proportion of the data is missing
◦ However, if data is missing across many variables, CCA may lead to the removal of a big portion of the dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
41
Feature Engineering

Missing value removal involves the removal of an entire observation containing the missing value from the dataset
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
¥20000
7-Oct-2020
AUS
JAP
58
Primary
No
…
…
…
…
…
…
…
…
Jones
₽3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Mary
₽8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Max
€7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Peter
₽500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Anson
₽7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Observations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
42
Feature Engineering
Feature
Target

Data Imputation

Imputation
◦ Imputation is the replacement of missing values with statistical estimates of the missing values
◦ There are multiple imputation techniques that can be deployed
◦ The choice of imputation technique will depend on
∙ whether the data is missing at random
∙ the number of missing values
∙ the machine learning model to use
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
44
Feature Engineering

Imputation techniques vary between numerical variables and categorical variables
Numerical Variable
◦ Mean / Median imputation
◦ Arbitrary number imputation
◦ End of distribution imputation
◦ Random sampling imputation
◦ Missing value indicator augmentation
◦ Multivariable imputation using chained equations
Categorical Variable
◦ Mode imputation
◦ Random sampling imputation
◦ Bespoken category imputation
◦ Missing value indicator augmentation
◦ Multivariable imputation using chained equations
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
45
Feature Engineering

Mean or Median Imputation
▪ Replacing missing values with variable mean or median
▪ Can only be performed in numerical variables
▪ The mean / median is calculated using a training dataset and are used to impute missing data in training, testing, and future datasets
▪ Use mean imputation if variables are normally distributed and median imputation otherwise
▪ Mean and median imputation may distort the distribution of the original variables if there is a high percentage of missing data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
46
Feature Engineering

Missing values can be replaced with the mean or median of the non-missing values of the feature
Name
Amount
Date
Issued In
Used In
Age
Education
Fraud?
Daniel
$2,600.45
1-Jul-2020
HK
HK
22
Secondary
No
Alex
$2,294.58
1-Oct-2020
HK
RUS
None
Postgraduate
Yes
Adrian
$1,003.30
3-Oct-2020
HK
25
Graduate
Yes
Vicky
$8,488.32
4-Oct-2020
JAPAN
HK
64
Graduate
No
Adams
¥20000
7-Oct-2020
AUS
JAP
58
Primary
No
…
…
…
…
…
…
…
…
Jones
₽3,250.11
Nov 1, 2020
HK
RUS
43
Graduate
No
Mary
₽8,156.20
Nov 1, 2020
HK
N/A
27
Graduate
Yes
Max
€7475,11
Nov 8, 2020
UK
GER
32
Primary
No
Peter
₽500.00
Nov 9, 2020
Hong Kong
RUS
0
Postgraduate
No
Anson
₽7,475.11
Nov 9, 2020
Hong Kong
RUS
20
Postgraduate
Yes
Observations
mean / median
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
47
Feature Engineering
Feature
Target

The choice of removal or imputation technique is determined by the superiority of model accuracy
Imputation Technique
# of rows in the training dataset
Accuracy
1
3
4
Dropping rows with missing values
Imputing missing values with zero
Imputing missing values with the mean
Imputing missing values with the median
392
768
768
769
0.74489
2
0.7304
0.7318
0.7357
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 48
Feature Engineering

References

References
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50
Understanding Machine Learning
“Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists”
Alice Zhang & Amanda Casari O’Reilly Media, April 2018 ISBN-13: 978-1-491-95324-2

Feature Engineering
▪ “DataTypesinStatistics”(https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/) ▪ “TypesofData&MeasurementScales:Nominal,Ordinal,IntervalandRatio”
(https://www.mymarketresearchmethods.com/types-of-data-nominal-ordinal-interval-ratio/)
▪ “MeasuresofCentralTendency”(https://statistics.laerd.com/statistical-guides/measures-central-tendency-mean-mode-
median.php)
▪ “ScalesofMeasurementandPresentationofStatisticalData”(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6206790/)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 51
Feature Engineering

Financial Datasets
▪ “yfinance0.1.54”,2019 https://pypi.org/project/yfinance/
▪ “quandl/quandl-python”,2019 https://github.com/quandl/quandl-python
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 52
Feature Engineering

Public Datasets
▪ AcademicTorrents(https://academictorrents.com/browse.php?cat=6)
▪ AwesomePublicDatasets(https://github.com/awesomedata/awesome-public-datasets)
▪ Awesome JSON Datasets (https://github.com/jdorfman/awesome-json-datasets)
▪ CommonCrawl(http://commoncrawl.org/the-data/)
▪ DataHubDatasets(https://datahub.io/search)
▪ KaggleDatasets(https://www.kaggle.com/datasets)
▪ GitHubArchive(http://www.gharchive.org/)
▪ GitHubCOCO-StuffDatasets(https://github.com/nightrome/cocostuff)
▪ HarvardResourcesforCOVID-19(https://dataverse.harvard.edu/dataverse/2019ncov)
▪ GitHubCOVID-19Data(https://github.com/owid/covid-19-data/tree/master/public/data/)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 53
Feature Engineering

Public Datasets
▪ CoronavirusSourceData(https://ourworldindata.org/coronavirus-source-data)
▪ OCHANovelCoronavirus(COVID-19)CasesData(https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-
cases)
▪ WorldBankOpenData(https://data.worldbank.org/)
▪ HongKongGovernmentOpenData(https://data.gov.hk/en/)
▪ USGovernmentOpenData(https://www.data.gov/open-gov/)
▪ TaiwanGovernmentOpenData(https://data.gov.tw/)
▪ Dataquest–18PlacestoFindFreeDataSetsforDataScienceProjects(https://www.dataquest.io/blog/free-datasets- for-projects/)
▪ GoogleDatasetSearch(https://datasetsearch.research.google.com/)
▪ UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.php)
▪ KDnuggetsDatasetsforDataMiningandDataScience(https://www.kdnuggets.com/datasets/index.html) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54
Feature Engineering

Public Datasets
▪ GoogleBigQueryPublicDatasets(https://cloud.google.com/bigquery/public-data/)
▪ GoogleResearchDatasets(https://research.google/tools/datasets/)
▪ MicrosoftPublicDataSetsforTestingandPrototyping(https://docs.microsoft.com/en-us/azure/sql-database/sql- database-public-data-sets)
▪ AmazonWebServicesRegistryofOpenData(https://registry.opendata.aws/)
▪ PathmindOpenDatasets(https://pathmind.com/wiki/open-datasets)
▪ Lionbridge15BestAudioDatasetsforMachineLearning(https://lionbridge.ai/datasets/12-best-audio-datasets-for- machine-learning/)
▪ GoogleCOVID-19PublicDatasets(https://console.cloud.google.com/marketplace/details/bigquery-public- datasets/covid19-dataset-list?preview=bigquery-public-datasets)
▪ GooglePublicData(https://www.google.com/publicdata/directory?hl=en_US&dl=en_US#!)
▪ Tableau – Coronavirus (COVID-19) Global Data Tracker (https://www.tableau.com/covid-19-coronavirus-data-resources)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 55
Feature Engineering

THANK YOU

This is a Header

This is a Medium Header

Related Posts