MFIN 290 Application of Machine Learning in Finance: Lecture 1
MFIN 290 Application of Machine
Learning in Finance: Lecture 1
Edward Sheng
6/26/2021
Background of lecturer
Portfolio Manager and Director of Quantitative Research at Pacific Life
CFA and CAIA charterholder
Mange over $30 billion Pacific Life asset allocation funds (Pacific Funds and Pacific
Select Funds)
Apply machine learning models on real investment
Previous experience
Senior Researcher, Research Affiliates
Summer Associate, Citigroup
Master, Financial Engineering, UCLA
PhD, Engineering, Arizona State University
BS, Nanjing University
2
A guide to decipher the maze of machine learning
3
The goal of this course
A fundamental understanding of machine learning that is good enough for your
job interview
A good foundation for future exploration in machine learning
4
Agenda
Introduction
Logistic regression
1
2
3
5
Machine learning work flow – an example with linear regression (OLS)
Section 1: Introduction
6
A world with big data
Unprecedented amount of data
Tick-by-tick price data from different markets around the world
Digital access to financial reports
Satellite data, weather data and forecast, logistics
Customer database, transactions, browsing and tracking history
Financial media, Twitter, Facebook, blogs
Problem: drowning in information, starving for knowledge
7
Types of big data
Too large
Data that is so large that even calculating simple statistics like means is challenging, even with
powerful computer
Example: Google tracks 30 trillion URLs, crawls over 20 billions a day, and answers 100 billion
search queries a month
These are problems for computer scientists
Too complicated (focus of this course)
Data with very high dimensions and complicated relationships
Example: a product recommendation system for a retailer with 1 million UPCs, 20 million
customers, and 30-50 variables (customer location, browsing history, transaction history, etc.)
This is a problem that requires a model structure to extract information from data
8
What is machine learning?
Arthur Samuel (1959). Machine Learning: Field of study that gives computers the
ability to learn without being explicitly programmed
Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to
learn from experience E with respect to some task T and some performance measure
P, if its performance on T, as measured by P, improves with experience E
Key components
Superior computation power of machine
Large or high dimensional data
Structured model to extract useful knowledge from data and adapt when new data coming
9
Data Knowledge
10
The key is NOT “machine”, but “human” as decision maker
Model is a tool, not a decision maker
Every model is wrong but good models can be useful
A good quant knows what model to use and how to use it in the right way
11
The key is NOT “machine”, but “human” as decision maker
Structure the model
Select features (predictable variables)
Assess and select model
Develop better algorithm
Avoid pitfall
Violation of algorithm assumptions
Overfitting and data mining
Lookahead bias, survivorship bias, etc.
Interpret and apply model
Intuition of model results
Strength and weakness of the model
Model deployment, monitoring, and enhancement
12
Machine learning work flow
13
Step 1
Data
preparation
Step 2
Features
selection
Step 3
Algorithm
selection
Step 4
Model
assessment
& selection
Step 5
Model
application
Types of machine learning problems
14
Machine
Learning
Supervised
Learning
Regression
Classification
Unsupervised
Learning Clustering
(This lecture)
Supervised learning
A response variable (y) associates with predictor variables (x)
𝑦𝑦 = 𝑓𝑓 𝑥𝑥 + 𝜀𝜀
Key terms
x: predictor, features, independent variable, explanatory variable, regressor
y: response, target, dependent variable, explained variable, predicted variable, regressant
f: model relationship between x and y
ε: error term
Goal: build a model f(x) for
Prediction of new y
Investigate how changes of x affects y
15
Example of supervised learning
Forecast recession (y) by using economic indicators (x)
Forecast stock market return (y) by using fundamental and technical indicators
(x)
Forecast high yield bond default probability (y) by using market and company
data (x)
Forecast credit card transaction fraud (y) by using customer records and other
tracking information (x)
16
Regression vs. classification
Regression
Quantitative y
Numerical and continuous, e.g., stock price level
Classification
Qualitative y
categorical, e.g., stock price direction, Email spam, credit score, speech recognition
Some algorithms are mainly used for regression (e.g., OLS), some mainly for
classification (e.g., logistic regression)
17
Unsupervised learning
No corresponding responses (y)
Goal: discover hidden patterns or intrinsic structures among x (e.g., clustering)
Example: a recommendation system by assigning customers to different
groups with different shopping behaviors
18
No free lunch theorem
No one method dominates all others over all possible data sets
Some methods may perform well in certain type of data, while other methods
perform better in different type of data
19
Tips of entering machine learning world
Do
Focus on big picture, key concepts, and key methods
Understand each step of machine learning work flow
Start from basic and common sense
Leverage existing code package whenever possible
Don’t
Distracted by too much technical details (save for next level)
Distracted by unpopular techniques (why they are unpopular?)
Code model from scratch (reinvent the wheel)
Be the fanciest and live in a black box (career risk)
Attempt to invent new algorithms (great if you want to pursuit mathematics, statistics, or
computer science Ph.D.)
20
Self-learning after lecture is important
James, et al (2013) An Introduction to Statistical Learning
Python scikit-learn: https://scikit-learn.org/stable/
21
https://scikit-learn.org/stable/
The easiest programming tool – MATLAB machine learning
apps (regression learner and classification learner)
22
https://www.oit.uci.edu/help/matlab/
https://www.oit.uci.edu/help/matlab/
Python scikit-learn
23
https://scikit-learn.org/stable/
https://scikit-learn.org/stable/
Section 2: Machine Learning Work
Flow – An Example with Linear
Regression (OLS)
24
Revisit ordinary least squares (OLS) regression
Linear relationship should be always considered first; OLS is the most commonly used linear
regression
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + �
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗 + 𝜀𝜀𝑖𝑖
OLS estimates linear regression of y on x by minimizing sum of squared differences between
observed y and predicted y
min
𝛽𝛽
�
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 𝛽𝛽0 −�
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗
2
OLS provides the best linear unbiased estimator (BLUE) when strict assumptions are met
(elaborate later)
25
Simple linear regression
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑖𝑖 + 𝜀𝜀𝑖𝑖
Coefficients
�̂�𝛽1 =
∑𝑖𝑖=1
𝑛𝑛 𝑥𝑥𝑖𝑖 − �̅�𝑥 𝑦𝑦𝑖𝑖 − �𝑦𝑦
∑𝑖𝑖=1
𝑛𝑛 𝑥𝑥𝑖𝑖 − �̅�𝑥 2
, �̂�𝛽0 = �𝑦𝑦 − �̂�𝛽1�̅�𝑥
Standard error of coefficients
SE �̂�𝛽0 = 𝜎𝜎𝜀𝜀
1
𝑛𝑛
+
�̅�𝑥2
∑𝑖𝑖=1
𝑛𝑛 𝑥𝑥𝑖𝑖 − �̅�𝑥 2
, SE �̂�𝛽1 =
𝜎𝜎𝜀𝜀
∑𝑖𝑖=1
𝑛𝑛 𝑥𝑥𝑖𝑖 − �̅�𝑥 2
Hypothesis test
𝐻𝐻0:𝛽𝛽1 = 0,𝐻𝐻𝑎𝑎:𝛽𝛽1 ≠ 0
t-test
𝑡𝑡 =
�̂�𝛽1 − 0
SE �̂�𝛽1
26
Multiple linear regression
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1𝑖𝑖 + 𝛽𝛽2𝑥𝑥2𝑖𝑖 + ⋯+ 𝛽𝛽𝑝𝑝𝑥𝑥𝑝𝑝𝑖𝑖 + 𝜀𝜀𝑖𝑖
Coefficients and SE formula are
complicated
Hypothesis test
𝐻𝐻0: 𝛽𝛽1 = 𝛽𝛽2 = ⋯ = 𝛽𝛽𝑝𝑝 = 0
𝐻𝐻𝑎𝑎: at least one 𝛽𝛽𝑗𝑗 is non zero
F-test
𝐹𝐹 =
⁄∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦 2 − ∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2 𝑝𝑝
⁄∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2 𝑛𝑛 − 𝑝𝑝 − 1
27
Machine learning work flow
28
Step 1
Data
preparation
Step 2
Features
selection
Step 3
Algorithm
selection
Step 4
Model
assessment
& selection
Step 5
Model
application
Machine learning work flow
Let’s show a typical workflow of building a model by using OLS
Focus on Step 1: data, Step 2: feature, and Step 4: model
We will expand this framework by learning other machine learning algorithms
(Step 3: algorithm)
We will also expand our toolkit for feature selection and model
assessment/selection while learning new algorithms
29
Step 1: Data preparation – typical problems
Missing values
Outliers
Different scales ($, weight, age, etc.)
Collinearity (redundant information)
Look-ahead bias (data leakage)
Survivorship bias
Irrelevant and uninformative data
30
Step 1: Data preparation – descriptive statistics table
Basic descriptive statistics provides insights on the quality and distribution of
the data
Count, mean, standard deviation, min, max, quantiles, percentage of missing
value
31
Step 1: Data preparation – distribution plot
32
Step 1: Data preparation – missing values
Reasons for missing data?
Different time history
Different time frequency
Recording error or difficulty
If the missing percentage is too high, consider removing the feature (use with
caution!)
Is the feature conceptually important to the target?
Is there alternative feature carrying similar information?
There is no ideal way of handling missing data; the Rule of Thumb is to
minimize disturbance of data
33
Step 1: Data preparation – missing values
34
Step 1: Data preparation – outliers
Trimming/delete: infrequent and due to
error
Retaining: legitimate outliers that
contains information
Winsorizing/winsorization: retain
outliers but mitigate their impact
Limit extreme values to a threshold
Example: values exceeding three
standard deviation shrink to three
standard deviation
Example: 95% winsorization sets all data
below 2.5th and above 97.5th percentile to
2.5th and 97.5th percentile
35
Step 1: Data preparation – transformation
Stationary: distribution statistics such as mean and variance do not change over time
Non-stationary data such as stock price level needs to be detrended (convert to return) to
stationary data
Popular detrend methods: log different, % difference, difference
36
Step 1: Data preparation – transformation
Standardization
Transform data to N(0, 1) distribution
�𝑋𝑋 =
𝑋𝑋 − �̂�𝜇
�𝜎𝜎
37
Step 1: Data preparation – transformation
Correlation on original data Correlation on transformed data
38
Step 1: Data preparation – common mistakes in finance
Look-ahead bias (data leakage)
Use information that is not available yet at the time of test
e.g., monthly return that is only available at the end of the month but is used at the
beginning of the month
e.g., use financial statement or GDP release before they were released in history
Survivorship bias
Use a dataset that excludes discontinued time series
e.g., backtest a category of mutual funds without discontinued mutual funds in history
39
Step 2: Feature selection – why?
Large set of predictors – curse of dimensionality
Irrelevant noise features increase complexity/dimensionality of the problem and
exacerbating the risk of overfitting
If number of predictor p approaches number of data points n, variation of the model
significant increases; keep p << n for robust model
Make searching for optimal solution difficult
Add difficulty of interpreting model
Feature selection is a bias-variance trade-off
Number of predictor p ↓, bias ↑, variance ↓
40
Step 2: Feature selection – curse of dimensionality
A problem with number of observations n = 100 and 20 features truly related to
response
Model error will be higher when p ↑, especially if noise features are added
41
p – number of features; degree of freedom – number of selected features
M
od
el
E
rr
or
Step 2: Feature selection – common methods
Subset selection
Identifying a subset of predictors related to response
Shrinkage/regularization
Regress on all predictors but shrink coefficients towards 0 to identify true influential
predictors
Dimension reduction
Map p predictors into M-dimensional subspace with M < p
42
Step 2: Feature selection – subset selection
Stepwise
Forward selection
Start with the null model (with only intercept)
For k = 0, …, p – 1, add one variable and select model with best improvement (Mk)
Iterate based on Mk by adding one variable at a time and find best improvement (Mk+1)
(improvement criteria in Step 4)
Select best model from (M0, …, Mp)
Backward selection
Starting with the full model (with all the features)
Similar iteration as forward selection but remove one variable at a time
43
Step 2: Feature selection – subset selection
Hybrid stepwise
A combination of forward and
backward selection
After adding one variable (forward), re-
evaluate and remove variables that no
longer provide improvement (backward)
Starting point of stepwise iteration can
lead to different selection of models
44
Step 2: Feature selection – shrinkage/regularization
Ridge regression
OLS is modified to also minimize the squared sum of the coefficients (L2 regularization)
Cost function/loss function
min
𝛽𝛽
�
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 𝛽𝛽0 −�
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗
2
+ 𝜆𝜆�
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗
2
Lambda (λ): tuning parameter controlling shrinkage penalty
Bias-variance trade-off
Coefficients will shrink towards 0 but not exactly to 0
45
shrinkage penalty
bias ↑, variance ↓
Step 2: Feature selection – shrinkage/regularization
Ridge regression
46
Combined
Variance
Bias
Step 2: Feature selection – shrinkage/regularization
Lasso regression
OLS is modified to also minimize the absolute sum of the coefficients (L1 regularization)
Cost function/loss function
min
𝛽𝛽
�
𝑖𝑖=1
𝑛𝑛
𝑦𝑦𝑖𝑖 − 𝛽𝛽0 −�
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗
2
+ 𝜆𝜆�
𝑗𝑗=1
𝑝𝑝
𝛽𝛽𝑗𝑗
Force some coefficients to be exact 0; better for feature selection
47
shrinkage penalty
Step 2: Feature selection – shrinkage/regularization
Lasso regression
48
Combined
Variance
Bias
Step 2: Feature selection – shrinkage/regularization
Why lasso force some coefficients to be exactly 0?
49
Lasso Ridge
Contours of RSS
Constrained
space of β
Step 2: Feature selection – shrinkage/regularization
Lambda (λ)
Tuning parameter or hyperparameter
Can be found by cross validation (elaborate later)
Elastic net
Limitation of lasso
When p > n, only select maximum n variables
Only select one variable in highly correlated group
A combination of ridge and lasso
50
Step 2: Feature selection – dimension reduction
Transform X1, X2, …, Xp to Z1, Z2, …, ZM with M < p (reduce dimension of features)
𝑍𝑍𝑚𝑚 = �
𝑗𝑗=1
𝑝𝑝
𝜙𝜙𝑗𝑗𝑚𝑚𝑋𝑋𝑗𝑗
𝑦𝑦𝑖𝑖 = 𝜃𝜃0 + �
𝑚𝑚=1
𝑀𝑀
𝜃𝜃𝑚𝑚𝑧𝑧𝑖𝑖𝑚𝑚 + 𝜀𝜀𝑖𝑖
𝛽𝛽𝑗𝑗 = �
𝑚𝑚=1
𝑀𝑀
𝜃𝜃𝑚𝑚𝜙𝜙𝑗𝑗𝑚𝑚
51
Step 2: Feature selection – dimension reduction
Principal components analysis (PCA) (elaborate in Lecture 4)
52
Z1
Z2
X1
X2
Step 4: Model assessment – algorithm assumption
OLS is a quite restrictive algorithm with strict assumptions to produce BLUE
Linear relationship between x and y
Use residual plot to identify any pattern (a well-conditioned model should not have any pattern in
residual plot)
May consider transform data (logX, X2, 𝑋𝑋, etc.)
Switch to non-linear methods
Uncorrelated error term
If error term correlated, standard error for coefficients will be underestimated, which will distort
model interpretation such as hypothesis test and confidence interval
Example: autocorrelation in time series data
Switch to time series regression methods (AR, MA, ARMA, etc.)
53
Step 4: Model assessment – algorithm assumption
OLS assumptions (cont.)
Constant variance of error term
Non-constant variance in error is called heteroscedasticity
May consider transform response (logY, 𝑌𝑌)
Switch to heteroscedasticity methods (ARCH, GARCH, etc.)
Low correlation among predictors
High correlation in predictors is called collinearity/multicollinearity
Collinearity may give model false opportunity of arbitrage and result big positive and negative
coefficients on highly correlated predictors
May consider drop some predictors
May consider combine some predictors (by averaging) or set difference of two predictors as a new
variable
Ridge regression, lasso, PCA
54
Step 4: Model assessment – model fit
R-squared
𝑅𝑅2 =
TSS − RSS
TSS
=
∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦 2 − ∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2
∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦 2
TSS – total sum of squares; RSS – residual sum of squares
Mean squared error (MSE)
MSE =
RSS
𝑛𝑛
=
∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2
𝑛𝑛
Root Mean Squared Error (RMSE)
RMSE = MSE =
∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2
𝑛𝑛
Residual Standard Error (RSE)
RSE =
RSS
𝑛𝑛 − 𝑝𝑝 − 1
=
∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2
𝑛𝑛 − 𝑝𝑝 − 1
55
Step 4: Model assessment – loss function
The evaluation matrixes in previous pages are also called loss function
Loss function typically takes a format of
𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐿𝐿𝐸𝐸 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡𝑑𝑑𝐿𝐿𝑛𝑛 𝑓𝑓𝐸𝐸𝐿𝐿𝑓𝑓 𝑡𝑡𝐸𝐸𝑡𝑡𝑑𝑑 + 𝑝𝑝𝑑𝑑𝑛𝑛𝑑𝑑𝑝𝑝𝑡𝑡𝑦𝑦(𝐿𝐿𝑝𝑝𝑡𝑡𝑑𝑑𝐿𝐿𝑛𝑛𝑑𝑑𝑝𝑝)
Recall loss function in shrinkage/regularization
56
Step 4: Model assessment – training error vs. test error
Training error: errors between trained data and fitted model (in sample)
Test error: errors between test data and model prediction (out of sample)
Minimizing training error does not necessarily mean minimizing test error
57
Training
Test
Step 4: Model assessment – underfitting and overfitting
58
Step 4: Model assessment – underfitting and overfitting
The goal is to choose a model with low test error
Underfitting
Easier to identify: high training error
Overfitting
More common and hard to identify
Also called data mining
Fake relationship in training data that will not materialize out-of-sample
Sign: much large test error than training error
Solutions
Indirectly: adjust training error to account for bias due to overfitting
Directly: cross-validation
59
Step 4: Model assessment – adjust training error and
penalize complicate model
Cp
𝐶𝐶𝑝𝑝 =
1
𝑛𝑛
RSS + 2𝑝𝑝 �𝜎𝜎2 = MSE + penalty
p – number of predictors; �𝜎𝜎2 - estimate of variance of error
Akaike information criterion (AIC)
AIC =
1
𝑛𝑛 �𝜎𝜎2
RSS + 2𝑝𝑝 �𝜎𝜎2
Bayesian information criterion (BIC)
BIC =
1
𝑛𝑛 �𝜎𝜎2
RSS + ln 𝑛𝑛 𝑝𝑝 �𝜎𝜎2
Adjusted R-squared
𝑅𝑅𝑎𝑎𝑎𝑎𝑗𝑗
2 = 1 − (1 − 𝑅𝑅2)
𝑛𝑛 − 1
𝑛𝑛 − 𝑝𝑝 − 1
60
Step 4: Model assessment – adjust training error and
penalize complicate model
61
Step 4: Model assessment – cross-validation
62
• Split dataset into training set and testing set (aka
validation set or hold-out set)
• Train model on training set
• Used trained model to predict on testing set
• Assess baseline model accuracy on testing set
• Tune and choose final model based on the
performance on testing set
• Train on all data and make prediction on future
observations
Step 4: Model assessment – k-fold cross-validation
Randomly divide observations into equally sized, non-overlapping k-folds
(typically 5 or 10 depending on size of data)
Train in k – 1 folds and test (validate) on remaining one fold, repeat k times by
using each fold as test fold (k out‐of‐sample tests)
Choose model with lowest average test error
CV(𝑘𝑘) =
1
𝑘𝑘
�
𝑖𝑖=1
𝑘𝑘
MSE𝑖𝑖
63
Step 4: Model assessment – k-fold cross-validation
64
All data
Training set
Test setFive runs
(5-fold)
Section 3: Logistic Regression
65
Logistic regression
Logistic regression is the simplest algorithm for binary classification in supervised
learning
Examples
Purchase of a product (Y = 1 if purchase, Y = 0 if not)
Click on display ad (Y = 1 if click, Y = 0 if not)
Default on loan (Y = 1 if default, Y = 0 if not)
66
Limitation of linear regression on classification
Linear regression cannot guarantee a valid probability 𝑌𝑌 ∈ 0, 1
For classification > 2 levels, in general there is no natural way to convert
qualitative levels to quantitative response
67
Invalid probability
prediction outside
[0, 1]
Logistic function
Conditional expectation of Y equals to the probability of Y = 1
𝐸𝐸 𝑌𝑌 𝑋𝑋 = 𝑃𝑃𝐸𝐸 𝑌𝑌 = 1 𝑋𝑋 × 1 + 1 − 𝑃𝑃𝐸𝐸 𝑌𝑌 = 1 𝑋𝑋 × 0 = 𝑃𝑃𝐸𝐸 𝑌𝑌 = 1 𝑋𝑋
Logistic function is an S shape curve between 0 and 1
𝑝𝑝 𝑋𝑋 = Pr 𝑌𝑌 = 1 𝑋𝑋 =
𝑑𝑑𝛽𝛽0+𝛽𝛽1𝑋𝑋1+𝛽𝛽2𝑋𝑋2+⋯+𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝
1 + 𝑑𝑑𝛽𝛽0+𝛽𝛽1𝑋𝑋1+𝛽𝛽2𝑋𝑋2+⋯+𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝
Convert to linear form
ln
𝑝𝑝 𝑋𝑋
1 − 𝑝𝑝 𝑋𝑋
= 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝
Logistic regression is part of generalized linear model (GLM)
𝑓𝑓 𝜇𝜇𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝
68
Logit
Link function
Difference and similarity with linear regression
In linear regression, one-unit change
of X lead to β unit change in Y
In logistic regression, there is no linear
relationship between p(X) and X. Rate
of change in p(X) per unit change in X
depends on level of X
𝜕𝜕𝑝𝑝 𝑋𝑋
𝜕𝜕𝑋𝑋𝑗𝑗
= 𝛽𝛽𝑗𝑗𝑝𝑝 𝑋𝑋 1 − 𝑝𝑝 𝑋𝑋
Statistical inference (standard error, t-
stat, p-value, hypothesis test) are
similar
69
Different slope at
different X level
How to solve logistic regression – maximum likelihood
estimation (MLE)
Likelihood function
How likely particular values of parameters 𝜃𝜃 = 𝜃𝜃1,𝜃𝜃2, … ,𝜃𝜃𝑝𝑝 (treated as variable) are for a given
set of observations 𝑌𝑌 = 𝑦𝑦1,𝑦𝑦2, … ,𝑦𝑦𝑛𝑛 𝑝𝑝 𝜃𝜃 𝑌𝑌
Re-expression of joint probability of observations Y conditioned on parameters θ
ℒ 𝜃𝜃 𝑌𝑌 = 𝑓𝑓 𝑌𝑌 𝜃𝜃 = 𝑓𝑓 𝑦𝑦1 𝜃𝜃 𝑓𝑓 𝑦𝑦2 𝜃𝜃 …𝑓𝑓 𝑦𝑦𝑛𝑛 𝜃𝜃 = �
𝑖𝑖=1
𝑛𝑛
𝑓𝑓 𝑦𝑦𝑖𝑖 𝜃𝜃
Under Bayes’ theorem with uniform prior distribution
𝑝𝑝 𝜃𝜃 𝑌𝑌 =
𝑓𝑓 𝑌𝑌 𝜃𝜃 𝑝𝑝 𝜃𝜃
𝑝𝑝 𝑌𝑌
∝ 𝑓𝑓 𝑌𝑌 𝜃𝜃
𝑓𝑓 𝑌𝑌 𝜃𝜃 : likelihood; 𝑝𝑝 𝜃𝜃 : prior distribution (uniform, non-informative)
𝑝𝑝 𝑌𝑌 : probability of data averaged over all parameters (independent of θ)
𝑝𝑝 𝜃𝜃 𝑌𝑌 : posterior distribution, in this case, proportional to likelihood
70
if y is independent and identically distributed (i.i.d)
How to solve logistic regression – maximum likelihood
estimation (MLE)
An example with coin flip
Observation: two heads (H) one tail (T) in three flips, what is likelihood of parameter pH
(fairness of coin, pH = 0.5 for fair coin)
If pH = 0.5, ℒ 𝑝𝑝H HHT = 𝑝𝑝 HHT 𝑝𝑝H = 0.5 = 0.52 � 0.5 = 0.125
If pH = 0.3, ℒ 𝑝𝑝H HHT = 𝑝𝑝 HHT 𝑝𝑝H = 0.3 = 0.32 � 0.7 = 0.063
71
Parameter with maximum likelihood
How to solve logistic regression – maximum likelihood
estimation (MLE)
MLE tries to find parameter set θ that maximizes likelihood function
For logistic regression
ℒ 𝛽𝛽 𝑌𝑌 = �
𝑖𝑖: 𝑦𝑦𝑖𝑖=1
𝑝𝑝 𝑥𝑥𝑖𝑖 �
𝑖𝑖′: 𝑦𝑦𝑖𝑖
′=0
1 − 𝑝𝑝 𝑥𝑥𝑖𝑖′
MLE is the most widely used method for fitting models
MLE can be solved either analytically (with closed-form solution) or numerically
OLS is a special case of MLE
72
Evaluate logistic regression model – deviance
Deviance is used to measure
goodness-of-fit for MLE
𝐷𝐷 = −2 lnℒ 𝑀𝑀𝑐𝑐 − lnℒ 𝑀𝑀𝑠𝑠
𝑀𝑀𝑠𝑠: saturated model with n parameters
fit n observations; when fitted perfectly,
ℒ 𝑀𝑀𝑠𝑠 = 1, lnℒ 𝑀𝑀𝑠𝑠 = 0
𝑀𝑀𝑐𝑐: candidate model
D ↓, better fit
Both deviance and log likelihood can
be used to evaluate feature selection
73
Evaluate logistic regression model – threshold
Threshold is used to convert probability prediction to class prediction
74
FICO Predicted
Default Prob
Prediction
800 0.1 No Default
750 0.2 No Default
700 0.4 No Default
650 0.6 Default
600 0.7 Default
550 0.8 Default
Threshold, when threshold moves,
prediction of class changes
Evaluate logistic regression model – Type I and Type II
errors
Type I (false positive, α) and Type II (false negative, β) errors from hypothesis
test
75
H0 True False
Don’t reject Type II error (β)
Reject Type I (α)
Evaluate logistic regression model – confusion matrix
76
False: different from actual
Positive/negative: prediction
Actual – or
Null
Actual + or
Non-null
Sum
Predict – or
Null
TN (True
negative)
FN (False
negative, β)
N*
Predict + or
Non-null
FP (False
positive, α)
TP (true
positive)
P*
Sum N P
FP
N
: Type I, false
positive rate
FN
P
: Type II
TP
P
: true positive rate, hit
rate, sensitivity, recall
Evaluate logistic regression model – ROC curve
The left edge of TPR (true positive rate) vs. FPR (false positive rate) when
threshold moves
77
false positive/actual –
Type I
true +/actual +
1 – Type II
Hit rate
Completely random
Ideal ROC
AUC: Area under curve
Revisit logistic regression in Lecture 3
Advanced topics in logistic regression will be further elaborated in Lecture 3
Gradient descent
Unbalanced data
Loss function with misclassification penalty
Lift table
F1 scores
Precision-recall curve
78
79
https://wrds-www.wharton.upenn.edu/
https://wrds-www.wharton.upenn.edu/
MFIN 290 Application of Machine Learning in Finance: Lecture 1
Background of lecturer
A guide to decipher the maze of machine learning
The goal of this course
Agenda
Section 1: Introduction
A world with big data
Types of big data
What is machine learning?
幻灯片编号 10
The key is NOT “machine”, but “human” as decision maker
The key is NOT “machine”, but “human” as decision maker
Machine learning work flow
Types of machine learning problems
Supervised learning
Example of supervised learning
Regression vs. classification
Unsupervised learning
No free lunch theorem
Tips of entering machine learning world
Self-learning after lecture is important
The easiest programming tool – MATLAB machine learning apps (regression learner and classification learner)
Python scikit-learn
Section 2: Machine Learning Work Flow – An Example with Linear Regression (OLS)
Revisit ordinary least squares (OLS) regression
Simple linear regression
Multiple linear regression
Machine learning work flow
Machine learning work flow
Step 1: Data preparation – typical problems
Step 1: Data preparation – descriptive statistics table
Step 1: Data preparation – distribution plot
Step 1: Data preparation – missing values
Step 1: Data preparation – missing values
Step 1: Data preparation – outliers
Step 1: Data preparation – transformation
Step 1: Data preparation – transformation
Step 1: Data preparation – transformation
Step 1: Data preparation – common mistakes in finance
Step 2: Feature selection – why?
Step 2: Feature selection – curse of dimensionality
Step 2: Feature selection – common methods
Step 2: Feature selection – subset selection
Step 2: Feature selection – subset selection
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – dimension reduction
Step 2: Feature selection – dimension reduction
Step 4: Model assessment – algorithm assumption
Step 4: Model assessment – algorithm assumption
Step 4: Model assessment – model fit
Step 4: Model assessment – loss function
Step 4: Model assessment – training error vs. test error
Step 4: Model assessment – underfitting and overfitting
Step 4: Model assessment – underfitting and overfitting
Step 4: Model assessment – adjust training error and penalize complicate model
Step 4: Model assessment – adjust training error and penalize complicate model
Step 4: Model assessment – cross-validation
Step 4: Model assessment – k-fold cross-validation
Step 4: Model assessment – k-fold cross-validation
Section 3: Logistic Regression
Logistic regression
Limitation of linear regression on classification
Logistic function
Difference and similarity with linear regression
How to solve logistic regression – maximum likelihood estimation (MLE)
How to solve logistic regression – maximum likelihood estimation (MLE)
How to solve logistic regression – maximum likelihood estimation (MLE)
Evaluate logistic regression model – deviance
Evaluate logistic regression model – threshold
Evaluate logistic regression model – Type I and Type II errors
Evaluate logistic regression model – confusion matrix
Evaluate logistic regression model – ROC curve
Revisit logistic regression in Lecture 3
幻灯片编号 79