CS计算机代考程序代写 matlab python database Bayesian finance data mining algorithm MFIN 290 Application of Machine Learning in Finance: Lecture 1

MFIN 290 Application of Machine Learning in Finance: Lecture 1

MFIN 290 Application of Machine
Learning in Finance: Lecture 1

Edward Sheng

6/26/2021

Background of lecturer
Portfolio Manager and Director of Quantitative Research at Pacific Life
CFA and CAIA charterholder
Mange over $30 billion Pacific Life asset allocation funds (Pacific Funds and Pacific
Select Funds)
Apply machine learning models on real investment
Previous experience

Senior Researcher, Research Affiliates
Summer Associate, Citigroup
Master, Financial Engineering, UCLA
PhD, Engineering, Arizona State University
BS, Nanjing University

2

A guide to decipher the maze of machine learning

3

The goal of this course
A fundamental understanding of machine learning that is good enough for your
job interview

A good foundation for future exploration in machine learning

4

Agenda

Introduction

Logistic regression

1

2

3

5

Machine learning work flow – an example with linear regression (OLS)

Section 1: Introduction

6

A world with big data
Unprecedented amount of data

Tick-by-tick price data from different markets around the world
Digital access to financial reports
Satellite data, weather data and forecast, logistics
Customer database, transactions, browsing and tracking history
Financial media, Twitter, Facebook, blogs

Problem: drowning in information, starving for knowledge

7

Types of big data
Too large

Data that is so large that even calculating simple statistics like means is challenging, even with
powerful computer
Example: Google tracks 30 trillion URLs, crawls over 20 billions a day, and answers 100 billion
search queries a month

These are problems for computer scientists

Too complicated (focus of this course)
Data with very high dimensions and complicated relationships

Example: a product recommendation system for a retailer with 1 million UPCs, 20 million
customers, and 30-50 variables (customer location, browsing history, transaction history, etc.)
This is a problem that requires a model structure to extract information from data

8

What is machine learning?
Arthur Samuel (1959). Machine Learning: Field of study that gives computers the
ability to learn without being explicitly programmed

Tom Mitchell (1998). Well-posed Learning Problem: A computer program is said to
learn from experience E with respect to some task T and some performance measure
P, if its performance on T, as measured by P, improves with experience E

Key components
Superior computation power of machine
Large or high dimensional data
Structured model to extract useful knowledge from data and adapt when new data coming

9

Data Knowledge

10

The key is NOT “machine”, but “human” as decision maker
Model is a tool, not a decision maker

Every model is wrong but good models can be useful

A good quant knows what model to use and how to use it in the right way

11

The key is NOT “machine”, but “human” as decision maker
Structure the model

Select features (predictable variables)
Assess and select model
Develop better algorithm

Avoid pitfall
Violation of algorithm assumptions
Overfitting and data mining
Lookahead bias, survivorship bias, etc.

Interpret and apply model
Intuition of model results
Strength and weakness of the model
Model deployment, monitoring, and enhancement

12

Machine learning work flow

13

Step 1
Data

preparation

Step 2
Features
selection

Step 3
Algorithm
selection

Step 4
Model

assessment
& selection

Step 5
Model

application

Types of machine learning problems

14

Machine
Learning

Supervised
Learning

Regression

Classification

Unsupervised
Learning Clustering

(This lecture)

Supervised learning
A response variable (y) associates with predictor variables (x)

𝑦𝑦 = 𝑓𝑓 𝑥𝑥 + 𝜀𝜀

Key terms
x: predictor, features, independent variable, explanatory variable, regressor
y: response, target, dependent variable, explained variable, predicted variable, regressant
f: model relationship between x and y
ε: error term

Goal: build a model f(x) for
Prediction of new y
Investigate how changes of x affects y

15

Example of supervised learning
Forecast recession (y) by using economic indicators (x)

Forecast stock market return (y) by using fundamental and technical indicators
(x)

Forecast high yield bond default probability (y) by using market and company
data (x)

Forecast credit card transaction fraud (y) by using customer records and other
tracking information (x)

16

Regression vs. classification
Regression

Quantitative y
Numerical and continuous, e.g., stock price level

Classification
Qualitative y
categorical, e.g., stock price direction, Email spam, credit score, speech recognition

Some algorithms are mainly used for regression (e.g., OLS), some mainly for
classification (e.g., logistic regression)

17

Unsupervised learning
No corresponding responses (y)

Goal: discover hidden patterns or intrinsic structures among x (e.g., clustering)

Example: a recommendation system by assigning customers to different
groups with different shopping behaviors

18

No free lunch theorem
No one method dominates all others over all possible data sets

Some methods may perform well in certain type of data, while other methods
perform better in different type of data

19

Tips of entering machine learning world
Do

Focus on big picture, key concepts, and key methods
Understand each step of machine learning work flow
Start from basic and common sense
Leverage existing code package whenever possible

Don’t
Distracted by too much technical details (save for next level)
Distracted by unpopular techniques (why they are unpopular?)
Code model from scratch (reinvent the wheel)
Be the fanciest and live in a black box (career risk)
Attempt to invent new algorithms (great if you want to pursuit mathematics, statistics, or
computer science Ph.D.)

20

Self-learning after lecture is important
James, et al (2013) An Introduction to Statistical Learning

Python scikit-learn: https://scikit-learn.org/stable/

21

https://scikit-learn.org/stable/

The easiest programming tool – MATLAB machine learning
apps (regression learner and classification learner)

22

https://www.oit.uci.edu/help/matlab/

https://www.oit.uci.edu/help/matlab/

Python scikit-learn

23

https://scikit-learn.org/stable/

https://scikit-learn.org/stable/

Section 2: Machine Learning Work
Flow – An Example with Linear
Regression (OLS)

24

Revisit ordinary least squares (OLS) regression
Linear relationship should be always considered first; OLS is the most commonly used linear
regression

𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + �
𝑗𝑗=1

𝑝𝑝

𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗 + 𝜀𝜀𝑖𝑖

OLS estimates linear regression of y on x by minimizing sum of squared differences between
observed y and predicted y

min
𝛽𝛽


𝑖𝑖=1

𝑛𝑛

𝑦𝑦𝑖𝑖 − 𝛽𝛽0 −�
𝑗𝑗=1

𝑝𝑝

𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗

2

OLS provides the best linear unbiased estimator (BLUE) when strict assumptions are met
(elaborate later)

25

Simple linear regression
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑖𝑖 + 𝜀𝜀𝑖𝑖

Coefficients

�̂�𝛽1 =
∑𝑖𝑖=1
𝑛𝑛 𝑥𝑥𝑖𝑖 − �̅�𝑥 𝑦𝑦𝑖𝑖 − �𝑦𝑦
∑𝑖𝑖=1
𝑛𝑛 𝑥𝑥𝑖𝑖 − �̅�𝑥 2

, �̂�𝛽0 = �𝑦𝑦 − �̂�𝛽1�̅�𝑥

Standard error of coefficients

SE �̂�𝛽0 = 𝜎𝜎𝜀𝜀
1
𝑛𝑛

+
�̅�𝑥2

∑𝑖𝑖=1
𝑛𝑛 𝑥𝑥𝑖𝑖 − �̅�𝑥 2

, SE �̂�𝛽1 =
𝜎𝜎𝜀𝜀

∑𝑖𝑖=1
𝑛𝑛 𝑥𝑥𝑖𝑖 − �̅�𝑥 2

Hypothesis test
𝐻𝐻0:𝛽𝛽1 = 0,𝐻𝐻𝑎𝑎:𝛽𝛽1 ≠ 0

t-test

𝑡𝑡 =
�̂�𝛽1 − 0
SE �̂�𝛽1

26

Multiple linear regression
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1𝑖𝑖 + 𝛽𝛽2𝑥𝑥2𝑖𝑖 + ⋯+ 𝛽𝛽𝑝𝑝𝑥𝑥𝑝𝑝𝑖𝑖 + 𝜀𝜀𝑖𝑖

Coefficients and SE formula are
complicated

Hypothesis test
𝐻𝐻0: 𝛽𝛽1 = 𝛽𝛽2 = ⋯ = 𝛽𝛽𝑝𝑝 = 0

𝐻𝐻𝑎𝑎: at least one 𝛽𝛽𝑗𝑗 is non zero

F-test

𝐹𝐹 =
⁄∑𝑖𝑖=1

𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦 2 − ∑𝑖𝑖=1
𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2 𝑝𝑝
⁄∑𝑖𝑖=1

𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2 𝑛𝑛 − 𝑝𝑝 − 1

27

Machine learning work flow

28

Step 1
Data

preparation

Step 2
Features
selection

Step 3
Algorithm
selection

Step 4
Model

assessment
& selection

Step 5
Model

application

Machine learning work flow
Let’s show a typical workflow of building a model by using OLS

Focus on Step 1: data, Step 2: feature, and Step 4: model

We will expand this framework by learning other machine learning algorithms
(Step 3: algorithm)

We will also expand our toolkit for feature selection and model
assessment/selection while learning new algorithms

29

Step 1: Data preparation – typical problems
Missing values

Outliers

Different scales ($, weight, age, etc.)

Collinearity (redundant information)

Look-ahead bias (data leakage)

Survivorship bias

Irrelevant and uninformative data

30

Step 1: Data preparation – descriptive statistics table
Basic descriptive statistics provides insights on the quality and distribution of
the data

Count, mean, standard deviation, min, max, quantiles, percentage of missing
value

31

Step 1: Data preparation – distribution plot

32

Step 1: Data preparation – missing values
Reasons for missing data?

Different time history
Different time frequency
Recording error or difficulty

If the missing percentage is too high, consider removing the feature (use with
caution!)

Is the feature conceptually important to the target?
Is there alternative feature carrying similar information?

There is no ideal way of handling missing data; the Rule of Thumb is to
minimize disturbance of data

33

Step 1: Data preparation – missing values

34

Step 1: Data preparation – outliers
Trimming/delete: infrequent and due to
error
Retaining: legitimate outliers that
contains information
Winsorizing/winsorization: retain
outliers but mitigate their impact

Limit extreme values to a threshold
Example: values exceeding three
standard deviation shrink to three
standard deviation
Example: 95% winsorization sets all data
below 2.5th and above 97.5th percentile to
2.5th and 97.5th percentile

35

Step 1: Data preparation – transformation
Stationary: distribution statistics such as mean and variance do not change over time

Non-stationary data such as stock price level needs to be detrended (convert to return) to
stationary data
Popular detrend methods: log different, % difference, difference

36

Step 1: Data preparation – transformation
Standardization

Transform data to N(0, 1) distribution

�𝑋𝑋 =
𝑋𝑋 − �̂�𝜇
�𝜎𝜎

37

Step 1: Data preparation – transformation

Correlation on original data Correlation on transformed data

38

Step 1: Data preparation – common mistakes in finance
Look-ahead bias (data leakage)

Use information that is not available yet at the time of test
e.g., monthly return that is only available at the end of the month but is used at the
beginning of the month
e.g., use financial statement or GDP release before they were released in history

Survivorship bias
Use a dataset that excludes discontinued time series
e.g., backtest a category of mutual funds without discontinued mutual funds in history

39

Step 2: Feature selection – why?
Large set of predictors – curse of dimensionality

Irrelevant noise features increase complexity/dimensionality of the problem and
exacerbating the risk of overfitting
If number of predictor p approaches number of data points n, variation of the model
significant increases; keep p << n for robust model Make searching for optimal solution difficult Add difficulty of interpreting model Feature selection is a bias-variance trade-off Number of predictor p ↓, bias ↑, variance ↓ 40 Step 2: Feature selection – curse of dimensionality A problem with number of observations n = 100 and 20 features truly related to response Model error will be higher when p ↑, especially if noise features are added 41 p – number of features; degree of freedom – number of selected features M od el E rr or Step 2: Feature selection – common methods Subset selection Identifying a subset of predictors related to response Shrinkage/regularization Regress on all predictors but shrink coefficients towards 0 to identify true influential predictors Dimension reduction Map p predictors into M-dimensional subspace with M < p 42 Step 2: Feature selection – subset selection Stepwise Forward selection Start with the null model (with only intercept) For k = 0, …, p – 1, add one variable and select model with best improvement (Mk) Iterate based on Mk by adding one variable at a time and find best improvement (Mk+1) (improvement criteria in Step 4) Select best model from (M0, …, Mp) Backward selection Starting with the full model (with all the features) Similar iteration as forward selection but remove one variable at a time 43 Step 2: Feature selection – subset selection Hybrid stepwise A combination of forward and backward selection After adding one variable (forward), re- evaluate and remove variables that no longer provide improvement (backward) Starting point of stepwise iteration can lead to different selection of models 44 Step 2: Feature selection – shrinkage/regularization Ridge regression OLS is modified to also minimize the squared sum of the coefficients (L2 regularization) Cost function/loss function min 𝛽𝛽 � 𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − 𝛽𝛽0 −� 𝑗𝑗=1 𝑝𝑝 𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗 2 + 𝜆𝜆� 𝑗𝑗=1 𝑝𝑝 𝛽𝛽𝑗𝑗 2 Lambda (λ): tuning parameter controlling shrinkage penalty Bias-variance trade-off Coefficients will shrink towards 0 but not exactly to 0 45 shrinkage penalty bias ↑, variance ↓ Step 2: Feature selection – shrinkage/regularization Ridge regression 46 Combined Variance Bias Step 2: Feature selection – shrinkage/regularization Lasso regression OLS is modified to also minimize the absolute sum of the coefficients (L1 regularization) Cost function/loss function min 𝛽𝛽 � 𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − 𝛽𝛽0 −� 𝑗𝑗=1 𝑝𝑝 𝛽𝛽𝑗𝑗𝑥𝑥𝑖𝑖𝑗𝑗 2 + 𝜆𝜆� 𝑗𝑗=1 𝑝𝑝 𝛽𝛽𝑗𝑗 Force some coefficients to be exact 0; better for feature selection 47 shrinkage penalty Step 2: Feature selection – shrinkage/regularization Lasso regression 48 Combined Variance Bias Step 2: Feature selection – shrinkage/regularization Why lasso force some coefficients to be exactly 0? 49 Lasso Ridge Contours of RSS Constrained space of β Step 2: Feature selection – shrinkage/regularization Lambda (λ) Tuning parameter or hyperparameter Can be found by cross validation (elaborate later) Elastic net Limitation of lasso When p > n, only select maximum n variables

Only select one variable in highly correlated group

A combination of ridge and lasso

50

Step 2: Feature selection – dimension reduction
Transform X1, X2, …, Xp to Z1, Z2, …, ZM with M < p (reduce dimension of features) 𝑍𝑍𝑚𝑚 = � 𝑗𝑗=1 𝑝𝑝 𝜙𝜙𝑗𝑗𝑚𝑚𝑋𝑋𝑗𝑗 𝑦𝑦𝑖𝑖 = 𝜃𝜃0 + � 𝑚𝑚=1 𝑀𝑀 𝜃𝜃𝑚𝑚𝑧𝑧𝑖𝑖𝑚𝑚 + 𝜀𝜀𝑖𝑖 𝛽𝛽𝑗𝑗 = � 𝑚𝑚=1 𝑀𝑀 𝜃𝜃𝑚𝑚𝜙𝜙𝑗𝑗𝑚𝑚 51 Step 2: Feature selection – dimension reduction Principal components analysis (PCA) (elaborate in Lecture 4) 52 Z1 Z2 X1 X2 Step 4: Model assessment – algorithm assumption OLS is a quite restrictive algorithm with strict assumptions to produce BLUE Linear relationship between x and y Use residual plot to identify any pattern (a well-conditioned model should not have any pattern in residual plot) May consider transform data (logX, X2, 𝑋𝑋, etc.) Switch to non-linear methods Uncorrelated error term If error term correlated, standard error for coefficients will be underestimated, which will distort model interpretation such as hypothesis test and confidence interval Example: autocorrelation in time series data Switch to time series regression methods (AR, MA, ARMA, etc.) 53 Step 4: Model assessment – algorithm assumption OLS assumptions (cont.) Constant variance of error term Non-constant variance in error is called heteroscedasticity May consider transform response (logY, 𝑌𝑌) Switch to heteroscedasticity methods (ARCH, GARCH, etc.) Low correlation among predictors High correlation in predictors is called collinearity/multicollinearity Collinearity may give model false opportunity of arbitrage and result big positive and negative coefficients on highly correlated predictors May consider drop some predictors May consider combine some predictors (by averaging) or set difference of two predictors as a new variable Ridge regression, lasso, PCA 54 Step 4: Model assessment – model fit R-squared 𝑅𝑅2 = TSS − RSS TSS = ∑𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦 2 − ∑𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2 ∑𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦 2 TSS – total sum of squares; RSS – residual sum of squares Mean squared error (MSE) MSE = RSS 𝑛𝑛 = ∑𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2 𝑛𝑛 Root Mean Squared Error (RMSE) RMSE = MSE = ∑𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2 𝑛𝑛 Residual Standard Error (RSE) RSE = RSS 𝑛𝑛 − 𝑝𝑝 − 1 = ∑𝑖𝑖=1 𝑛𝑛 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑖𝑖 2 𝑛𝑛 − 𝑝𝑝 − 1 55 Step 4: Model assessment – loss function The evaluation matrixes in previous pages are also called loss function Loss function typically takes a format of 𝐿𝐿𝐿𝐿𝐿𝐿𝐿𝐿 = 𝐸𝐸𝐸𝐸𝐸𝐸𝐿𝐿𝐸𝐸 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑡𝑡𝑑𝑑𝐿𝐿𝑛𝑛 𝑓𝑓𝐸𝐸𝐿𝐿𝑓𝑓 𝑡𝑡𝐸𝐸𝑡𝑡𝑑𝑑 + 𝑝𝑝𝑑𝑑𝑛𝑛𝑑𝑑𝑝𝑝𝑡𝑡𝑦𝑦(𝐿𝐿𝑝𝑝𝑡𝑡𝑑𝑑𝐿𝐿𝑛𝑛𝑑𝑑𝑝𝑝) Recall loss function in shrinkage/regularization 56 Step 4: Model assessment – training error vs. test error Training error: errors between trained data and fitted model (in sample) Test error: errors between test data and model prediction (out of sample) Minimizing training error does not necessarily mean minimizing test error 57 Training Test Step 4: Model assessment – underfitting and overfitting 58 Step 4: Model assessment – underfitting and overfitting The goal is to choose a model with low test error Underfitting Easier to identify: high training error Overfitting More common and hard to identify Also called data mining Fake relationship in training data that will not materialize out-of-sample Sign: much large test error than training error Solutions Indirectly: adjust training error to account for bias due to overfitting Directly: cross-validation 59 Step 4: Model assessment – adjust training error and penalize complicate model Cp 𝐶𝐶𝑝𝑝 = 1 𝑛𝑛 RSS + 2𝑝𝑝 �𝜎𝜎2 = MSE + penalty p – number of predictors; �𝜎𝜎2 - estimate of variance of error Akaike information criterion (AIC) AIC = 1 𝑛𝑛 �𝜎𝜎2 RSS + 2𝑝𝑝 �𝜎𝜎2 Bayesian information criterion (BIC) BIC = 1 𝑛𝑛 �𝜎𝜎2 RSS + ln 𝑛𝑛 𝑝𝑝 �𝜎𝜎2 Adjusted R-squared 𝑅𝑅𝑎𝑎𝑎𝑎𝑗𝑗 2 = 1 − (1 − 𝑅𝑅2) 𝑛𝑛 − 1 𝑛𝑛 − 𝑝𝑝 − 1 60 Step 4: Model assessment – adjust training error and penalize complicate model 61 Step 4: Model assessment – cross-validation 62 • Split dataset into training set and testing set (aka validation set or hold-out set) • Train model on training set • Used trained model to predict on testing set • Assess baseline model accuracy on testing set • Tune and choose final model based on the performance on testing set • Train on all data and make prediction on future observations Step 4: Model assessment – k-fold cross-validation Randomly divide observations into equally sized, non-overlapping k-folds (typically 5 or 10 depending on size of data) Train in k – 1 folds and test (validate) on remaining one fold, repeat k times by using each fold as test fold (k out‐of‐sample tests) Choose model with lowest average test error CV(𝑘𝑘) = 1 𝑘𝑘 � 𝑖𝑖=1 𝑘𝑘 MSE𝑖𝑖 63 Step 4: Model assessment – k-fold cross-validation 64 All data Training set Test setFive runs (5-fold) Section 3: Logistic Regression 65 Logistic regression Logistic regression is the simplest algorithm for binary classification in supervised learning Examples Purchase of a product (Y = 1 if purchase, Y = 0 if not) Click on display ad (Y = 1 if click, Y = 0 if not) Default on loan (Y = 1 if default, Y = 0 if not) 66 Limitation of linear regression on classification Linear regression cannot guarantee a valid probability 𝑌𝑌 ∈ 0, 1 For classification > 2 levels, in general there is no natural way to convert
qualitative levels to quantitative response

67

Invalid probability
prediction outside
[0, 1]

Logistic function
Conditional expectation of Y equals to the probability of Y = 1

𝐸𝐸 𝑌𝑌 𝑋𝑋 = 𝑃𝑃𝐸𝐸 𝑌𝑌 = 1 𝑋𝑋 × 1 + 1 − 𝑃𝑃𝐸𝐸 𝑌𝑌 = 1 𝑋𝑋 × 0 = 𝑃𝑃𝐸𝐸 𝑌𝑌 = 1 𝑋𝑋

Logistic function is an S shape curve between 0 and 1

𝑝𝑝 𝑋𝑋 = Pr 𝑌𝑌 = 1 𝑋𝑋 =
𝑑𝑑𝛽𝛽0+𝛽𝛽1𝑋𝑋1+𝛽𝛽2𝑋𝑋2+⋯+𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝

1 + 𝑑𝑑𝛽𝛽0+𝛽𝛽1𝑋𝑋1+𝛽𝛽2𝑋𝑋2+⋯+𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝

Convert to linear form

ln
𝑝𝑝 𝑋𝑋

1 − 𝑝𝑝 𝑋𝑋
= 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝

Logistic regression is part of generalized linear model (GLM)
𝑓𝑓 𝜇𝜇𝑌𝑌 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝

68

Logit

Link function

Difference and similarity with linear regression
In linear regression, one-unit change
of X lead to β unit change in Y

In logistic regression, there is no linear
relationship between p(X) and X. Rate
of change in p(X) per unit change in X
depends on level of X

𝜕𝜕𝑝𝑝 𝑋𝑋
𝜕𝜕𝑋𝑋𝑗𝑗

= 𝛽𝛽𝑗𝑗𝑝𝑝 𝑋𝑋 1 − 𝑝𝑝 𝑋𝑋

Statistical inference (standard error, t-
stat, p-value, hypothesis test) are
similar

69

Different slope at
different X level

How to solve logistic regression – maximum likelihood
estimation (MLE)

Likelihood function

How likely particular values of parameters 𝜃𝜃 = 𝜃𝜃1,𝜃𝜃2, … ,𝜃𝜃𝑝𝑝 (treated as variable) are for a given
set of observations 𝑌𝑌 = 𝑦𝑦1,𝑦𝑦2, … ,𝑦𝑦𝑛𝑛 𝑝𝑝 𝜃𝜃 𝑌𝑌
Re-expression of joint probability of observations Y conditioned on parameters θ

ℒ 𝜃𝜃 𝑌𝑌 = 𝑓𝑓 𝑌𝑌 𝜃𝜃 = 𝑓𝑓 𝑦𝑦1 𝜃𝜃 𝑓𝑓 𝑦𝑦2 𝜃𝜃 …𝑓𝑓 𝑦𝑦𝑛𝑛 𝜃𝜃 = �
𝑖𝑖=1

𝑛𝑛

𝑓𝑓 𝑦𝑦𝑖𝑖 𝜃𝜃

Under Bayes’ theorem with uniform prior distribution

𝑝𝑝 𝜃𝜃 𝑌𝑌 =
𝑓𝑓 𝑌𝑌 𝜃𝜃 𝑝𝑝 𝜃𝜃

𝑝𝑝 𝑌𝑌
∝ 𝑓𝑓 𝑌𝑌 𝜃𝜃

𝑓𝑓 𝑌𝑌 𝜃𝜃 : likelihood; 𝑝𝑝 𝜃𝜃 : prior distribution (uniform, non-informative)
𝑝𝑝 𝑌𝑌 : probability of data averaged over all parameters (independent of θ)
𝑝𝑝 𝜃𝜃 𝑌𝑌 : posterior distribution, in this case, proportional to likelihood

70

if y is independent and identically distributed (i.i.d)

How to solve logistic regression – maximum likelihood
estimation (MLE)

An example with coin flip
Observation: two heads (H) one tail (T) in three flips, what is likelihood of parameter pH
(fairness of coin, pH = 0.5 for fair coin)
If pH = 0.5, ℒ 𝑝𝑝H HHT = 𝑝𝑝 HHT 𝑝𝑝H = 0.5 = 0.52 � 0.5 = 0.125
If pH = 0.3, ℒ 𝑝𝑝H HHT = 𝑝𝑝 HHT 𝑝𝑝H = 0.3 = 0.32 � 0.7 = 0.063

71

Parameter with maximum likelihood

How to solve logistic regression – maximum likelihood
estimation (MLE)

MLE tries to find parameter set θ that maximizes likelihood function

For logistic regression

ℒ 𝛽𝛽 𝑌𝑌 = �
𝑖𝑖: 𝑦𝑦𝑖𝑖=1

𝑝𝑝 𝑥𝑥𝑖𝑖 �
𝑖𝑖′: 𝑦𝑦𝑖𝑖

′=0

1 − 𝑝𝑝 𝑥𝑥𝑖𝑖′

MLE is the most widely used method for fitting models

MLE can be solved either analytically (with closed-form solution) or numerically

OLS is a special case of MLE

72

Evaluate logistic regression model – deviance
Deviance is used to measure
goodness-of-fit for MLE

𝐷𝐷 = −2 lnℒ 𝑀𝑀𝑐𝑐 − lnℒ 𝑀𝑀𝑠𝑠

𝑀𝑀𝑠𝑠: saturated model with n parameters
fit n observations; when fitted perfectly,
ℒ 𝑀𝑀𝑠𝑠 = 1, lnℒ 𝑀𝑀𝑠𝑠 = 0

𝑀𝑀𝑐𝑐: candidate model

D ↓, better fit

Both deviance and log likelihood can
be used to evaluate feature selection

73

Evaluate logistic regression model – threshold
Threshold is used to convert probability prediction to class prediction

74

FICO Predicted
Default Prob

Prediction

800 0.1 No Default

750 0.2 No Default

700 0.4 No Default

650 0.6 Default

600 0.7 Default

550 0.8 Default

Threshold, when threshold moves,
prediction of class changes

Evaluate logistic regression model – Type I and Type II
errors

Type I (false positive, α) and Type II (false negative, β) errors from hypothesis
test

75

H0 True False
Don’t reject Type II error (β)
Reject Type I (α)

Evaluate logistic regression model – confusion matrix

76

False: different from actual
Positive/negative: prediction

Actual – or
Null

Actual + or
Non-null

Sum

Predict – or
Null

TN (True
negative)

FN (False
negative, β)

N*

Predict + or
Non-null

FP (False
positive, α)

TP (true
positive)

P*

Sum N P

FP
N

: Type I, false
positive rate

FN
P

: Type II
TP
P

: true positive rate, hit
rate, sensitivity, recall

Evaluate logistic regression model – ROC curve
The left edge of TPR (true positive rate) vs. FPR (false positive rate) when
threshold moves

77

false positive/actual –
Type I

true +/actual +
1 – Type II
Hit rate

Completely random

Ideal ROC

AUC: Area under curve

Revisit logistic regression in Lecture 3
Advanced topics in logistic regression will be further elaborated in Lecture 3

Gradient descent
Unbalanced data
Loss function with misclassification penalty
Lift table
F1 scores
Precision-recall curve

78

79

https://wrds-www.wharton.upenn.edu/

https://wrds-www.wharton.upenn.edu/

MFIN 290 Application of Machine Learning in Finance: Lecture 1
Background of lecturer
A guide to decipher the maze of machine learning
The goal of this course
Agenda
Section 1: Introduction
A world with big data
Types of big data
What is machine learning?
幻灯片编号 10
The key is NOT “machine”, but “human” as decision maker
The key is NOT “machine”, but “human” as decision maker
Machine learning work flow
Types of machine learning problems
Supervised learning
Example of supervised learning
Regression vs. classification
Unsupervised learning
No free lunch theorem
Tips of entering machine learning world
Self-learning after lecture is important
The easiest programming tool – MATLAB machine learning apps (regression learner and classification learner)
Python scikit-learn
Section 2: Machine Learning Work Flow – An Example with Linear Regression (OLS)
Revisit ordinary least squares (OLS) regression
Simple linear regression
Multiple linear regression
Machine learning work flow
Machine learning work flow
Step 1: Data preparation – typical problems
Step 1: Data preparation – descriptive statistics table
Step 1: Data preparation – distribution plot
Step 1: Data preparation – missing values
Step 1: Data preparation – missing values
Step 1: Data preparation – outliers
Step 1: Data preparation – transformation
Step 1: Data preparation – transformation
Step 1: Data preparation – transformation
Step 1: Data preparation – common mistakes in finance
Step 2: Feature selection – why?
Step 2: Feature selection – curse of dimensionality
Step 2: Feature selection – common methods
Step 2: Feature selection – subset selection
Step 2: Feature selection – subset selection
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – shrinkage/regularization
Step 2: Feature selection – dimension reduction
Step 2: Feature selection – dimension reduction
Step 4: Model assessment – algorithm assumption
Step 4: Model assessment – algorithm assumption
Step 4: Model assessment – model fit
Step 4: Model assessment – loss function
Step 4: Model assessment – training error vs. test error
Step 4: Model assessment – underfitting and overfitting
Step 4: Model assessment – underfitting and overfitting
Step 4: Model assessment – adjust training error and penalize complicate model
Step 4: Model assessment – adjust training error and penalize complicate model
Step 4: Model assessment – cross-validation
Step 4: Model assessment – k-fold cross-validation
Step 4: Model assessment – k-fold cross-validation
Section 3: Logistic Regression
Logistic regression
Limitation of linear regression on classification
Logistic function
Difference and similarity with linear regression
How to solve logistic regression – maximum likelihood estimation (MLE)
How to solve logistic regression – maximum likelihood estimation (MLE)
How to solve logistic regression – maximum likelihood estimation (MLE)
Evaluate logistic regression model – deviance
Evaluate logistic regression model – threshold
Evaluate logistic regression model – Type I and Type II errors
Evaluate logistic regression model – confusion matrix
Evaluate logistic regression model – ROC curve
Revisit logistic regression in Lecture 3
幻灯片编号 79