MFIN 290 Application of Machine Learning in Finance: Lecture 2
MFIN 290 Application of Machine
Learning in Finance: Lecture 2
Edward Sheng
7/5/2021
Agenda
Basic Decision Tree
Support Vector Machine (SVM)
1
2
3
2
Bagging and Boosting Tree
Section 1: Basic Decision Tree
3
Moving beyond linearity
From linear to non-linear
Linear models are restrictive models that require linear relationship, directly or after
transformation
However, non-linear relationship is very common
Trade-off are made when going to more complicated models
Flexibility-interpretation trade-off
As model becomes more complex, flexibility ↑, interpretation ↓
More complex model may track true data relationship better, but add difficulty of inference
More complex model may also be a black box, difficult to interpret and communicate,
adding career risk
4
Moving beyond linearity – bias-variance trade-off
Bias-variance trade off
Bias: deviation of model estimated relationship 𝑓𝑓 𝑋𝑋 from true relationship 𝑓𝑓 𝑋𝑋
Variance: changes of 𝑓𝑓 𝑋𝑋 when using different dataset
Test error relates to combination of bias and variance
As model becomes more complex, bias ↓, variance ↑
𝐸𝐸 𝑌𝑌 − 𝑓𝑓 𝑋𝑋
2
= Var 𝑓𝑓 𝑋𝑋 + Bias 𝑓𝑓 𝑋𝑋
2
+ Var 𝜀𝜀
5
Expected test MSE Variance of 𝑓𝑓 𝑋𝑋 Bias of 𝑓𝑓 𝑋𝑋 Variance of error
Moving beyond linearity – bias-variance trade-off
How bias and variance trade off depends on data, that is why there is no holy-
grail method that fits all (no free lunch theorem)
6
Complex model
performs better
Linear model
performs better
Test error
Variance
Bias
Test
error
Variance
Bias
Moving beyond linearity – a glance
7
Linear
Non-linear
Less
flexibility
More
flexibility
Easy to
interpret
Hard to
interpret
High bias
Low bias
Low
variance
High
variance
Decision trees
Tree-based methods
Simple yet non-linear supervised learning models
Flexible yet easier to explain comparing with some more advanced algorithms
Can be used for both regression and classification
General procedure
Divide predictor space, X1, X2, …, Xp into J distinct and non-overlapping regions, R1, R2, …,
RJ
Make same prediction for every observation within region Rj, either the mean or the mode
of the training observations in that region
8
Decision trees – an example
Predict salary of a baseball player
from years in the Leagues and hits
from last year
Salary is color-coded with high salary
in yellow-red and low salary in purple-
blue
Visually, for junior players with only a
few years in the Leagues, number of
hits is not that important
For players with more than 5 years in
the Leagues, number of hits starts to
explain salary dispersion
9
Decision trees – what is a tree?
10
Terminal node (leaf)
Internal node
(split)
Root
Branch
Region (box)
Decision trees – what is a tree?
Tree is easy to interpret
Years is most important for salary
Early years, number of hits does not
matter much for wage
Later in career, number of hits
becomes important
Tree is easier to explain than
regression
Nonlinearity
Number of hits does not have linear
relationship with salary
11
Decision trees – regression trees
Objective function is to find the set of boxes that minimizes RSS (or MSE,
RMSE)
min
𝑅𝑅𝑗𝑗 𝑗𝑗
�
𝑗𝑗=1
𝐽𝐽
�
𝑖𝑖∈𝑅𝑅𝑗𝑗
𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑅𝑅𝑗𝑗
2
�𝑦𝑦𝑅𝑅𝑗𝑗: mean response of training observations in box Rj
How to find right partitions of data that achieves minimum RSS?
When J = 2 (two boxes) and N observations, there are order of N possible splits
When J = 3, there are order of N2 possible splits, etc.
The complexity of the problem grows exponentially
12
Decision trees – recursive binary splitting
Recursive binary splitting – A top‐down, greedy approach
Top-down: start at the top of the tree and successively adding splits
Greedy: at each step, choose the best split given that particular box, with no looking ahead
to globally optimal tree
13
• Find the next split (by iterate Xj and find cutpoint s) that
minimizes RSS
• This split can be in any boxes already created
• For each Xj, find a cutpoint s that minimizes RSS
• Choose the predictor that lead to the lowest RSS
Create more splits/boxes until a convergence criterion is
met (e.g., no region contains more than 5 observations, or
number of leaves < 30)
Decision trees – recursive binary splitting
14
It’s like cutting a cake!
❶
❷ ❸
❹
Y
Decision trees – pruning
A full-grown tree can easily overfit data (again, bias-variance trade-off)
Solution: grow a very large tree and prune it back to a subtree
Cost complexity pruning/weakest link pruning
min
𝑇𝑇
�
𝑚𝑚=1
𝑇𝑇
�
𝑖𝑖: 𝑥𝑥𝑖𝑖∈𝑅𝑅𝑚𝑚
𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑅𝑅𝑚𝑚
2
+ 𝛼𝛼 𝑇𝑇
𝑇𝑇 – number of terminal nodes of tree T
α – tuning parameter (hyperparameter, elaborate later)
15
Penalty, similar
with Lasso
Decision trees – tuning
16
Tuning
• k-fold cross validation to select α
• Apply recursive binary splitting and cost complexity pruning to first growth and then
pruning
• Evaluate out-of-sample MSE as a function of α
• Pick α to minimize average MSE
Pruning
• Use recursive binary splitting to growth a large tree using all training data
• Use cost complexity pruning with tuned α to obtain best subtree
Decision trees – model assessment
17
Choose
(leaves)
Decision trees – classification
The decision trees algorithm introduced here is also called CART (classification and
regression trees), which is the most popular decision trees algorithm
For classification
Predictive response is most commonly occurring class of training observations in each region
(vs. mean in regression trees)
Error (vs. RSS in regression trees)
Classification error rate: 𝐸𝐸 = 1 − max
𝑘𝑘
�𝑝𝑝𝑚𝑚𝑘𝑘
Gini index: 𝐺𝐺 = ∑𝑘𝑘=1
𝐾𝐾 �𝑝𝑝𝑚𝑚𝑘𝑘 1 − �𝑝𝑝𝑚𝑚𝑘𝑘
Entropy: 𝐷𝐷 = −∑𝑘𝑘=1
𝐾𝐾 �𝑝𝑝𝑚𝑚𝑘𝑘 ln �𝑝𝑝𝑚𝑚𝑘𝑘
�𝑝𝑝𝑚𝑚𝑘𝑘: proportion of observations from class k in region m
if classification is pure, �𝑝𝑝𝑚𝑚𝑘𝑘 is close to 0 or 1, E, G, D will be small
18
Decision trees vs. linear model
A 2-class classification (green or
yellow)
If true boundary is linear (top), tree
will do worse than linear model
If true boundary is non-linear
(bottom), tree will do better than
linear model
19
Decision trees – pros and cons
Pros
Easy: decision trees (or CART) is the easiest tree method
Automatically rank indicators by prediction power
Cons
Easily overfit: model may not be robust when data change
Maybe biased when tree is too simple
Overweight predictability of variables around root while underweighting branches and other
indicators
Weak learner
Improvement
Bagging, random forest, boosting
They are ensemble methods: combine base models (weak learner) to improve overall model
performance (strong learner)
20
Section 2: Bagging and Boosting Tree
21
Bagging
BAGGing (Bootstrap AGGregating): a combination of bootstrap and
aggregating
Bootstrap
From a population, take n sets of sample Z1, Z2, …, Zn with variance σ2
Typically sample with replacement each time
Variance of sample average is reduced: 𝜎𝜎�̅�𝑍
2 = ⁄𝜎𝜎2 𝑛𝑛
22
Bagging
23
Bootstrap
• Repeat sampling from training set B times to get B different bootstrapped
training set
• Train decision trees on each bootstrapped training set (for bth set, prediction is
𝑓𝑓∗𝑏𝑏 𝑥𝑥 )
Aggregating
• For regression: average predictions from B trees 𝑓𝑓bag 𝑥𝑥 =
1
𝐵𝐵
∑𝑏𝑏=1
𝐵𝐵 𝑓𝑓∗𝑏𝑏 𝑥𝑥
• For classification: take a majority vote or produce probability
Random forest
A improvement from bagged trees
Bagging: bootstrap training data, use all predictors
Random forest: bootstrap training data, randomly select a subset of predictors for each tree
Typically choose number of predictors 𝑚𝑚 = 𝑝𝑝 for each tree
24
Random forest – benefits
It is like a collective vote from each
sub-committee (how democratic!)
Bagging can be correlated (e.g., CEO,
the most influence guy, will always be
there)
Random forest decorrelate problem
by forming sub-committee, other guys
can speak up
Very easy to tune the model, only
need to choose number of predictors
in each tree (typically 𝑝𝑝) and number
of trees
25
Random forest – benefits
Random forest typically has much lower error than bagged trees
26
Random forest – variable importance
We can record reduction of RSS or
Gini index in each split of each tree
Collectively, it will provide variable
importance for each predictor
Variable importance shows total
reduction of error from each
predictor
Variable importance is another way
for feature selection
27
Boosting
Boosting: to increase or improve
28
Bagging Boosting
First overfits each tree (bias ↓,
variance ↑), and then averaging
(variance ↓)
First fits smaller trees (bias ↑,
variance ↓), and then learns
slowly by adding new small
trees fit to existing fitted trees
(bias ↓)
Each tree grows independently Trees grow sequentially with
each tree grows based on
previously grown tree
Bootstrap on training data Fit on modified version of
original dataset
Easier to tune More hyperparameters to tune
Boosting – boosting vs. bagging
29
Bagging and random
forest, focus on
reducing variance
Boosting, focus on
reducing bias
Boosting
30
Boosting – AdaBoost (adaptive boosting)
31
Increasing observation weight
of misclassified observation
Boosting – gradient boosting
32
Initiate 𝑓𝑓 𝑥𝑥 = 0
Initiate residual
𝑟𝑟𝑖𝑖 = 𝑦𝑦𝑖𝑖
Fit a tree 𝑓𝑓1
Update 𝑓𝑓 𝑥𝑥 ← 𝑓𝑓 𝑥𝑥 + 𝜆𝜆𝑓𝑓1 𝑥𝑥
Update 𝑟𝑟𝑖𝑖 ← 𝑟𝑟𝑖𝑖 − 𝜆𝜆𝑓𝑓1 𝑥𝑥
A good example: https://towardsdatascience.com/machine-learning-part-18-
boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4
https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4
Boosting – XGBoost (extreme gradient boosting)
High speed, high accuracy method/package dominated recent research
Support gradient boosting, stochastic gradient boosting, and regularized
gradient boosting
Regularization, controlling overfitting, is the main source of high accuracy
33
Boosting – regularization in XGBoost
34
Ω(f): complexity of tree (L2 regularization + penalty for the size of tree)
L(f): loss function (variance of prediction errors)
Boosting – XGBoost performance
35
Boosting – hyperparameters
Number of trees B
Unlike bagging, boosting can overfit if B is too large
Use cross validation to select B
Shrinkage parameter λ
A small positive number to control the rate of learning (λ ↓, learning rate ↓)
Typically 0.01 or 0.001 depending on problem
Very small λ may require a larger B
Number of splits in each tree d (terminal node d + 1)
Controls complexity of each tree and interaction depth of boosting
d = 1, each tree is a single split (stump), fit an additive model with no interaction
XGBoost has additional hyperparameters for regularization
36
Section 3: Support Vector Machine (SVM)
37
Hyperplane
Hyperplane: a flat affine subspace of
dimension p – 1 in a space of
dimension p
p = 2, hyperplane is a line
p = 3, hyperplane is a plane
Mathematically
𝑓𝑓 𝑋𝑋 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝
= 0
Points on hyperplane 𝑓𝑓 𝑋𝑋 = 0
Points above hyperplane 𝑓𝑓 𝑋𝑋 > 0
Points below hyperplane 𝑓𝑓 𝑋𝑋 < 0
38
x1
x2
Hyperplane
Separating hyperplane
Separating hyperplane: a
hyperplane that separates training
observations perfectly
For example
Hyperplane A: 𝑓𝑓(𝑋𝑋) = 𝑥𝑥1 + 𝑥𝑥2 + 𝑐𝑐
Hyperplane B: 𝑓𝑓(𝑋𝑋) = 𝑥𝑥1 + 𝑐𝑐
39
Hyperplane B
x1
x2
Hyperplane A
Maximal margin hyperplane
𝑑𝑑𝑖𝑖: distance from i-th sample to the
hyperplane
Margin: min𝑑𝑑𝑖𝑖, distance of the
closest samples from the
hyperplane
Maximal margin hyperplane: the
linear hyperplane (classifier) with
the maximum margin
40
x1
x2
Hyperplane BHyperplane A
Margin A
Margin B
di
Maximal margin hyperplane – mathematical representation
Hyperplane: 𝑓𝑓 𝑋𝑋 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 +
𝛽𝛽2𝑋𝑋2 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝 = 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋
Distance: 𝑑𝑑𝑖𝑖 = ⁄𝑓𝑓 𝑋𝑋𝑖𝑖 𝑊𝑊
Margin: 𝑚𝑚 = min𝑑𝑑𝑖𝑖 = ⁄𝑎𝑎 𝑊𝑊
41
x1
x2
Hyperplane
Margin (m)
𝑓𝑓 𝑋𝑋 = 𝑎𝑎
𝑓𝑓 𝑋𝑋 = −𝑎𝑎
di a
Maximal margin hyperplane – mathematical representation
Goal: maximize margin when samples are
perfectly separated by the hyperplane
max
𝑊𝑊, 𝛽𝛽0
𝑚𝑚
s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 𝑎𝑎
∀𝑖𝑖 = 1, … ,𝑛𝑛
Note 𝑦𝑦𝑖𝑖 = ±1
However, it has infinite number of solutions;
for example:
𝑓𝑓 𝑋𝑋 = 𝑥𝑥1 + 𝑥𝑥2 + 𝑐𝑐
𝑓𝑓 𝑋𝑋 = 2𝑥𝑥1 + 2𝑥𝑥2 + 2𝑐𝑐
……
42
x1
x2
Margin (m)
𝑓𝑓 𝑋𝑋 = 𝑎𝑎
𝑓𝑓 𝑋𝑋 = −𝑎𝑎
Maximal margin hyperplane
a
All points on the right
side and out of margin
Maximal margin hyperplane – mathematical representation
Divide formula by 𝑎𝑎
max
𝑊𝑊, 𝛽𝛽0
⁄𝑚𝑚 𝑎𝑎
s. t. (𝛽𝛽0+𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖) � 𝑦𝑦𝑖𝑖 ≥ 1
∀𝑖𝑖 = 1, … ,𝑛𝑛
Recall that 𝑚𝑚 = min𝑑𝑑𝑖𝑖 =
𝑎𝑎
𝑊𝑊
Further transform to
max
𝑊𝑊, 𝛽𝛽0
⁄1 𝑊𝑊
s. t. (𝛽𝛽0+𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖) � 𝑦𝑦𝑖𝑖 ≥ 1
∀𝑖𝑖 = 1, … ,𝑛𝑛
This will only have one solution
43
x1
x2
Margin (m)
𝑓𝑓 𝑋𝑋 = 1
𝑓𝑓 𝑋𝑋 = −1
Maximal margin hyperplane
1
Support vectors
44
x1
x2
Support vectors
Maximal margin hyperplane is
controlled by support vectors
Moving other points a little doesn’t
effect decision
Maximal margin hyperplane – mathematical representation
Optimization goal
max
𝑊𝑊, 𝛽𝛽0
1/ 𝑊𝑊 s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 ∀𝑖𝑖
Convert to primal form
min
𝑊𝑊, 𝛽𝛽0
𝑊𝑊𝑇𝑇 � 𝑊𝑊 s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 ∀𝑖𝑖
Continually convert to dual form
max
𝑐𝑐1,…,𝑐𝑐𝑛𝑛
�
𝑖𝑖=1
𝑛𝑛
𝑐𝑐𝑖𝑖 −
1
2
�
𝑗𝑗=1
𝑛𝑛
�
𝑘𝑘=1
𝑛𝑛
𝑐𝑐𝑖𝑖𝑐𝑐𝑘𝑘𝑦𝑦𝑗𝑗𝑦𝑦𝑘𝑘 𝑋𝑋𝑗𝑗 � 𝑋𝑋𝑘𝑘 s. t. 𝑐𝑐𝑖𝑖 ≥ 0 &�
𝑖𝑖=1
𝑛𝑛
𝑐𝑐𝑖𝑖𝑦𝑦𝑖𝑖 = 0 ∀𝑖𝑖
𝑊𝑊 = �
𝑖𝑖=1
𝑛𝑛
𝑐𝑐𝑖𝑖𝑦𝑦𝑖𝑖 𝑋𝑋𝑖𝑖; 𝑦𝑦𝑖𝑖 = ±1
45
Kernel
Kernel functions
After transforming to dual form, the target function only rely on 𝑋𝑋𝑖𝑖 � 𝑋𝑋𝑘𝑘 ; this is
called kernel function
A general format of kernel function is 𝛫𝛫 𝑋𝑋𝑖𝑖 ,𝑋𝑋𝑗𝑗 = ∅(𝑋𝑋𝑖𝑖) � ∅(𝑋𝑋𝑗𝑗)
In previous linear separable case, the kernel function is the inner product of 𝑋𝑋
or ∅ 𝑋𝑋 = 𝑋𝑋; for non-linear separable cases, we can modify the kernel function
to map data to another feature space
46
Kernel transformation
For non-linear separable cases, kernel can convert data to higher dimension
for a better cut of hyperplane
47
∅ 𝑋𝑋 = ∅ X1 ,∅ 𝑋𝑋2 , …∅ 𝑋𝑋𝑛𝑛𝑋𝑋 = 𝑋𝑋1,𝑋𝑋2, … ,𝑋𝑋𝑛𝑛
Kernel transformation
48
Hyperplane is
your blade
Kernel: throw fruits
It’s like playing Fruit Ninja!
Kernel functions
Linear kernel: 𝑋𝑋𝑖𝑖 � 𝑋𝑋𝑗𝑗
Polynomial kernel: 𝛾𝛾 𝑋𝑋𝑖𝑖 � 𝑋𝑋𝑗𝑗 + 𝑟𝑟
𝑑𝑑
Gaussian (radial basis function, RBF) kernel: exp −𝛾𝛾 𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑗𝑗
2
Sigmoid kernel: tanh 𝛾𝛾 𝑋𝑋𝑖𝑖 � 𝑋𝑋𝑗𝑗 + 𝑟𝑟
49
Caution:
overfitting!
Data still not separable?
Previous objective function (prime
form)
min
𝑊𝑊, 𝛽𝛽0
𝑊𝑊𝑇𝑇 � 𝑊𝑊
s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 ∀𝑖𝑖
Now 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 cannot
be satisfied
Still prefer it to be as close to 1 as
possible
50
x1
x2
Hyperplane
Soft margin
Soft margin: allow some observations to
be at the incorrect side of the margin or
hyperplane
We introduce a penalty term 𝜀𝜀𝑖𝑖, and
change objective function
min
𝑊𝑊, 𝛽𝛽0
𝜆𝜆𝑊𝑊𝑇𝑇 � 𝑊𝑊 + �
𝑖𝑖=1
𝑛𝑛
𝜀𝜀𝑖𝑖
s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 − 𝜀𝜀𝑖𝑖
𝜀𝜀𝑖𝑖 ≥ 0
λ is a hyperparameter related to total
budget for violation of margin
λ ↑, budget ↑
51
x1
x2
Hyperplane
𝜀𝜀
𝜀𝜀
Loss function with soft margin
Minimal value of 𝜀𝜀𝑖𝑖 to satisfy soft margin SVM is
1 − 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖
We can then plug it into the previous objective function
𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = �
𝑖𝑖=1
𝑛𝑛
1 − (𝛽𝛽0+𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖) � 𝑦𝑦𝑖𝑖 + 𝜆𝜆𝑊𝑊𝑇𝑇 � 𝑊𝑊
If λ is small
Fewer violations to the margin will occur
Lead to high variance, low bias
52
Minimal value of 𝜀𝜀𝑖𝑖 Penalty
SVM vs. logistic regression
SVM: hinge loss
𝐿𝐿 = �
𝑖𝑖=1
𝑛𝑛
max 0,1 − (𝛽𝛽0+𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖) � 𝑦𝑦𝑖𝑖
Logistic regression: log loss [− ln Likelihood ]
𝐿𝐿 = �
𝑖𝑖=1
𝑛𝑛
log 1 + exp − 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖
53
SVM vs. logistic regression
SVM
If 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1, hinge loss = 0
(observations that are correctly
classified do not contribute to hinge
loss)
Otherwise, hinge loss = 1 − (
)
𝛽𝛽0 + 𝑊𝑊𝑇𝑇 �
𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖
Logistic regression
Log loss always > 0 (observations that
are correctly classified still add to hinge
loss, but the contribution is small)
54
SVM for multilevel classification and regression
SVM is mainly used for binary classification, as in previous slides
SVM can work with classification with more than two classes
One-versus-one: break down the problem to pairwise binary classification
One-versus-all: focus on one class (K) and combine all other observations into non-K class
SVM can also be adapted for regression, called support vector regression
(SVR)
SVR utilized the concept of margin and kernel in SVM
55
SVR
56
Margin
Kernel transformation
Hyperparameter tuning
Hyperparameters are tuning parameters of the algorithms; choose right value
of hyperparameters to optimize model performance
Ridge
λ: penalty for L2 regulation (sum of squared coefficients)
LASSO
λ: penalty for L1 regulation (sum of absolute coefficients)
Decision tree
α: penalty for pruning
Random forest
m: number of features for each tree
Number of trees
57
Hyperparameter tuning
Gradient boosting
B: number of trees
λ: Shrinkage parameter (learning rate)
d: number of splits for each tree (terminal node d + 1)
SVM
λ: penalty and related to total budget for violation of margin
γ, d, r in kernel
Rule of thumb: don’t be creative, stick to commonly accepted range of
hyperparameters
58
Hyperparameter tuning – grid search
Set up grid in the value space of hyperparameters
Use cross validation technique
Recall α in pruning of decision trees
For exponentially ranged hyperparameters, set grid in log scale ([10, 100,
1000])
59
Hyperparameter tuning – k-fold cross validation
Traditionally split data into three sets: test, training, validation; use training and
validation set to tune hyperparameter
A better way: keep training set and test set, perform k-fold cross validation for
hyperparameter tuning on training set and then evaluate tuned model on test
set
60
Next step
Homework 1: see detailed instruction
Lecture 3 and 4: Prof. Yujie He will teach Lecture 3 and 4 on classification,
unsupervised learning, and labs
Lecture 3 will be back to our normal schedule on Saturday Morning of July 10
61
62
https://fred.stlouisfed.org/
https://fred.stlouisfed.org/
MFIN 290 Application of Machine Learning in Finance: Lecture 2
Agenda
Section 1: Basic Decision Tree
Moving beyond linearity
Moving beyond linearity – bias-variance trade-off
Moving beyond linearity – bias-variance trade-off
Moving beyond linearity – a glance
Decision trees
Decision trees – an example
Decision trees – what is a tree?
Decision trees – what is a tree?
Decision trees – regression trees
Decision trees – recursive binary splitting
Decision trees – recursive binary splitting
Decision trees – pruning
Decision trees – tuning
Decision trees – model assessment
Decision trees – classification
Decision trees vs. linear model
Decision trees – pros and cons
Section 2: Bagging and Boosting Tree
Bagging
Bagging
Random forest
Random forest – benefits
Random forest – benefits
Random forest – variable importance
Boosting
Boosting – boosting vs. bagging
Boosting
Boosting – AdaBoost (adaptive boosting)
Boosting – gradient boosting
Boosting – XGBoost (extreme gradient boosting)
Boosting – regularization in XGBoost
Boosting – XGBoost performance
Boosting – hyperparameters
Section 3: Support Vector Machine (SVM)
Hyperplane
Separating hyperplane
Maximal margin hyperplane
Maximal margin hyperplane – mathematical representation
Maximal margin hyperplane – mathematical representation
Maximal margin hyperplane – mathematical representation
Support vectors
Maximal margin hyperplane – mathematical representation
Kernel functions
Kernel transformation
Kernel transformation
Kernel functions
Data still not separable?
Soft margin
Loss function with soft margin
SVM vs. logistic regression
SVM vs. logistic regression
SVM for multilevel classification and regression
SVR
Hyperparameter tuning
Hyperparameter tuning
Hyperparameter tuning – grid search
Hyperparameter tuning – k-fold cross validation
Next step
幻灯片编号 62