CS计算机代考程序代写 python flex finance decision tree B tree algorithm MFIN 290 Application of Machine Learning in Finance: Lecture 2

MFIN 290 Application of Machine Learning in Finance: Lecture 2

MFIN 290 Application of Machine
Learning in Finance: Lecture 2

Edward Sheng

7/5/2021

Agenda

Basic Decision Tree

Support Vector Machine (SVM)

1

2

3

2

Bagging and Boosting Tree

Section 1: Basic Decision Tree

3

Moving beyond linearity
From linear to non-linear

Linear models are restrictive models that require linear relationship, directly or after
transformation
However, non-linear relationship is very common
Trade-off are made when going to more complicated models

Flexibility-interpretation trade-off
As model becomes more complex, flexibility ↑, interpretation ↓
More complex model may track true data relationship better, but add difficulty of inference
More complex model may also be a black box, difficult to interpret and communicate,
adding career risk

4

Moving beyond linearity – bias-variance trade-off
Bias-variance trade off

Bias: deviation of model estimated relationship 𝑓𝑓 𝑋𝑋 from true relationship 𝑓𝑓 𝑋𝑋

Variance: changes of 𝑓𝑓 𝑋𝑋 when using different dataset

Test error relates to combination of bias and variance
As model becomes more complex, bias ↓, variance ↑

𝐸𝐸 𝑌𝑌 − 𝑓𝑓 𝑋𝑋
2

= Var 𝑓𝑓 𝑋𝑋 + Bias 𝑓𝑓 𝑋𝑋
2

+ Var 𝜀𝜀

5

Expected test MSE Variance of 𝑓𝑓 𝑋𝑋 Bias of 𝑓𝑓 𝑋𝑋 Variance of error

Moving beyond linearity – bias-variance trade-off
How bias and variance trade off depends on data, that is why there is no holy-
grail method that fits all (no free lunch theorem)

6

Complex model
performs better

Linear model
performs better

Test error

Variance

Bias

Test
error

Variance

Bias

Moving beyond linearity – a glance

7

Linear

Non-linear

Less
flexibility

More
flexibility

Easy to
interpret

Hard to
interpret

High bias

Low bias

Low
variance

High
variance

Decision trees
Tree-based methods

Simple yet non-linear supervised learning models
Flexible yet easier to explain comparing with some more advanced algorithms
Can be used for both regression and classification

General procedure
Divide predictor space, X1, X2, …, Xp into J distinct and non-overlapping regions, R1, R2, …,
RJ
Make same prediction for every observation within region Rj, either the mean or the mode
of the training observations in that region

8

Decision trees – an example
Predict salary of a baseball player
from years in the Leagues and hits
from last year
Salary is color-coded with high salary
in yellow-red and low salary in purple-
blue
Visually, for junior players with only a
few years in the Leagues, number of
hits is not that important
For players with more than 5 years in
the Leagues, number of hits starts to
explain salary dispersion

9

Decision trees – what is a tree?

10

Terminal node (leaf)

Internal node
(split)

Root

Branch

Region (box)

Decision trees – what is a tree?
Tree is easy to interpret

Years is most important for salary
Early years, number of hits does not
matter much for wage
Later in career, number of hits
becomes important
Tree is easier to explain than
regression

Nonlinearity
Number of hits does not have linear
relationship with salary

11

Decision trees – regression trees
Objective function is to find the set of boxes that minimizes RSS (or MSE,
RMSE)

min
𝑅𝑅𝑗𝑗 𝑗𝑗


𝑗𝑗=1

𝐽𝐽


𝑖𝑖∈𝑅𝑅𝑗𝑗

𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑅𝑅𝑗𝑗
2

�𝑦𝑦𝑅𝑅𝑗𝑗: mean response of training observations in box Rj
How to find right partitions of data that achieves minimum RSS?

When J = 2 (two boxes) and N observations, there are order of N possible splits
When J = 3, there are order of N2 possible splits, etc.
The complexity of the problem grows exponentially

12

Decision trees – recursive binary splitting
Recursive binary splitting – A top‐down, greedy approach

Top-down: start at the top of the tree and successively adding splits
Greedy: at each step, choose the best split given that particular box, with no looking ahead
to globally optimal tree

13

• Find the next split (by iterate Xj and find cutpoint s) that
minimizes RSS

• This split can be in any boxes already created

• For each Xj, find a cutpoint s that minimizes RSS
• Choose the predictor that lead to the lowest RSS

Create more splits/boxes until a convergence criterion is
met (e.g., no region contains more than 5 observations, or
number of leaves < 30) Decision trees – recursive binary splitting 14 It’s like cutting a cake! ❶ ❷ ❸ ❹ Y Decision trees – pruning A full-grown tree can easily overfit data (again, bias-variance trade-off) Solution: grow a very large tree and prune it back to a subtree Cost complexity pruning/weakest link pruning min 𝑇𝑇 � 𝑚𝑚=1 𝑇𝑇 � 𝑖𝑖: 𝑥𝑥𝑖𝑖∈𝑅𝑅𝑚𝑚 𝑦𝑦𝑖𝑖 − �𝑦𝑦𝑅𝑅𝑚𝑚 2 + 𝛼𝛼 𝑇𝑇 𝑇𝑇 – number of terminal nodes of tree T α – tuning parameter (hyperparameter, elaborate later) 15 Penalty, similar with Lasso Decision trees – tuning 16 Tuning • k-fold cross validation to select α • Apply recursive binary splitting and cost complexity pruning to first growth and then pruning • Evaluate out-of-sample MSE as a function of α • Pick α to minimize average MSE Pruning • Use recursive binary splitting to growth a large tree using all training data • Use cost complexity pruning with tuned α to obtain best subtree Decision trees – model assessment 17 Choose (leaves) Decision trees – classification The decision trees algorithm introduced here is also called CART (classification and regression trees), which is the most popular decision trees algorithm For classification Predictive response is most commonly occurring class of training observations in each region (vs. mean in regression trees) Error (vs. RSS in regression trees) Classification error rate: 𝐸𝐸 = 1 − max 𝑘𝑘 �𝑝𝑝𝑚𝑚𝑘𝑘 Gini index: 𝐺𝐺 = ∑𝑘𝑘=1 𝐾𝐾 �𝑝𝑝𝑚𝑚𝑘𝑘 1 − �𝑝𝑝𝑚𝑚𝑘𝑘 Entropy: 𝐷𝐷 = −∑𝑘𝑘=1 𝐾𝐾 �𝑝𝑝𝑚𝑚𝑘𝑘 ln �𝑝𝑝𝑚𝑚𝑘𝑘 �𝑝𝑝𝑚𝑚𝑘𝑘: proportion of observations from class k in region m if classification is pure, �𝑝𝑝𝑚𝑚𝑘𝑘 is close to 0 or 1, E, G, D will be small 18 Decision trees vs. linear model A 2-class classification (green or yellow) If true boundary is linear (top), tree will do worse than linear model If true boundary is non-linear (bottom), tree will do better than linear model 19 Decision trees – pros and cons Pros Easy: decision trees (or CART) is the easiest tree method Automatically rank indicators by prediction power Cons Easily overfit: model may not be robust when data change Maybe biased when tree is too simple Overweight predictability of variables around root while underweighting branches and other indicators Weak learner Improvement Bagging, random forest, boosting They are ensemble methods: combine base models (weak learner) to improve overall model performance (strong learner) 20 Section 2: Bagging and Boosting Tree 21 Bagging BAGGing (Bootstrap AGGregating): a combination of bootstrap and aggregating Bootstrap From a population, take n sets of sample Z1, Z2, …, Zn with variance σ2 Typically sample with replacement each time Variance of sample average is reduced: 𝜎𝜎�̅�𝑍 2 = ⁄𝜎𝜎2 𝑛𝑛 22 Bagging 23 Bootstrap • Repeat sampling from training set B times to get B different bootstrapped training set • Train decision trees on each bootstrapped training set (for bth set, prediction is 𝑓𝑓∗𝑏𝑏 𝑥𝑥 ) Aggregating • For regression: average predictions from B trees 𝑓𝑓bag 𝑥𝑥 = 1 𝐵𝐵 ∑𝑏𝑏=1 𝐵𝐵 𝑓𝑓∗𝑏𝑏 𝑥𝑥 • For classification: take a majority vote or produce probability Random forest A improvement from bagged trees Bagging: bootstrap training data, use all predictors Random forest: bootstrap training data, randomly select a subset of predictors for each tree Typically choose number of predictors 𝑚𝑚 = 𝑝𝑝 for each tree 24 Random forest – benefits It is like a collective vote from each sub-committee (how democratic!) Bagging can be correlated (e.g., CEO, the most influence guy, will always be there) Random forest decorrelate problem by forming sub-committee, other guys can speak up Very easy to tune the model, only need to choose number of predictors in each tree (typically 𝑝𝑝) and number of trees 25 Random forest – benefits Random forest typically has much lower error than bagged trees 26 Random forest – variable importance We can record reduction of RSS or Gini index in each split of each tree Collectively, it will provide variable importance for each predictor Variable importance shows total reduction of error from each predictor Variable importance is another way for feature selection 27 Boosting Boosting: to increase or improve 28 Bagging Boosting First overfits each tree (bias ↓, variance ↑), and then averaging (variance ↓) First fits smaller trees (bias ↑, variance ↓), and then learns slowly by adding new small trees fit to existing fitted trees (bias ↓) Each tree grows independently Trees grow sequentially with each tree grows based on previously grown tree Bootstrap on training data Fit on modified version of original dataset Easier to tune More hyperparameters to tune Boosting – boosting vs. bagging 29 Bagging and random forest, focus on reducing variance Boosting, focus on reducing bias Boosting 30 Boosting – AdaBoost (adaptive boosting) 31 Increasing observation weight of misclassified observation Boosting – gradient boosting 32 Initiate 𝑓𝑓 𝑥𝑥 = 0 Initiate residual 𝑟𝑟𝑖𝑖 = 𝑦𝑦𝑖𝑖 Fit a tree 𝑓𝑓1 Update 𝑓𝑓 𝑥𝑥 ← 𝑓𝑓 𝑥𝑥 + 𝜆𝜆𝑓𝑓1 𝑥𝑥 Update 𝑟𝑟𝑖𝑖 ← 𝑟𝑟𝑖𝑖 − 𝜆𝜆𝑓𝑓1 𝑥𝑥 A good example: https://towardsdatascience.com/machine-learning-part-18- boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4 https://towardsdatascience.com/machine-learning-part-18-boosting-algorithms-gradient-boosting-in-python-ef5ae6965be4 Boosting – XGBoost (extreme gradient boosting) High speed, high accuracy method/package dominated recent research Support gradient boosting, stochastic gradient boosting, and regularized gradient boosting Regularization, controlling overfitting, is the main source of high accuracy 33 Boosting – regularization in XGBoost 34 Ω(f): complexity of tree (L2 regularization + penalty for the size of tree) L(f): loss function (variance of prediction errors) Boosting – XGBoost performance 35 Boosting – hyperparameters Number of trees B Unlike bagging, boosting can overfit if B is too large Use cross validation to select B Shrinkage parameter λ A small positive number to control the rate of learning (λ ↓, learning rate ↓) Typically 0.01 or 0.001 depending on problem Very small λ may require a larger B Number of splits in each tree d (terminal node d + 1) Controls complexity of each tree and interaction depth of boosting d = 1, each tree is a single split (stump), fit an additive model with no interaction XGBoost has additional hyperparameters for regularization 36 Section 3: Support Vector Machine (SVM) 37 Hyperplane Hyperplane: a flat affine subspace of dimension p – 1 in a space of dimension p p = 2, hyperplane is a line p = 3, hyperplane is a plane Mathematically 𝑓𝑓 𝑋𝑋 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝 = 0 Points on hyperplane 𝑓𝑓 𝑋𝑋 = 0 Points above hyperplane 𝑓𝑓 𝑋𝑋 > 0
Points below hyperplane 𝑓𝑓 𝑋𝑋 < 0 38 x1 x2 Hyperplane Separating hyperplane Separating hyperplane: a hyperplane that separates training observations perfectly For example Hyperplane A: 𝑓𝑓(𝑋𝑋) = 𝑥𝑥1 + 𝑥𝑥2 + 𝑐𝑐 Hyperplane B: 𝑓𝑓(𝑋𝑋) = 𝑥𝑥1 + 𝑐𝑐 39 Hyperplane B x1 x2 Hyperplane A Maximal margin hyperplane 𝑑𝑑𝑖𝑖: distance from i-th sample to the hyperplane Margin: min𝑑𝑑𝑖𝑖, distance of the closest samples from the hyperplane Maximal margin hyperplane: the linear hyperplane (classifier) with the maximum margin 40 x1 x2 Hyperplane BHyperplane A Margin A Margin B di Maximal margin hyperplane – mathematical representation Hyperplane: 𝑓𝑓 𝑋𝑋 = 𝛽𝛽0 + 𝛽𝛽1𝑋𝑋1 + 𝛽𝛽2𝑋𝑋2 + ⋯+ 𝛽𝛽𝑝𝑝𝑋𝑋𝑝𝑝 = 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋 Distance: 𝑑𝑑𝑖𝑖 = ⁄𝑓𝑓 𝑋𝑋𝑖𝑖 𝑊𝑊 Margin: 𝑚𝑚 = min𝑑𝑑𝑖𝑖 = ⁄𝑎𝑎 𝑊𝑊 41 x1 x2 Hyperplane Margin (m) 𝑓𝑓 𝑋𝑋 = 𝑎𝑎 𝑓𝑓 𝑋𝑋 = −𝑎𝑎 di a Maximal margin hyperplane – mathematical representation Goal: maximize margin when samples are perfectly separated by the hyperplane max 𝑊𝑊, 𝛽𝛽0 𝑚𝑚 s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 𝑎𝑎 ∀𝑖𝑖 = 1, … ,𝑛𝑛 Note 𝑦𝑦𝑖𝑖 = ±1 However, it has infinite number of solutions; for example: 𝑓𝑓 𝑋𝑋 = 𝑥𝑥1 + 𝑥𝑥2 + 𝑐𝑐 𝑓𝑓 𝑋𝑋 = 2𝑥𝑥1 + 2𝑥𝑥2 + 2𝑐𝑐 …… 42 x1 x2 Margin (m) 𝑓𝑓 𝑋𝑋 = 𝑎𝑎 𝑓𝑓 𝑋𝑋 = −𝑎𝑎 Maximal margin hyperplane a All points on the right side and out of margin Maximal margin hyperplane – mathematical representation Divide formula by 𝑎𝑎 max 𝑊𝑊, 𝛽𝛽0 ⁄𝑚𝑚 𝑎𝑎 s. t. (𝛽𝛽0+𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖) � 𝑦𝑦𝑖𝑖 ≥ 1 ∀𝑖𝑖 = 1, … ,𝑛𝑛 Recall that 𝑚𝑚 = min𝑑𝑑𝑖𝑖 = 𝑎𝑎 𝑊𝑊 Further transform to max 𝑊𝑊, 𝛽𝛽0 ⁄1 𝑊𝑊 s. t. (𝛽𝛽0+𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖) � 𝑦𝑦𝑖𝑖 ≥ 1 ∀𝑖𝑖 = 1, … ,𝑛𝑛 This will only have one solution 43 x1 x2 Margin (m) 𝑓𝑓 𝑋𝑋 = 1 𝑓𝑓 𝑋𝑋 = −1 Maximal margin hyperplane 1 Support vectors 44 x1 x2 Support vectors Maximal margin hyperplane is controlled by support vectors Moving other points a little doesn’t effect decision Maximal margin hyperplane – mathematical representation Optimization goal max 𝑊𝑊, 𝛽𝛽0 1/ 𝑊𝑊 s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 ∀𝑖𝑖 Convert to primal form min 𝑊𝑊, 𝛽𝛽0 𝑊𝑊𝑇𝑇 � 𝑊𝑊 s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 ∀𝑖𝑖 Continually convert to dual form max 𝑐𝑐1,…,𝑐𝑐𝑛𝑛 � 𝑖𝑖=1 𝑛𝑛 𝑐𝑐𝑖𝑖 − 1 2 � 𝑗𝑗=1 𝑛𝑛 � 𝑘𝑘=1 𝑛𝑛 𝑐𝑐𝑖𝑖𝑐𝑐𝑘𝑘𝑦𝑦𝑗𝑗𝑦𝑦𝑘𝑘 𝑋𝑋𝑗𝑗 � 𝑋𝑋𝑘𝑘 s. t. 𝑐𝑐𝑖𝑖 ≥ 0 &� 𝑖𝑖=1 𝑛𝑛 𝑐𝑐𝑖𝑖𝑦𝑦𝑖𝑖 = 0 ∀𝑖𝑖 𝑊𝑊 = � 𝑖𝑖=1 𝑛𝑛 𝑐𝑐𝑖𝑖𝑦𝑦𝑖𝑖 𝑋𝑋𝑖𝑖; 𝑦𝑦𝑖𝑖 = ±1 45 Kernel Kernel functions After transforming to dual form, the target function only rely on 𝑋𝑋𝑖𝑖 � 𝑋𝑋𝑘𝑘 ; this is called kernel function A general format of kernel function is 𝛫𝛫 𝑋𝑋𝑖𝑖 ,𝑋𝑋𝑗𝑗 = ∅(𝑋𝑋𝑖𝑖) � ∅(𝑋𝑋𝑗𝑗) In previous linear separable case, the kernel function is the inner product of 𝑋𝑋 or ∅ 𝑋𝑋 = 𝑋𝑋; for non-linear separable cases, we can modify the kernel function to map data to another feature space 46 Kernel transformation For non-linear separable cases, kernel can convert data to higher dimension for a better cut of hyperplane 47 ∅ 𝑋𝑋 = ∅ X1 ,∅ 𝑋𝑋2 , …∅ 𝑋𝑋𝑛𝑛𝑋𝑋 = 𝑋𝑋1,𝑋𝑋2, … ,𝑋𝑋𝑛𝑛 Kernel transformation 48 Hyperplane is your blade Kernel: throw fruits It’s like playing Fruit Ninja! Kernel functions Linear kernel: 𝑋𝑋𝑖𝑖 � 𝑋𝑋𝑗𝑗 Polynomial kernel: 𝛾𝛾 𝑋𝑋𝑖𝑖 � 𝑋𝑋𝑗𝑗 + 𝑟𝑟 𝑑𝑑 Gaussian (radial basis function, RBF) kernel: exp −𝛾𝛾 𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑗𝑗 2 Sigmoid kernel: tanh 𝛾𝛾 𝑋𝑋𝑖𝑖 � 𝑋𝑋𝑗𝑗 + 𝑟𝑟 49 Caution: overfitting! Data still not separable? Previous objective function (prime form) min 𝑊𝑊, 𝛽𝛽0 𝑊𝑊𝑇𝑇 � 𝑊𝑊 s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 ∀𝑖𝑖 Now 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 cannot be satisfied Still prefer it to be as close to 1 as possible 50 x1 x2 Hyperplane Soft margin Soft margin: allow some observations to be at the incorrect side of the margin or hyperplane We introduce a penalty term 𝜀𝜀𝑖𝑖, and change objective function min 𝑊𝑊, 𝛽𝛽0 𝜆𝜆𝑊𝑊𝑇𝑇 � 𝑊𝑊 + � 𝑖𝑖=1 𝑛𝑛 𝜀𝜀𝑖𝑖 s. t. 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1 − 𝜀𝜀𝑖𝑖 𝜀𝜀𝑖𝑖 ≥ 0 λ is a hyperparameter related to total budget for violation of margin λ ↑, budget ↑ 51 x1 x2 Hyperplane 𝜀𝜀 𝜀𝜀 Loss function with soft margin Minimal value of 𝜀𝜀𝑖𝑖 to satisfy soft margin SVM is 1 − 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 We can then plug it into the previous objective function 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 = � 𝑖𝑖=1 𝑛𝑛 1 − (𝛽𝛽0+𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖) � 𝑦𝑦𝑖𝑖 + 𝜆𝜆𝑊𝑊𝑇𝑇 � 𝑊𝑊 If λ is small Fewer violations to the margin will occur Lead to high variance, low bias 52 Minimal value of 𝜀𝜀𝑖𝑖 Penalty SVM vs. logistic regression SVM: hinge loss 𝐿𝐿 = � 𝑖𝑖=1 𝑛𝑛 max 0,1 − (𝛽𝛽0+𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖) � 𝑦𝑦𝑖𝑖 Logistic regression: log loss [− ln Likelihood ] 𝐿𝐿 = � 𝑖𝑖=1 𝑛𝑛 log 1 + exp − 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 53 SVM vs. logistic regression SVM If 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 ≥ 1, hinge loss = 0 (observations that are correctly classified do not contribute to hinge loss) Otherwise, hinge loss = 1 − ( ) 𝛽𝛽0 + 𝑊𝑊𝑇𝑇 � 𝑋𝑋𝑖𝑖 � 𝑦𝑦𝑖𝑖 Logistic regression Log loss always > 0 (observations that
are correctly classified still add to hinge
loss, but the contribution is small)

54

SVM for multilevel classification and regression
SVM is mainly used for binary classification, as in previous slides

SVM can work with classification with more than two classes
One-versus-one: break down the problem to pairwise binary classification
One-versus-all: focus on one class (K) and combine all other observations into non-K class

SVM can also be adapted for regression, called support vector regression
(SVR)

SVR utilized the concept of margin and kernel in SVM

55

SVR

56

Margin

Kernel transformation

Hyperparameter tuning
Hyperparameters are tuning parameters of the algorithms; choose right value
of hyperparameters to optimize model performance

Ridge
λ: penalty for L2 regulation (sum of squared coefficients)

LASSO
λ: penalty for L1 regulation (sum of absolute coefficients)

Decision tree
α: penalty for pruning

Random forest
m: number of features for each tree
Number of trees

57

Hyperparameter tuning
Gradient boosting

B: number of trees

λ: Shrinkage parameter (learning rate)

d: number of splits for each tree (terminal node d + 1)

SVM
λ: penalty and related to total budget for violation of margin

γ, d, r in kernel

Rule of thumb: don’t be creative, stick to commonly accepted range of
hyperparameters

58

Hyperparameter tuning – grid search
Set up grid in the value space of hyperparameters

Use cross validation technique

Recall α in pruning of decision trees

For exponentially ranged hyperparameters, set grid in log scale ([10, 100,
1000])

59

Hyperparameter tuning – k-fold cross validation
Traditionally split data into three sets: test, training, validation; use training and
validation set to tune hyperparameter

A better way: keep training set and test set, perform k-fold cross validation for
hyperparameter tuning on training set and then evaluate tuned model on test
set

60

Next step
Homework 1: see detailed instruction

Lecture 3 and 4: Prof. Yujie He will teach Lecture 3 and 4 on classification,
unsupervised learning, and labs

Lecture 3 will be back to our normal schedule on Saturday Morning of July 10

61

62
https://fred.stlouisfed.org/

https://fred.stlouisfed.org/

MFIN 290 Application of Machine Learning in Finance: Lecture 2
Agenda
Section 1: Basic Decision Tree
Moving beyond linearity
Moving beyond linearity – bias-variance trade-off
Moving beyond linearity – bias-variance trade-off
Moving beyond linearity – a glance
Decision trees
Decision trees – an example
Decision trees – what is a tree?
Decision trees – what is a tree?
Decision trees – regression trees
Decision trees – recursive binary splitting
Decision trees – recursive binary splitting
Decision trees – pruning
Decision trees – tuning
Decision trees – model assessment
Decision trees – classification
Decision trees vs. linear model
Decision trees – pros and cons
Section 2: Bagging and Boosting Tree
Bagging
Bagging
Random forest
Random forest – benefits
Random forest – benefits
Random forest – variable importance
Boosting
Boosting – boosting vs. bagging
Boosting
Boosting – AdaBoost (adaptive boosting)
Boosting – gradient boosting
Boosting – XGBoost (extreme gradient boosting)
Boosting – regularization in XGBoost
Boosting – XGBoost performance
Boosting – hyperparameters
Section 3: Support Vector Machine (SVM)
Hyperplane
Separating hyperplane
Maximal margin hyperplane
Maximal margin hyperplane – mathematical representation
Maximal margin hyperplane – mathematical representation
Maximal margin hyperplane – mathematical representation
Support vectors
Maximal margin hyperplane – mathematical representation
Kernel functions
Kernel transformation
Kernel transformation
Kernel functions
Data still not separable?
Soft margin
Loss function with soft margin
SVM vs. logistic regression
SVM vs. logistic regression
SVM for multilevel classification and regression
SVR
Hyperparameter tuning
Hyperparameter tuning
Hyperparameter tuning – grid search
Hyperparameter tuning – k-fold cross validation
Next step
幻灯片编号 62