PowerPoint 演示文稿
MFIN 290 Application of Machine Learning in
Finance: Lecture 3
Yujie He
7/10/2021
Background of lecturer
Tech Lead/Senior Applied Scientist in Microsoft
Multiple patents and conference papers in knowledge graph and natural language processing (NLP)
Expertise in deep learning, NLP and building end-to-end AI systems
Work with applied machine learning in various scenarios (NL[GPU])
Search relevance, semantic search
Email search features in Outlook, people search, Smart Look-Up in Word
Text/knowledge mining
Enterprise Knowledge Graph and Microsoft Viva Topics
Multi-lingual applications
Few-shot and meta-learning-based applications
Previous experience
worked at KPMG as Senior Data Scientist (Lead DS in Advanced Analytics in Capital Group)
MS, Computer Science, Georgia Institute of Technology
PhD, Science, Purdue University
BS, Engineering, Fudan University
2
Agenda
Python and Colab
Recap of last lecture
Classification (Supervised approach)
Introduction
K Nearest Neighbor
Logistic Regression
Multi-class classification
Evaluations
Gradient Descent
Python Example of Portfolio Optimization
3
Basic Python
Your comfortableness in scale 0-5
http://cs231n.github.io/python-numpy-tutorial/
Learn by doing; Bias for action
4
http://cs231n.github.io/python-numpy-tutorial/
Colab Pros and Cons
Google’s free Jupyter-Notebook like cloud-based Python runtime
Pros
Prebuilt with many commonly used python libraries
No infra setup required
Free GPU usage*
Upto 12h continuous running time (browser open)
Notebook saved to Google Drive
Cons
Session is non-persistent and preemptive (install DIY packages per session etc.)
Storage is session based and files need to be downloaded to local if use later
“Idle” notebook instance recycled after 90-mins
Maximum of 2 notebook running concurrently
5
Goal
Build intuitions towards multiple commonly used models
Understand key math components of models
Know industry use cases of different models
Know when to use which model
Know Pros and cons of different models
Know how to improve when modeling result is unsatisfactory
Excellent resource:
https://scikit-learn.org/stable/tutorial/index.html
6
https://scikit-learn.org/stable/tutorial/index.html
Recap
7
What is Machine Learning (supervised)
Algorithms that can improve their own performance using training data
Typically the algorithm has a (large) number of parameters whose values are learned from data
Models “learned” from data and “fit” to data by changing decision boundary
Optimization process achieved by carefully designed loss functions
Can be applied in situation where it is challenging (impossible) to define rules by hand
Face detection
Speech recognition
Stock price prediction
8
Last Lecture
Basic Decision Tree
Classification: feature split that minimizes Gini impurity (maximize information gain)
Regression: Feature split that minimizes MSE/RMSE; pruning using Lagrangian duality to
limit number of leaves
Bagging (bootstrap aggregating) and Boosting
Parallel vs. sequential methods
Random forest vs. boosted trees (Adaboost, GBDT, XGBoost)
Stacking: heterogeneous weak learners; multi-level stacking
SVM
Kernel trick: represents data X through pairwise similarity matrix
Support vectors!
9
Machine Learning Workflow
10
Step 1
Data
preparation
Step 2
Features
selection
Step 3
Algorithm
selection
Step 4
Model
evaluation
Step 5
Model
application
Machine Learning Use Cases in Finance
Process automation
Chatbots
Call-center automation
Paperwork automation
Security
Fraud detection
Finance and risk modeling
Underwriting and credit scoring
Algorithmic trading
11
Data Leakage
Leaky predictors
Predictors include data that will not be available at predict time
Predict who got sick ~ f(age, weight, took_antibiotics_medicine)
Predict whether credit card application will be accepted ~ f(income, expenditure, share, number_of_active_accounts)
Happens more frequent than you thought because data are messy
Leaky validation strategies
Did not carefully distinguish training vs. validation set
Normalization
Imputation
Time series
How to prevent
Screen possible leaky predictors by examining correlations with target
Too good to be true
Model interpretation
12
Classification
13
Three Canonical Learning Problems
Regression – supervised
Estimate continuous variables, e.g. house sqft vs. house price
Classification – supervised
Estimate discrete variables or classes, e.g. hand-written digit recognition
14
Unsupervised Learning – model the data
Clustering
Dimension reduction
15
Example 1: Hand-written digit recognition
Represent each input image as a vector x ∈ 𝑅784
Learn a classifier f(x) such that
𝑓: 𝑥 → {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}
16
Example 1: Hand-written digit recognition
17
Example 2: Facial Recognition
Again, a supervised classification problem
Need to classify an image window into three classes
Non-face
Frontal-face
Profile-face
18
Classifier is learnt from labeled data
Training data for frontal faces
5000 faces
All near frontal
Age, race, gender, lighting
108 non faces
faces are normalized
Scaled, color translated
Data distribution
Model learns both positive and negative samples (just like human!)
Contrastive learning
19
Example 3: Spam Detection
Task is to classify message/email to spam/non-spam
X can be vector of content related features, e.g. title, sender, length
Requires a learning system as “spams” keep innovating
20
Example 4: Stock Price Prediction
Task is to predict stock price in the future
This is a regression task as output is continuous
However, can also be formulated as a classification problem
Binning the price change
Can outperform regression under some scenarios
21
How to get the labels?
Human Relevance System
Amazon Mechanical Turk
Internal systems
Things to watch out:
Label quality
Gold hits, spam detection
Works well only with simplistic data
Categorize images, rank articles
Model is straightforward, data is the key differentiator that moves the needle
22
Classification
23
Why not use regression for classification problems?
Class encoding implies ordering of outcomes AND the distance between classes
Maybe ok for binary classification
Regression predictions outside of [0, 1] can be hard to interpret
Regression representation does not make sense for classification problems
24
Common classification models
K-Nearest Neighbor
Naïve Bayes (will cover in NLP course on spam detection)
Logistic Regression (very popular in industry)
Tree-based methods
Decision tree
Many variance (e.g. CHAID tree)
Random forest
Gradient Boosted Decision Tree
Many variance (e.g. ranking, ads recommendation)
XGBoost (Kaggle favorite, popular in industry)
25
K Nearest Neighbor (KNN) Classifier
26
27
Impact of K
28
29
30
31
32
KNN Properties and Training
As K increases
Classification boundary becomes smoother
Training error can increase
Choose K by cross-validation
33
Key Questions
How to define “nearest”?
Distance metrics: Euclidean, Mahalanobis, Manhattan, Cosine…
34
Characteristics of Different Distance Metrics
Euclidean:
most often used
Feature magnitude matters
Sensitive to outliers
Mahalanobis:
Gaussian assumptions
E.g. Gaussian mixture models
Manhattan:
Less emphasis on outliers
Preferable for the case of high dimensional data
Cosine:
Feature magnitude does not matter, but direction matters
E.g. product preference, word-embedding
Use dot product when care about magnitude
35
36
37
38
Approximate Nearest Neighbor
Return neighbors whose distances are ≤ 𝐶 × 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒
In many cases, approximate NN is almost as good as the exact one
N is set to the size of the training set, often should choose N comparable
with K (e.g. 10×k, 100×k etc.)
Used in practice, through data structures/techniques like trees or LSH
(Locality sensitive hashing)
39
40
K-NN Industry Use Cases
K-NN search (via ANN)
“People also search/buy”
Semantically similar documents/web page relevant to a query, query expansion
Recommender systems
Exploration (ε-greedy) and exploitation (K-NN)
K-NN Classification
In cases where it is impractical to train a classifier for each sample, e.g. identify suspects w.r.t a watchlist
In general, often used as intermediate step, rather than end-to-end solution
L1 ranker vs. L2 ranker in search relevance
Query expansion
41
Logistic Regression
It is called regression…but is actually used for classification
Think of it as a projection of linear model output y ∈ [-∞, +∞] into [0, 1]
Closely related to neural networks
42
Logistic Function and Interpretations
Weights do not influence the probability linearly, but rather the odds ratio
by exp(𝛽)
Categorical variable (one-hot-encoding) with L levels: use L-1 columns to
represent the feature, reference category is the L-th.
43
Cross Entropy Loss
Maximize the likelihood (MLE) is the same as minimizing the cross entropy (H) in
classification problems
Trick to memorize p/q: observation p can be 0, whereas q in log q can not be 0
Gradient descent is used to minimize loss function J(w)
44
Logistic Regression Industry Use Cases
Very commonly used
Spam detection, credit card fraud, underwriting approval, marketing (ads recommendation) etc.
Pros:
Gives you probability (vs. SVM)
Interpretability
Scalability
Can add regularization (L1, L2 etc.)
Cons:
Model can be too simple and lead to under-fitting
Interactions need to be added manually (linear nature)
Linear assumption
45
Other classification methods
LDA
Generative model, vs. discriminative (e.g.logistic regression)
Assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class specific
mean vector and a covariance matrix that is common to all K classes.
High bias
QDA
More complexity, higher variance
46
Linear set
47
Non-Linear set
Multi-class Classification
Decompose trivially into a set of unlinked binary problems
One-vs-All (OVA)
All samples in class i are positive, in the rest classes are negative
Choose a properly tuned regularization classifier as your underlying binary classifier
Some model comes with built-in multi-class capability
Tree based model (majority vote at leaf node)
Neural network (output layer has number-of-classes nodes + softmax)
48
Multi-class Classification
All-vs-All (AVA, also called all-pairs, or one-vs-one)
Build N*(N-1) classifiers for each pair of classes
𝑓𝑖𝑗 is the classifier where class i were positive examples and j were negative. Note 𝑓𝑖𝑗 = 1 − 𝑓𝑗𝑖
49
Multi-class Classification
OVA is more commonly used, in cases where dataset size and N are moderate
AVA can be faster and more memory efficient when dataset is large, because usually N is not very large
(dozens)
O(N2) classifiers instead of O(N) but each classifier is much smaller and built on a smaller dataset
If time takes to build a classifier is superlinear in the number of data points, e.g. SVM, then AVA is a good
choice
50
Binary or Multiclass Classification?
Detect noise in a named entity knowledge-mining scenario
Multiclass (8 types + noise) vs. binary (noise or not)
Multiclass provides 7%+ lift in Accuracy
In practice, multi-task joint training/multi-objective may help model learn
51
Evaluation of Classification Model
Confusion matrix
Generalize to multi-class
Overall accuracy
Per-class F1 metric and aggregate
Class-weighted aggregation
52
Predicte
d
True
Predicte
d
False
Actual
True
TP FN
Actual
False
FP TN
Unbalanced data in classification
Model is straight forward, get to know your data is key
Distribution of samples for different classes is highly skewed
Very common (churn, fraud)
Metric bias
Accuracy trap
Model training
Class weight
Up-sampling (SMOTE, imbalanced-learn)
Down-sampling
53
https://github.com/scikit-learn-contrib/imbalanced-learn
Unbalanced data in classification
Model is straight forward, get to know your data is key
Distribution of samples for different classes is highly skewed
Very common (churn, fraud)
Metric bias
Accuracy trap
Model training
Class weight (preferable)
Up-sampling (SMOTE, imbalanced-learn)
Down-sampling
Meta-learning to reweight examples (e.g. Ren et al.)
Cross validate and test on original data but can verify on re-sampled sets
54
https://github.com/scikit-learn-contrib/imbalanced-learn
https://github.com/uber-research/learning-to-reweight-examples
Unbalanced data in classification
Evaluation
ROC curves are appropriate when the observations are balanced between each class
Precision-recall curves are appropriate for imbalanced datasets
In both cases the area under the curve (AUC) can be used as a summary of the model
performance
Associate costs to mis-classifications
55
ROC Curve (TPR – FPR) AUC: 0.92 Precision-Recall Curve AUC: 0.58
How does ML model learn?
Definition of loss function or cost function
Minimize the loss so that model predicted value is closer to ground-truth value
This now turns into an optimization problem!
Tuning model parameters/weights in a way that can minimize the loss
How?
Gradient descent comes to rescue
56
Gradient Descent
57
• An optimization algorithm to find minimum of a loss function (J)
• m: parameter to optimize; 𝛼: learning rate
Intuitively…
58
Intuitively…
What if 𝛼 too large?
What if 𝛼 too small?
What if function is not convex?
59
Intuitively…
What if 𝛼 too large?
What if 𝛼 too small?
What if function is not convex?
60
Intuitively…
What if 𝛼 too large?
What if 𝛼 too small?
What if function is not convex?
61
Gradient Descent Variants
62
Stochastic gradient descent
One sample per gradient calc
Mini-batch version is better
Momentum (momentum-based)
Adagrad (scale gradient/stepsize)
More stable
Accelerated SGD
Adadelta (scale gradient/stepsize)
Rmsprop (scale gradient/stepsize)
Nag (momentum-based)
Adam (adaptive moment estimation)
63
Adam
Standard for neural network
models
Robust and effective
Large memory footprint
64
Saddle point performance
65
Algos without scaling based on
gradient struggles to break
symmetry
Momentum oscillates and builds
up velocity in the optimal
direction
Most of time SGD can converge,
eventually…
Generalization (Learning Theory)
The real aim of supervised learning is to do well on test data that is not known during training
Choosing the values for parameters that minimize the loss function on the training data is not necessarily
the best policy
We want the learning machine to model the true regularities in the data and to ignore the noise in the
data
66
67
Can the model fit your training set well?
Can the model fit your validation set well?
68
Under-fitting
69
• Rarely happens
• Fundamental assumption issue (linear vs. polynomial)
• Data carries little signal
Over-fitting
70
• Happens more frequently than you think
• Data distribution drift
Best Practice
1. Start with simplest model that can solve the problem (no free lunch theorem)
2. Define metrics. Is the performance good enough? If so, done. If not, this simple model becomes your
baseline/benchmark model
3. Build a slightly more complicated model, maybe tree-based model
4. Collect the same metrics. Is it much better? If yes, done. If slightly better, repeat (3); if almost not better
at all, rethink your feature engineering and examine your data
71
Python Hands-on Session: Portfolio
Optimization
72
Next Step
Lecture 4: Unsupervised learning, Neural network, Python example continue
73
74