CS计算机代考程序代写 python data structure deep learning GPU finance decision tree AI Excel algorithm PowerPoint 演示文稿

PowerPoint 演示文稿

MFIN 290 Application of Machine Learning in
Finance: Lecture 3

Yujie He

7/10/2021

Background of lecturer

Tech Lead/Senior Applied Scientist in Microsoft

Multiple patents and conference papers in knowledge graph and natural language processing (NLP)

Expertise in deep learning, NLP and building end-to-end AI systems

Work with applied machine learning in various scenarios (NL[GPU])

Search relevance, semantic search

Email search features in Outlook, people search, Smart Look-Up in Word

Text/knowledge mining

Enterprise Knowledge Graph and Microsoft Viva Topics

Multi-lingual applications

Few-shot and meta-learning-based applications

Previous experience

worked at KPMG as Senior Data Scientist (Lead DS in Advanced Analytics in Capital Group)

MS, Computer Science, Georgia Institute of Technology

PhD, Science, Purdue University

BS, Engineering, Fudan University

2

Agenda

Python and Colab

Recap of last lecture

Classification (Supervised approach)

Introduction

K Nearest Neighbor

Logistic Regression

Multi-class classification

Evaluations

Gradient Descent

Python Example of Portfolio Optimization

3

Basic Python

Your comfortableness in scale 0-5

http://cs231n.github.io/python-numpy-tutorial/

Learn by doing; Bias for action

4

http://cs231n.github.io/python-numpy-tutorial/

Colab Pros and Cons

Google’s free Jupyter-Notebook like cloud-based Python runtime

Pros

Prebuilt with many commonly used python libraries

No infra setup required

Free GPU usage*

Upto 12h continuous running time (browser open)

Notebook saved to Google Drive

Cons

Session is non-persistent and preemptive (install DIY packages per session etc.)

Storage is session based and files need to be downloaded to local if use later

“Idle” notebook instance recycled after 90-mins

Maximum of 2 notebook running concurrently

5

Goal

Build intuitions towards multiple commonly used models

Understand key math components of models

Know industry use cases of different models

Know when to use which model

Know Pros and cons of different models

Know how to improve when modeling result is unsatisfactory

Excellent resource:

https://scikit-learn.org/stable/tutorial/index.html

6

https://scikit-learn.org/stable/tutorial/index.html

Recap

7

What is Machine Learning (supervised)

Algorithms that can improve their own performance using training data

Typically the algorithm has a (large) number of parameters whose values are learned from data

Models “learned” from data and “fit” to data by changing decision boundary

Optimization process achieved by carefully designed loss functions

Can be applied in situation where it is challenging (impossible) to define rules by hand

Face detection

Speech recognition

Stock price prediction

8

Last Lecture

Basic Decision Tree

Classification: feature split that minimizes Gini impurity (maximize information gain)

Regression: Feature split that minimizes MSE/RMSE; pruning using Lagrangian duality to

limit number of leaves

Bagging (bootstrap aggregating) and Boosting

Parallel vs. sequential methods

Random forest vs. boosted trees (Adaboost, GBDT, XGBoost)

Stacking: heterogeneous weak learners; multi-level stacking

SVM

Kernel trick: represents data X through pairwise similarity matrix

Support vectors!

9

Machine Learning Workflow

10

Step 1

Data
preparation

Step 2

Features
selection

Step 3

Algorithm
selection

Step 4

Model
evaluation

Step 5

Model
application

Machine Learning Use Cases in Finance

Process automation

Chatbots

Call-center automation

Paperwork automation

Security

Fraud detection

Finance and risk modeling

Underwriting and credit scoring

Algorithmic trading

11

Data Leakage

Leaky predictors

Predictors include data that will not be available at predict time

Predict who got sick ~ f(age, weight, took_antibiotics_medicine)

Predict whether credit card application will be accepted ~ f(income, expenditure, share, number_of_active_accounts)

Happens more frequent than you thought because data are messy

Leaky validation strategies

Did not carefully distinguish training vs. validation set

Normalization

Imputation

Time series

How to prevent

Screen possible leaky predictors by examining correlations with target

Too good to be true

Model interpretation

12

Classification

13

Three Canonical Learning Problems

Regression – supervised

Estimate continuous variables, e.g. house sqft vs. house price

Classification – supervised

Estimate discrete variables or classes, e.g. hand-written digit recognition

14

Unsupervised Learning – model the data

Clustering

Dimension reduction

15

Example 1: Hand-written digit recognition

Represent each input image as a vector x ∈ 𝑅784

Learn a classifier f(x) such that

𝑓: 𝑥 → {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}

16

Example 1: Hand-written digit recognition

17

Example 2: Facial Recognition

Again, a supervised classification problem

Need to classify an image window into three classes

Non-face

Frontal-face

Profile-face

18

Classifier is learnt from labeled data

Training data for frontal faces

5000 faces

All near frontal

Age, race, gender, lighting

108 non faces

faces are normalized

Scaled, color translated

Data distribution

Model learns both positive and negative samples (just like human!)

Contrastive learning

19

Example 3: Spam Detection

Task is to classify message/email to spam/non-spam

X can be vector of content related features, e.g. title, sender, length

Requires a learning system as “spams” keep innovating

20

Example 4: Stock Price Prediction

Task is to predict stock price in the future

This is a regression task as output is continuous

However, can also be formulated as a classification problem

Binning the price change

Can outperform regression under some scenarios

21

How to get the labels?

Human Relevance System

Amazon Mechanical Turk

Internal systems

Things to watch out:

Label quality

Gold hits, spam detection

Works well only with simplistic data

Categorize images, rank articles

Model is straightforward, data is the key differentiator that moves the needle

22

Classification

23

Why not use regression for classification problems?

Class encoding implies ordering of outcomes AND the distance between classes

Maybe ok for binary classification

Regression predictions outside of [0, 1] can be hard to interpret

Regression representation does not make sense for classification problems

24

Common classification models

K-Nearest Neighbor

Naïve Bayes (will cover in NLP course on spam detection)

Logistic Regression (very popular in industry)

Tree-based methods

Decision tree

Many variance (e.g. CHAID tree)

Random forest

Gradient Boosted Decision Tree

Many variance (e.g. ranking, ads recommendation)

XGBoost (Kaggle favorite, popular in industry)

25

K Nearest Neighbor (KNN) Classifier

26

27

Impact of K

28

29

30

31

32

KNN Properties and Training

As K increases

Classification boundary becomes smoother

Training error can increase

Choose K by cross-validation

33

Key Questions

How to define “nearest”?

Distance metrics: Euclidean, Mahalanobis, Manhattan, Cosine…

34

Characteristics of Different Distance Metrics

Euclidean:

most often used

Feature magnitude matters

Sensitive to outliers

Mahalanobis:

Gaussian assumptions

E.g. Gaussian mixture models

Manhattan:

Less emphasis on outliers

Preferable for the case of high dimensional data

Cosine:

Feature magnitude does not matter, but direction matters

E.g. product preference, word-embedding

Use dot product when care about magnitude

35

36

37

38

Approximate Nearest Neighbor

Return neighbors whose distances are ≤ 𝐶 × 𝑛𝑒𝑎𝑟𝑒𝑠𝑡 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒

In many cases, approximate NN is almost as good as the exact one

N is set to the size of the training set, often should choose N comparable

with K (e.g. 10×k, 100×k etc.)

Used in practice, through data structures/techniques like trees or LSH

(Locality sensitive hashing)

39

40

K-NN Industry Use Cases

K-NN search (via ANN)

“People also search/buy”

Semantically similar documents/web page relevant to a query, query expansion

Recommender systems

Exploration (ε-greedy) and exploitation (K-NN)

K-NN Classification

In cases where it is impractical to train a classifier for each sample, e.g. identify suspects w.r.t a watchlist

In general, often used as intermediate step, rather than end-to-end solution

L1 ranker vs. L2 ranker in search relevance

Query expansion

41

Logistic Regression

It is called regression…but is actually used for classification

Think of it as a projection of linear model output y ∈ [-∞, +∞] into [0, 1]

Closely related to neural networks

42

Logistic Function and Interpretations

Weights do not influence the probability linearly, but rather the odds ratio

by exp(𝛽)

Categorical variable (one-hot-encoding) with L levels: use L-1 columns to

represent the feature, reference category is the L-th.

43

Cross Entropy Loss

Maximize the likelihood (MLE) is the same as minimizing the cross entropy (H) in

classification problems

Trick to memorize p/q: observation p can be 0, whereas q in log q can not be 0

Gradient descent is used to minimize loss function J(w)

44

Logistic Regression Industry Use Cases

Very commonly used

Spam detection, credit card fraud, underwriting approval, marketing (ads recommendation) etc.

Pros:

Gives you probability (vs. SVM)

Interpretability

Scalability

Can add regularization (L1, L2 etc.)

Cons:

Model can be too simple and lead to under-fitting

Interactions need to be added manually (linear nature)

Linear assumption

45

Other classification methods

LDA

Generative model, vs. discriminative (e.g.logistic regression)

Assumes that the observations within each class are drawn from a multivariate Gaussian distribution with a class specific

mean vector and a covariance matrix that is common to all K classes.

High bias

QDA

More complexity, higher variance

46

Linear set

47

Non-Linear set

Multi-class Classification

Decompose trivially into a set of unlinked binary problems

One-vs-All (OVA)

All samples in class i are positive, in the rest classes are negative

Choose a properly tuned regularization classifier as your underlying binary classifier

Some model comes with built-in multi-class capability

Tree based model (majority vote at leaf node)

Neural network (output layer has number-of-classes nodes + softmax)

48

Multi-class Classification

All-vs-All (AVA, also called all-pairs, or one-vs-one)

Build N*(N-1) classifiers for each pair of classes

𝑓𝑖𝑗 is the classifier where class i were positive examples and j were negative. Note 𝑓𝑖𝑗 = 1 − 𝑓𝑗𝑖

49

Multi-class Classification

OVA is more commonly used, in cases where dataset size and N are moderate

AVA can be faster and more memory efficient when dataset is large, because usually N is not very large

(dozens)

O(N2) classifiers instead of O(N) but each classifier is much smaller and built on a smaller dataset

If time takes to build a classifier is superlinear in the number of data points, e.g. SVM, then AVA is a good

choice

50

Binary or Multiclass Classification?

Detect noise in a named entity knowledge-mining scenario

Multiclass (8 types + noise) vs. binary (noise or not)

Multiclass provides 7%+ lift in Accuracy

In practice, multi-task joint training/multi-objective may help model learn

51

Evaluation of Classification Model

Confusion matrix

Generalize to multi-class

Overall accuracy

Per-class F1 metric and aggregate

Class-weighted aggregation

52

Predicte

d

True

Predicte

d

False

Actual

True

TP FN

Actual

False

FP TN

Unbalanced data in classification

Model is straight forward, get to know your data is key

Distribution of samples for different classes is highly skewed

Very common (churn, fraud)

Metric bias

Accuracy trap

Model training

Class weight

Up-sampling (SMOTE, imbalanced-learn)

Down-sampling

53

https://github.com/scikit-learn-contrib/imbalanced-learn

Unbalanced data in classification

Model is straight forward, get to know your data is key

Distribution of samples for different classes is highly skewed

Very common (churn, fraud)

Metric bias

Accuracy trap

Model training

Class weight (preferable)

Up-sampling (SMOTE, imbalanced-learn)

Down-sampling

Meta-learning to reweight examples (e.g. Ren et al.)

Cross validate and test on original data but can verify on re-sampled sets

54

https://github.com/scikit-learn-contrib/imbalanced-learn
https://github.com/uber-research/learning-to-reweight-examples

Unbalanced data in classification

Evaluation

ROC curves are appropriate when the observations are balanced between each class

Precision-recall curves are appropriate for imbalanced datasets

In both cases the area under the curve (AUC) can be used as a summary of the model

performance

Associate costs to mis-classifications

55

ROC Curve (TPR – FPR) AUC: 0.92 Precision-Recall Curve AUC: 0.58

How does ML model learn?

Definition of loss function or cost function

Minimize the loss so that model predicted value is closer to ground-truth value

This now turns into an optimization problem!

Tuning model parameters/weights in a way that can minimize the loss

How?

Gradient descent comes to rescue

56

Gradient Descent

57

• An optimization algorithm to find minimum of a loss function (J)

• m: parameter to optimize; 𝛼: learning rate

Intuitively…

58

Intuitively…

What if 𝛼 too large?

What if 𝛼 too small?

What if function is not convex?

59

Intuitively…

What if 𝛼 too large?

What if 𝛼 too small?

What if function is not convex?

60

Intuitively…

What if 𝛼 too large?

What if 𝛼 too small?

What if function is not convex?

61

Gradient Descent Variants

62

Stochastic gradient descent

One sample per gradient calc

Mini-batch version is better

Momentum (momentum-based)

Adagrad (scale gradient/stepsize)

More stable

Accelerated SGD

Adadelta (scale gradient/stepsize)

Rmsprop (scale gradient/stepsize)

Nag (momentum-based)

Adam (adaptive moment estimation)

63

Adam

Standard for neural network

models

Robust and effective

Large memory footprint

64

Saddle point performance

65

Algos without scaling based on

gradient struggles to break

symmetry

Momentum oscillates and builds

up velocity in the optimal

direction

Most of time SGD can converge,

eventually…

Generalization (Learning Theory)

The real aim of supervised learning is to do well on test data that is not known during training

Choosing the values for parameters that minimize the loss function on the training data is not necessarily

the best policy

We want the learning machine to model the true regularities in the data and to ignore the noise in the

data

66

67

Can the model fit your training set well?

Can the model fit your validation set well?

68

Under-fitting

69

• Rarely happens

• Fundamental assumption issue (linear vs. polynomial)

• Data carries little signal

Over-fitting

70

• Happens more frequently than you think

• Data distribution drift

Best Practice

1. Start with simplest model that can solve the problem (no free lunch theorem)

2. Define metrics. Is the performance good enough? If so, done. If not, this simple model becomes your

baseline/benchmark model

3. Build a slightly more complicated model, maybe tree-based model

4. Collect the same metrics. Is it much better? If yes, done. If slightly better, repeat (3); if almost not better

at all, rethink your feature engineering and examine your data

71

Python Hands-on Session: Portfolio
Optimization

72

Next Step

Lecture 4: Unsupervised learning, Neural network, Python example continue

73

74