CS计算机代考程序代写 algorithm decision tree python data science data structure finance flex Machine Learning for Financial Data

Machine Learning for Financial Data
February 2021
CLASSIFICATION (CONCEPTS – PART 2)

Contents
◦ Naive Bayes Classifier
◦ Support Vector Machine (SVM)
◦ Random Forest
◦ Logistic Regression
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 2
Classification

Naive Bayes Classifier

Naïve Bayes classifier relies on the probability function of pedal & sepal measures to species over the sample space
prediction is not so obvious
prediction is obvious
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 4
Classification

Bayes’ Theorem is about updating the belief based on evidence
ALL POSSIBILITIES
No. of non-setosa in the sample
ALL POSSIBILITIES FITTING THE EVIDENCE
PROBABILITY OF A HYPOTHESIS GIVEN THE EVIDENCE IS TRUE
No. of non-setosa meeting some condition on the measurements
No. of setosa in the sample
No. of setosa meeting some condition on the measurements
+
Probability of a sample being a setosa given some condition on the measurements
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
5
Classification

Bayes’ Theorem (1)
𝑃 𝐻 𝑃 𝐸 𝐻
𝑃 𝐸 𝑃 𝐻 𝐸
Probability of a hypothesis being true (before any evidence) Probability of seeing the evidence if the hypothesis is true Probability of seeing the evidence
Probability a hypothesis being true given seeing the evidence
𝑃𝐻𝐸 ∙𝑃𝐸 =𝑃𝐸𝐻 ∙𝑃𝐻 =𝑃𝐻∩𝐸 =𝑃𝐸∩𝐻 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 6
Classification

Bayes’ Theorem (2)
Background Proposition
𝐵
¬𝐵 (not 𝐵)
Total
𝐴
𝑃𝐵𝐴∙𝑃𝐴 =𝑃𝐴𝐵∙𝑃𝐵
𝑃 ¬𝐵 𝐴 ∙ 𝑃 𝐴 =𝑃𝐴¬𝐵 ∙𝑃¬𝐵
𝑃𝐴
¬𝐴 (not 𝐴)
𝑃 𝐵 ¬𝐴 ∙ 𝑃 ¬𝐴 =𝑃¬𝐴𝐵 ∙𝑃𝐵
𝑃 ¬𝐵 ¬𝐴 ∙ 𝑃 ¬𝐴 =𝑃¬𝐴¬𝐵 ∙𝑃¬𝐵
𝑃¬𝐴 =1−𝑃𝐴
Total
𝑃𝐵
𝑃¬𝐵 =1−𝑃𝐵
1
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 7
Classification

Bayes’ Theorem (3)
No. of setosa = 10
𝑃 𝑆 = 1Τ11
No. of versicolor & virginica = 100
POSTERIOR: 𝑃 𝑆 𝑀 =
S S S S
S
S S
S
PRIOR:
+
= #𝐼𝑟𝑖𝑠 ∙ 𝑃 𝑆 ∙ 𝑃 𝑀 𝑆
#𝐼𝑟𝑖𝑠 ∙𝑃𝑆 ∙𝑃𝑀𝑆 + #𝐼𝑟𝑖𝑠 ∙𝑃¬𝑆 ∙𝑃𝑀¬𝑆 = 𝑃𝑆∙𝑃𝑀𝑆
𝑃𝑆 ∙𝑃𝑀𝑆 +𝑃¬𝑆 ∙𝑃𝑀¬𝑆
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
S
V
V
V
V
V
V
V
V
V
V
VVVVVVVVVV
LIKELIHOOD:
𝑃 𝑀 𝑆 = 0.4
𝑃 𝑀¬𝑆 =0.1 Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 8
= 0.0909 ∙ 0.4 0.0909 ∙ 0.4 + 0.9091 ∙ 0.1
= 0.2857
= 0.0364 0.0364 + 0.0909
Classification

Gaussian Naive Bayes classifier relies on probability of each feature value within a class and the class probability
▪ A Naive Bayes classifier is a probabilistic ML model that is used for classification ▪ The crux of the classifier is based on the Bayes’ theorem
P(A|B) = 𝑃(𝐵|𝐴)∙𝑃(𝐴) 𝑃(𝐵)
▪ The theorem provides the probability of A happening, given that B has occurred ▪ B is the evidence and A is the hypothesis
▪ Features are assumed to be independent; hence, it is called naïve
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 9
Classification

Iris classification is based the maximum probability value of the 3 species classes given 4 sepal & petal measurements
▪ Question: which species has the highest probability given 4 measurements
▪ The hypothesis (𝑦) is the Iris being one of the three species
▪ The evidence (𝑥1, 𝑥2, 𝑥3, 𝑥4) is the 4 sepal and petal measurements P(y|𝑥1,𝑥2,𝑥3,𝑥4) =𝑃(𝑥1|𝑦)∙𝑃(𝑥2|𝑦)∙𝑃(𝑥3|𝑦)∙𝑃(𝑥4|𝑦)∙𝑃(𝑦)
𝑃 𝑥1 ∙𝑃(𝑥2)∙𝑃(𝑥3)∙𝑃(𝑥4)
▪ Given that the denominator is a constant, the probability of an Iris being a particular species (𝑦) given the 4 measurements (𝑥𝑖) can be expressed as
P(y|𝑥 ,𝑥 ,𝑥 ,𝑥 ) ∝𝑃(𝑦)ς4 𝑃(𝑥|𝑦) 1 2 3 4 𝑖=1 𝑖
▪ The initial estimation of 𝑃(𝑦) is simply the proportion of 𝑦 among the samples
▪ The species with the largest probability will be taken as the prediction Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 10
Classification

Python: Fitting a Naive Bayes Model to Make Prediction
# load relevant modules
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
# fit/train the classifier to the training dataset
model = nb.fit(X_train, y_train)
# predict the targets for the test features
test_t = model.predict(X_test)
# calculate the accuracy score for the predicted targets using the known targets
print(“NB accuracy:”, accuracy_score(y_test, test_t)) Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 11
Classification
# instantiate a Naive Bayes classifier
# https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

Naïve Bayes Classifier in a Nutshell
1
2
3
4
5
6
7
8
Property
Feature Data Types
Target Data Types
Key Principles
Hyperparameters
Data Assumptions
Performance
Accuracy
Explainability
Description
Categorical or numerical.
Categorical (with probability).
Uses the Bayes’ theorem of conditional probabilities. For each feature, it calculates the probability for a class depending on the value of the feature.
None
Assume features are independent. Numerical features are assumed to be normally distributed.
Low computation cost. Fast and accurate. Efficient on large datasets.
When assumption of independence holds, outperform even highly sophisticated classification methods. Also perform well in multi-class prediction hence mostly used in text classification, e.g. spam filtering, sentiment analysis. Classifier combination technique like ensembling, bagging and boosting would not help its performance since their purpose is to reduce variance but Naive Bayes has no variance to minimize.
How much each feature contributes to a class prediction is Interpretable in the form of conditional probability.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 12
Classification

Support Vector Machine (SVM)

Support Vector Machine
A SVM is a powerful and versatile ML model, capable of performing linear or non-linear classification, regression, and even outlier detection. It is one of the more complex but accurate family of models making it one of most popular models in ML despite being a black box technique. SVMs are particularly well suited for classification of complex and small- or medium-size datasets.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 14
Classification

Linear SVM Classification

An SVM classifier tries to fit the widest possible street between the data points – large margin classification
▪ Using the Iris dataset, the scatterplot showing petal length vs petal width can clearly be separated easily with a straight line – linearly separable
3 possible linear classifiers: green is bad, the other two too close to the data points & may not perform well on new data
an SVM classifier: the line not only separates the two classes but also stays as far away from the closest training data points as possible
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 16
Classification

Hard margin classification may not generalize well
hyperplane
negative boundary
positive boundary
▪ Strictly imposing that all instances must be off the street is called hard margin classification
▪ Only works if the data is linearly separable
▪ Sensitive to outliers
▪ Sometime, it is impossible to find a hard margin that will generalize well
support vector
maximum margin
HARD MARGIN / CONSTRAINT
no data point is allowed to appear in the street implying that misclassification is not allowed
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
17
Classification

Soft margin classification trades margin violations for better generalization
misclassified
▪ To avoid the issues with hard margin classification, a more flexible soft margin classification is introduced
▪ The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations
▪ Samples may end up in the middle of the street or even on the wrong side, allowing misclassification
SOFT MARGIN / CONSTRAINT
data point is allowed to appear in the street implying that misclassification is allowed
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
18
Classification

The C hyperparameter is used to control error by specifying a mis-classification penalty
using linear kernel
▪ C is a hyperparameter for SVM
◦ Setting it to a low value, we might end up having a lot of margin violations but will probably generalize better
◦ Setting it to a high value, we might get less margin violations but the model may not generalize well ▪ Reducing C can regularize the model to avoid overfitting
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 19
Classification

Nonlinear SVM Classification

Features can be added to make a dataset linearly separable
data points not linearly separable
add feature x2, which is the square of x1 (x2 = x12) to make the data points linearly separable
▪ Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close to being linearly separable
▪ One approach to handling nonlinear datasets is to add more features, such as polynomial features, in some cases this can result in a linearly separable dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 21
Classification

A kernel function “adds” features by using a similarity function over a landmark and each existing data point
▪ Adding polynomial features significantly increases the complexity of ML algorithms (SVM & others), which hurts model performance
▪ When using SVM, kernel functions can be applied to get the same result as if many polynomial features were added to the model, even with very high-degree polynomials, without actually having to add them and therefore avoiding the combinatorial explosion of features
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 22
Classification

The Radial Basis Function (RBF) introduces a new feature having values between 0 and 1
𝑥2 is a new feature obtained by applying ∅𝛾(𝑥, 𝑙1) over the existing data points
𝑥3 is a new feature obtained by applying ∅𝛾(𝑥, 𝑙2) over the existing data points
∅ (𝑥, 𝑙) = exp −𝛾 𝛾
𝑥 − 𝑙
2
where 𝛾 = 1 2𝜎2
landmark 𝑙1
landmark 𝑙2
▪ The RBF is a bell-shaped function measuring the similarity between a landmark point (i.e. 𝑙) and any existing data point (e.g. 𝑥)
▪ ∅𝛾 𝑥,𝑙 =0indicatesthedatapoint𝑥isfarfrom the landmark point 𝑙
▪ ∅𝛾 𝑥,𝑙 = 1 indicates the data point 𝑥 is at the landmark point 𝑙
▪ 𝛾 is a hyperparameter and can be seen as the inverse of the radius of influence of data points selected by the model as support vectors
▪ It can be perceived as deciding how much curvature we want in a decision boundary (i.e. high 𝛾 means more curvature)
Classification
input 𝑥1 has a 1D feature space
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 23

The transformed dataset, dropping the original feature, is linearly separable
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 24
Classification

Setting the centroid of the data points as the landmark and then uplifting the data points around the landmark
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 25
Classification

The hyperplane is chosen in the 3D space
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 26
Classification

The hyperplane therefore provides a decision boundary for the original dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 27
Classification

Transforming the training dataset into a linear separable dataset is the objective of the kernel trick
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 28
Classification

When a model is overfitting/underfitting, 𝛾 should be reduced/increased
▪ Increasing gamma makes the bell- shaped curve narrower
▪ Each sample’s range of influence is smaller
▪ The decision boundary ends up being more irregular, wiggling around individual samples
▪ A small gamma value makes the bell- shaped curve wider
▪ Samples have a larger range of influence, and the decision boundary ends up smoother
▪ So 𝛾 acts like a regularization hyperparameter
▪ When overfitting, it should be reduced
▪ When underfitting, it should be increased
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 29
Classification

With so many kernel functions to choose from, how can you decide which one to use?
▪ As a rule of thumb, you should always try the linear kernel first
▪ LinearSVC is much faster than SVC(kernel=”linear”) especially if the training set is very
large or if it has plenty of features
▪ If the training set is not too large, you should also try the RBF kernel – it works
well in most cases
▪ Then if you have spare time and computing power, you can experiment with a
few other kernels, using cross-validation and grid search
▪ You would want to experiment like that especially if there are kernels specialized
for your training set’s data structure
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 30
Classification

Support Vector Machine (SVM) in a Nutshell
1
2
3
4
5
6
7
8
Property
Feature Data Types
Target Data Types
Key Principles
Hyperparameters
Data Assumptions
Performance
Accuracy
Explainability
Description
Requires feature scaling.
Categorical or numerical.
Find the maximum separation between classes while minimizing the classification error. Using kernel tricks to turn data into linearly separable data.
With linear and non-linear kernel functions. The C hyperparameter specifying the penalty of mis- classification is needed. The gamma hyperparameter specifying the degree of curvature of the decision boundary is not always needed. With the RBF kernel, both gamma and C are needed.
No data distributional requirement.
Fairly robust against overfitting, especially in higher dimensional space. Handles non-linear relationships quite well. Can be inefficient to train as well as memory-intensive to run and tune. Does not perform well with large datasets.
SVM is known as the most accurate and robust machine learning algorithms.
Support vectors provide some information about how the classification decision is determined.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 31
Classification

Random Forest

Decision trees work great with the data used to create them but not flexible when it comes to classifying new samples
single decision tree
random forest
▪ Decision trees are easy to build, easy to use, and easy to interpret
▪ Inaccuracy prevents them from being the ideal tool for predictive learning
▪ They work great with the data used to create them
▪ However, they are not flexible when it comes to classifying new samples
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
33
Classification

Random Forest
A random forest is comprised of multiple decision trees. It is said that the more trees it has, the more robust a forest is. A random forest creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 34
Classification

Random forests combine the simplicity of decision trees with flexibility resulting in a vast improvement in accuracy
Stage 1
Bootstrap Sampling
Stage 2
Model Training
Training Subset 1
Forecast 1
⋯⋯⋯
⋯⋯⋯
Training Subset 2
Training Subset n
Stage 3
Model Forecasting
Stage 4
Result Aggregating
Forecast 2
Forecast
⋯ ⋯ ⋯
Forecast n
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
35
Classification
Samples

Bootstrapping is a resampling technique used to estimate population statistics by sampling a dataset with replacement
The basic idea of bootstrapping is that inference about a population from sample data can be modelled
by resampling the sample data and performing inference about a sample from resampled data
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 36
Classification

Data subset is created by randomly selecting samples from the sample dataset – bootstrapping with replacement
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
No
No
No
125
No
Yes
Yes
Yes
180
Yes
Yes
Yes
No
210
No
Yes
No
Yes
167
Yes
original sample dataset
▪ A bootstrapped data subset is created by randomly selecting samples from the original sample dataset
▪ The bootstrapped data subset is of the same size as the original dataset
▪ The important detail is that it is allowed to pick the same sample more than once
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 37
Classification

Data subset is created by randomly selecting samples from the sample dataset – bootstrapping with replacement
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
No
No
No
125
No
Yes
Yes
Yes
180
Yes
Yes
Yes
No
210
No
Yes
No
Yes
167
Yes
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
Yes
180
Yes
original sample dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 38
Classification

Data subset is created by randomly selecting samples from the sample dataset – bootstrapping with replacement
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
No
No
No
125
No
Yes
Yes
Yes
180
Yes
Yes
Yes
No
210
No
Yes
No
Yes
167
Yes
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
Yes
180
Yes
No
No
No
125
No
original sample dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 39
Classification

Data subset is created by randomly selecting samples from the sample dataset – bootstrapping with replacement
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
No
No
No
125
No
Yes
Yes
Yes
180
Yes
Yes
Yes
No
210
No
Yes
No
Yes
167
Yes
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
Yes
180
Yes
No
No
No
125
No
Yes
No
Yes
167
Yes
original sample dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 40
Classification

Data subset is created by randomly selecting samples from the sample dataset – bootstrapping with replacement
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
No
No
No
125
No
Yes
Yes
Yes
180
Yes
Yes
Yes
No
210
No
Yes
No
Yes
167
Yes
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
Yes
180
Yes
No
No
No
125
No
Yes
No
Yes
167
Yes
Yes
No
Yes
167
Yes
original sample dataset
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 41
Classification
The 4th selected sample is the same as the 3rd one – sampling with replacement is at work here

A decision tree is constructed using a randomly selected subset of the features at each step
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
Yes
180
Yes
No
No
No
125
No
Yes
No
Yes
167
Yes
Yes
No
Yes
167
Yes
bootstrapped data subset
randomly selected two features as candidates for the root node of the new decision tree
assuming Good Blood Circulation does the best job separating the samples
Good Blood Circulation
root node of the new decision tree
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
42
Classification

The candidate feature with the best separating power is selected as the decision feature
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
Yes
180
Yes
No
No
No
125
No
Yes
No
Yes
167
Yes
Yes
No
Yes
167
Yes
randomly selected two features as candidates for the next internal node
assuming Weight does the best job separating the samples
Good Blood Circulation
bootstrapped data subset
Weight
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 43
Classification
the new decision tree with the root node and one internal node

A decision tree is built as usual but only considering a randomly selected subset of features at each step
bootstrapped data subset the new decision tree
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
Yes
180
Yes
No
No
No
125
No
Yes
No
Yes
167
Yes
Yes
No
Yes
167
Yes
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 44
Classification

Repeatedly make a new bootstrapped dataset and build a tree considering a subset of features at each step
▪ After building hundreds of decision trees, it results in a wide variety of trees
▪ The variety is the fundamental element that makes random forests more effective than individual decision trees
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 45
Classification

New data will be run through the decision trees one by one and the result of each decision tree is recorded
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
No
No
168
?
run the data down the 1st tree
a new data
Heart Disease YES
Heart Disease No
1
0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 46
Classification
the 1st tree says YES

Each decision tree result is tracked against the prediction classes
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
No
No
168
?
run the data down the 2nd tree
a new data
Heart Disease YES
Heart Disease No
2
0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 47
Classification
the 2nd tree says YES

The prediction outcome is determined by the votes of all decision trees in the forest
▪ In this case, “YES” received the most votes, so the conclusion is that the patient does have heart disease
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
No
No
168
YES
a new data
Heart Disease YES
Heart Disease No
5
1
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 48
Classification

Ensemble Method
◦ Random forest is technically an ensemble method based on the divide-and-conquer approach
◦ Each decision tree in the forest is generated based on a random sample from the training dataset selected using information gain, gain ratio, and Gini index for each feature
◦ In a classification problem, each tree votes and the most popular class is chosen as the final result
◦ In the case of regression, the average of all the tree outputs is considered as the final result
◦ It is simpler and more powerful compared to the other non-linear classification algorithms
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
49
Classification

Bagging uses the same algorithm for every predictor but using different random subsets of the training dataset
▪ Bagging / Bootstrap aggregating uses the same algorithm for each predictor but using different random subsets of the training dataset to allow for a more generalised result
▪ Subsets can be created with or without replacement
▪ With replacement, some samples may be present & repeated in more than one subset ▪ Without replacement, all samples in each subset are unique with no repeated sample
▪ Once all the predictors are trained, the ensemble can make a prediction for a new instance by aggregating the predicted values of all trained predictors
▪ Although each individual predictor has a higher bias than if it were trained on the original dataset, the aggregation allows the reduction of both bias & variance
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 50
Classification

Typically, about 1/3 of the original data does not end up in the bootstrapped dataset – the Out-of-Bag dataset
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
No
No
No
125
No
Yes
Yes
Yes
180
Yes
Yes
Yes
No
210
No
Yes
No
Yes
167
Yes
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
Yes
180
Yes
No
No
No
125
No
Yes
No
Yes
167
Yes
Yes
No
Yes
167
Yes
original sample dataset bootstrapped dataset this sample is not included in the bootstrapped dataset so will
be considered as a sample in the Out-Of-Bag dataset Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 51
Classification

The OOB dataset was not used to create this decision tree so it can be run through the decision tree for validation
Chest Pain
Good
Blood Circulation
Blocked Arteries
Weight
Heart Disease
Yes
Yes
No
210
No
out-of-bag dataset
run the oob data down the tree created without using the oob data
NO
NO is correct
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 52
Classification

Continuing running this out-of-bag sample through all of the other trees that were built without it & aggregate the results
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 53
Classification

Accuracy of the model can be determined by running the out-of-bag dataset against all applicable decision trees
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 54
Classification
The proportion of Out-Of-Bag samples that are incorrectly classified is the Out-Of-Bag Error

Random Forest Models in a Nutshell
1
2
3
4
5
6
7
8
Property
Feature Data Types
Target Data Types
Key Principles
Hyperparameters
Data Assumptions
Performance
Accuracy
Explainability
Description
Numerical.
Categorical or Numerical.
Extremely flexible & easy to use. Can be used for both classification & regression problems. Can handle missing values in training and prediction by replacing imputing continuous features with median values and categorical values using the proximity weighted average of missing values.
No of trees in the forest. Quality function for internal node split. Minimum number of samples required to split an internal node. Minimum number of samples required to be a leaf node. Maximum number of leaf nodes. Maximum depth of the tree.
Data scaling is expected.
Overfitting does not occur because of the use of the average of predictions and hence cancels out the biases. Slow in generating predictions due to the number of decision trees involved.
Considered as a very accurate and robust method because of the number of decision trees taking part in the prediction. Simpler and more powerful than other non-linear classification algorithms.
Relative feature contribution to the prediction. Less interpretable than simple decision tree.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 55
Classification

Logistic Regression

Applying Logistic Regression

Linear regression does not always predict a value that falls within the expected range
These samples
INFECTED
are infected
These samples are not infected
large variability in the outcome at all ages
NOT INFECTED
a young person would be predicted to have a negative value!
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 58
Classification
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
||

Logistic Regression
Unlike linear regression, logistic regression does not try to predict the value of a numeric variable given a set of inputs. Instead, the output of logistic regression is the probability of a given input point belonging to a specific class. The output of logistic regression always lies in [0,1].
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 59
Classification

The sample contains people of different ages and each person is either infected or not infected
INFECTED
These samples are infected
1
NOT INFECTE0D
These samples are not infected
Logistic regression fits an S-curve to the data to model the probability of infection
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 60
Classification
||

The logistic regression predicts the probability of a person being infected based on the person’s age
When doing logistic regression, the y-axis is converted to the probability that a person is infected
PROBABILITY OF INFECTION
0
1
To do classification,
it is necessary to turn probability into classification
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 61
Classification
||

People with a probability greater than the threshold will be classified as infected; otherwise, not infected
PROBABILITY OF INFECTION
1 0.5
0
One way to classify people is to set a threshold at 0.5
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 62
Classification
||

Logistic regression is generalised to predict using multiple variables
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 63
Classification

Logistic Regression S-Curve

1.0
0.8
0.6
0.4
0.2
0.0
|||||||
-6 -4 -2 0 2 4 6
SIGMOID
The logistic function belongs to a class of functions called the sigmoid function
𝑒𝑧 1 Probaility= 𝜎 𝑧 =1+𝑒𝑧 =1+𝑒−𝑧
where 𝑧 = 𝛽 + 𝛽 𝑥 + ⋯ + 𝛽 𝑥 011𝑛𝑛
◦ 𝜎 𝑧 is close to 1 when 𝑧 is big
◦ 𝜎 𝑧 is close to 0 when 𝑧 is small
◦ The change in 𝜎 𝑧 per unit change in 𝑧 becomes
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
progressively smaller as 𝜎 𝑧 1
65
gets close to 0 and Classification
PROBABILITY
|||||||||||

Transformations make likelihood measure symmetrical (easy to interpret), more succinct & with unrestricted range
Odds 𝑝 =
chances of something happening chances of something not happening
monotonic transformationxxxx
Log−Odds Odd 𝑝
= ln Odds 𝑝
10 8 6 4 2 0
2 1 0
-1
monotonic transformation
| | | | | | | | | | |
0.0 0.2 0.4 0.6 0.8 1.0
PROBABILITY
-2 | | | | | | | | | | | 0 2 4 6 8 10
ODDS
▪ A change in a feature by one unit changes the odds by a factor of 𝑒𝛽𝑖 (i.e. 𝑒 to a constant power that equals to the coefficient of that feature)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 66
Classification
ODDS
|||||||||||
LOG-ODDS
|||||

The logistic sigmoid function can be obtained by taking the
inverse of the logit function
−1 logit 𝑧
= 1
1 + 𝑒−𝑧
6 4 2 0
-2 -4 -6
logit 𝑝 = ln
𝑝 𝑦 = 1
1 − 𝑝 𝑦 = 1
1.0
0.8
0.6
0.4
0.2
0.0
|||||||||||
0.0 0.2 0.4 0.6 0.8 1.0
PROBABILITY
▪ Flipping the axes, the logit curve becomes the sigmoid curve ▪ The sigmoid function is the inverse of the logit function
𝑧=𝛽+𝛽𝑥+⋯+𝛽𝑥 011𝑛𝑛
|||||||
-6 -4 -2 0 2 4 6
SIGMOID
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 67
Classification
PROBABILITY
LOGIT
|||||||
|||||||||||

Logistic regression can be perceived as regressing against the log of the odds that the class is 1
chances of something not happening
chances of something happening
𝑜𝑑𝑑𝑠 =
𝑐h𝑎𝑛𝑐𝑒𝑠 𝑜𝑓 𝑠𝑜𝑚𝑒𝑡h𝑖𝑛𝑔 h𝑎𝑝𝑝𝑒𝑛𝑖𝑛𝑔 𝑐h𝑎𝑛𝑐𝑒𝑠 𝑜𝑓 𝑠𝑜𝑚𝑒𝑡h𝑖𝑛𝑔 𝑛𝑜𝑡 h𝑎𝑝𝑝𝑒𝑛𝑖𝑛𝑔
1 =ln 1+𝑒−𝑧
ln
𝑝𝑦=1𝑧
𝑝𝑦=0𝑧
𝑝𝑦=0𝑧
=1− 𝑒−𝑧
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
68
Classification
𝑝𝑦=1𝑧
= 1 − 1 1+𝑒−𝑧
𝑒−𝑧
1 + 𝑒−𝑧
the logit transformation is central to logistic regression
1 + 𝑒−𝑧 =ln𝑒𝑧 =𝑧
=

Finding the Best S-Curve

Likelihood measures the goodness of fit of a model to a sample of data for given values of the unknown parameters
▪ Likelihood is formed from the joint probability distribution of the sample data, but viewed and used as a function of the unknown parameters only, thus treating the independent variables as fixed at the observed values
▪ The likelihood function describes a hypersurface whose peak, if it exists, represents the combination of model parameter values that maximize the probability of drawing the sample obtained
Likelihood = 𝑝 𝑑𝑎𝑡𝑎 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 = 𝑝 𝑦 𝑧
𝑁 best fit means
=ෑ 𝑝𝑦𝑖=1𝑧𝑦𝑖∙𝑝𝑦𝑖=0𝑧1−𝑦𝑖 𝑖=1
maximum likelihood
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 70
Classification

Performing gradient descent on the negative log-likelihood will get us the optimal 𝛽 values that minimizes the total loss
Negative Log−Likelihood
𝑁 𝑦 1−𝑦
For computational convenience,the maximization of likelihood is usually done by minimizing the negative of thenatural logarithm of the likelihood, known as thelog-likelihood function
=−ln ෑ 𝑝 𝑦𝑖 =1𝑧 𝑖 ∙𝑝 𝑦𝑖 =0𝑧 𝑖 𝑖=1
𝑁
=−෍ 𝑦𝑖∙ln𝑝𝑦𝑖=1𝑧 +1−𝑦𝑖 ∙ln𝑝𝑦𝑖=0𝑧 𝑖=1
𝑁
=−෍ 𝑦 ∙ln 𝑖=1 𝑖
𝑁
1
1 + 𝑒−𝑧
+ 1−𝑦 ∙ln 𝑖
+𝑦𝑖 ∙𝑧
𝑒−𝑧
1 + 𝑒−𝑧
=−෍ −𝑧−ln 1+𝑒−𝑧 𝑖=1
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
71
Classification

Logistic regression uses maximum likelihood to obtain the curve that fits the sample data best
INFECTED
1
NOT INFECTE0D
calculate the likelihood of infection for each age value and then multiply all of those likelihoods together
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 72
Classification
||

Logistic regression uses maximum likelihood to obtain the curve that fits the sample data best
INFECTED
1
NOT INFECTE0D
shift the curve and calculate a new likelihood of the sample data
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 73
Classification
||

Logistic regression uses maximum likelihood to obtain the curve that fits the sample data best
INFECTED
1
NOT INFECTE0D
finally, the curve with the maximum likelihood is selected
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 74
Classification
||

Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 75
Classification

Receiver Operating Characteristic (ROC) Curve

The classification will change as the threshold value changes giving a different confusion matrix each time
Threshold @ 0.75
Actual
Infected
Not Infected
Predicted
Infected
1
1
Not Infected
2
2
PROBABILITY OF INFECTION
1
0.75
0.5 0
Threshold @ 0.5
Actual
Infected
Not Infected
Predicted
Infected
2
1
Not Infected
1
2
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 77
Classification
||

A confusion matrix can be characterised by the True Positive Rate and False Positive Rate
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 78
Source: https://en.wikipedia.org/wiki/Confusion_matrix Classification

Therefore, changing the threshold will generate possibly infinite number of TPR and FPR pairs
Threshold @ 0.75
Actual
Infected
Not Infected
Predicted
Infected
1
1
Not Infected
2
2
PROBABILITY OF INFECTION
1
0.75
0.5 0
TPR = 0.33 FPR = 0.33
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Threshold @ 0.5
Actual
Infected
Not Infected
Predicted
Infected
2
1
Not Infected
1
2
TPR = 0.67 FPR = 0.33
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 79
Classification
||

What is the Receiver Operating Characteristic (ROC) curve?
▪ An ROC curve is a graphical plot that illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied
So instead of being overwhelmed with confusion matrices, the ROC curve provides a simple way to summarize all of the information
TRUE POSITIVE RATE
ROC curve
1
0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
80
Classification
01 FALSE POSITIVE RATE
|
|

Classification Metric: AUC (Area Under Curve) A balanced measure of precision and sensitivity
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 81
Classification
AUC varies between 0 and 1
▪ The ROC curve can be used to compare model predictive power based on TPR and FPR
▪ Decision will be based on how much area is under the curve
▪ The ideal curve fills in 100% and will be able to tell negative from positive results 100% of the time
▪ The ROC curve at the bottom does a worse job than chance, mixing up the negatives and positives

http://arogozhnikov.github.io/2015/10/05/roc-curve.html
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 82
Classification

The Log Loss Function

There is no census on how to calculate R2 for logistic regression – there are more than 10 different ways to do it
INFECTED
1
NOT INFECTE0D
Logistic regression does not have the concept of a residual so it can use neither RSS nor R2 to compare models
|||||||||||||||||
… 20 … 25 … 30 … 35 … 40 … 50 … 60 … 70 …
AGE
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 84
Classification
||

The Log Loss function represents the price paid for inaccuracy of predictions in classification problems
1𝑁
LogLoss=−𝑁෍𝑦𝑖 ∙log 𝑝 𝑦𝑖 +(1−𝑦𝑖)∙log(1−𝑝(𝑦𝑖))
𝑖=1
▪ For each row 𝑖 in a dataset with 𝑁 rows
• 𝑦 is the outcome (dependent variable) which can be either 0 or 1
• 𝑝 is the predicted probability outcome by applying the logistic regression function
▪ The objective is to minimize the total log loss over the whole dataset by adjusting the estimates in the logistic regression equation
▪ If 𝑦 is 1, log loss is minimized with high value of 𝑝 ▪ If 𝑦 is 0, log loss is minimized with low value of 𝑝
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 85
Classification

For all the points belonging to the positive class (green), what are the predicted probabilities given by the classifier? The green bars represent the probability of a given point being green
Fitting a logistic regression to predict the probability of a point being green for any given value of x, which can take on either negative or positive value
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 86
Classification
What is the probability of a given point being red?
The red bars above the curve represent the probability of the negative class

The loss function aims to penalize bad predictions
▪ If the probability associated with the true class is 1.0, we need its loss to be 0
▪ Conversely, if that probability is low, say, 0.01, we need its loss to be HUGE
▪ Taking the negative log of the probability suits well enough for this purpose
▪ the log of values between 0.0 and 1.0 is negative
▪ taking the negative log provides a positive value for the loss
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
87
Classification

The Log Loss function penalizes heavily the predictions that are confident but wrong
𝑦=1 𝑦=0
Cost increases to infinity when probability is very far away from actual 1
Cost increases to infinity when probability is very far away from actual 0
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
88
Classification

Logistic Regression Models in a Nutshell
1
2
3
4
5
6
7
8
Property
Feature Data Types
Target Data Types
Key Principles
Hyperparameters
Data Assumptions
Performance
Accuracy
Explainability
Description
Any data type. Encoding is expected for categorical features.
Binary.
Predicts the probabilities of an event occurring (probability=1) given certain values of input variables x. The output is a value between 0 and 1. A threshold probability determines to which class the output belongs.
None.
Does not require scaling of features.
Regularization is applied by default. Can handle both dense and sparse input. Not able to handle a large number pf categorical features. Vulnerable to overfitting. Cannot solve the non- linear problems.
Restrictive expressiveness (e.g. interactions must be added manually) and other models may have better predictive performance.
Provides probability associated with the classification. Interpretation is more difficult because the interpretation of the weights is multiplicative and not additive.
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 89
Classification

References

References
“Hands-On Machine Learning with Scikit- Learn and TensorFlow”, Aurelien Geron, O’Reilly Media, Inc., 2017
“Machine Learning & Data Science Blueprints for Finance”, Hariom Tatsat, Sahil Puri, and Brad Lookabaugh, O’Reilly Media, Inc., 2020
“Applied Logistic Regression”, David W. Hosmer Jr., Wiley, 2013
Copyright (c) by Daniel K.C. Chan. All Rights Reserved.
91
Classification

References
▪ “Chapter 14 Support Vector Machine” in “Hands-On Machine Learning with R” (https://bradleyboehmke.github.io/HOML/svm.html)
▪ “The Gaussian RBF Kernel in Nonlinear SVM”, Suvigya Saxena, 2020 (https://medium.com/@suvigya2001/the-gaussian-rbf-kernel-in-
non-linear-svm-2fb1c822aae0)
▪ “C and Gamma in SVM, A Man Kumar”, 2018 (https://medium.com/@myselfaman12345/c-and-gamma-in-svm-e6cee48626be)
▪ “Using Random Forests in Python with Scikit-Learn”, Fergus Boyles, 2017 (https://www.blopig.com/blog/2017/07/using-random-forests- in-python-with-scikit-learn/)
▪ “Ensemble Learning: 5 Main Approaches”, Diogo Menezes Borges (https://www.kdnuggets.com/2019/01/ensemble-learning-5-main- approaches.html)
▪ “Logistic Regression: A Concise Technical Overview”, Reena Shaw, Kdnuggets (https://www.kdnuggets.com/2018/02/logistic- regression-concise-technical-overview.html)
Copyright (c) by Daniel K.C. Chan. All Rights Reserved. 92
Classification

THANK YOU