Machine Learning in Finance
Lecture 1
Machine Learning – Setting Up the scenes
Arnaud de Servigny & Hachem Madmoun
Outline:
• ThevarioustypologiesofMachineLearningapproaches
• BestpracticesintermsofDataHandling
• Gettingthebestfit-Ashortreviewofoptimisationconcepts
• Evaluation Metrics for Classification
• ProgrammingSession
Imperial College Business School Imperial means Intelligent Business 2
The various typologies of Machine Learning approaches
Imperial College Business School Imperial means Intelligent Business 3
Supervised Learning – Unsupervised Learning
• SupervisedLearningistheprocessoflearningafunctionwhichmapsfeaturestoanoutputbasedon
several input-output pairs
MNIST Labeled dataset
X = R28⇥28
• UnsupervisedLearningistheprocessofidentifyingmeaninfulpatternsinunlabeleddata
7
Y = {0,…,9}
MNIST Unlabeled dataset
Imperial College Business School
Imperial means Intelligent Business 4
2 dimensional representation of the MNIST dataset using a variational autoencoder (example of Unsupervised algorithm). Each color represents a number
Supervised Learning – Unsupervised Learning
Typology of machine-learning problems / Defining the problem
Generally there are two main types of machine learning problems: supervised and unsupervised. Supervised machine learning problems
• There are problems where we want to make predictions using data (features) based on pre-defined targets (labels).
• In supervised machine learning, you feed the features and their corresponding labels into an algorithm in a process called training. During training, the algorithm gradually determines the relationship between features and their corresponding labels. This relationship is called the model.
• A useful locution to understand what is actually going on in practice is “Pattern Recognition”
• There are in fact two main types of supervised learning problems: classification which involves
predicting a class label and regression which involves predicting a numerical value.
ü Classification: Supervised learning problem that involves predicting a class label. ü Regression: Supervised learning problem that involves predicting a numerical label.
The wording” supervised learning” originates from the perspective of the target being provided by an “instructor” who teaches the machine learning algorithm what to do.
Imperial College Business School Imperial means Intelligent Business 5
Supervised Learning – Unsupervised Learning Unsupervised machine learning problems
• There are problems where the data does not have a defined set of categories, but instead we are looking for the machine-learning algorithms to help us organize it.
• In unsupervised machine learning, the goal is to identify meaningful patterns, i.e. describe or extract relationships in the data. To accomplish this, the machine must learn from an unlabelled data set. In other words, the model has no hints how to categorize each piece of data and must infer its own rules for doing so. It is all about making sense of the data.
In practice, we are talking about a variety of possible outcomes:
• Clustering: a problem which involves finding groups in the data.
• Density Estimation: a problem which involves summarizing the distribution of the data.
• Visualization: a problem which involves creating plots of data.
• Projection: a problem which involves creating lower-dimensional representations of the data.
This being said, it is more accurate to describe ML problems as falling along a spectrum of supervision between supervised and unsupervised learning. Let us not be too dependant on the above segmentation!
Imperial College Business School Imperial means Intelligent Business 6
An example of a mixed approach: Reinforcement Learning
Reinforcement learning describes a class of problems where an agent operates in a game-like environment and must learn from a trial and error iterative process, using some reward feedback.
Reinforcement learning differs from the supervised learning approach in the sense that in supervised learning the training dataset provides the rule for the decision, i.e. a model is trained with the correct answer before being applied to unknown data. In reinforcement learning instead, there is no modelled answer but the reinforcement agent has to decide what to do to perform the given task. In the absence of a training dataset, it has to learn sequentially from experience.
Imperial College Business School Imperial means Intelligent Business 7
Financial problems typically solved using Machine Learning
1. Classification: e.g. good / bad credit.
2. Risk prediction: likelihood to default type of scoring
3. Learning Associations: which type of client tends to feel comfortable with a kind of financial product (customer analytics).
4. Digital assistant: answering questions from clients
5. Extracting signals from a large spectrum of data repositories: removing noise (feature engineering)
6. Fraud detection: identifying anomalous behaviours
7. Investment prediction: inferring the dynamics of asset prices
8. Portfolio Management: making sensible asset allocation decisions
9. Document Analysis: extracting meaningful words / sentences; getting information on sentiment.
10. Model assessment and stress testing (Reinforcement Learning)
Imperial College Business School Imperial means Intelligent Business 8
Best practices in terms of Data Handling
Imperial College Business School Imperial means Intelligent Business 9
Best practices in terms of Data Handling – Preprocessing – • Beforedivingintothealgorithmsofsupervisedandunsupervisedlearning,itisimportantto
understand the best practices of building a good machine learning algorithm.
• Theroadmapforbuildingmachinelearningsystemscanbesummarizedasfollows:
• Preprocessing:MachineLearningalgorithmscanonlybeappliedtonumbers,butdatacanbe of different forms : text, audio signal, images, etc. Moreover, it is usually important to transform the data (using unsupervised learning for instance) before feeding it to a supervised algorithm. We will illustrate it with several examples in future lectures.
Audio Image
Texts Social network
Imperial College Business School Imperial means Intelligent Business 10
Best practices in terms of Data Handling – Splitting the dataset (The holdout method) –
• Oneofthekeystepsinbuildingamachinelearningmodelistoestimateitsperformanceondatathat the model hasn’t seen before.
• Tothatend,aclassicapproachistosplittheoriginaldatasetintoatrainingset(usedfortrainingthe model) and a test set (used to evaluate the generalization performance of the model).
Original Datast
Input Data
60 18 30
10 10 10
07 05 08
. . . 45 17 20
80 20 30
03 04 12
. . . 18 09 20
1 1 0
. 1
1 0
. 1
Training the model
Estimating its performance
0B 1C
B C B@ CA
0B 1C
B C B@ CA
Train set
Targets
Test set
Imperial College Business School Imperial means Intelligent Business 11
Best practices in terms of Data Handling – Splitting the dataset into Train / Validation / Test sets –
• Thistime,insteadoftrainingthemodelonthetrainsetandtestingitonthetestset,theoriginal dataset is partitioned into three sets : training, validation and test sets.
• Thereasonforaddingavalidationsetisthatdeployingamodelalwaysinvolvestuningits configuration. For instance, we will see that creating a neural network requires choosing the number of layers, the number of neurons for each layer, etc. These choices are called hyperparameters.
• Asaresult,usingtheperformanceonthevalidationsettochangetheconfigurationresultsin information leaks: Indeed, information about the validation data leaks into the model.
• Aswecareabouttheperformanceoncompletelynewdata,weusethetestsetasanever-before- seen dataset to evaluate the model.
Training set
Validation set Test set
Choice of a configuration : a set of hyperparameters
Learning
Imperial College Business School Imperial means Intelligent Business 12
Evaluate
Learning a Model with the « best » set of hyperparameters
Final Model Performance
Best practices in terms of Data Handling – Splitting the dataset (The K-fold cross validation for small non-sequential datasets)
–
• Theprevioussplittingstrategysuffersfromonemajorflaw:iflittledataisavailable,thenthevalidation and test set contain very few data points.
• Asaresult,forthesametrainedmodel,shufflingthedatabeforesplittingitendsupyieldingdifferent performance measures. K fold cross validation is a way of addressing this issue.
Training set
Perf 1
Perf 2
.
Perf_K
K fold cross validation:
• The training set is divided into K folds
• DuringKiterations,weuseone fold for testing and we train the model on the K-1 other folds.
• Weendupwithadistributionof performance measures.
• Byaveragingtheestimated performances on each test fold, we end up with the average performance of the model.
Test fold
Training folds
Imperial College Business School Imperial means Intelligent Business 13
Average Performance
The Machine Learning Workflow – Learning and Predicting –
• LearningandModelEvaluation:
• Duringthisstep,wefirstneedtodecideuponthemetrictomeasureperformance.
• Thechoiceofthemostsuitableevaluationmetricwilldependonthenatureofthe problem (classification or regression) and also on the nature of the dataset (balanced or imbalanced).
• Thenextsectionwilldetailthedifferentevaluationmetricsfordifferentcontexts.
• Buthowdoweknowwhichmodeltolearn?AsDavidWolpert’sfamousnofreelunch theorem suggests, there is no model that works best for every problem, it is therefore essential to compare a handful of different models with different sets of hyperparameters.
• Predictingnewdata:
• Afterwehaveselectedthe«best»modelthathasbeenfittedonthetrainingset,wecan
use the test set to estimate how well it performs on the unseen data.
• Ifwearesatisfiedwiththegeneralizationerror,wecanusethemodeltopredictnew data.
Imperial College Business School Imperial means Intelligent Business 14
Getting the best fit
A short review of optimisation concepts
Maximum Likelihood Estimation
Imperial College Business School Imperial means Intelligent Business 15
Position of the problem:
Tossing a Biased Coin
Given a biased coin that comes up heads with some probability greater than 1/2. How can we determine the bias parameter (i.e the probability of getting heads) using n flips ?
The Coronavirus Curve
Given the number of infection cases for n different days in a particular geography. How can we assess whether the distribution of the number of cases per day is flat enough to cope with the healthcare system capacity ?
Imperial College Business School Imperial means Intelligent Business 16
Understanding the problem assuming the Statistical Model is known
• In the case of tossing a coin, we assume that this process is well characterised using a probabilistic model.
• Examples:
• Bernoulli model: X ⇠ B(✓) where ✓ 2 [0, 1] :
•
8×2{0,1} p✓(x)=P(X=x)=✓x(1 ✓)1 x
The Bernoulli distribution models the outcome of a single binary output trial (success or
failure), it typically models whether flipping a coin one time will result in heads or tails.
Binomial distribution
8×2{0,…,n} p✓(x)=P(X =x)=✓nx◆✓x(1 ✓)1 x = n! ✓x(1 ✓)1 x x!(n x)!
The Binomial distribution models the outcome of n trials. More specifically, it models the number of times the output is a success from performing n independent identically distributed Bernoulli trials.
Imperial College Business School Imperial means Intelligent Business 17
Estimating the fitting parameters using a Maximum Likelihood approach
• Weuseacollectionofobservationstolearn(orestimate)theparametersofastatisticalmodel.
• TheseobservationsX1,…,XnarecalledsamplesinstatisticsortrainingsetinMachineLearning.
• Usually,wemaketheassumptionthatthevariablesarei.i.d.,whichmeansindependentidentically distributed.
• Let P⇥ ={p(x;✓)|✓2⇥} bethemodeland x theobservation.Wewishtoestimate✓2[0,1]from the training set.
• Weshouldnoteherethatagainthismeasureoflikelihoodisnotuniqueasitdealswiththespecific case where a good fit carries the same weight as a bad fit.
• In order to get there, we define a likelihood function L , which gives an idea of the ability of the retained model to account for the observations:
L : ⇥ ! R+
✓ p(x, ✓)
Imperial College Business School Imperial means Intelligent Business 18
7!
Maximum Likelihood Estimation
• The maximum likelihood estimator (MLE) is defined as the ✓⇤ which maximizes the likelihood:
✓⇤ = argmax p(x; ✓) ✓2⇥
• Whenthetrainingsetconsistsinni.i.d.samples,weoptimizethefollowinglog-likelihood: ⇤ Yn Xn
✓ = argmax p(xi; ✓) = argmax log(p(xi; ✓)) ✓2⇥ i=1 ✓2⇥ i=1
• TheMLE:
• doesnotalwaysexist.
• innotnecessarilyunique
• Itassumesacertainformofutilityfunction
• WewillseetheexampleoftheBernoullistatisticalmodel,wheretheMLEcanbedeterminedinaclosed form.
Imperial College Business School Imperial means Intelligent Business 19
Maximum Likelihood Estimation for the Bernoulli Model
Tossing a Biased Coin
Given a biased coin that comes up heads with some probability greater than one-half. How can we determine the bias parameter (i.e the probability of getting heads) using n flips ?
• Our training dataset consists in n observations X1, . . . , Xn i.i.d. ⇠ B(✓) .
• We consider getting heads as being the success. Which means Xi = 1 if the i-th coin toss
results in heads.
• Thelog-likelihoodisexpressedasfollows:
Xn i=1
P(Xi =1)=✓
L(✓) =
log(p(xi; ✓)
Imperial College Business School Imperial means Intelligent Business 20
Maximum Likelihood Estimation for the Bernoulli Model • Thelog-likelihoodcanbewrittenasfollows:
Xn iX=1
n i=1
L(✓) = =
log(p(xi; ✓)
log(✓xi (1 ✓)1 xi )
Xn
i=1! !
i=1 i=1
xi log(✓)+(1 xi)log(1 ✓) Xn Xn
=
= xi log(✓)+ n xi log(1 ✓)
• Thelog-likelihoodisstronglyconcave,whichimpliestheexistenceanduniquenessoftheMLE.
Imperial College Business School Imperial means Intelligent Business 21
Maximum Likelihood Estimation for the Bernoulli Model
• Since the log-likelihood is differentiable and strongly concave, its maximizer is its unique stationary point:
dL Xn! Xn! d✓(✓)=0 () xi log(✓)+ n xi log(1 ✓)
i=1 i=1
Pn Pn
xi n xi
() i=1 = i=1 ✓ 1 ✓
Pn
xi
() ✓ = i=1 n
• The optimal parameter is then:
xi n
✓⇤ = i=1
Pn
Imperial College Business School Imperial means Intelligent Business 22
Maximum Likelihood Estimation for a Categorical Model:
Interactive Session
Imperial College Business School Imperial means Intelligent Business 23
Evaluation Metrics for classification
Imperial College Business School Imperial means Intelligent Business 24
•
•
Duringthislecture,wewillfocusonbinaryclassification,whichconsistsinclassifyingtheinputdatainto two possible categories.
For instance, predicting whether the market is going up or down is a binary classification task. So is predicting if a student is going to pass or fail an exam. In order to make a decision, we have to introduce a “cut-off point” to discriminate between the two predicted classes.
Byconvention,oneofthetwoclassesiscalledthepositiveclass(output=1)andtheotheroneiscalledthe negative class (output = 0).
•
Classification Metrics:
• Inthenextsection,wewillintroducesomeclassificationmetricsusedtoassesstheperformanceofa classifier:
• TheAccuracyScore.
• TheConfusionMatrix
• TheReceiverOperatorCharacteristic(ROC)graphandtheAreaUndertheCurve(AUC).
Imperial College Business School Imperial means Intelligent Business 25
Classification Metrics for Binary Classification :
• The easiest way to evaluate a classifier is to determine the Accuracy Score, which is the fraction of the train test correctly classified.
Accuracy Score = Number of correct predictions
Total number of predictions
• If the dataset contains 99% of positive labels and 1% of negative labels, a naive classifier which predicts always the positive label will have a very high accuracy score, which is problematic !
• So, the accuracy score should not be used for imbalanced datasets. Thus, we introduce the following
confusion matrix:
1
Predicted Targets
1
0
True Positive (TP)
False Negative (FN)
Type II error
0
False Positive (FP)
True Negative (TN)
Type I error
Imperial College Business School Imperial means Intelligent Business 26
Actually
Classification Metrics for Binary Classification :
Recall
• Given a class, will the classifier detect it ?
Recall = TP Actual Positives
= TP
TP+FN
Predicted Targets
1
0
True Positive (TP)
False Negative (FN) (Type 2 error)
False Positive (FP) (Type 1 error)
True Negative (TN)
1
0
Precision
Recall considers FN as the worst errors
Precision considers FP as the worst errors
•
Given a class prediction from the classifier, how likely is it to be correct ?
Precision = TP Predited Positives
= TP
TP+FP
Imperial College Business School Imperial means Intelligent Business 27
Actually
Classification Metrics for Binary Classification – Part 2 – : • TheF1Scoreisjusttheharmonicmeanofrecallandprecision.
Precision ⇥ Recall Precision + Recall
F1 = 2
• As shown in the figure below, the F1 score punishes the extreme values: Model 2 is better than Model 1.
• The F1 score should be used for Imbalanced datasets instead of the accuracy score. 1
precision
recall
precision
recall
Model 1
Model 2
h
h is half the harmonic mean of recall and precision (i.e, the F1 score)
0
h
Imperial College Business School Imperial means Intelligent Business 28
An example of an Imbalanced dataset – Part 1 –
• The objective is to create a predictor for the final exam of a class full of exceptional pupils. The notation
system consists in 4 different labels : A (the best) – B (less good) – C (bad) – D (the worst).
• Aswearedealingwithgiftedstudents,thedistributionoftheactualresultsisthefollowing:
ABCD
• Wewanttocomparetwomodels:
• Thefirstmodelisaveryoptimisticmodel.ItpredictsalotofAlabels.
• Thesecondmodelhasamorespreadoutdistributionoverthedifferentlabels.
• Thefirstmodelobviouslybeatsthesecondoneintermofaccuracy(sinceitpredictsalotofAlabelsand the dataset contains 83% of A labels).
• Butweknowthattheaccuracyscoreisnotagoodevaluationmetricforthiskindofproblems.
• Thus,wewillcomparethetwomodelsbasedontheirconfusionmatrixandF1score.
200 students
10 students
10 students
10 students
Imperial College Business School Imperial means Intelligent Business 29
An example of an Imbalanced dataset – Part 2 – Confusion Matrix for each Model
First Model (Optimistic one)
Predictions
Second Model (Pessimistic one)
Predictions
A
B
C
D
A
100
80
10
10
B
0
9
0
1
C
0
1
8
1
D
0
1
0
9
A
B
C
D
A
198
2
0
0
B
7
1
0
2
C
0
8
1
1
D
2
3
4
1
Accuracy Score = Correct Predictions = 0.87 Total Predictions
Accuracy Score = Correct Predictions = 0.54 Total Predictions
Imperial College Business School Imperial means Intelligent Business 30
Actual Values
Actual Values
An example of an Imbalanced dataset – Part 3 – Precision for each Model
Predictions Predictions
Precision = TP TP+FP
A
B
C
D
A
100
80
10
10
B
0
9
0
1
C
0
1
8
1
D
0
1
0
9
A
B
C
D
A
198
2
0
0
B
7
1
0
2
C
0
8
1
1
D
2
3
4
1
TP : 198 FP : 9
TP : 1 FP : 13
TP : 1 FP : 4
TP : 1 FP : 3
Average Precision = Precision A + Precision B + Precision C + Precision D 4
TP : 100 FP : 0
TP : 9 FP : 82
TP : 8 FP : 10
TP : 9 FP : 12
Average Precision(First Model) = 0.36
Average Precision(Second Model) = 0.49
Imperial College Business School Imperial means Intelligent Business 31
Actual Values
Actual Values
An example of an Imbalanced dataset – Part 4 – Recall for each Model
Predictions
A
B
C
D
A
198
2
0
0
B
7
1
0
2
C
0
8
1
1
D
2
3
4
1
Recall = TP TP+FN
Predictions
Imperial College Business School Imperial means Intelligent Business 32
TP = 198 / FN = 2
TP=1 /FN=9
TP=1 /FN=9
TP=1 /FN=9
A
B
C
D
A
100
80
10
10
B
0
9
0
1
C
0
1
8
1
D
0
1
0
9
TP = 100 / FN = 100
TP=9 /FN=1
TP=9 /FN=1
Average Recall (First Model)
= 0.32
Average Recall (Second Model)
= 0.77
TP=8 /FN=2
Actual Values Actual Values
An example of an Imbalanced dataset – Part 5 –
0.5
0.5
First Model
Second Model
h
precision
0.87 0.33
h
precision
0.54 0.60
recall
recall
Comparing the two Models
Reminder
F1 = 2 Precision ⇥ Recall Precision + Recall
h = F1 2
Accuracy Score F1 Score
By calculating the F1 score, we realize that the second Model (the pessimistic one) has a better performance than the first one (the Optimistic one) on this imbalanced dataset.
Imperial College Business School Imperial means Intelligent Business 33
The Receiver Operator Characteristic (ROC) graph
• Duringthepredictionphase(afterthetrainingprocess),mostoftheclassifiersoutputamoreprecise information than just the class label for each new data point x⇤ . They also output the probability p that this data point belongs to the positive class and, a fortiori, the probability of belonging to the negative class (which is 1 p )
Test set features
0B80 20 301C B03 04 12C
@ . . . . . . . . . A 18 09 20
• Howdowegofromcontinuouspredictionstodiscreteones? • WewillexplainthisontheexampleofLogisticRegression.
00.991 B0.23C
B@ . . . CA 0.47
011 B0C
B@ . . . CA 1
`Trained classifier
Predictions
True Targets
Imperial College Business School Imperial means Intelligent Business 34
The Receiver Operator Characteristic (ROC) graph
• Logistic regression predicts the conditional distribution of the label Y ⇤ 2 {0, 1} given the feature
vector x⇤ 2 RD as follows:
P(Y⇤ =1|X⇤ =x⇤)= (w⇤T x⇤)
where refers to the sigmoid function : z 7! 1 1+e z
Determined by training the model on the training set
Imperial College Business School Imperial means Intelligent Business 35
The Receiver Operator Characteristic (ROC) graph
• Let’sconsider9pairs(xi,yi)i2{1,…,9}inourtestset.(Thebluepointareassociatedtothepositive class and the red ones are associated to the negative class).
• •
As the model has been trained, we can generate the predictions 8i 2 {1, . . . , 9} pi = (w⇤T xi)
To turn the predictions to class labels, we need to define a threshold above which we consider the points as belonging to the positive class.
Positive Class
Example of threshold
Negative Class
Imperial College Business School zi = w⇤T xi Imperial means Intelligent Business 36
The Receiver Operator Characteristic (ROC) graph
Predicted
01 0
1
3
2
Predicted
01 0
1
1
3
3
1
Predicted
01 0
1
1
4
Threshold 5
3
0
1
5
Predicted
01 0
1
4
Predicted
01 0
1
2
0
2
5
2
0
3
Imperial College Business School Imperial means Intelligent Business 37
Threshold 1
Threshold 2
Actual
Actual
Actual
Actual
Actual
The Receiver Operator Characteristic (ROC) graph
True Positive Rate = TP
• InordertoplottheROC graph, we first need to define the True Positive Rate and the True negative Rate:
Predicted Targets
1
0
TP+FN
1
0
TP
FN
FP
TN
TPR
Always predicting the positive class
False Positive Rate = FP
FP+TN
TPR
FPR
⌧1
0.6
0
⌧2
0.6
0.25
⌧3
0.8
0.25
⌧4
1
0.25
⌧5
1
0.5
• We can then plot the True Positive Rate (TPR) vs the False Positive Rate (FPR), over all the thresholds.
• The different thresholds
⌧1,…,⌧5
⌧1
⌧5 ⌧3
⌧4 1
⌧2
Direction of better performances
are denoted
1
FPR
Always predicting the negative class
Imperial College Business School Imperial means Intelligent Business 38
Actually
The Area Under the Curve (AUC)
TPR
1
• •
TheareaundertheROCcurve(calledAUC)evaluatesthe model at all possible cut-off points.
TheAUC,bysummarizingtheROCcurveinonemeasure, gives better insights about how well the classifier is able to separate the positive and negative classes.
TPR
1
1
FPR
•
•
If the blue ROC curve represents an other model (Random Forest for instance), we can see from the figure below that the red ROC curve (representing Logistic Regression) is better.
The higher the AUC, the better the model.
1
FPR
Imperial College Business School Imperial means Intelligent Business 39
The Gini coefficient (Gini)
Let us consider the case of a default model, where we aim at identifying defaulting firms from features.
• SofarwehaveconsideredtheAUC.AsubsetofitisusefultocomputetheGinicoefficient.
• Gini=(AreaF)/(AreaE).Gini=2*ROC-1
• Theupperboundis100%.
• ModelswithbetterrankingproducehigherGinicoefficients
• NOTE:theseGinicoefficients,likeROCaredatasetspecificandcannotbecomparedacross datasets
Imperial College Business School Imperial means Intelligent Business 40
Going beyond simple Logit models
• Logistic regression predicts the conditional distribution of the label Y ⇤ 2 {0, 1} given the feature
vector x⇤ 2 RD as follows:
Note that instead of looking for simple weights w*, it is possible to think of an unknown function we wish to
P(Y⇤ =1|X⇤ =x⇤)= (w⇤T x⇤)
uncover, but given the size of our dataset, we limit ourselves to the second order of its Taylor expansion.
While still considering a logistic transformation, we have to estimate the parameters of a quadratic function, hence the denomination of “Quadratic Logit”
We could alternatively approximate this unknown function using Gaussian kernels with spread centres In this case, we talk about a “Kernel Logit”.
Imperial College Business School Imperial means Intelligent Business 41
Comparing their Gini coefficients on an identical dataset The Quadratic Logit model looks better, but…
Imperial College Business School Imperial means Intelligent Business 42
Empirically choosing between models: the lending bank case
We have here 4 competing classification models. A bank has built its own model (SME model) and wishes to compare it with various logit models. They all look to identify defaulting firms considering the same sample and the same features. We would like to chose at the same time an optimal model and a relevant cut-off to identify defaulting firms.
In this case a firm labelled 1 is a firm in default
Type 2 error (false negative) = the firm is in default (1) but is predicted as a non defaulter (0)
Caveat: missing a “true defaulter” is much more important from a P&L perspective than assuming a firm will default, where as it will not. The choice of the cut-off point will depend on the risk aversion of the lending bank! It is likely that the logit model will be preferred, although it is not the best.
Imperial College Business School Imperial means Intelligent Business 43
Bearing in mind the implicit assumptions made to find the ‘best model ’
When we look to estimate and fit a model using a traditional Maximum Likelihood approach, looking to optimize:
We in fact make two assumptions:
1. The utility function of the user of the model is logarithmic. It is not a bad assumption, but far from a perfect one
2. Success and errors carry the same weight, which is not always the case, as seen on the previous slide.
The consequence is that different fitted models optimised through an MLE process may not lead to an optimal decision process for people having specific preferences.
Imperial College Business School Imperial means Intelligent Business 44
A model may be preferable to another due to the size of the dataset available
Imperial College Business School Imperial means Intelligent Business 45
Generalizing to Multiclass Classification
• Thegeneralizationoftothemulticlasssetting(i.eclassifyingintooneofKcategorieswithKgreaterthan 2) can easily be achieved by the One Over All (OOA) approach,
• TheOOAapproachconsistsinturningthemulticlassclassificationproblemintoKbinaryclassification ones as follows:
• ForeachclasskamongtheKpossibleclasses,wecancreateKnewdatasetsbykeepingthesame input data X and turning the target data Y into a binary one : positive for the class k and negative for all the other classes.
• You can then train K binary classifiers (Ck)k2{1,…K, }, where each Ck is associated to the class k
• Atpredictiontime,foreachnewsample,thepredictionisgivenbyselectingrandomlyoneofthe
classes k for which the classifier Ck predicted a positive output.
• Let’stakeanexamplewith K=3:
Training
Prediction
C 0B121C C2
0B101C
Y = B@01CA Labels for C1
Y = B@31CA 3C0
Original Training targets
0B01C
1
0 0B011C
Y = B@0CA Labels for C2
3
Y = B10C Labels for C3
C1 1
x⇤ C2
0
Random Selection
Class 3
C3 1
New point
Predictions
Final Prediction
Imperial College Business School Imperial means Intelligent Business 46
1
B@ CA
Take away
In the end, in the case of a supervised learning approach, fitting a model always boils down to optimising an objective function.
There are many ways to define an objective function and it will always depend on the error measure used.
Below, we list some of them in a non exclusive manner. It is important to realise that there is not a unique relevant measure.
•
Quality of fit / loss measures: there are in fact many different loss functions such as mean square error (MSE), mean absolute error (MAE), R^2, Kullback-Leibler divergence (KL) , Signal to Noise Ratio (SNR)
⇒ They assume an implicit utility function ( for instance the difference between the MSE and the MAE is related to the relative importance of small v.s. large losses)
⇒ Note that some measures are related to a group of observations irrespective of their underlying distribution (MSE, MAE, R^2, SNR) while other measures are related to distributions (KL)
Classification measures: ROC/AUC, Accuracy, Precision/Recall, Hit Ratio
⇒ These measures depend on the dataset used and are not comparable across datasets
⇒ Some of these measures enable the modeller to use differentiated rewards / penalties for successes & failures with differentiated cost functions. This again equivalent to assuming a utility function
•
Imperial College Business School Imperial means Intelligent Business 47
Go to the following link and take Quiz 1 : https://mlfbg.github.io/MachineLearningInFinance/
Imperial College Business School Imperial means Intelligent Business 48
Programming Session
Imperial College Business School Imperial means Intelligent Business 49