CS计算机代考程序代写 Hive decision tree python information theory finance algorithm Machine Learning in Finance

Machine Learning in Finance
Lecture 2
Introduction to Supervised Learning
Arnaud de Servigny & Hachem Madmoun

Outline
• IntroductiontotheprinciplesofSupervisedLearning
• LinearRegression
• LogisticRegression
• TreesandRandomForests
• ProgrammingSession
• ReviewofPythonProgramming
• Introducingthelibraries:Numpy–Matplotlib–Pandas-TensorFlow
Imperial College Business School Imperial means Intelligent Business 2

Introduction to the principles of Supervised Learning
Imperial College Business School Imperial means Intelligent Business 3

Categorisation of Supervised models
• Insomeinstances,allexplanatoryfeaturesareconsideredonthesamefooting.Thisistypicallythe case with regressions, and related ones such as Logit (Parametric models).
• Alternatively,theexplanatoryfeaturesmaybeorderedinasuccessivemannerinordertorefinethe selection effort (non Parametric models).
• Inthefirstinstance,fittingamodelmeansfindingtheoptimalweightsappliedtoeachfeature.Inthe second instance finding the optimal model means ordering the most relevant features and finding the best cut-offs at each step.
Imperial College Business School Imperial means Intelligent Business 4

Supervised Learning – Parametric Models
Imperial College Business School Imperial means Intelligent Business 5

Setting up the Objective
• SupervisedLearningistheprocessoflearningafunctionwhichmapsinputdatatoanoutputbased on several input-output pairs. Let’s detail the process:
• First, we have a dataset of pairs {features, target} = {(Xi , Yi )1in } over X ⇥ Y
D
• The pairs {(Xi , Yi )1in } are assumed to be independent and identically distributed (i.i.d.) following an unknown distribution. It is important to mention here that we assume no sequentiality in the data.
• Example:
• Let’s consider this small dataset: We try to predict whether a student will fail or pass the final
exam based on some feature values.
• Yi =1 ifthestudentpass,Yi =0 ifhefails.
• For each Xi , the first coordinate represents the number of hours spent on the course, the second coordinate is the average intermediary quiz mark and the third coordinate is the number of hours spent on the coursework.
Y = {0, 1} • Typically: X=R and .
X1 = [60,18,30] X2 = [10,10,10]
X3 = [07,05,08]
Y1 = 1 Y2 = 1
Y3 = 0
Imperial College Business School Imperial means Intelligent Business 6

Setting up the Objective
• A Supervised Algorithm is an algorithm that aims at building a predictor (i.e, a function : X ! Y ) which minimizes an error, based on the dataset.
0B60 18 301C 0B11C
B10 10 10C B1C

B07 05 08C B0C X=B45 17 20C Y =B1C B80 20 30C B1C B03 04 12C B0C @ . . . A @.A
X = R3
• Inthepreviousexample,ourobjectivewastopredictadiscretevalue:passorfail(1or0).This
Y = {0, 1} • Wecanalsotrytopredictacontinuousvalue:thefinalexammarkforinstance.Inthatcase,the
18 09 20 1
supervised task is called classification. task is called regression.
Y
Discrete (typically:{0, 1} )
Classification Regression
Continuous (typically : R )
Imperial College Business School
Imperial means Intelligent Business 7

Setting up the Objective
• To define the error, we need first to define a loss function (i.e, a function l : Y ⇥ Y ! R which
measuresthe”distance”betweenthetheoutputofthepredictorandthetruelabels(Yi)1in ).
• For the loss function, we usually choose for all pairs (ouptut, true label), (y, y0) 2 Y ⇥⇢Y: lRegression(y, y0) = ||y y0||2 and lClassification(y, y0) = y6=y0 =
• We then define the following error / risk associated to the predictor , the aggregate loss over the train set, which follows an unknown distributionP: P
RP() = E [l(Y, (X)]
• Ourobjectiveistofindtheoptimalpredictoramongallthepossiblefunctionsdefinedbythemodeler
(e.g. Logit, linear regression, etc) F(X,Y) :
Finding ⇤P = argmin EP[l(Y, (X)] = argmin RP()
2F(X,Y) 2F(X,Y)
• Since P is unknown, we optimize (minimize) the empirical cumulative risk Rn() :
1 Xn
Rn() = n l(Yi, (Xi))
i=1
1 if y6=y0 0 if y=y0
Imperial College Business School Imperial means Intelligent Business 8

Linear Regression
Imperial College Business School Imperial means Intelligent Business 9

Introduction:
• Letusstartwiththesimplestregressionmodel:LinearRegression.
• Considerthefollowing(fake)datasetrepresentingthesalary(inthexaxis)ofsome(fake)employees according to the number of years of experience (in the y axis).
• Fromtheplot,wecanseethatthesalaryandtheexperiencevariablesexhibitaclearlinear dependence.
Imperial College Business School Imperial means Intelligent Business 10

Setting up the Objective:
• Wewouldliketofindawaytodefinetheredlinefromthepairs(experience,salary)representedbythe blue points.
• In that way, we could assign an estimated salary to each value of the experience variable.
Imperial College Business School Imperial means Intelligent Business 11

Linear Regression: a Mathematical Perspective
• Mathematically speaking, it consists in modeling the conditional distribution of Yi 2 R
given Xi = xi 2 RD , parametrized by the set of parameters ✓ = (w, b) , as follows:
8i2{1,…,N} Yi|Xi =xi ⇠N(wTxi +b,2) with w2RD and b2R • In other words,
8i2{1,…,N} Yi =wTXi +b+✏ with ✏⇠N(0,2)
• Nowthatthemodelisdescribed,wewillhavetomaximizethelikelihoodofthedatasetinordertoobtain the optimal parameters.
Imperial College Business School Imperial means Intelligent Business 12

The training process :
• Let {(Xi , Yi )1iN } be the dataset used for training.
• Themodel:
8i2{1,…,N} Yi =wTXi +b+✏ with ✏⇠N(0,2)
• The parameter b can be ignored thanks to a normalisation of the training set :
8i2{1,…,N} Yi =wˆTXˆi +✏ with Xˆi =✓Xi◆ ,wˆ2RD+1 and ✏⇠N(0,2) 1
wˆ 2 RD+1 • Sincethedatasetisi.i.d,thelikelihoodofthetrainingsetcanbeexpressedasfollows:
• Thetrainingprocessconsistsinmaximizingthelog-likelihoodw.r.ttheparameters
YN L(wˆ) =
i=1
YN ✓ 1 1 ( y i wˆ T x i ) 2 ◆
p(yi|xi;wˆ) =
i=1
p2⇡ exp(2 2
Imperial College Business School Imperial means Intelligent Business 13

The training process :
• Inoptimizationmatters,weusuallyprefertominimizefunctionsinsteadofmaximizingthem.
• Thus,wetransformthelikelihoodmaximizationproblemintotheequivalentcostminimization problem, where the cost is the following negative log-likelihood:
XN N 1XN
log(L(wˆ)) = log(p(yi|xi)) = 2 log(2⇡2) + 22 (yi wˆT xi)2
i=1 i=1
• Thetrainingproblemcanthenbewrittenasthefollowingequivalentminimizationproblem:
| {z }
1 XN
m i n ( y i wˆ T x i ) 2
wˆ2RD+1 N
J ( wˆ )
i=1
Imperial College Business School Imperial means Intelligent Business 14

Matrix Notation and Optimization :
• Itiscomputationallymoreefficienttowritetheoptimizationprobleminmatrixnotation.
{(X , Y ) }
• To that end, let’s introduce the following matrix notation for the training data i i 1iN :
26 x1 137
Xˆ=6 x2 172RN⇥(D+1)
2y13 6y27 N
64. . . .75 xN 1
Y =64 . 752R yN
• Then, we deduce the following prediction matrix P :
26 wˆ T x 1 37
P = Xˆ wˆ = 6 6 wˆ T x 2 7 7 2 R N 4 . 5
wˆ T x N
• Ifwedenoteby||.||2 the L2 normon RN (i.e:8z2RN ||z||2=zTz ),wecanthen express the cost function J which we wish to minimize as follows :
J(wˆ)= 1 ||XˆwˆY ||2 N
Imperial College Business School Imperial means Intelligent Business 15

Using a Gradient Descent for Optimization
• We will use a Gradient Descent to find wˆ⇤ as follows:
• Initilize randomly wˆ0
• Fix a number of iterations K and a learning rate

wˆk+1 wˆk ⌘rwˆJ(wˆk)
and repeat K times:
Code for Gradient Descent using Numpy
Imperial College Business School Imperial means Intelligent Business 16

Logistic Regression
Imperial College Business School Imperial means Intelligent Business 17

Introduction:
• TheLogisticRegressionisoneoftheeasiestclassificationmodelstoimplement.Italsoperforms very well on linearly separable classes.
• Wecalldecisionboundarythehypersurfaceseparatingthespaceofinputdatabetweentwo subsets, one for each class. The classifier will classify all the points belonging in one side of the decision boundary as belonging in one class and all those on the other side as belonging in the other class.
• InthecaseofaLogisticRegression,thedecisionboundaryisahyperplane.
• ThefollowingscatterplotofthepublicIrisdatasetshowsalineardecisionboundaryassociated with Logistic Regression.
Imperial College Business School Imperial means Intelligent Business 18

Presenting the logit function :
• BeforeintroducingtheLogisticRegressionasaprobabilisticmodel,weshouldfirstintroducethe
odds ratio.
• The odds ratio (in favor of a particular event) can be written as ✓ where ✓ stands for the
probability of the positive event. 1 ✓ A high odds ratio means a very probable outcome.
• We can further define the log-odds, also called logit function [0, 1] ! R , as the logarithm of the
odds ratio:
logit(✓) = log( ✓ ) 1✓
• Thefactthatweconsideralogoutcomemeansthatpooroutcomesmattermoretousthanpositive outcomes. It is a strong preference choice. It also means that when the logit function outputs a high level, it means that it is even higher in reality.
• InthecontextofaLogisticRegression,weassumealinearrelationshipbetweenfeaturevaluesand the log-odds of the conditional probability that a particular sample belongs in the positive class given its features.
• Mathematicallyspeaking:
logit(P(Yi =1|Xi =xi))=wTxi
Imperial College Business School Imperial means Intelligent Business 19

The sigmoid function and the Logistic Regression Model
• Byinvertingthelogitfunction,weobtain:
where refers to the sigmoid function : z 7! 1
P(Yi =1|Xi =xi)=(wTxi) 1+ez
• In other words, a Logistic Regression consists in modeling the conditional distribution of Yi 2 {0, 1}
given Xi = xi 2 RD , parametrized by w . As the probability of the outcome Y being equal to 1 is the sigmoid function, while that of Y being equal to 0, is (1-sigmoid function), we are in the case of a Bernoulli distribution applied to the sigmoid function.
8i2{1,…,N} Yi|Xi =xi ⇠B((wTxi)) where stands for the Bernoulli distribution.
Imperial College Business School Imperial means Intelligent Business 20
B

The prediction phase after training:
• The training process consists in finding the parameter w⇤ that maximizes the likelihood of the dataset.
• After training the model, given a new data point x⇤ 2 RD, we can determine the probability of being in the positive class
P(Y⇤ =1|X⇤ =x⇤)=(w⇤T x⇤)
Imperial College Business School Imperial means Intelligent Business 21

The Training Process: finding the optimal w
• Let {(Xi , Yi )1iN } be the dataset used for training.
• Themodel: 8i2{1,…,N} Yi|Xi =xi ⇠B((wTxi))
• Thelikelihoodofthedatasetcanbeexpressedasfollows:
YN L(w)=
YN i=1
(wTxi)yi(1(wTxi)1yi)
J(w) = 1 log(L(w)) = 1 XN yi log((wT xi)) + (1 yi) log(1 (wT xi))
p(Yi =yi |Xi =xi;w)=
• Hence,thenormalizednegativeloglikelihood(orcostfunction)is:
i=1
N N i=1
• Thetrainingproblemcanthenbewrittenasthefollowingminimizationproblem:
min 1 XN yi log((wT xi)) + (1 yi) log(1 (wT xi))
w2RD N
i=1
J(w)
Imperial College Business School Imperial means Intelligent Business 22
| {z }

Using a Gradient Descent for Optimization • We will use Gradient Descent to find w⇤ as follows:
w0
• Fix a number of iterations K and a learning rate ⌘ and repeat K times:
wk+1 wk ⌘rwJ(wk)
• Initilizerandomly
Code for Gradient Descent using Numpy
Imperial College Business School Imperial means Intelligent Business 23

Supervised Learning – Non Parametric Models
Imperial College Business School Imperial means Intelligent Business 24

Decision Trees and Random Forest
Imperial College Business School Imperial means Intelligent Business 25

Introduction:
• TheDecisionTree(DT)algorithmisanattractivemodelifwecareaboutinterpretability.
• Asthenamedecisiontreesuggests,wecanthinkofthismodelasbreakingdownourdatabymakinga decision based on asking a series of questions.
X2
0.7
Decision regions
Decision Tree Graph
X1 > 0.5 Yes
X2 > 0.7
No
Yes
No
0.5
X1
Imperial College Business School Imperial means Intelligent Business 26

A high level description of the algorithm
• TheDTalgorithmisbasicallyjustabunchofnestedif-statementsontheinputfeatures(alsocalled
attributes) in the training dataset.
• Thedecisionalgorithm:
• Westartatthetreeroot(withthewholedataset)
• Then we split the dataset on the attribute that results in the largest Information Gain (IG).
• We iterate the splitting procedure at each child node until the leaves are pure (which means that the samples at the leaves belong to the same class)
• A very deep tree is prone to overfitting. To avoid that, we set a limit for the maximal depth of the tree.
Imperial College Business School Imperial means Intelligent Business 27

Building Decision Trees
• First,weneedtodefineanobjectivefunctionthatwewanttooptimize(InformationGain). • Then,ateachiteration,twochallengesarisewhentryingtochoosethebestsplit.
• Howdowechoosethebestattributeresponsibleforthesplit?
• Howdowechoosethethresholdwhensplittingbasedonthe”bestattribute”?
Thresholds
X1 > 0.5 Yes
X2 > 0.7
No
Yes
No
Imperial College Business School Imperial means Intelligent Business 28

Information Theory : Entropy – Part 1 – • Let Y bearandomvariabletakingvaluesinthefiniteset Y
• Let’s denote p(y) = P(Y = y)
• In information theory, the quantity I(y) = log2( 1 ) can be interpreted as a quantity of
p(y)
information carried by the occurence of y (sometimes called self-information).
• So, the entropy is defined as the expected amount of information of the random variable Y
H(Y ) = Ep(y)[I(Y )] = X p(y) log2(p(y)) y2Y
Imperial College Business School Imperial means Intelligent Business 29

Information Theory : Entropy – Part 2 –
• Let’s take an example of a Bernoulli distribution Y ⇠ B(p)
• The entropy of Y as a function of p is:
H(p) = p log2(p) + (1 p) log2(1 p)
• p = 0.5 yields maximum entropy.
• So,theentropyisameasureofhowmuchinformationweget
from finding out the value of the random variable.
• Wehavethefollowinginequalities:
H(Y ) 0 with equality if Y is constant a.s H(Y )  log2(Card(Y))
Imperial College Business School Imperial means Intelligent Business 30

Information Gain – Definition –
• Wedefinetheinformationgainatasplitontheattribute c asfollows:
IG(Dp, c) = I(Dp) Nleft I(Dleft) Nright I(Dright) Np Np
• Dp refers to the dataset of the parent.
• Dleft and Dright are the datasets of the left and right child nodes. (For simplicity, most libraries
only implement binary decision trees).
• I is the impurity measure.
• Np is the total number of samples at the parent node.
• Nleft and Nright are the total number of samples at the right and left child nodes.
Imperial College Business School Imperial means Intelligent Business 31

Different impurity measures
• The three impurity measures or splitting criteria that are commonly used in binary decision trees are Gini
Impurity ( IG ), Entropy ( IH ), and the Classification Error ( IE ).
• Let’s denote pmk the proportion of the samples that belong to the class m 2 {1, . . . , M } for a
particular node k . We define the above stated impurities as follows:
XM m=1
XM m=1
IE (k) = 1 max pmk 1mM
IG(k)=
pmk (1pmk ) pmk log2(pmk )
IH(k)=
Imperial College Business School Imperial means Intelligent Business 32

A simple example – Description – • Let’sconsiderthefollowingexample:
• Wewanttopredictwhetherapersonissickornot( Y =1 ifit’sthecaseand Y =0 ifnot) based on the followins 4 features:
• X1 2 {0, 1} is whether the person was in contact with someone or not during the last month. (X1 =1 ifit’sthecaseand X1 =0 ifnot).
• X2 2 [0, 100] is the number of times the person has touched their face per day.
• X3 2 [0, 10] is the number of times the person has washed their hands per day.
• X4 2{0,1}iswhetherthepersonisaman( X4 =0)orawoman( X4 =1).
Imperial College Business School Imperial means Intelligent Business 33

A simple example – Dataset –
• Weendupwiththefollowingsmalldatasetof10samples.
In contact with a person
Number of touching face
Number of washing hands
Man/ woman
Y
0
98
0
0
0
0
87
1
0
0
1
93
9
0
1
1
97
4
1
1
1
81
7
0
1
1
42
8
1
0
1
28
3
0
0
1
12
1
0
1
1
8
0
0
1
1
2
9
0
0
• Let’susetheentropyastheimpuritymeasure.
• We have at the root:
I(Dp) = H(Y ) = 1 log2(1) 1 log2(1) = 1
Imperial College Business School Imperial means Intelligent Business 34
2222

A simple example – split on « man vs woman » column – • Let’s split on the attribute « man/woman » ( X4 )
X1
X2
X3
X4
Y
0
98
0
0
0
0
87
1
0
0
1
93
9
0
1
1
81
7
0
1
1
28
3
0
0
1
12
1
0
1
1
8
0
0
1
1
2
9
0
0
X4 = 0
I(Dleft)=H(Y|X4 =0)=1
X1
X2
X3
X4
Y
1
97
4
1
1
1
42
8
1
0
X4 = 1
I(Dright)=H(Y|X4 =1)=1
Imperial College Business School Imperial means Intelligent Business 35

A simple example – split on « man vs woman » column – • If we split on the attribute « man/woman » ( X4 ) , we obtain the following information gain :
IG(Dp, X4) = I(Dp) Nleft I(Dleft) Nright I(Dright) Np Np
=18⇥12⇥1 10 10
=0
• Weconcludethatthereisabsolutelynothingtogainfromsplittingonthe«man/woman»attribute. Men and women seem to have the same likelihood to be infected by the disease.
Imperial College Business School Imperial means Intelligent Business 36

A simple example – split on «in contact with a person » column – • Let’s split on the attribute «in contact with a person » ( X1 )
• Weobtainthenthefollowinginformationgain:
IG(Dp,X1)=1 2 ⇥0 8 ⇥0.95=0.24 10 10
X1
0
0
1
1
1
1
1
1
1
1
X2
98
87
93
97
81
42
28
12
8
2
X3
0
1
9
4
7
8
3
1
0
9
X4
0
0
0
1
0
1
0
0
0
0
Y
0
0
1
1
1
0
0
1
1
0
Imperial College Business School Imperial means Intelligent Business 37

Finding the best split for a particular column
• Inthelastexample,wemadeiteasy:«man/woman»and«incontactwithaperson»columnshave only two possible values each.
• Forcontinuousdata,wecanfindsomerulesthatleadtoasmallersetofpossiblevalues.
• SotofindthebestsplitforacontinuousattributeX:
• We sort the values of the attribute in order, and sort the target Y in the corresponding way.
• We find all the boundary points (i.e, where Y changes from one value to an other). • Wecalculatetheinformationgainwhensplittingateachboundary
• Wekeepthesplitwhichgivesthehighestinformationgain.
Imperial College Business School Imperial means Intelligent Business 38

A simple example – Split on « Number of washing hands » column at the root –
• We can see that the best split (the one with the highest information gain) on the « Number washing hands » column ( X3 ) is still very low.
• For the first split (at the root), it is better to do it on « in contact with a person » column ( X1 ) as the information gain is 0.24.
X1
X2
X3
X4
Y
0
98
0
0
0
1
8
0
0
1
1
12
1
0
1
0
87
1
0
0
1
28
3
0
0
1
97
4
1
1
1
81
7
0
1
1
42
8
1
0
1
2
9
0
0
1
93
9
0
1
Thresholds
0.5
2
3.5 0.03
5.5 7.5 8.5
IG 0.04
Imperial College Business School Imperial means Intelligent Business 39

A simple example – Split on « Number of washing hands » column after some iterations –
• Aftersomeiterations,splittingon«Numberoftouchingface»column(X1)becomesinteresting.
X1
X2
X3
X4
Y
1
8
0
0
1
1
12
1
0
1
1
28
3
0
0
1
42
8
1
0
1
2
9
0
0
2 5.5 8.5
IG 0.97
Thresholds
Imperial College Business School Imperial means Intelligent Business 40

Basic example after training • Whendoweactuallystopsplitting?
• Ifanodeispure(allthesamplesofthisnodebelongtothesamecategory),wemakeitaleaf node and predict the class of the samples.
• Ifthemaximuminformationgain=0,wegainnothingfromsplitting,wemakethenodealeaf node and predict the most likely class (majority vote).
• Toavoidoverfitting,wesetalimittothedepthofthetree.
• Byapplyingthetrainingalgorithmonthebasicexample,weobtainthefollowingtree:
Imperial College Business School Imperial means Intelligent Business 41

Decision Trees on the Iris Dataset – Data –
• Thedatasetconsistsof50samplesfromeachofthreespeciesofIris(Irissetosa,IrisvirginicaandIris
versicolor), so 150 total samples.
• Fourfeaturesweremeasuredfromeachsample:thelengthandthewidthofthesepalsandpetals,in centimeters.
Setosa Versicolor
Virginica
Imperial College Business School Imperial means Intelligent Business 42

Decision Trees on the Iris Dataset – Model Selection –
• Tofindtheoptimalsetofhyperparameters,weuseagridsearchmethod,whichisabrute-force exhaustive search paradigm where we specify a list of values for different hyperparameters, and the algorithm evaluate the model performance (via cross validation) for each combination of hyperparameters.
• Fordecisiontreealgorithm,wetunedthefollowingtwohyperparameters:
• Impuritymeasure:withthetwopossibilities: • Gini
• Entropy
• Thedepthofthetree:withvaluesin[1,2,3,4,5,10,20,30,40]
• Asaresult,weobtainthefollowingbestparameters: • Bestimpurity:«gini»
• Bestdepth:5
Imperial College Business School Imperial means Intelligent Business 43

Decision Trees on the Iris Dataset – Biais/Variance Tradeoff –
• Byincreasingthetreedepth,weincreasethemodelcomplexity.
• Ifthecomplexityistoosmall,itwillaffecttheperformanceonboththetrainingsetandthetestset (the model will suffer from high bias).
• Ifit’stoobig,wecouldachiveaperfectscoreonthetrainingsetbuttheperformancewilldeteriorate on the test set (the model will suffer from high variance).
Underfitting
Good Compromise
Overfitting
Complexity too small (high bias)
Complexity: neither too big, nor too small.
Complexity too big (high variance)
Imperial College Business School Imperial means Intelligent Business 44

Decision Trees on the Iris Dataset – Overfitting –
• Byplottingthemodeltrainingandvalidationaccuraciesasfunctionsofthetrainingsetsize,wecan detect whether the model suffers from high variance or high bias, and whether the collection of more data could help address this problem.
Accuracy vs training set size Accuracy vs max depth
Imperial College Business School Imperial means Intelligent Business 45

Combining multiple decision trees via Random Forest
• TheRandomForestalgorithmaimstoaveragemultipledecisiontreesthatindividuallysufferfrom overfitting, in order to build a model with better generalization performance. It consists in the following steps:
1. Randomly choose n samples from the training set with replacement (bootstrap sample).
2. Grow a decision tree from the bootstrap sample. And at each node:
• Randomly select d attributes without replacement.
• Split the node using the attribute that provides the best split according to maximizing the information gain.
3. Repeat the firs two steps K times.
4. Use the K trees to assign the class label by majority voting.
Imperial College Business School Imperial means Intelligent Business 46

The Random Forest Algorithm on the Iris dataset • TheHyperparametersoftheRandomForestAlgorithm:
• The sample size n of the bootstrap controls the bias-variance tradeoff of the Random Forest.
• Inmostlibraries(includingscikit-learn),thesizeofthebootstrapsampleischosentobeequaltothe
number of samples in the training dataset.
• For the number of attributes d at each split, a reasonable default value (used in scikit-learn) is
where D is the number of attributes in the dataset.
• So,onebigadvantageofRandomForestisthatitrequiresverylittletuning:Themainhyperparameter
to tune is the number of trees.
• Bytuningthenumberoftreesasa hyperparameter for the Iris dataset, we get 20 as the best value, which is confirmed by the following curve:
d=D
Imperial College Business School Imperial means Intelligent Business 47
p

Go to the following link and take Quiz 1 : https://mlfbg.github.io/MachineLearningInFinance/
Imperial College Business School Imperial means Intelligent Business 48

Programming Session
Imperial College Business School Imperial means Intelligent Business 49

Appendix
A zoom on the notion of Gradient Descent
Imperial College Business School Imperial means Intelligent Business 50

Position of the Problem:
• TheGradientDescentalgorithmisthebackboneofMachineLearning.
• Letusstartwithabasicexample.
• We want to minimize the function f : ✓ 7! ✓2 by using the gradient descent algorithm.
• Asshowninthefigurebelow,findingtheminimum can be achieved by finding where the derivative (slope) is zero. To that end, we can do the following steps:
• Choose a random initial point ✓0
• Choose a learning rate ⌘ > 0 and repeat
until convergence:
✓k+1 ✓k ⌘df(✓k) d✓
Imperial College Business School Imperial means Intelligent Business 51

Gradient Descent Theorem:
• Beforestatingthegradientdescenttheorem,let’srecallthatthegradientisjustthemultidimensional
equivalent of the derivative.
• For f :✓2Rd 7!R : r✓f(✓)=✓@f (✓), @f (✓),…, @f (✓)◆ @✓1 @✓2 @✓d
Gradient Descent Algorithm:
f : Rd ! R
• Suppose we have a convex and differentiable function .
• Suppose its gradient is (Lipschitz) continuous with constant ✏ . Let’s fix the learning rate ⌘  1 ✏
• Then, if we run the gradient descent algorithmn for K iterations:
• Initialize randomly ✓0
• Repeat K times
✓k+1
✓k ⌘r✓f(✓k) f(✓k)f(✓⇤) ||✓0 ✓⇤ ||2
2⌘K
• Itwillyieldasolutionwhichsatisfies:
where: ||.||2 isthe L2 normon Rd (i.e: 8z2Rd ||z||2=zTz)
Imperial College Business School Imperial means Intelligent Business 52

Choice of the learning rate in Gradient Descent:
• Undertheassuptionsofthegradientdescenttheorem,thealgorithmisguarenteedtoconvergeandit
converges with a rate of O( 1 ) K
• The learning rate ⌘ should not be too big so that the objective function does not blow up, but not too small so that it doesn’t take too much time to converge.
• In this case, the learning rate is fixed, we will present in Lecture 4 other versions of the gradient descent algorithm but with a learning rate that decreases with iterations, when we are closer to the optimal parameters.
Imperial College Business School Imperial means Intelligent Business 53

Using the gradient Descent in the case of a linear regression
The training problem has been written as the following equivalent minimization problem:
1 XN
m i n ( y i wˆ T x i ) 2
wˆ2RD+1 N
J ( wˆ )
i=1
| {z }
Imperial College Business School Imperial means Intelligent Business 54

Matrix Notation and Optimization :
• Itiscomputationallymoreefficienttowritetheoptimizationproblemwithmatrixnotation.
{(X , Y ) }
• To that end, let’s introduce the following matrix notation for the training data i i 1iN :
26 x1 137
Xˆ=6 x2 172RN⇥(D+1)
2y13 6y27 N
64. . . .75 xN 1
Y =64 . 752R yN
• We then deduce the following prediction matrix P :
26 wˆ T x 1 37
P = Xˆ wˆ = 6 6 wˆ T x 2 7 7 2 R N 4 . 5
wˆ T x N
• Ifwedenoteby||.||2 the L2 normon RN (i.e:8z2RN ||z||2=zTz ),wecanthen express the cost function J we look to minimize as follows :
J(wˆ)= 1 ||XˆwˆY ||2 N
Imperial College Business School Imperial means Intelligent Business 55

Matrix Notation and Optimization :



To cost function J(wˆ) = 1 || Xˆwˆ Y ||2 is a differentiable convex function whose minima are thus N
characterized by the setting the gradient to zero.
In order to differenciate the cost function, we will need the following formulas:
So,
8z2Rn 8A2Rn⇥n rz(zTAz)=(A+AT)z 8z2Rn 8⇠2Rn rz(zT⇠)=⇠
rwˆJ(wˆ)=0 () rwˆ✓1(XˆwˆY)T(XˆwˆY)◆
1⇣N ⌘
() N rwˆ(wˆXˆT Xˆwˆ) rwˆ(wˆT XˆT Y ) rwˆ(Y T Xˆwˆ) () 2(XˆTXˆwˆXˆTY)=0
N
• •
() XˆTXˆwˆ=XˆTY Wededucethefollowingexpressionofthegradient:
If Xˆ T Xˆ is invertible, the optimal wˆ⇤ is given by the following closed form:
rwˆJ(wˆ)= 2(XˆTXˆwˆXˆTY) N
wˆ ⇤ = ( Xˆ T Xˆ ) 1 Xˆ T Y
Imperial College Business School Imperial means Intelligent Business 56

Using the gradient Descent in the case of a logistic regression
• Thetrainingproblemhasbeenwrittenasthefollowingequivalentminimizationproblem:
| {z }
min 1 XN yi log((wT xi)) + (1 yi) log(1 (wT xi))
J(w)
w2RD N
i=1
Imperial College Business School Imperial means Intelligent Business 57

Gradient Descent for Logistic Regression: • Usingthesetwopropertiesofthesigmoidfunction:
8z2R (z)=1(z)
8z 2 R 0(z) = (z)(1 (z)) = (z)(z)
• Wecanrewritethecostfunctionasfollows:
XN
J(w)=1 Xyilog((wTxi))+(1yi)log(1(wTxi))
N i=1
= 1 N yiwT xi + log((wT xi))
1 XN
rwJ(w) = rw N (yiwT xi + log((wT xi)))
i=1 1 XN
= N ((wT xi) yi)xi i=1
N i=1 • Let’s take the gradient w.r.t. w :
!
Imperial College Business School Imperial means Intelligent Business 58

Matrix Notation for the Gradient :
• let’sintroducethefollowingmatrixnotationforthetrainingdata {(Xi,Yi)1iN}:
26 x1 37 26y137
X =6 x2 72RN⇥D Y =6y272{0,1}N
4. . .5
xN yN
N
26 ( w T x 1 ) 37 64 . 75
• The prediction vector P 2 R
• Andtheerrorvector ✏2RN isdefinedasfollows: ✏=PY =6(wTx2)y2 72RN
is then defined as follows:
4 . 5
P =6(wTx2)72RN
(wTxN)
26 ( w T x 1 ) y 1 37
4 . 5 (wTxN)yN
Imperial College Business School Imperial means Intelligent Business 59

Matrix Notation for the Gradient :
• The error vector ✏ 2 RN can be represented by ✏ˆ : a diagonal
26 ( w T x 1 ) y 1 T
✏ˆ=6 (w x2)y2
RN ⇥N matrix as follows:
37 72RN⇥N
4 … 5 (wTxN)yN
• As a result, the matrix ✏ˆX 2 RN⇥D contains all the information needed to compute the gradient:
26 ((wT x1) y1)x1 37
✏ˆX = 6 ((wT x2) y1)x2 7 2 RN⇥D
4 . . . 5 ((wTxN)y1)xN
• ndeed, if R1,…,RN are the rows ✏ˆX , the gradient of J w.r.t. w can be expressed as follows:
1 XN rwJ(w) = N Ri
i=1
Imperial College Business School Imperial means Intelligent Business 60