Machine Learning in Finance
Lecture 5 Introduction to Deep Learning
Arnaud de Servigny & Hachem Madmoun
Outline:
• FromLogisticRegressiontoShallowNeuralNetworks
• Deep Neural Networks and Loss functions
• DeepLearningTechniques
• ProgrammingSession
Imperial College Business School Imperial means Intelligent Business 2
Part 1 : From Logistic Regression to Shallow Neural Networks
Imperial College Business School Imperial means Intelligent Business 3
Logistic Regression
• TheLogisticRegressionModelpredictstheprobabilityofthepositiveclassusingthecombinationof linear decision function and a sigmoid function
x1
p
x2
p=P(Y =1|X1 =x1,X2 =x2) = (W1x1 + W2x2 + b)
: z 7! 1 1+e z
(W, b)
• Asthedecisionboundaryisa hyperplane, the Logistic Regression model performs well on linearly separable classes.
Imperial College Business School Imperial means Intelligent Business 4
Linearly inseparable data
• Thebasicideatodealwithlinearlyinseparabledataistocreatenonlinearcombinationsoftheoriginal features.
• Wecanthentransformatwo-dimensionaldatasetontoathree-dimensionalfeaturespacewherethe classes become linearly separable.
x2
i 1,i12,i
where :z7!1+e z x2
z3 (x)
z2(x)
Original features:
Transformation of the features:
x = ✓x1◆ x2
8i 2 {1,2,3} :
zi = (b(1) +W(1)x1 +W(1)x2)
0@ z 1 1A z2 z3
z1 (x)
x1
x1
Imperial College Business School Imperial means Intelligent Business 5
Shallow Neural Network
• Aftertransformingthefeatures,wecanbuildalinearmodelontopofthenewfeatures:
z1 z2
p = (b(2) + W(2)z1 + W(2)z2 + W(2)z3) 123
z3
• Thismodelisabasicexampleofanartificialneuralnetworkwithonehiddenlayercontaining3neurons.
p W(2),b(2)
x1 x2
W(1),b(1)
p W(2),b(2)
8i2{1,2,3} zi = (b(1) +W(1)x1 +W(1)x2) i 1,i 2,i
p = (b(2) + W(2)z1 + W(2)z2 + W(2)z3) 123
z1 z2
z3
Hidden layer
Imperial College Business School Imperial means Intelligent Business 6
The importance of the activation function
Interactive Session
z1 x1
z2
x2
(1) (1) z3 W ,b
• Thisisourmodel:
8i2{1,2,3} zi = (b(1) +W(1)x1 +W(1)x2)
p W (2) , b(2)
123
i 1,i 2,i p = (b(2) + W(2)z1 + W(2)z2 + W(2)z3)
• Provethatwithouttheactivationfunction,themodelbecomes a simple linear model
p=Ux1+Vx2+b
Imperial College Business School Imperial means Intelligent Business 7
Training the Shallow Neural Network – Part 1 –
• WeusedGradientDescenttolearntheparametersofLogisticRegression,wecandothesameforthis
shallow neural network model:
8i2{1,2,3} zi = (b(1) +W(1)x1 +W(1)x2) i 1,i 2,i
p = (b(2) + W(2)z1 + W(2)z2 + W(2)z3) 123
• The parameters ✓ of the model can be summerized as follows:
(W(1) 2 R2⇥3,b(1) 2 R3) and (W(2) 2 R3⇥1,b(2) 2 R)
• The Gradient Descent algorithm requires the use of backpropagation.
• Backpropagation consists in computing the gradient of the loss function J (that will be detailed later)
with respect to each weight by the chain rule, iterating backward from the last layer.
Backpropagation
x z(1) z(2)
Chain rule
@J = @J @z(2) @z(1) @W(1) @z(2) @z(1) @W(1)
@J = @J @z(2) @W(2) @z(2) @W(2)
W(1),b(1)
W(2),b(2)
Imperial College Business School Imperial means Intelligent Business 8
Training the Shallow Neural Network – Part 2 – • The Gradient Descent algorithm consists in the following steps:
• Initilizerandomly ✓0
• Fix a number of iterations K and a learning rate ⌘ and repeat K times:
✓k+1 ✓k ⌘r✓J(✓k)
local minima
• J represents the loss function. We will detail later how to choose the appropriate one.
• We will also see in Lecture 7 several ways to improve the learning process:
• By using Momentum.
• By using adaptive learning rate.
Imperial College Business School Imperial means Intelligent Business 9
The activation functions – Part 1 –
• Inthepreviousexample,thenonlinearityresultsfromthesigmoidfunction.
• Thesigmoidfunctiontakesvaluesin[0,1],whichisconvenientwhenwewanttooutputaprobability.
• However, we can still use other functions to introduce the non linearity in the intermediate layers.
• For instance, the hyperbolic tangent can also be used. It is just a scaled and shifted version of the sigmoid, so it has the same shape of the sigmoid with the advantage of taking values in [ 1, 1]
sigmoid: z7! 1 tanh: z7! exp(2z) 1 1 + exp( z) exp(2z) + 1
Imperial College Business School Imperial means Intelligent Business 10
The activation functions – Part 2 –
• Althoughthesigmoidandtanhfunctionshavegoodtheoraticalproperties(bothsmooth,bothdiffentiable
everywhere), they suffer from the vanishig gradient problem.
• Totrainthemodelweusebackpropagation.However,thedeepertheneuralnetworkis,themoreterms
have to be multipled due to the chain rule.
• Forinstance,withLlayers:
@J @J @z(L) @z(2) @z(1) @W (1) = @z(L) @z(L 1) . . . @z(1) @W (1)
• Theproblemwiththesigmoidfunctionisthatits derivative has a maximum value of 0.25.
• Soifweuseadeepneuralnetworkswith sigmoid activation functions, we will end up multiplying several small numbers, which leads to the vanishing gradient problem.
Imperial College Business School Imperial means Intelligent Business 11
The activation functions – Part 3 –
• Asimplesolutiontothepreviousvanishinggradientproblemistousethefollowingactivationsfunctions:
ReLu : z 7! max(0, z)
LeakyRelu:z7!⇢ z if z6=0 ↵z if z<0
Softplus : z 7! log(1 + exp(z))
the gradient in the left side is vanished
The right side is not vanishing
Fixing the left side of the ReLu
• Apartfromthelastactivationfunctionthatshouldbechosenaccordingtotheproblem(wewillseelater why we should use sigmoid for binary classification, softmax for multiclass classification and no activation function for regression), it is essential to compare a handful of different activation functions for the other layers, as we do for any other hyperparameter.
Looks linear on the right side
Imperial College Business School Imperial means Intelligent Business 12
Universal Approximation Theory
• Universal Approximation Theory proves that any continuous function can be approximated under some regularity conditions as closely as wanted by a shallow neural network. But it will require exponentially many neurons in terms of the dimension of the problem.
• Reference:[Pinkus1999:ApproximationtheoryoftheMLPmodelinneuralnetworks]
X = R28⇥28
• AlthoughMultilayerNetworksarelessunderstoodtheoraticallycomparedwithShallowNetworks,it
turns out deeper networks perform better for a given number of parameters in practice.
7
The classification problem is of dimension:
d = 784
Theoratically, it should
require exp(d) neurons in the hidden layer
Y = {0,...,9}
Imperial College Business School Imperial means Intelligent Business 13
Part 2 : Deep Neural Networks and Loss functions
Imperial College Business School Imperial means Intelligent Business 14
Why deeper Networks ?
• Eachneuralnetworklayerisafeaturetransformation.
• Deeperlayerslearnincreasinglycomplexfeatures.
• WehaveintroducedaspecifictypeoflayerscalledtheDenseLayers.Wewillintroduceothertypesof transformations in future lectures. For instance, the Convolutional layer is typically used for images.
• Thefirstconvolutionlayer will learn small local patterns such as edges.
• Deeperlayerswilllearn larger patterns made of the features of the previous layers, and so on.
Imperial College Business School Imperial means Intelligent Business 15
Forward Propagation for Binary Classification – Part 1 –
• Let’skeeptheexampleofbinaryclassification,butwithadeepermodel(i.e,withmultiplelayersand
multiple neurons per layer)
x = z(0)
z(1) z(2) z(3) z(4)
p = z(5)
(W(1),b(1))
• Onaneuronlevel: • Onalayerlevel:
(W(2),b(2)) (W(3),b(3))
(W(4),b(4))
(W(5),b(5))
Forward Propagation
z(3) = (W (3)z(2) + W (3)z(2) + W (3)z(2) + b(3)) 2 121 222 323
z(3) 1
z(3) 2
z(3) 3
8l 2 {1,...,5}
z(l) = (W (l)T z(l 1) + b(l))
W
(3)T
W (3)1
z(2) =
0W (3) W (3)
B 11 21 31C
= @W(3) W(3) W(3)A 13 23 33
W(3) W(3) W(3) 12 22 32
z(2) 1
z(2) 2
z(2) 3
Imperial College Business School Imperial means Intelligent Business 16
0B@ 1CA
Forward Propagation for Binary Classification – Part 2 –
xi xi = ✓length◆
width
z(1) z(2) z(3)
z(4)
pi (W(5),b(5))
(W (1), b(1))
(W (2), b(2))
(W (3), b(3)) f✓
(W(4),b(4))
• Inthepreviousexample,theforwardpropagationcanbesummerizedasfollows:
where ✓ = n(W(i),b(i));i 2 {1,...,5}o
• The forward propagation outputs the probability of the positive class: pi = P(Y = 1|X = xi)
• As the forward propagation outputs a probability in the range [0, 1] , the last activation function should be
a sigmoid function.
8i 2 {1,...,N} pi = f✓(xi)
Imperial College Business School Imperial means Intelligent Business 17
Loss function for Binary Classification – Part 1 –
x1 =✓length◆ width
. .
xi =✓length◆ width
. .
xN =✓length◆ width
p1 =f✓(x1)=P(Y =1|X =x1) .
pi =f✓(xi)=P(Y =1|X =xi) .
pN =f✓(xN)=P(Y =1|X =xN)
f✓
f✓
f✓
• From the dataset (xi, yi)1iN and the model: Y |X = xi ⇠ B(pi) where B stands for the Bernoulli distribution, our objective is to maximize the following likelihood :
YN L(✓) =
P(Y = yi|X = xi) =
YN i=1
f✓(xi)yi (1 f✓(xi))1 yi
i=1
Imperial College Business School Imperial means Intelligent Business 18
Loss function for Binary Classification – Part 2 –
• Asusual,insteadofmaximizingthelikelihood,weprefertominimizethenormalizednegativelog-
likelihood (called the loss function J )
J(✓) = 1 log(L(✓)) !
N
1 YN
= N log f✓(xi)yi (1 f✓(xi))1 yi
i=1
1XN y 1 y
= N log f✓(xi) i(1 f✓(xi)) i i=1
1 XN
= N i=1{yilog(f✓(xi))+(1 yi)log(1 f✓(xi))}
• Thus,thelossfunctionforbinaryclassificationisthefollowingbinarycross-entropy:
1 XN
J(✓)= N i=1{yilog(f✓(xi))+(1 yi)log(1 f✓(xi))}
Imperial College Business School Imperial means Intelligent Business 19
Forward Propagation for Multiclass Classification – Part 1 – • ThemulticlassclassificationconsistsinpredictingoneofKcategories.
• Let’staketheexampleoftheMNISTdataset.Theobjectiveistopredictacategoryamongthesetof numbers {0, 1, . . . , 9} based on an image of shape (28, 28) that we can flatten into a 784 dimensional input vector.
x 2 R784
z(1) z(2) z(3)
p = z(4)
0B0.211C label 0 B0.04C label 1 B0.05C label 2
B0.04C label 3 B0.03C label 4 B0.02C label 5
B0.03C label 6 B C label 7 @0.04A label 8
0.08 label 9
Should sum to one
Discrete probability distribution
28
28
0.46
Imperial College Business School Imperial means Intelligent Business 20
Forward Propagation for Multiclass Classification – Part 2 –
• Ifwewanttheoutputtobediscretedistributionoverallthepossiblecategories(10inourexample), the last layer must have 10 neurons and the activation function must be the softmax activation function.
• ThesoftmaxactivationfunctiontransformsavectorofsizeKintoaprobabilitydistribution.Itfirst uses the exponential function to turn the real numbers into positive ones, then a classic normalization is performed.
01P01 w1 Applying the Softmax activation function p1
B . C ewi B . C Bwi C 8i2{1,...,K} pi = ewj Bpi C
@ . A j=1 B@ . CA
wK
pK
Output of the last layer before applying the activation function
8i 2 {1,...,K} pi 0
XK
pi = 1
i=1
Output of the last layer after applying the activation function
Imperial College Business School Imperial means Intelligent Business 21
Categorical Distribution – One hot encoding –
• As we have seen before, , the categorical distribution (also called multinomial distribution) models the
outcome of a random variable that can take K possible categories.
• Let X be a random variable than can take K possible values, each value k with probability ⇡k
8k 2 {1,...,K} P(X = k) = ⇡k
• We say that X follows a Multinomial distribution:
X ⇠ M(1,⇡1,...,⇡K)
• We usually use one hot encoding to represent the discrete random variable X, which consists in
where 8k 2 {1,...K} ⇡k 0 and encoding X with a random variable Y = (Y1, . . . , YK )T such that
8k 2 {1,...,K}
• Which means:
• Thus: 8k 2 {1,...,K}
⇡k
Yk = 1{X=k}
{X = k} Y = B1.C
001 B . C
XK k=1
P(X = k) = P(Yk = 1)@0.A
k-th position
Imperial College Business School Imperial means Intelligent Business 22
Loss function for Multiclass Classification – Part 1 –
x1 2 R784
. .
xi 2 R784
. .
xN 2 R784
0p11 0P(Y1=1|X=x1) B . C B .
f✓
p1 = f✓(x1) = Bpk1 C = BP(Yk = 1|X = x1) @.A @ .
p K1 P ( Y K = 1 | X = x 1 )
.
0p1i1 B . C
0P(Y1=1|X=xi)1 B . C
f✓
pi = f✓(xi) = Bpki C @.A
= BP(Yk = 1|X = xi)C @ . A
p Ki .
P ( Y K = 1 | X = x i ) 0P(Y1=1|X=xN)
B .
= BP(Yk = 1|X = xN)
@ .
P ( Y K = 1 | X = x N )
0p1N1 B . C
f✓
pN = f✓(xN) = BpkNC @.A
p KN
Imperial College Business School Imperial means Intelligent Business 23
1CA 1CA
Loss function for Multiclass Classification – Part 2 – • Thetargetsshouldbeonehotencodedtoo.
0B00001 000001C B00010 00000C B00100 00000C B00000 01000C @00000 00001A 1000000000
k i
• Fromthedataset (xi,yi)1iN andthemodel: Y|X =xi ⇠M(1,p1i,...,pKi ) where M stands for the Multinomial distribution, our objective is to maximize the following likelihood :
• Which means:
0B41C B3C
BC = B6C
[yˆ] = i 1iN
2
k
[yi ]1iN
0
B@9CA
8i 2 {1,...,N} 8k 2 {1,...,K} yi = k () yˆ = 1
1kK
YN L(✓)=
i=1
YN
ii p=p()yˆ=1
k=1
P(Y =y|X=x)=
i=1
YK k yˆ
YK k yˆ
k=1
pk i i
kill iii
Imperial College Business School Imperial means Intelligent Business 24
Loss function for Multiclass Classification – Part 3 –
• Asbefore,insteadofmaximizingthelikelihood,weprefertominimizethenormalizednegativelog-
likelihood (called the loss function J )
J(✓) = 1 log(L(✓))
N
YNYK k! 1 yˆ
= log pki Ni
i=1 k=1
N
i=1 k=1
1
XN XK
kk
=
• Thus,thelossfunctionformulticlassclassificationisthefollowingcategoricalcross-entropy:
yˆ log(p ) ii
J(✓)=
1
yˆ log(p ) ii
XN XK
kk
N
i=1 k=1
Imperial College Business School Imperial means Intelligent Business 25
Part 3 : Deep Learning Techniques
Imperial College Business School Imperial means Intelligent Business 26
Optimization Techniques: - Gradient Descent -
D
• Data:
sample i
N
• Parameters:
• Loss function for the dataset: • Algorithm:
n (k) (k) X o ✓= (W ,b );k2{1,...,K}
XY
1N
Jdataset(✓) = J(✓, i)
|{z} Ni=1 |{z} loss of the dataset loss for sample i
✓0 learning rate ⌘ and repeat:
• Initilizerandomly
• Fix a number of iterations Niter and a
✓k+1 ✓k ⌘r✓Jdataset(✓k)
Imperial College Business School Imperial means Intelligent Business 27
Optimization Techniques: - Stochastic Gradient Descent - D
Batch 1
• Data:
• Parameters:
• Loss function for a batch: • Algorithm:
i1 M
iM N
X
Y
Batch 2
Batch F
n (k) (k) X o ✓= (W ,b );k2{1,...,K}
1 iM
Jbatch(✓) = J(✓, i)
| {z } Mi=i1 |{z} loss of the batch loss for sample i
✓0
• Shufflethedataandsplititintobatches
• Initilizerandomly
• Repeat Nepochs times:
of size M.
• Update the weights for each batch in {1,...,F}
✓k+1 ✓k ⌘r✓ Jbatch (✓k)
Number of batches
F = int(N/M)
Imperial College Business School Imperial means Intelligent Business 28
Fighting the Overfitting problem – 1 –
• TherearealotofhyperparameterstotunewhenusingNeuralNetworks:
• Thenumberofhiddenlayers,thenumberofneurosperhiddenlayer.
• Theactivationsfunctions
• Thenumberofepochs,thebatchsizeandthelearningrate(andotherhyperparametersformore sophisticated optimization algorithms), etc.
• Themainissuewhendesigningthearchitectureofaneuralnetwork(i.e,choosingthe hyperparameters) is to make sure we keep the balance between the optimization task and the generalization purpose.
Underfitting
Good Compromise
Overfitting
Complexity of G too small (high bias)
Complexity of G : neither too big, nor too small.
Complexity of G too big (high variance)
Imperial College Business School Imperial means Intelligent Business 29
Fighting the Overfitting problem – 2 –
• Atthebeginningofthetrainingprocess,optimizationandgeneralizationarecorrelated.Boththe
training and the validation metrics are improving.
• Aftersomeiterations,generalizationstopsimproving,validationmetricsstartdegradingbecause the model is then learning some patterns that are specific to the training data and irrelevant to new data. The model is simply overfitting.
Imperial College Business School Imperial means Intelligent Business 30
Fighting the Overfitting problem – 3 –
• To overcome the overfitting problem, we can add more samples or reduce the compexity of the network.
We can also test several regularization techniques:
• Dropout applied to a layer, consists of randomly "dropping out" (i.e. setting to zero) a number of output features of the layer during training. The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5.
• Weight regularization: it consists in adding to the loss function of the network a cost associated with having large weights. For the cost, we can use the L1 norm of the weights or the L2 norm.
Imperial College Business School Imperial means Intelligent Business 31
Programming Session
Imperial College Business School Imperial means Intelligent Business 32
Go to the following link and take Quiz 5 : https://mlfbg.github.io/MachineLearningInFinance/
Imperial College Business School Imperial means Intelligent Business 33