CS计算机代考程序代写 finance algorithm deep learning chain Machine Learning in Finance

Machine Learning in Finance
Lecture 5 Introduction to Deep Learning
Arnaud de Servigny & Hachem Madmoun

Outline:
• FromLogisticRegressiontoShallowNeuralNetworks
• Deep Neural Networks and Loss functions
• DeepLearningTechniques
• ProgrammingSession
Imperial College Business School Imperial means Intelligent Business 2

Part 1 : From Logistic Regression to Shallow Neural Networks
Imperial College Business School Imperial means Intelligent Business 3

Logistic Regression
• TheLogisticRegressionModelpredictstheprobabilityofthepositiveclassusingthecombinationof linear decision function and a sigmoid function
x1
p
x2
p=P(Y =1|X1 =x1,X2 =x2) = (W1x1 + W2x2 + b)
: z 7! 1 1+ez
(W, b)
• Asthedecisionboundaryisa hyperplane, the Logistic Regression model performs well on linearly separable classes.
Imperial College Business School Imperial means Intelligent Business 4

Linearly inseparable data
• Thebasicideatodealwithlinearlyinseparabledataistocreatenonlinearcombinationsoftheoriginal features.
• Wecanthentransformatwo-dimensionaldatasetontoathree-dimensionalfeaturespacewherethe classes become linearly separable.
x2
i 1,i12,i
where :z7!1+ez x2
z3 (x)
z2(x)
Original features:
Transformation of the features:
x = ✓x1◆ x2
8i 2 {1,2,3} :
zi =(b(1) +W(1)x1 +W(1)x2)
0@ z 1 1A z2 z3
z1 (x)
x1
x1
Imperial College Business School Imperial means Intelligent Business 5

Shallow Neural Network
• Aftertransformingthefeatures,wecanbuildalinearmodelontopofthenewfeatures:
z1 z2
p = (b(2) + W(2)z1 + W(2)z2 + W(2)z3) 123
z3
• Thismodelisabasicexampleofanartificialneuralnetworkwithonehiddenlayercontaining3neurons.
p W(2),b(2)
x1 x2
W(1),b(1)
p W(2),b(2)
8i2{1,2,3} zi =(b(1) +W(1)x1 +W(1)x2) i 1,i 2,i
p = (b(2) + W(2)z1 + W(2)z2 + W(2)z3) 123
z1 z2
z3
Hidden layer
Imperial College Business School Imperial means Intelligent Business 6

The importance of the activation function
Interactive Session
z1 x1
z2
x2
(1) (1) z3 W ,b
• Thisisourmodel:
8i2{1,2,3} zi =(b(1) +W(1)x1 +W(1)x2)
p W (2) , b(2)
123
i 1,i 2,i p = (b(2) + W(2)z1 + W(2)z2 + W(2)z3)
• Provethatwithouttheactivationfunction,themodelbecomes a simple linear model
p=Ux1+Vx2+b
Imperial College Business School Imperial means Intelligent Business 7

Training the Shallow Neural Network – Part 1 –
• WeusedGradientDescenttolearntheparametersofLogisticRegression,wecandothesameforthis
shallow neural network model:
8i2{1,2,3} zi =(b(1) +W(1)x1 +W(1)x2) i 1,i 2,i
p = (b(2) + W(2)z1 + W(2)z2 + W(2)z3) 123
• The parameters ✓ of the model can be summerized as follows:
(W(1) 2 R2⇥3,b(1) 2 R3) and (W(2) 2 R3⇥1,b(2) 2 R)
• The Gradient Descent algorithm requires the use of backpropagation.
• Backpropagation consists in computing the gradient of the loss function J (that will be detailed later)
with respect to each weight by the chain rule, iterating backward from the last layer.
Backpropagation
x z(1) z(2)
Chain rule
@J = @J @z(2) @z(1) @W(1) @z(2) @z(1) @W(1)
@J = @J @z(2) @W(2) @z(2) @W(2)
W(1),b(1)
W(2),b(2)
Imperial College Business School Imperial means Intelligent Business 8

Training the Shallow Neural Network – Part 2 – • The Gradient Descent algorithm consists in the following steps:
• Initilizerandomly ✓0
• Fix a number of iterations K and a learning rate ⌘ and repeat K times:
✓k+1 ✓k ⌘r✓J(✓k)
local minima
• J represents the loss function. We will detail later how to choose the appropriate one.
• We will also see in Lecture 7 several ways to improve the learning process:
• By using Momentum.
• By using adaptive learning rate.
Imperial College Business School Imperial means Intelligent Business 9

The activation functions – Part 1 –
• Inthepreviousexample,thenonlinearityresultsfromthesigmoidfunction.
• Thesigmoidfunctiontakesvaluesin[0,1],whichisconvenientwhenwewanttooutputaprobability.
• However, we can still use other functions to introduce the non linearity in the intermediate layers.
• For instance, the hyperbolic tangent can also be used. It is just a scaled and shifted version of the sigmoid, so it has the same shape of the sigmoid with the advantage of taking values in [1, 1]
sigmoid: z7! 1 tanh: z7! exp(2z)1 1 + exp(z) exp(2z) + 1
Imperial College Business School Imperial means Intelligent Business 10

The activation functions – Part 2 –
• Althoughthesigmoidandtanhfunctionshavegoodtheoraticalproperties(bothsmooth,bothdiffentiable
everywhere), they suffer from the vanishig gradient problem.
• Totrainthemodelweusebackpropagation.However,thedeepertheneuralnetworkis,themoreterms
have to be multipled due to the chain rule.
• Forinstance,withLlayers:
@J @J @z(L) @z(2) @z(1) @W (1) = @z(L) @z(L1) . . . @z(1) @W (1)
• Theproblemwiththesigmoidfunctionisthatits derivative has a maximum value of 0.25.
• Soifweuseadeepneuralnetworkswith sigmoid activation functions, we will end up multiplying several small numbers, which leads to the vanishing gradient problem.
Imperial College Business School Imperial means Intelligent Business 11

The activation functions – Part 3 –
• Asimplesolutiontothepreviousvanishinggradientproblemistousethefollowingactivationsfunctions:
ReLu : z 7! max(0, z)
LeakyRelu:z7!⇢ z if z6=0 ↵z if z<0 Softplus : z 7! log(1 + exp(z)) the gradient in the left side is vanished The right side is not vanishing Fixing the left side of the ReLu • Apartfromthelastactivationfunctionthatshouldbechosenaccordingtotheproblem(wewillseelater why we should use sigmoid for binary classification, softmax for multiclass classification and no activation function for regression), it is essential to compare a handful of different activation functions for the other layers, as we do for any other hyperparameter. Looks linear on the right side Imperial College Business School Imperial means Intelligent Business 12 Universal Approximation Theory • Universal Approximation Theory proves that any continuous function can be approximated under some regularity conditions as closely as wanted by a shallow neural network. But it will require exponentially many neurons in terms of the dimension of the problem. • Reference:[Pinkus1999:ApproximationtheoryoftheMLPmodelinneuralnetworks] X = R28⇥28 • AlthoughMultilayerNetworksarelessunderstoodtheoraticallycomparedwithShallowNetworks,it turns out deeper networks perform better for a given number of parameters in practice. 7 The classification problem is of dimension: d = 784 Theoratically, it should require exp(d) neurons in the hidden layer Y = {0,...,9} Imperial College Business School Imperial means Intelligent Business 13 Part 2 : Deep Neural Networks and Loss functions Imperial College Business School Imperial means Intelligent Business 14 Why deeper Networks ? • Eachneuralnetworklayerisafeaturetransformation. • Deeperlayerslearnincreasinglycomplexfeatures. • WehaveintroducedaspecifictypeoflayerscalledtheDenseLayers.Wewillintroduceothertypesof transformations in future lectures. For instance, the Convolutional layer is typically used for images. • Thefirstconvolutionlayer will learn small local patterns such as edges. • Deeperlayerswilllearn larger patterns made of the features of the previous layers, and so on. Imperial College Business School Imperial means Intelligent Business 15 Forward Propagation for Binary Classification – Part 1 – • Let’skeeptheexampleofbinaryclassification,butwithadeepermodel(i.e,withmultiplelayersand multiple neurons per layer) x = z(0) z(1) z(2) z(3) z(4) p = z(5) (W(1),b(1)) • Onaneuronlevel: • Onalayerlevel: (W(2),b(2)) (W(3),b(3)) (W(4),b(4)) (W(5),b(5)) Forward Propagation z(3) = (W (3)z(2) + W (3)z(2) + W (3)z(2) + b(3)) 2 121 222 323 z(3) 1 z(3) 2 z(3) 3 8l 2 {1,...,5} z(l) = (W (l)T z(l1) + b(l)) W (3)T W (3)1 z(2) = 0W (3) W (3) B 11 21 31C = @W(3) W(3) W(3)A 13 23 33 W(3) W(3) W(3) 12 22 32 z(2) 1 z(2) 2 z(2) 3 Imperial College Business School Imperial means Intelligent Business 16 0B@ 1CA Forward Propagation for Binary Classification – Part 2 – xi xi = ✓length◆ width z(1) z(2) z(3) z(4) pi (W(5),b(5)) (W (1), b(1)) (W (2), b(2)) (W (3), b(3)) f✓ (W(4),b(4)) • Inthepreviousexample,theforwardpropagationcanbesummerizedasfollows: where ✓ = n(W(i),b(i));i 2 {1,...,5}o • The forward propagation outputs the probability of the positive class: pi = P(Y = 1|X = xi) • As the forward propagation outputs a probability in the range [0, 1] , the last activation function should be a sigmoid function. 8i 2 {1,...,N} pi = f✓(xi) Imperial College Business School Imperial means Intelligent Business 17 Loss function for Binary Classification – Part 1 – x1 =✓length◆ width . . xi =✓length◆ width . . xN =✓length◆ width p1 =f✓(x1)=P(Y =1|X =x1) . pi =f✓(xi)=P(Y =1|X =xi) . pN =f✓(xN)=P(Y =1|X =xN) f✓ f✓ f✓ • From the dataset (xi, yi)1iN and the model: Y |X = xi ⇠ B(pi) where B stands for the Bernoulli distribution, our objective is to maximize the following likelihood : YN L(✓) = P(Y = yi|X = xi) = YN i=1 f✓(xi)yi (1 f✓(xi))1yi i=1 Imperial College Business School Imperial means Intelligent Business 18 Loss function for Binary Classification – Part 2 – • Asusual,insteadofmaximizingthelikelihood,weprefertominimizethenormalizednegativelog- likelihood (called the loss function J ) J(✓) = 1 log(L(✓)) ! N 1 YN = N log f✓(xi)yi (1 f✓(xi))1yi i=1 1XNy 1y =N log f✓(xi) i(1f✓(xi)) i i=1 1 XN =N i=1{yilog(f✓(xi))+(1yi)log(1f✓(xi))} • Thus,thelossfunctionforbinaryclassificationisthefollowingbinarycross-entropy: 1 XN J(✓)=N i=1{yilog(f✓(xi))+(1yi)log(1f✓(xi))} Imperial College Business School Imperial means Intelligent Business 19 Forward Propagation for Multiclass Classification – Part 1 – • ThemulticlassclassificationconsistsinpredictingoneofKcategories. • Let’staketheexampleoftheMNISTdataset.Theobjectiveistopredictacategoryamongthesetof numbers {0, 1, . . . , 9} based on an image of shape (28, 28) that we can flatten into a 784 dimensional input vector. x 2 R784 z(1) z(2) z(3) p = z(4) 0B0.211C label 0 B0.04C label 1 B0.05C label 2 B0.04C label 3 B0.03C label 4 B0.02C label 5 B0.03C label 6 B C label 7 @0.04A label 8 0.08 label 9 Should sum to one Discrete probability distribution 28 28 0.46 Imperial College Business School Imperial means Intelligent Business 20 Forward Propagation for Multiclass Classification – Part 2 – • Ifwewanttheoutputtobediscretedistributionoverallthepossiblecategories(10inourexample), the last layer must have 10 neurons and the activation function must be the softmax activation function. • ThesoftmaxactivationfunctiontransformsavectorofsizeKintoaprobabilitydistribution.Itfirst uses the exponential function to turn the real numbers into positive ones, then a classic normalization is performed. 01P01 w1 Applying the Softmax activation function p1 B . C ewi B . C Bwi C 8i2{1,...,K} pi = ewj Bpi C @ . A j=1 B@ . CA wK pK Output of the last layer before applying the activation function 8i 2 {1,...,K} pi 0 XK pi = 1 i=1 Output of the last layer after applying the activation function Imperial College Business School Imperial means Intelligent Business 21 Categorical Distribution – One hot encoding – • As we have seen before, , the categorical distribution (also called multinomial distribution) models the outcome of a random variable that can take K possible categories. • Let X be a random variable than can take K possible values, each value k with probability ⇡k 8k 2 {1,...,K} P(X = k) = ⇡k • We say that X follows a Multinomial distribution: X ⇠ M(1,⇡1,...,⇡K) • We usually use one hot encoding to represent the discrete random variable X, which consists in where 8k 2 {1,...K} ⇡k 0 and encoding X with a random variable Y = (Y1, . . . , YK )T such that 8k 2 {1,...,K} • Which means: • Thus: 8k 2 {1,...,K} ⇡k Yk = 1{X=k} {X = k} Y = B1.C 001 B . C XK k=1 P(X = k) = P(Yk = 1)@0.A k-th position Imperial College Business School Imperial means Intelligent Business 22 Loss function for Multiclass Classification – Part 1 – x1 2 R784 . . xi 2 R784 . . xN 2 R784 0p11 0P(Y1=1|X=x1) B . C B . f✓ p1 = f✓(x1) = Bpk1 C = BP(Yk = 1|X = x1) @.A @ . p K1 P ( Y K = 1 | X = x 1 ) . 0p1i1 B . C 0P(Y1=1|X=xi)1 B . C f✓ pi = f✓(xi) = Bpki C @.A = BP(Yk = 1|X = xi)C @ . A p Ki . P ( Y K = 1 | X = x i ) 0P(Y1=1|X=xN) B . = BP(Yk = 1|X = xN) @ . P ( Y K = 1 | X = x N ) 0p1N1 B . C f✓ pN = f✓(xN) = BpkNC @.A p KN Imperial College Business School Imperial means Intelligent Business 23 1CA 1CA Loss function for Multiclass Classification – Part 2 – • Thetargetsshouldbeonehotencodedtoo. 0B00001 000001C B00010 00000C B00100 00000C B00000 01000C @00000 00001A 1000000000 k i • Fromthedataset (xi,yi)1iN andthemodel: Y|X =xi ⇠M(1,p1i,...,pKi ) where M stands for the Multinomial distribution, our objective is to maximize the following likelihood : • Which means: 0B41C B3C BC = B6C [yˆ] = i 1iN 2 k [yi ]1iN 0 B@9CA 8i 2 {1,...,N} 8k 2 {1,...,K} yi = k () yˆ = 1 1kK YN L(✓)= i=1 YN ii p=p()yˆ=1 k=1 P(Y =y|X=x)= i=1 YK k yˆ YK k yˆ k=1 pk i i kill iii Imperial College Business School Imperial means Intelligent Business 24 Loss function for Multiclass Classification – Part 3 – • Asbefore,insteadofmaximizingthelikelihood,weprefertominimizethenormalizednegativelog- likelihood (called the loss function J ) J(✓) = 1 log(L(✓)) N YNYK k! 1 yˆ = log pki Ni i=1 k=1 N i=1 k=1 1 XN XK kk = • Thus,thelossfunctionformulticlassclassificationisthefollowingcategoricalcross-entropy: yˆ log(p ) ii J(✓)= 1 yˆ log(p ) ii XN XK kk N i=1 k=1 Imperial College Business School Imperial means Intelligent Business 25 Part 3 : Deep Learning Techniques Imperial College Business School Imperial means Intelligent Business 26 Optimization Techniques: - Gradient Descent - D • Data: sample i N • Parameters: • Loss function for the dataset: • Algorithm: n (k) (k) X o ✓= (W ,b );k2{1,...,K} XY 1N Jdataset(✓) = J(✓, i) |{z} Ni=1 |{z} loss of the dataset loss for sample i ✓0 learning rate ⌘ and repeat: • Initilizerandomly • Fix a number of iterations Niter and a ✓k+1 ✓k ⌘r✓Jdataset(✓k) Imperial College Business School Imperial means Intelligent Business 27 Optimization Techniques: - Stochastic Gradient Descent - D Batch 1 • Data: • Parameters: • Loss function for a batch: • Algorithm: i1 M iM N X Y Batch 2 Batch F n (k) (k) X o ✓= (W ,b );k2{1,...,K} 1 iM Jbatch(✓) = J(✓, i) | {z } Mi=i1 |{z} loss of the batch loss for sample i ✓0 • Shufflethedataandsplititintobatches • Initilizerandomly • Repeat Nepochs times: of size M. • Update the weights for each batch in {1,...,F} ✓k+1 ✓k ⌘r✓ Jbatch (✓k) Number of batches F = int(N/M) Imperial College Business School Imperial means Intelligent Business 28 Fighting the Overfitting problem – 1 – • TherearealotofhyperparameterstotunewhenusingNeuralNetworks: • Thenumberofhiddenlayers,thenumberofneurosperhiddenlayer. • Theactivationsfunctions • Thenumberofepochs,thebatchsizeandthelearningrate(andotherhyperparametersformore sophisticated optimization algorithms), etc. • Themainissuewhendesigningthearchitectureofaneuralnetwork(i.e,choosingthe hyperparameters) is to make sure we keep the balance between the optimization task and the generalization purpose. Underfitting Good Compromise Overfitting Complexity of G too small (high bias) Complexity of G : neither too big, nor too small. Complexity of G too big (high variance) Imperial College Business School Imperial means Intelligent Business 29 Fighting the Overfitting problem – 2 – • Atthebeginningofthetrainingprocess,optimizationandgeneralizationarecorrelated.Boththe training and the validation metrics are improving. • Aftersomeiterations,generalizationstopsimproving,validationmetricsstartdegradingbecause the model is then learning some patterns that are specific to the training data and irrelevant to new data. The model is simply overfitting. Imperial College Business School Imperial means Intelligent Business 30 Fighting the Overfitting problem – 3 – • To overcome the overfitting problem, we can add more samples or reduce the compexity of the network. We can also test several regularization techniques: • Dropout applied to a layer, consists of randomly "dropping out" (i.e. setting to zero) a number of output features of the layer during training. The "dropout rate" is the fraction of the features that are being zeroed-out; it is usually set between 0.2 and 0.5. • Weight regularization: it consists in adding to the loss function of the network a cost associated with having large weights. For the cost, we can use the L1 norm of the weights or the L2 norm. Imperial College Business School Imperial means Intelligent Business 31 Programming Session Imperial College Business School Imperial means Intelligent Business 32 Go to the following link and take Quiz 5 : https://mlfbg.github.io/MachineLearningInFinance/ Imperial College Business School Imperial means Intelligent Business 33