Machine Learning in Finance
Lecture 7 Introduction to Sequence Models
Arnaud de Servigny & Hachem Madmoun
Outline:
• IntroducingtheconceptofMemory
• TheEmbeddingLayer
• Recurrent Neural Networks
• ProgrammingSession
Imperial College Business School Imperial means Intelligent Business 2
Part 1 : Introducing the concept of Memory
Imperial College Business School Imperial means Intelligent Business 3
Review of the Sentiment Analysis Pipeline • ThePreprocessingsteps:
• Document 1 : [0, 23, 43, 12, …, 2343, 9999]
• Document 2 : [0, 12, 1, 3453, …, 123, 9999]
. .
• Document N : [0, 1, 1252, …, 1232, 9999]
• TheTensordataisofshape(N,V)afterthe preprocessing steps.
• Then,wefeedthetensordataintoatraditionalMachine Learning Algorithm or a Neural Networks composed of a stack of dense layers.
0B 0 1 1 . . . 0 1 1C B0 0 1 … 0 1C
@ . . . . . . A 0 1 1 … 1 1
N
• •
•
Document 1 : « There were no wolves in the movie. »
Document 2 : « This movie has one star and that star is Ryan Gosling. Great flick, highly recommend it. »
Document N : « How many times must Willy be freed before he’s freed?. »
Row Data
Preprocessed Data
Vectorized Data
. .
V
V
Imperial College Business School Imperial means Intelligent Business 4
Limitations of the approach
1. The first limitation comes from the way we encode the sentence into a V-dimensional vector:
• Theencodingisperformedregardlessoftheorderinwhichthewordscome.
• Forinstance,thesetwosentenceswillbeencodedintothesamevector: • «Neverquit.Doyourbest.»
• «Neverdoyourbest.Quit.»
2. The other limitation comes from the model itself. We feed the entire sequence (encoded into a V- dimensional vector) all at once.
V
0B B B 0 0 1 . . . 0 1 1C C C @ . . . . . . A
011…01
011…11
N
V
Imperial College Business School Imperial means Intelligent Business 5
The concept of Memory in Neural Networks
• When we are dealing with sequential data, the main limitation of the neural networks we have seen so far is that they look at each sequence as a single large vector. We call this kind of networks : feedforward networks.
What a great movie [0,1,0,1,…,1,0]
V
v
sequence
• Conversely,aRecurrentNeuralNetworkprocessessequencesbylookingatthemelementby element and keeping track of a state which contains the information processed so far.
Internal Loop
Preprocessing of « What a great movie »
Processing of « What »
Processing of « What a »
Processing of « What a great»
Processing of
« What a great movie»
output
Recurrent connection
RNN
RNN
RNN
RNN
RNN
input
What
a
great
movie
Imperial College Business School Imperial means Intelligent Business 6
A simple example – Part 1 –
• WewanttopredictwhetherthestockAAPLisgoing up or down for the next day, based on D = 4 characteristics of the stock during the last T days.
• At each time step t, let xt = [c1t,c2t,c3t,c4t]denote the four characteristics of the stock AAPL at time t.
• Let yt denote the stock movement at time t:
• yt = 1 when the stock goes up between t-1 and t.
• yt = 0 when the stock goes down between t-1 and t.
• Let x1 , . . . , xF be the whole sequence of characteristics from 2015 to 2020
• Wewanttopredictthestockmovementasfollows:
• From the sequence x1,…,xT , we want to predict yT+1
• From the sequence x2,…,xT+1 , we want to predict yT+2
. . .
• FromthesequencexF T,…,xF 1,wewanttopredict yF
Imperial College Business School Imperial means Intelligent Business 7
A simple example – Part 2 –
xF T
xF 1
AAPL
x3 xT+2 x2 xT+1
x1 xT
yT +1
yT +2
yT +3
…
yF
Imperial College Business School Imperial means Intelligent Business 8
A simple example – Part 3 –
• Theinputdatatensorisatensorofshape(N,T,D):
• Each sample i among the N samples is a sequence si of length T with elements of dimension D. • WehaveN=F–Tsequences.
• Thetargettensordataisofshape(N,):
• Each sample i is associated with a target ti
• Eachtargetisabinaryoutput:1for«up»and0for«down».
Input Data Tensor
T
: x1 … xT
: x2 … xT+1
. . .
: xF T … xF 1
yT+1:
t1
N
N
s1
Target Tensor
yT +2
:
s2
yF :
t2
sN
tN
Imperial College Business School Imperial means Intelligent Business 9
A simple example – Part 4 – The Forward Propagation: d1
t1
hT
prediction
target
d2 d3
…
:1
T
D
RNN
RNN
xT
p1
RNN
RNN
s1
x1
x2
xT
• The loss function associated with the N samples is the following binary cross entropy loss function:
1 XN
J = N (tilog(pi) + (1 ti)log(1 pi))
i=1
Imperial College Business School Imperial means Intelligent Business 10
A simple example – Part 4 – The Forward Propagation:
• Let’sfocusontheevolutionofthetensorshapeateachlayertransformationforthestockmovement
prediction problem:
(N,T,D)
RNN layer
• In the previous example, the input data is a 3D tensor representing N sequences of length T. Each sequence is composed of D-dimensional continuous vectors.
• IntheSentimentAnalysisproblem,theinputdatacomesassequencesofintegers.
• InordertouseRNNmodelsfortheSentimentAnalysisproblem,wewillneedanintermediatelayer called the Embedding layer.
Input Data
(N, d1)
(N, d2)
(N, d3)
(N, )
Predictions
Dense layer
Dense layer
Dense layer
Imperial College Business School Imperial means Intelligent Business 11
Interactive Session
Imperial College Business School Imperial means Intelligent Business 12
Part 2 : The Embedding Layer
Imperial College Business School Imperial means Intelligent Business 13
The Embedding Space
• Theembeddinglayeraimsatmappingeachword
into a geometric space, called the embedding space.
• EachwordisencodedintoaD-dimensionalvector (D is around 50 or 100).
• In the embedding space, words with similar meaning are encoded into similar word vectors.
• Since we use unsupervised learning to obtain these word vectors, they don’t necessarily have to make sense to us. They only have to make sense geometrically.
• Commun example of meaningful geometric transformations are « gender ».
wKing wMan ⇡ wQueen wWoman
economy finance
+markets+ +
rock +pop +
jazz + +
environment
+ earth
+
nature +
music
wFrance wParis ⇡ wSpain wMadrid
Imperial College Business School Imperial means Intelligent Business 14
Different Word Embeddings
• Wecanstoreallthewordvectorsintoamatrix
called : The Embedding Matrix.
• The idea of a low dimensional embedding space for words, computed using unsupervised learning was initially explored by Bengio et al. in the early 2000s.
• It started to take off in the industry after the release of the Word2vec algorithm, explored by Tomas Mikolov at Google in 2013.
• Inthepreviouslecture,weexploredtheWord2vecalgorithm.
…dire consequences for the UK economy, even as markets were rocked… …High pay for bosses hurting economy says senior Bank of England… …Mervyn King believes the world economy will soon face another crash…
• InadditiontotheWord2vecvectors,therearevariouspretrainedwordembeddingsthatcanbe downloaded and used, like GloVe and FastText.
Context
Imperial College Business School Imperial means Intelligent Business 15
The Embedding Layer
• TheEmbeddingLayertakesasinputthesequencesofintegers.Butallthesequencesshouldbeofthe same length T, so that we can pack them into the same tensor :
• SequencesthatareshorterthanTarepaddedwithzeros. • SequencesthatarelongerthatTaretruncated.
Row Data
Preprocessed Data
• •
•
Document 1 : « There were no wolves in the movie. »
Document 2 : « This movie has one star and that star is Ryan Gosling. Great flick, highly recommend it. »
. .
Document N : « How many times must Willy be freed before he’s freed?. »
• Document 1 : [23, 43, 12, …, 2343, 0, 0, 0, 0]
• Document 2 : [12, 1, 3453, …, 123, 23, 12, 9]
. .
• Document N : [1, 1252, …, 1232, 0, 0, 0, 0, 0]
• TheEmbeddingLayertransformsthe2Dinputtensorofshape(N,T)intoatensorofshape(N,T,D).
T
Embedding Matrix
D
0B . . . w1 B… w4 B… w1 B… w0
@ . . … w0
N
T
. . . 1C
…C …C …C
. A …
0 0B 1C 1 B… w2 …C
4 B… w3 …C
B@ . . . CA … wV 1 …
… w0 …
… w1 …
… w4 …
• Document 1 : [23, 43, 12, …, 2343, 0, 0, 0, 0]
• Document 2 : [12, 1, 3453, …, 123, 23, 12, 9]
• Document 3:
. .
• Document N : [1, 1252, …, 1232, 0, 0, 0, 0, 0]
[1, 4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Imperial College Business School Imperial means Intelligent Business 16
The Sentiment Analysis new Pipeline – Part 1 – d1
d2 d3
t1
hT
RNN
RNN
RNN
RNN
x2
xT
xT
x1
1
D
prediction
p1
target
Preprocessing and Embedding Layer
What
a
great
movie
Imperial College Business School
Imperial means Intelligent Business 17
T
The Sentiment Analysis new Pipeline – Part 2 –
• Let’skeeptrackoftheevolutionofthetensorshapeateachlayertransformation:
Sequences of Integers
Input Data
(N,T) (N,T,D) (N,d1) (N,d2) (N,d3) (N,)
Embedding RNN Dense Dense layer layer layer layer
The Forward Propagation
• ThenextsectionswilldetailtheRNNtransformation
• WewillfirststartwithasimpleRNNmodel.
• ButthesimpleRNNmodelsuffersfromthevanishinggradientproblem
• Wewillthenexplainhowtosolvethevanishinggradientproblembyusingabetter transformation called the Long Short Term Memory model.
Dense layer
Predictions
Imperial College Business School Imperial means Intelligent Business 18
Part 3 : Recurrent Neural Networks
Imperial College Business School Imperial means Intelligent Business 19
A Simple RNN layer – Part 1– • Theinput:AsequenceoflengthT,
composed of D-dimensional vectors.
x1,…,xT
• Theoutput:AsequenceoflengthT,
composed of d-dimensional vectors.
h1,…,hT • Theweights:
d h0
Whh Wxh
D
h1 h2 Whh Whh
Wxh Wxh
… hT 1 Whh
Wxh
h1
h2
h3
RNN
RNN
RNN
x2
x3
Wxh 2RD⇥d , Whh 2Rd⇥d and bh 2Rd • TheTransformation:
h1 =tanh(WT h0 +WT x1 +bh) hh xh
h2 =tanh(WT h1 +WT x2 +bh)
.
hT =tanh(WT hT 1+WT xT +bh) hh xh
T
8t 2 {1,…,T}
ht =tanh(WT ht 1 +WT xt +bh) hh xh
hh
xh
Imperial College Business School Imperial means Intelligent Business 20
x1
hT
RNN
xT
Interactive Session
Imperial College Business School Imperial means Intelligent Business 21
A Simple RNN layer – Part 2–
• Asusual,weusetheGradientDescentAlgorithmtoupdatetheweigths.
h1
h2
h3
h0 h1 h2 … hT 1
RNN
RNN
RNN
hT
RNN
x1
x2
x3
• Unfortunately,simpleRNNsaren’tcapableoflearning«longtermdependencies»andit’shard to make hT influenced by x1 , x2 , x3 due to the vanishing gradient problem, as explained in:
• [Hochreiter1991]:UntersuchungenzudynamischenneuronalenNetzen
• [Bengio,etal.1994]:LearningLong-TermDependencieswithGradientDescentisDifficult • [BengioetMikolov2013]:Onthedifficultyoftrainingrecurrentneuralnetworks
xT
Imperial College Business School Imperial means Intelligent Business 22
LSTM networks – Part 1 –
• LongShortTermMemorynetworks (LSTMs) are a special type of RNN explicitly designed to avoid long term dependency problem
d
…
h0 h1 h2 h3 hT 1
D
h1
h2
h3
LSTM
LSTM
LSTM
• LikethestandardRNNs,LSTMsalso have the form of a chain of repeating modules of neural networks.
T
Ct 1 ht 1
ht
• UnlikethestandardRNNs,the repeating module has four single neural network layers, interacting in a special way.
Ct ht
xt
Source: colah’s blog
LSTM
x1
x2
x3
hT
xT
LSTM
Imperial College Business School Imperial means Intelligent Business 23
LSTM networks – Part 2 – • TherearetwomainconceptsinLSTMs:
1. The cell state :
• It’s represented by the sequence Ct for t 2 {1,…,T}
• The cell state represents the memory.
• Ateachstep t 1 !t,theLSTMwillremovesome information from the cell state and add some other information using the concept of gates.
• The LSTM has 3 gates to protect the cell state.
2. The gates:
• The gate is a way to control the amount of information we want to keep or change.
• It’s composed of a sigmoid neural network and an element wise multiplication.
• For each dimension, the sigmoid layer outputs a value between zero and 1, describing how much we want to let through: « close to zero » means « let nothing through !» and « close to one » means « let everything through! ».
Imperial College Business School Imperial means Intelligent Business 24
LSTM networks – Part 3 –
• Totransitionfromt-1totusingLSTMs,thereare4steps:
• STEP1:Theforgetgatelayer:
• Thisstepconcernstheinformationwewantto
throw away from the cell state.
• Forthat,weusetheforgetgatelayer:A sigmoid applied to the concatenation of ht 1 and xt outputs the forget vector ft
• Theparametersare: (Wf,bf)
• STEP2:Theinputgatelayer:
• Conversly,thisstepconcernstheinformation
we want to store in the cell state.
• Forthat,weusetheinputgatelayer:Again,a sigmoid applied to the concatenation of ht 1 and xt outputs the input vector it
• Then,wecreateanewcandidateforthecell state C ̃t thanks to a tanh layer.
• The parameters are: (Wi, bi) and (WC , bC )
f t = ( W fT [ h t 1 , x t ] + b f )
it = (WiT [ht 1, xt] + bi)
C ̃t = tanh(WCT [ht 1, xt] + bC )
Imperial College Business School Imperial means Intelligent Business 25
LSTM networks – Part 4 – • STEP3:Updatingthecellstate:
•
•
Weknowfromthepreviousstepsthatby
multiplying (element wise) the forget vector
(f ) by the old cell state (C ), we obtain the t t 1
first part of the uptated cell state: ft ⇤ Ct 1 corresponding to what we want to forget.
We also know that it ⇤ C ̃t represents the new candidate vector, scaled by how much we want to uptade each dimension of the cell state.
• STEP4:Theoutputgatelayer:
• Wewanttooutputafilteredversionoftheupdated
cell state.
• Forthat,weusetheoutputgatelayertodecide what dimensions of the cell state we want to output: A simple sigmoid function applied to the concatenation of ht 1 and xt gives the output vector ot
• Theoutputattimetisjusttheelementwise multiplication of the output vector and the updated cell state (after a tanh transformation).
• The parameters are: (Wo, bo)
ot = (WoT [ht 1, xt] + bo) ht = ot ⇤ tanh(Ct)
Ct = ft ⇤ Ct 1 + it ⇤ C ̃t
Imperial College Business School Imperial means Intelligent Business 26
LSTM networks – Part 5 – • Summary:
xt 1
xt+1
ht 1 Ct 1
ht 1 xt
Equations of the LSTM
ht
ht+1
The gates:
ft = (WfT [ht 1, xt] + bf ) it = (WiT[ht 1,xt]+bi) ot = (WoT [ht 1, xt] + bo)
C ̃t = tanh(WCT [ht 1, xt] + bC ) Ct =ft ⇤Ct 1 +it ⇤C ̃t
ht = ot ⇤ tanh(Ct)
Ct ht
Ct+1 ht+1
The updates:
Imperial College Business School Imperial means Intelligent Business 27
Part 4 : Programming Session
Imperial College Business School Imperial means Intelligent Business 28
Programming Session
Imperial College Business School Imperial means Intelligent Business 29
Go to the following link and take Quiz 7 : https://mlfbg.github.io/MachineLearningInFinance/
Imperial College Business School Imperial means Intelligent Business 30