Machine Learning in Finance
Lecture 4
Markov Models & Hidden Markov Models
Arnaud de Servigny & Hachem Madmoun
Outline:
• Introduction:Positionoftotheproblem
• MarkovModels
• HiddenMarkovModels
• ProgrammingSession
Imperial College Business School Imperial means Intelligent Business 2
Part 1 : Introducing the problem
Imperial College Business School Imperial means Intelligent Business 3
Markov models in finance:
• Whenthei.i.d.assumptionisrelaxed,sequentialitystartstomatter!ForgettheBrownianmotion.
• Inthisfamilyofmodels,themainassumptionisthatwecanfindastabledirectorindirectlink between what happened yesterday and what happens today.
• Theaboveassumptionisquitesimple.Maybetoosimple,butsuchmodelsareparsimoniousondata. This is good when the training set is limited in size.
• MarkovmodelsrepresentedasignificantimprovementovertraditionalARMA/ARIMAmodels,as they can treat different pieces of information at once (e.g. return & volatility)
• ThisapproachaccountsparticularlywellforRegimes.Infinancegrowthandrecessionregimestend to be quite well separated.
• ThetechniquewasfirstdevelopedbyHamiltonin1989,butalotofinnovation/refinementshave been brought to the table since then. It is therefore not old science!
Imperial College Business School Imperial means Intelligent Business 4
Markov models apply to many different Sequential Problems
• Sequencesareeverywhere:
• Financialdata
• Speechdata
• Texts
• Ignoringsequentialityinthedataresultsinsub-optimalmodelssinceahugeamountofinformationis
contained in the order.
• Forinstance,let’sconsiderthisexample: • «NeverQuit.Doyourbest»
• «Neverdoyourbest.Quit.»
• Abagofwords*modelwon’tmakeanydifferencebetweenthesetwosentences. Bag of words*: model that treats a sentence as a bag of its words disregarding word order (but keeping multiplicity).
Imperial College Business School Imperial means Intelligent Business 5
What is our objective today ?
• WewillnotuseMarkovmodelsonquantitativedataalthoughthiswouldbepossible,butrather on qualitative data. Applications are numerous:
o Identifying the mood reflected in a sentence.
o Relating a sentence or a text to a type of individuals. o Predicting the word or the sentence to come.
o Identifying textual anomalies.
• Ultimately,theobjectiveofthislectureistointroduceagraphicalmodel,calledHiddenMarkov Model (HMM) in the context of Language Modelling.
• ALanguageModelisamodelusedtopredicttheprobabilityofasequenceofwords.
• Wewantourlanguagemodeltoassignaprobabilitytoeverysentenceaccordingtoitslevelof
coherence (with respect to the training data). For instance:
• «Thequickbrownfoxjumpsoverthelazydog»:highprobability
• «Despitetheconstantnegativepresscovfefe»:averageprobability • «Thethethethethethe»:lowprobability
Imperial College Business School Imperial means Intelligent Business 6
What do we need to calculate these probabilities ?
• First,weneedamodel,whichconsistsinmakingsomeassumptionsonthedependenciesbetween words.
• Wealsoneedatrainingcorpus:alargenumberofsentences(calleddocuments).
• Forinstance,aftertrainingamodelonacorpusofdocumentsfromJohnStuartMill,themodel should assign high probabilities to the sentences that correspond to JSM way of thinking and much lower probabilities to the ones that contradict his ideas:
• p(« the greatest good for the greatest number ») >> p(« An action, to have moral worth, must be done from duty »)
Imperial College Business School Imperial means Intelligent Business 7
Common day-to-day applications:
• LanguageModelscanbeusedforseveralapplications,suchasthefollowing:
• SentenceCompletion:Alanguagemodelcanpredictthenextwordgiventhepreviouswordsina sentence.
• Classification:UsingBayesrule,wecanturnalanguagemodelintoagenerativeclassifier.For instance, it can be used for sentiment analysis, which consists in predicting the sentiment conveyed by a sentence.
• Automaticessaywriting:Alanguagemodellearnstheprobabilitydistributionofsentences according to a specific corpus. We can then sample from this distribution to generate artificial sentences.
Imperial College Business School Imperial means Intelligent Business 8
What is a Generative Classifier in the context of a Sequential Approach? (I)
• • • •
• •
•
Let’sconsideratrainingdatasetofNsamples(typicallysentences)usedforaclassificationtask. Let’s denote the samples: (Xi)1iN , and the targets (Yi)1iN (typically mood levels)
(Suppose 8i Yi 2 {0, 1} )
We assume that each sample is made of sequential constituents (typically words)
81iN Xi =Xi1,…,XiT
Supervised Classification consists in estimating p✓ (Yi |Xi ) where ✓ represents the parameters of
the supervised learning algorithm we are looking to estimate. AgenerativemodellikeHMMaimstolearnfromthedistributionofthetrainingsetofsequences:
p ✓ ( X i ) = p ✓ ( X i1 , . . . , X iT )
We can train 2 different models. The first model, parametrized by ✓0 is associated to the samples of
target 0 and the second model parameterized by ✓1 is associated to the samples of target 1. As a consequence, we will learn the two following distributions:
p✓0(Xi1,…,XiT|Yi =0) and p✓1(Xi1,…,XiT|Yi =1) So far, we have built a class conditional density estimator
Imperial College Business School Imperial means Intelligent Business 9
What is a Generative Classifier in the context of a Sequential Approach? (II)
8A, B 2 ⌦ P(B|A) = P(A, B) = P(A|B)p(B) P(A) P(A)
• UsingtheBayesruledescribedas:
we can turn the previous class conditional density estimator into a classifier.
81 i N :
p(Yi =0|Xi1,…,XiT)=
=
p(Xi1,…,XiT,Yi =0) p(Xi1,…,XiT)
p✓0(Xi1,…,XiT|Yi =0)p(Yi =0)
p✓0(Xi1,…,XiT|Yi =0)p(Yi =0)+p✓1(Xi1,…,XiT|Yi =1)p(Yi =1)
• Similarly:
81 i N :
p(Yi =1|Xi1,…,XiT)=
p✓1(Xi1,…,XiT|Yi =1)p(Yi =1)
p✓0(Xi1,…,XiT|Yi =0)p(Yi =0)+p✓1(Xi1,…,XiT|Yi =1)p(Yi =1)
• Where: p(Yi = 0) and p(Yi = 1) are estimated by their frequencies in the dataset.
Imperial College Business School Imperial means Intelligent Business 10
Constructing a Generative Classifier. An example (I)
• Let’ssupposewehaveaNsizedtrainingsamplecomposedofdocumentsfromJohnStuartMill(J)
• •
We want our model, to classify new sentences like « The greatest good for the greatest number ».
Let’s review the different steps involved in computing the following probabilities:
• p(Y = K | « The greatest good for the greatest number »)
• p(Y = J | « The greatest good for the greatest number »)
and from Immanuel Kant (K).
Training Sample
Target J
K
K J
• Document 1 : « The sole evidence it is possible to produce that anything is desirable is that people actually to desire it. »
• Document 2 : « In law a man is guilty when he violates the rights of others. In ethics he is guilty if he only thinks of doing so. »
• Document 3 : « Always recognize that human individuals are ends, and do not use them as means to your end. »
…
• Document N : « Justice is a name for certain moral requirements, which, regarded collectively, stand higher in the scale of social utility and are therefore of more paramount obligation than any others. »
Imperial College Business School Imperial means Intelligent Business 11
Constructing a Generative Classifier. An example (II)
STEP 1
• •
Document 1 : « The sole evidence it is possible to produce that anything is desirable is that people actually to desire it. »
…
Document N : « Justice is a name for certain moral requirements, which, regarded collectively, stand higher in the scale of social utility and are therefore of more paramount obligation than any others. »
STEP 2
STEP 3
STEP 4
Train the density estimator 1 on documents Train density estimator 2 on documents with target J with target K
For each new sentence, calculate the probability of the target being J or K, conditional on this provided sequence, using the Bayes Rule and the two density estimators trained in Step 2.
Make a decision based on a decision rule: New sentence has been written by J, as
J
K
p(Y = J | « The greatest good for the greatest number) > p(Y=K | « The greatest good for the greatest number)
• Document 2 : « In law a man is guilty when he violates the rights of others. In ethics he is guilty if he only thinks of doing so. »
…
• Document 3 : « Always recognize that human individuals are ends, and do not use them as means to your end. »
Imperial College Business School Imperial means Intelligent Business 12
Part 2 : Markov Models
A Markov model relates to a process
=> which is capable of being in more than one state,
=> which can make transitions among those states, and
=> in which the available states and transition probabilities only depend upon what state the system is currently in.
Imperial College Business School Imperial means Intelligent Business 13
Introduction
• Inthissection,wewilldiscussasimpleModelcalledMarkovModel.
• Notations:
• We are dealing with sequences (Xi)1iN of discrete observations (e.g. sentences). • EachobservationXi isasequenceofTdiscretestates: Xi1,…,XiT (e.g.words).
• Example:
Observation : « The greatest good for the greatest number »
X16
• Each X1i is our example represents a word in the English dictionary.
X17
X1 : X1 X12 X13 X14 X15
Imperial College Business School Imperial means Intelligent Business 14
Defining the Probability of a sequential Observation:
• Using simple probability rules, for any observation i, we can write the probability p(Xi) as the product of
conditional probabilities :
p(X 1 , . . . , X T ) = p(X T |X j , j < T )p(X T 1 |Xj , j < T 1) . . . p(X 2 |X 1 )p(X 1 ) iiiiiiii
• ThebasicideabehindtheMarkovPropertyistoassumethatXit capturesalltherelevantinformationto account for the next step. The previous equation therefore simplifies to:
p(X1, . . . , XT ) = p(XT |XT 1)p(XT 1|XT 2) . . . p(X2|X1)p(X1) iiiiiiiii
• Additional assumptions:
• We assume that the transition function p(Xt|Xt 1) is independent of time t. I.e. will always hold.
ii
• We assume that the observed variables Xit are discrete : Xit 2 {1, . . . , V }
In our previous example, eachX1i represents a unique word among the different words of an English dictionary of size V (the number of words listed in it).
To be clear, the Markov property means :
p(The greatest good for the greatest number) = p(The) *p(greatest | The)* p(good | greatest) *p(for | good) *p(the | for) *p(greatest | the)* p(number | greatest)
Imperial College Business School Imperial means Intelligent Business 15
Summarizing the sequential rules learnt in a Transition Matrix • The conditional distribution p(Xt|Xt 1) inferred from all the observations can be written as a V ⇥ V
•
matrix, known as the transition matrix Q , where 8v,w2{1,...V} Qvw =p(Xt =w|Xt 1 =v)
Example : Let us consider the following 3 observations, containing 4 states in a dictionnary of size 4
• The greatest good
• The greatest number
• The good number
V=4
the 1 good 3
greatest 2 number 4
Two times out of three, ‘the’ is followed by ‘greatest’, hence p(1,2)=2/3
0B 0 2 / 3 1 / 3 0 1C Q = B@ 0 0 1 / 2 1 / 2 CA
0001 1/4 1/4 1/4 1/4
Note that the rows of the transition matrix must add up to one. This Transition matrix is also called a stochastic matrix.
ii
1/3
2/3
1/2
1
1/2
Imperial College Business School
Imperial means Intelligent Business 16
Estimating the Transition Matrix from the Training Observations
Training Observations
X1 : X1,...,X1T X 2 : X 21 , . . . , X 2T
.
X N : X N1 , . . . , X NT
X = [Xit]{1iN;1tT} : we call it a tensor of shape (N, T). • Let’s introduce the parameters ⇡ and Q , where :
• The Data is of the form
• Each Xit represents a word in the English dictionary of dimension V. So, Xit 2 {1, . . . , V }
8v2{1,...,V} ⇡v =p(Xi1 =v) 8v,w2{1,...,V} Qvw =p(Xt+1 =w|Xt =v)
• Our objective is to find optimal ⇡⇤ and Q⇤(i.e the ones that maximize the likelihood of the data)
ii
Imperial College Business School Imperial means Intelligent Business 17
Maximum Likelihood Estimation for the Markov Model:
Interactive Session
Imperial College Business School Imperial means Intelligent Business 18
Estimating the Transition Matrix from Training Data • Let’srecalltheparameterswewanttooptimize:
• Weintroduceforall 1 XN
Cv = {Xi1=v} i=1
N T 1
Cvw = X X {Xit=v} {Xt+1=w}
8v2{1,...,V} ⇡v =p(Xi1 =v) 8v,w2{1,...,V} Qvw =p(Xt+1 =w|Xt =v)
v,w2{1,...,V} thefollowingcounts:
: The number of times the word v appears at the beginning of a sentence.
The number of times we transition from the word to w
• After a Maximum Likelihood estimation, we find the following optimal (and very intuitive) parameters:
ii
i=1 t=1
i
⇡ v⇤ = P v V1
C1
Q ⇤v w = PV
Cvw
Cvw0 w0=1
Cv0 v0=1
Imperial College Business School Imperial means Intelligent Business 19
Limits of a Model driven by a Markov Process • The Markov Model has a lot of limitations:
• AMemoryStructurelimit:First,theMarkovPropertycorrespondstoaverystrongassumption: Making each word only depend on the previous one is not realistic.
• A Dimensionality limit: If we have V ⇡ 100.000 words in our Dictionnary, then a Markov Model will have about 1010 parameters to train, corresponding to all the possible pairs.
• A Missing Value limit: It is also very unlikely to find all these pairs in the training dataset.
Imperial College Business School Imperial means Intelligent Business 20
How could we more realistically use the Markov Approach?
•
ThenextsectionwillintroduceamoresubtleapproachusingtheMarkovscheme.Inanotherwayrather than having the states (the words) in an observation which depend on their previous neighbour, we introduce a latent variable (the grammar) which will driver each state and which in turn will follow the Markov property. This is the reason why the word “Hidden” start to be used, as the latent variable (the grammar) has to be discovered by the model.
AHiddenMarkovModel(HMM)isapowerfuldiscretetime/discretestategenerativemodelwhichaims to solve the issues of the Markov Model via the concept of Latent Variables. A common metaphor is to think of the HMM as if the Markov Model were a mechanism hidden behind a curtain. The observer doesn't have any knowledge of how the mechanism is operating, and only knows what is going on behind the curtain from periodic reports that are emitted by the mechanism.
In particular, it’s worth noticing that the HMM model has the advantage of being able to represent long- range dependencies between observations. The Markov Property still holds, but not for the observations themselves.
•
•
Imperial College Business School Imperial means Intelligent Business 21
Part 3 : Hidden Markov Models
Imperial College Business School Imperial means Intelligent Business 22
Introduction:
• ThebasicideabehindtheHMMistheconceptofahiddenstate responsible for the observation.
•
Xi
• Automatic Speech Recognition: The observation is the speech signal and the hidden state is
Examples of HMMs where the hidden states have specific meanings: the text.
Hi Hidden state
Observation
• POS (Part Of Speech) Tagging : The observation is the sentence, the hidden state is the syntax. Noun Verb Det Adjective Noun
likes the friendly lady
Timothy
Imperial College Business School Imperial means Intelligent Business 23
Parameterization: The Hidden States form a Markov Chain
• LetusconsidertheexampleofPOStagging:
Noun Verb Det Adjective Noun
Timothy likes the friendly lady
• WehaveMdiscretesyntaxtags{Noun(N),Verb(V),Determinant(D),...}andVdiscretewordsinthe English Dictionnary.
• The dynamics of the tags will be parameterized by a Markov Model : ⇡ 2 RM and Q 2 RM⇥M
• TheMarkovPropertymakesmuchmoresenseintermsofgrammarrules.Forinstance,weexpect
p(Noun| Determinant) to be high and p(Det|Det) to be very low.
• Illustration:
0N V D 1 .1 .7 .2
Q = @.3 .05 .65A .9 .05 .05
.9
.2
Although we often imbue the hidden states with some specific meaning, they don’t necessarily have to represent anything tangible.
N V
D
.1
.7
.05 .05
Imperial College Business School Imperial means Intelligent Business
24
.3
.65 .05
Parameterization : Obtaining the Observation Matrix
• EachHiddenstateisresponsibleforadiscreteprobabilitydistributionoverallthepossiblewordsinthe English Vocabulary:
Noun
a 0. apple 0.12
Verb
Det
0.15 0.
.
. .
think
0. think 0.1
think
zoo
0.
.
. . . ..
.
.
zoo 0.13
zoo 0.
0.
• WegroupalltheseprobabilitydistributionsintoamatrixcalledtheObservationMatrix.
0 0. B0.12
B . O=B 0.
@. 0.13
... 0. ... ... 0. ...
... . ... ... 0.1 ...
... . ... ... 0. ...
0.151 0. C
. C
0. C2RV⇥M
.A 0.
a 0. apple 0.
a apple
.
. . ....
Imperial College Business School Imperial means Intelligent Business 25
What needs to be estimated to have a HMM work :
• Tosumup,whenwearedealingwithdiscreteobservations,theHiddenMarkovModelisparametrized
by the three following elements:
• Compared to the Markov Model, we have reduced the dimensionality from
O 2 RV ⇥M
V + V 2 to M + M 2 + M ⇥ V
Det
0.151 Noun
0.02 Det
00.31 B.C
⇡ 2 RM Q 2 RM⇥M
Noun Verb
B.C ⇡= B0.1C B.C
Noun
Verb Q=
00.2 ... 0.6 ... B . ... . ...
. C
0.4 C Verb
B@0.5CA Det
B0.3 ... 0.05 ... B . .
. C . A
. .
Noun Verb
@. ... . ... 0.8 ... 0.02 ...
0 0. ... 0. ... B0.12 ... 0. ...
B . ... . ... O=B 0. ... 0.1 ...
@. ... . ... 0.13 ... 0. ...
Det
0.151 0. C
. C 0. C
.A 0.
a apple
think zoo
Imperial College Business School Imperial means Intelligent Business 26
Maximum Likelihood Estimation for the HMM:
Interactive Session
Imperial College Business School Imperial means Intelligent Business 27
The limit and interest of the HMM approach
• Bylookingatamoreabstractandsyntheticelement,thehiddenvariable,weacknowledgetwothings:
o Most of the time, it will be difficult to put a name and identify the meaning behind this variable.
o We have to lose part of the information in order to reduce dimensionality and handle a tractable model.
• Inthepreviousmodelrelatedtoasentence,wehaveforinstancerenouncedtoidentifythepreciseword to come. The only things we will be able to say is that its most likely syntax is identifiable and that among such a syntax the most likely word is also identifiable. But we go for two levels of guesses.
• The HMM is therefore certainly not the best class of model to obtain a precise prediction. It carries the very useful advantage to invent classes, whose definition is not obvious, but as long as we find these classes informative generically, we have learnt something. In addition, provided we have enough data, we can learn from the HMM what is the optimal number of classes to consider. As a result it may prove a very effective clustering tool.
• In total the HMM should be considered as a data-parsimonious semi-supervised model from which we can identify a priori unknown clusters.
Imperial College Business School Imperial means Intelligent Business 28
Go to the following link and take Quiz 5 : https://mlfbg.github.io/MachineLearningInFinance/
Imperial College Business School Imperial means Intelligent Business 29
Programming Session
Imperial College Business School Imperial means Intelligent Business 30