Machine Learning in Finance
Lecture 6
Practical Implementation : Word Vectors
Arnaud de Servigny & Hachem Madmoun
Outline:
• IntroducingtheProblem
• WordEmbeddingMethods
• TheGloVeapproach(Coursework) • TheWord2vecapproach
• ProgrammingSession:ImplementationoftheWord2vecapproach
Imperial College Business School Imperial means Intelligent Business 2
Part 1 : Introducing the problem
Imperial College Business School Imperial means Intelligent Business 3
Why do we need vectors to represent words ?
• Wearedealingwithdataintheformofacorpusofsentencesandwanttoperformaclassificationtask for instance.
• Weobviouslycan’tfeedwordstoamodel.Amodelcanonlyhandlenumbers.
• Thequestionis:HowdowerepresentthewordsofourcorpusinawaythatcanbefeededinaMachine
Learning Algorithm ?
• It’sclearlyanUnsupervisedLearningtask.
DATA
Model ?
• Document 1 : « The sole evidence it is possible to produce that anything is desirable is that people actually to desire it. »
• Document 2 : « In law a man is guilty when he violates the rights of others. In ethics he is guilty if he only thinks of doing so. »
• Document 3 : « Always recognize that human individuals are ends, and do not use them as means to your end. »
…
• Document N : « Justice is a name for certain moral requirements, which, regarded collectively, stand higher in the scale of social utility and are therefore of more paramount obligation than any others. »
Imperial College Business School Imperial means Intelligent Business 4
Review : Words as discrete symbols:
• Whatwehaveseensofor(inLecture5)isthepossibilitytoturneachwordintoadiscretesymbol.
• For that, we create a dictionary to map each word present in our corpus to a unique discrete index. DATA
word_index = {
• Document 1 : « The sole evidence it is possible to produce that anything is desirable is that people actually to desire it. »
• Document 2 : « In law a man is guilty when he violates the rights of others. In ethics he is guilty if he only thinks of doing so. »
• Document 3 : « Always recognize that human individuals are ends, and do not use them as means to your end. »
…
• Document N : « Justice is a name for certain moral requirements, which, regarded collectively, stand higher in the scale of social utility and are therefore of more paramount obligation than any others. »
‘the’
‘sole’ ‘evidence’
: 1 , : 2 , : 3 ,
Code
‘any ’ }
: 934233
.
Imperial College Business School Imperial means Intelligent Business 5
From discrete symbols to one hot vectors:
• Afterthefirstpre-precossing step, we end up with the following lists if integers representing the words:
.
Corpus
• Document1 • Document2
• Documentn .
• DocumentN
Discretize via word_index
• Instead of representing a word by its index in the word_index dictionary. It is strictly equivalent to represent it as a vector of size V (where V is the size of vocabulary, i.e the number of distinct words in the whole corpus) with 1 in the index position and zeros in all the other positions.
• Example:Let’ssupposetheword«equity»isofindex134andV=100000. • Then,theword«equity»willberepresentedbythefollowingvectorofsizeV:
• Wecallthisvectoraonehotvector.
[0,…,0,1,0,…,0]
position 134
• Document1:[23,43,12,…,2343,1] • Document2:[12,1,23453,…,123]
• Documentn:[1234,1,23]
• DocumentN:[1,1232,…,12322]
Imperial College Business School Imperial means Intelligent Business 6
Limitations of one hot vectors:
• TheOne-hot-vectoristheeasiestwaytorepresentwordsasvectors.
• Inthistypeofencoding,eachwordisasacompletelyindependententityandthereisn’tany notion of similarity between words, even if they have the same meaning.
• Onewayofmeasuringthesimilaritybetweentwovectorsistousethedotproduct.
• Thedotproductisjustthecosinesimilarity:
wstatistics
✓
similarity(wstatistics, wlearning) = cos(✓) = < wstatistics, wlearning > wlearning ||wstatistics|| ||wlearning||
• Inthecase,thetwowords“statistics”and“learning”willhaveasimilarityofzero,eventhough they are related to each other. 26037
4 0. 5
Imperial College Business School Imperial means Intelligent Business 7
Part 2 : Word Embedding Methods
Imperial College Business School Imperial means Intelligent Business 8
Creating Word Embedding
• Weneedtofindasubspacethatencodestherelationshipsbetweenwords.
• Astherearemillionsoftokensinanylanguage,wecantrytoreducethesizeofthisspace
from RV (where V is the vocabulary size) to some D-dimensional space (such that D << V ) that is sufficient to encode all semantics of the language.
• Eachdimensionwouldencodesomemeaning(suchastense,count,gender...)
• Wearegoingtointroduce2approaches:
• GloVe(GlobalVectorsforWordRepresentation,Pennington,SocherandManning, 2014).
• TheWord2vecapproach:introducedbyMikolov,Sutskever,etal.(2013).
• BothalgorithmstaketheirinspirationfromanEnglishlinguist,namedJohnRuperFirth, known for his famous quotation:
Imperial College Business School Imperial means Intelligent Business 9
The GloVe approach – Introduction –
• GloVe(GlobalVectorsforWordRepresentation)isanunsupervisedalgorithm,developedatthe
Stanford NLP lab, that learns embedding vectors from word-word co-occurrence statistics.
• TheGloVealgorithmconsistsinapplyingMatrixfactorizationmethodsonamatrix
summarizing the co-occurrence statistics of the corpus.
• The entry Xij of the matrix of co-occurrence counts X represents the number of times the word j occurs in the context of word i, which suggest the definition of a context size (or window size).
• Examplewithi=indexoftheword«enjoyed».Weappendtheindexofword«I»twice.
z window size=1 I
Xij +=1
z window size=1
{
the research project.
{
NLP
}|
enjoyed
|{z}
}|
enjoyed
I
|{z}
Xij +=1
Imperial College Business School Imperial means Intelligent Business 10
The GloVe approach – The co-occurrence matrix –
• Letuscreatetheco-occurrencematrixonasimplecorpuscomposedof3documentsanda
vocabulary size of 10 tokens. V = 10. So, the co-occurrence matrix is of shape
• Thecorpus:
• Document1:Ienjoyedtheresearchproject. • Document2::IlikeDeepLearning.
• Document 3: I enjoyed NLP .
• Thefinalco-occurrencematrix:
V
. 1 IB0200010000C enjoyedB2 0 1 0 0 0 0 0 1 0C the B0 1 0 1 0 0 0 0 0 0C researchB0 0 1 0 1 0 0 0 0 0C X=projectB0 0 0 1 0 0 0 0 0 1C like B1 0 0 0 0 0 1 0 0 0C Deep B0 0 0 0 0 1 0 1 0 0C LearningB0 0 0 0 0 0 1 0 0 1C NLP @0 1 0 0 0 0 0 0 0 1A
0 I enjoyed the research project like Deep
Learning NLP
(V,V)
.0000100110 Imperial College Business School Imperial means Intelligent Business
11
V
The GloVe approach – SVD based methods –
• Tocreateembeddingvectorsfromtheco-occurrencematrix,oneapproachcanbetouseaSingular
Value decomposition (SVD) of the co-occurrence matrix: X = W 1 ⌦ W 2T
(V ⇥V) (V ⇥V) (V ⇥V)(V ⇥V)
• Then,wereducethedimensionalitybyselectingthefirstDsingularvectors(withD<
• Let 1 V,suchthat 1 2 V P D
• WeselectDsothatwecancapturethedesiredamoutofvariancewewant:
!i
!i i=1
V
i=1
Imperial College Business School Imperial means Intelligent Business 12
P
The GloVe approach – Matrix Factorization instead of SVD – • TheSVDapproachdoesnotworkwellinpracticeforseveralreasons:
• Thedimensionsofthematrixchangeveryoften(newwordsareaddedveryfrequentlyandthe corpus changes in size).
• Thematrixisextremelysparse(i.e,itcontainsalotofzerovalues)sincemostwordsdonotco- occur.
• Thematrixisveryhighdimensionalasthevocabularysizeisusuallyhuge.
• Wearegoingtointroduceanotherwayofperformingthefactorization:MatrixFactorizationMethods
are widely used for generating meaningful and low-dimensional word representation.
• IntheGloVeapproach,sincenon-zerovaluesareverylarge,wefactorizethelogarithmofX
(denoted log X ) instead of factorizing X .
• Remark:Obviously,aswecan’tapplythelogarithmfunctionontheentrieswithazerovalue,we
add 1 to all the element of the matrix before applying the logarithm).
8(i,j)2V2 Xij Xij +1
• We want to factorize log X into 2 matrices: log X ⇡ W W ̃ T
• Wewanttoestimate W,W ̃ 2RV⇥D with D<