CS计算机代考程序代写 algorithm finance Machine Learning in Finance

Machine Learning in Finance
Lecture 6
Practical Implementation : Word Vectors
Arnaud de Servigny & Hachem Madmoun

Outline:
• IntroducingtheProblem
• WordEmbeddingMethods
• TheGloVeapproach(Coursework) • TheWord2vecapproach
• ProgrammingSession:ImplementationoftheWord2vecapproach
Imperial College Business School Imperial means Intelligent Business 2

Part 1 : Introducing the problem
Imperial College Business School Imperial means Intelligent Business 3

Why do we need vectors to represent words ?
• Wearedealingwithdataintheformofacorpusofsentencesandwanttoperformaclassificationtask for instance.
• Weobviouslycan’tfeedwordstoamodel.Amodelcanonlyhandlenumbers.
• Thequestionis:HowdowerepresentthewordsofourcorpusinawaythatcanbefeededinaMachine
Learning Algorithm ?
• It’sclearlyanUnsupervisedLearningtask.
DATA
Model ?
• Document 1 : « The sole evidence it is possible to produce that anything is desirable is that people actually to desire it. »
• Document 2 : « In law a man is guilty when he violates the rights of others. In ethics he is guilty if he only thinks of doing so. »
• Document 3 : « Always recognize that human individuals are ends, and do not use them as means to your end. »

• Document N : « Justice is a name for certain moral requirements, which, regarded collectively, stand higher in the scale of social utility and are therefore of more paramount obligation than any others. »
Imperial College Business School Imperial means Intelligent Business 4

Review : Words as discrete symbols:
• Whatwehaveseensofor(inLecture5)isthepossibilitytoturneachwordintoadiscretesymbol.
• For that, we create a dictionary to map each word present in our corpus to a unique discrete index. DATA
word_index = {
• Document 1 : « The sole evidence it is possible to produce that anything is desirable is that people actually to desire it. »
• Document 2 : « In law a man is guilty when he violates the rights of others. In ethics he is guilty if he only thinks of doing so. »
• Document 3 : « Always recognize that human individuals are ends, and do not use them as means to your end. »

• Document N : « Justice is a name for certain moral requirements, which, regarded collectively, stand higher in the scale of social utility and are therefore of more paramount obligation than any others. »
‘the’
‘sole’ ‘evidence’
: 1 , : 2 , : 3 ,
Code
‘any ’ }
: 934233
.
Imperial College Business School Imperial means Intelligent Business 5

From discrete symbols to one hot vectors:
• Afterthefirstpre-precossing step, we end up with the following lists if integers representing the words:
.
Corpus
• Document1 • Document2
• Documentn .
• DocumentN
Discretize via word_index
• Instead of representing a word by its index in the word_index dictionary. It is strictly equivalent to represent it as a vector of size V (where V is the size of vocabulary, i.e the number of distinct words in the whole corpus) with 1 in the index position and zeros in all the other positions.
• Example:Let’ssupposetheword«equity»isofindex134andV=100000. • Then,theword«equity»willberepresentedbythefollowingvectorofsizeV:
• Wecallthisvectoraonehotvector.
[0,…,0,1,0,…,0]
position 134
• Document1:[23,43,12,…,2343,1] • Document2:[12,1,23453,…,123]
• Documentn:[1234,1,23]
• DocumentN:[1,1232,…,12322]
Imperial College Business School Imperial means Intelligent Business 6

Limitations of one hot vectors:
• TheOne-hot-vectoristheeasiestwaytorepresentwordsasvectors.
• Inthistypeofencoding,eachwordisasacompletelyindependententityandthereisn’tany notion of similarity between words, even if they have the same meaning.
• Onewayofmeasuringthesimilaritybetweentwovectorsistousethedotproduct.
• Thedotproductisjustthecosinesimilarity:
wstatistics

similarity(wstatistics, wlearning) = cos(✓) = < wstatistics, wlearning > wlearning ||wstatistics|| ||wlearning||
• Inthecase,thetwowords“statistics”and“learning”willhaveasimilarityofzero,eventhough they are related to each other. 26037
=(wstatistics)Twlearning =⇥0 1 0 … 0⇤ 61.7=0
4 0. 5
Imperial College Business School Imperial means Intelligent Business 7

Part 2 : Word Embedding Methods
Imperial College Business School Imperial means Intelligent Business 8

Creating Word Embedding
• Weneedtofindasubspacethatencodestherelationshipsbetweenwords.
• Astherearemillionsoftokensinanylanguage,wecantrytoreducethesizeofthisspace
from RV (where V is the vocabulary size) to some D-dimensional space (such that D << V ) that is sufficient to encode all semantics of the language. • Eachdimensionwouldencodesomemeaning(suchastense,count,gender...) • Wearegoingtointroduce2approaches: • GloVe(GlobalVectorsforWordRepresentation,Pennington,SocherandManning, 2014). • TheWord2vecapproach:introducedbyMikolov,Sutskever,etal.(2013). • BothalgorithmstaketheirinspirationfromanEnglishlinguist,namedJohnRuperFirth, known for his famous quotation: Imperial College Business School Imperial means Intelligent Business 9 The GloVe approach – Introduction – • GloVe(GlobalVectorsforWordRepresentation)isanunsupervisedalgorithm,developedatthe Stanford NLP lab, that learns embedding vectors from word-word co-occurrence statistics. • TheGloVealgorithmconsistsinapplyingMatrixfactorizationmethodsonamatrix summarizing the co-occurrence statistics of the corpus. • The entry Xij of the matrix of co-occurrence counts X represents the number of times the word j occurs in the context of word i, which suggest the definition of a context size (or window size). • Examplewithi=indexoftheword«enjoyed».Weappendtheindexofword«I»twice. z window size=1 I Xij +=1 z window size=1 { the research project. { NLP }| enjoyed |{z} }| enjoyed I |{z} Xij +=1 Imperial College Business School Imperial means Intelligent Business 10 The GloVe approach – The co-occurrence matrix – • Letuscreatetheco-occurrencematrixonasimplecorpuscomposedof3documentsanda vocabulary size of 10 tokens. V = 10. So, the co-occurrence matrix is of shape • Thecorpus: • Document1:Ienjoyedtheresearchproject. • Document2::IlikeDeepLearning. • Document 3: I enjoyed NLP . • Thefinalco-occurrencematrix: V . 1 IB0200010000C enjoyedB2 0 1 0 0 0 0 0 1 0C the B0 1 0 1 0 0 0 0 0 0C researchB0 0 1 0 1 0 0 0 0 0C X=projectB0 0 0 1 0 0 0 0 0 1C like B1 0 0 0 0 0 1 0 0 0C Deep B0 0 0 0 0 1 0 1 0 0C LearningB0 0 0 0 0 0 1 0 0 1C NLP @0 1 0 0 0 0 0 0 0 1A 0 I enjoyed the research project like Deep Learning NLP (V,V) .0000100110 Imperial College Business School Imperial means Intelligent Business 11 V The GloVe approach – SVD based methods – • Tocreateembeddingvectorsfromtheco-occurrencematrix,oneapproachcanbetouseaSingular Value decomposition (SVD) of the co-occurrence matrix: X = W 1 ⌦ W 2T (V ⇥V) (V ⇥V) (V ⇥V)(V ⇥V) • Then,wereducethedimensionalitybyselectingthefirstDsingularvectors(withD<! >···>!
• Let 1 V,suchthat 1 2 V P D
• WeselectDsothatwecancapturethedesiredamoutofvariancewewant:
!i
!i i=1
V
i=1
Imperial College Business School Imperial means Intelligent Business 12
P

The GloVe approach – Matrix Factorization instead of SVD – • TheSVDapproachdoesnotworkwellinpracticeforseveralreasons:
• Thedimensionsofthematrixchangeveryoften(newwordsareaddedveryfrequentlyandthe corpus changes in size).
• Thematrixisextremelysparse(i.e,itcontainsalotofzerovalues)sincemostwordsdonotco- occur.
• Thematrixisveryhighdimensionalasthevocabularysizeisusuallyhuge.
• Wearegoingtointroduceanotherwayofperformingthefactorization:MatrixFactorizationMethods
are widely used for generating meaningful and low-dimensional word representation.
• IntheGloVeapproach,sincenon-zerovaluesareverylarge,wefactorizethelogarithmofX
(denoted log X ) instead of factorizing X .
• Remark:Obviously,aswecan’tapplythelogarithmfunctionontheentrieswithazerovalue,we
add 1 to all the element of the matrix before applying the logarithm).
8(i,j)2V2 Xij Xij +1
• We want to factorize log X into 2 matrices: log X ⇡ W W ̃ T
• Wewanttoestimate W,W ̃ 2RV⇥D with D<