– Machine Learning and Finance -Coursework- 2 Notations:
• Mn,p(R) is the space of the matrices composed of n rows and p columns.
• Thegradientofafunctionf :✓2RD 7!Rat✓2RD (✓),…, @f (✓)⌘
• Convention:
Copyright By PowCoder代写 加微信 powcoder
– The rows (Ai)1in of a matrix A = B@ . . . CA 2 Mn,p(R) are considered Mp,1(R) matrices.
– The columns (Bj)1jp of a matrix B = matrices.
Presentation of the Coursework:
0@ | . . . | 1A
B1 … Bp 2 Mn,p(R) are considered Mn,1(R)
0 A1 1 An
The objective of this coursework is to implement the Word2vec approach. The model was introduced in Mikolov et al. 2013. It is one of the most successful ideas for learning an embedding matrix from a corpus. The model captures some linguistic patterns between word vectors and performs well on the word analogy task. For instance, the embedding vectors learned using the word2vec approach have the following property: eFrance eParis ⇡ eEngland eLondon (where eX stands for the embedding of the word X).
The coursework is subdivided into four parts:
• In section 1, we introduce the concept of word embedding and the word2vec approach.
• In section 2, the objective is to load the dataset and perform the first processing steps in order to get the corpus.
• In section 3, we use the processed corpus to create a dataset for a binary classification problem.
• In section 4, we create an embedding matrix by learning the parameters of a shallow neural network trained for the binary classification task.
– Machine Learning and Finance -Coursework- 3 1 Introducing the word2vec approach
The objective of the coursework is to train a model on a corpus of training sentences in order to represent words in a D-dimensional space. We would like to encode the similarity between the words in the embedding vectors themselves.
Question 1: Explain why this notion of similarity is not encoded in the one hot vector represen- tation of words.
Several methods have been used to create word embeddings. The most popular ones rely on the intuition that a word’s meaning is given by the words that frequently appear close-by.
For instance, we have introduced in Programming Session 5 the GloVe approach, a popular method used to learn low-dimensional word representations by using matrix factorization methods on a matrix of word-word co-occurence statistics.
In this programming session, we are going to introduce the word2vec approach, which represents the tokens as parameters of a shallow neural network predicting a word’s context given the world itself.
Similarly to what we have done in the GloVe approach, we will rely on the intuition that a word is defined by the context words.
Figure 1 is an illustration of the concept of neighbors. The center word ”economy” is represented in red, and the neighbors for a context size of 3 are boxed. The ”context size” is the number of words we consider on either side of a center word as neighbors.
If there are at least 3 words on the left of a center word, we can extract the first 3 left neighbors. If there are fewer than 3 words, we only extract the ones found within the document. For the right neighbors, the same principle applies.
Figure 1: Context words
2 Preprocessing the data
The data folder contains a csv file named RedditNews.csv1.
In the RedditNews.csv file are stored historical news headlines from Reddit WorldNews Channel, ranked by
reddit users’ votes, and only the top 25 headlines are considered for a single date. You will find two colomns:
• The first column is for the ”date”.
• The second column is for the ”News”. As all the news are ranked from top to bottom, there are only 25
lines for each date.
Question 2: Load the data from the csv file, create a list of all the news.
1Source: Sun, J. (2016, August) Daily News for Stock Market Prediction, Version 1. Retrieved [26 may 2020] from https://www.kaggle.com/aaron7sun/stocknews.
– Machine Learning and Finance -Coursework- 4 Question 3: Preprocess the data by transforming the list of sentences into a list of sequences of
integers, via a dictionary that maps the words to integers.
Question 4: For each sentence, add a specific index for the token ”< sos >” (start of sequence) at the beginning of each sequence and an index for the token ”< eos >” (end of sequence) at the end of each sequence.
The resulting list of lists of integers is called sequences.
The processed corpus is represented in figure 2. It is composed of several document (D1iN ).
Each document (Di) = (wi1, . . . , wni ) is a list of indices of length ni. i
Figure 2: Corpus
Question 5: Filter the documents by only keeping the ones such that ni 6.
3 Preparing the training dataset for a binary classification problem from the processed corpus
3.1 Introducing the positive and negative batch associated with a true center word
Let us consider a document (Di) = (wi1,…,wni) such that ni 6. i
Although the elements of the documents (Di) are integers from {0, . . . , V 1}, we will refer to them as ”words” in the section 3.1.
Let us consider a context size of 3 words.
Figure 3 represents the context words associated with a center word wit in a document (Di).
– Machine Learning and Finance -Coursework- 5
Figure 3: Context words associated with the center word wit
Consequently, each center word is associated with 3, 4, 5 or 6 context words (a maximum of 3 on the left and
a maximum of 3 on the right).
Let us suppose that the position of the center word wit allows the existence of 3 context words on the left and
3 context words on the right. We can then define 6 true couples of the form (true center word, context word).
In the example of figure 3, the 6 true couples associated with the center word wt are (wt,wt 3), (wt,wt 2), t t 1 t t+1 t t+2 t t+3 i i i i i
(wi,wi ), (wi,wi ), (wi,wi ) and (wi,wi ).
These couples constitute the positive batch associated with the center word wt, denoted ⇣Bwit ⌘.
In the word2vec approach, for each true center word wit, we need to artificially create the same number of fake couples (fake center word, context word). It can be done by keeping the context words associated with the center word wit untouched, and randomly select a fake center word from the vocabulary.
In the original paper, the fake center word, denoted fit, is sampled from the negative sampling distribution. Foreachwordw2{0,…,V 1},letN(w)bethenumberoftimesthewordwappearsinthecorpus.
We define the negative sampling distribution n as follows:
8w2{0,…,V 1} n(w)= P
V 1 0 0.75
Question 6: Create a function that takes as arguments :
• sequences : the list of list of integers composing the corpus. • V : the vocabulary size
The function should output the list [n(w) for w 2 {0, . . . , V 1}]
Question 7: Use the function defined in the previous question to sample fake center words from
the negative sampling distribution.
(Hint: You can use the following reference Link to random choice using numpy).
As a result, we end up with the same number of fake couples. The 6 fake couples associated with the fake center
word ft are (ft, wt 3), (ft, wt 2), (ft, wt 1), (ft, wt+1), (ft, wt+2) and (ft, wt+3). iiiiiiiiiii ii
N(w ) w0=0
– Machine Learning and Finance -Coursework-
These couples constitute the negative batch associated with the center word wt, denoted ⇣Bwit ⌘. i
Figure 4 summarizes all the steps involved in creating the true couples and the fake couples associated with the center word wit.
Figure 4: Creating the true couples and the fake couples associated with the true center word wit
Question 8: Create a function which takes as argument:
• position: an integer representing the position of the center word.
• sequence: a list of integers representing the document.
• contextsize: an integer representing the number of neighbors we want to consider on each side.
The function should output the list of integers representing the context words.
An example, is represented in figure 5.
– Machine Learning and Finance -Coursework- 7
Figure 5: Getting the context words
• The position is equal to 4.
• The sequence is [0, 45, 64, 675, 56, 235, 76, 443, 654, 765, 10000] • The contextsize is 3.
• The function should output the list [45, 64, 675, 235, 76, 443]
Question 9: As a sanity check, test the function defined in Question 8 on the document repre- sented in figure 5.
3.2 A Small Example
Let us have a concrete example to illustrate the way we create the positive batch and the negative batch associated with a true center word.
Figure 6 represents a fake corpus composed of three documents.
Figure 6: A Fake Corpus of 3 documents
Question 10: Create the fake corpus represented in figure 6.
Question 11: Create the word2idx dictionary (a dictionary mapping each token to a unique index)
without using any libraries.
Question 12: Transform the corpus into a list of lists of integers.
We denote idX the index of the word X in the word2idx dictionary.
Let us consider the true center word economy in the second document represented in figure 7.
Figure 7: Example of a center word in a document
– Machine Learning and Finance -Coursework- 8 The context words associated with the word economy are ”for”, ”bosses”, ”hurting”, ”says”, ”senior” and
Question 13: Use the function defined in Question 8 to get the indices of the context words ”for”, ”bosses”, ”hurting”, ”says”, ”senior” and ”Bank”.
The positive batch associated with the center word economy is composed of the following true couples: (idxeconomy, idxfor), (idxeconomy, idxbosses), (idxeconomy, idxhurting), (idxeconomy, idxsays), (idxeconomy, idxsenior) and (idxeconomy, idxBank).
Question 14: Use the function defined in Question 6 in order to sample the index of a fake center word.
Question 15: Reverse the word2idx dictionary to determine the fake center word from its index.
Let us suppose that the fake center word is music. The negative batch associated with the center word economy is composed of the following fake couples: (idxmusic, idxfor), (idxmusic, idxbosses), (idxmusic, idxhurting), (idxmusic, idxsays), (idxmusic, idxsenior) and (idxmusic, idxBank).
As shown in Figure 8, we associate each true couple with a label 1 and each fake couple with a label 0.
Figure 8: The positive and negative batch associated with the center word ”economy” in the second document
3.3 Creating the training dataset
We have explained the way we create the positive and negative batches associated with a center word wit in section 3.1.
In order to create the whole training dataset, we need to iterate the process by considering each word wit in the
– Machine Learning and Finance -Coursework- 9
corpus for all i 2 {1, . . . , N } and for all t 2 {1, . . . , n } as a center word, and determine the positive batch ⇣Bwit ⌘ i ⇣wt⌘ +
composed of the true couples associated with the center word wit and the negative batch B i the fake couples associated with the fake center word fit.
composed of As shown in the example 3.2, Each element of the true couples is considered as a positive sample (with a label
1) and each element of the fake couples is considered as a negative sample (with a label 0). Let NT be the number of true couples (i.e, associated with a label 1).
N =XXCard Bwit
T+ i=1 t=1
Where Card(F) stands for the cardinality of F.
Question 16: What is the expression of NT as a function of (ni)1nN?.
Question 17: What is the number of fake couples (i.e, couples associated with a label 0)?.
4 Learning the Word Vectors 4.1 The Forward Propagation
We would like to use the shallow neural network represented in figure 9 in order to predict the context words from the center word. The hidden layer contains D neurons.
Figure 9: A Shallow Neural Network
Let us consider a center word of index x 2 {0,…,V 1} and xˆ 2 {0,1}V the V-dimensional one hot vector associated with it.
A first linear transformation maps xˆ to the D-dimensional vector h as follows: h = W 1T xˆ
A second transformation maps the hidden vector h to the V -dimensional vector p = (p1 , . . . , pV ) as follows: p = W2T h where is the sigmoid activation function.
– Machine Learning and Finance -Coursework- 10
We use the sigmoid activation function because we would like the o-th element of p (denoted po for o 2 {0, . . . , V 1}) to represent the probability that the word of index o is in the true context of the word of index x.
In other words, po is the probability that the couple (x, o) is a true couple. Question 18: What are the shapes of W1 and W2 ?
Let W1[0],W1[1],…,W1[V 1] be the rows of the matrix W1 and W2[0],W2[1],…,W2[V 1] be the columns of the matrix W2.
Question 19: Show that:
po = W1[x]T W2[o]
Let us consider a center word wit and the following positive and negative batches associated with wit, each
containing K = 6 couples:
• The positive batch ⇣Bwit ⌘ = {(i , c ), . . . , (i , c )} composed of the true couples associated with a label 1.
• The negative batch ⇣Bwit ⌘ = {(i , c ), . . . , (i , c )} composed of the fake couples associated with a label
4.1.1 The Forward Propagation for the positive batch associated with the center word wit
Figure 10 is a representation of the forward propagation of the one hot vector associated with the true center word it.
Figure 10: The forward propagation for the positive batch We need to compare the predictions pc for all c 2 {c1, . . . , cK } to the true targets 1.
Question 20: Explain why the loss associated with the positive batch ⇣Bwit ⌘ is the following: +
wit 1XK T J+ = K log W1[it] W2[ck]
Hint: Remember that if you have N training data for a binary classification problem, if pn represent your
– Machine Learning and Finance -Coursework- 11 prediction for the sample n and tn the associated target, then the loss function is the following:
J = N (tnlog(pn)+(1 tn)log(1 pn))
4.1.2 The Forward Propagation for the negative batch associated with the center word wit Figure 11 is a representation of the forward propagation of the one hot vector associated with the fake center
Figure 11: The forward propagation for the negative batch We need to compare the predictions pc for all c 2 {c1, . . . , cK } to the true targets 0.
Question 21: Explain why the loss associated with the negative batch ⇣Bwit ⌘ is the following:
wit 1XK T J = K log 1 W1[if] W2[ck]
k=1 4.2 The Backward Propagation
We would like to update the parameters of the neural network ✓ = (W1, W2) after each batch using stochastic gradient descent.
4.2.1 Updating the parameters after the forward propagation associated with the positive batch
The forward propagation, calculated from the one hot vector of it, results in a V -dimensional prediction vector p.
Question 22: Which rows and columns of W1 and W2 are involved in the computation of pck for all k 2 {1,…,K}?
(Hint: Use the result of Question 19)
In order to avoid calculating the gradients of the loss function Jwit with respect to all the parameters ✓ =
(W1,W2). We are only going to update the gradients of the rows and columns of W1 and W2 involved in the
forward propagation.
We get the following gradients:
– Machine Learning and Finance -Coursework-
r J w i t = 1 XK ( W [ i ] T W [ c ] ) 1 W [ c ] W1[it] + K 1 t 2 k 2 k
r Jwit=1 (W[i]TW[c]) 1 W[i] forallk2{1,…,K}
W2[ck] + K 1 t 2 k 1 t Let ⌘ be the learning rate.
Question 23: What are the update equations of the parameters ✓ = (W ,W ) after the forward
propagation associated with the positive batch ⇣Bwit ⌘ +
4.2.2 Updating the parameters after the forward propagation associated with the negative batch
The forward propagation, calculated from the one hot vector of if , results in a V -dimensional prediction vector p.
Question 24: Which rows and columns of W1 and W2 are involved in the computation of pck for all k 2 {1,…,K}?
Question 25: What are the update equations of the parameters ✓ = (W ,W ) after the forward
propagation associated with the negative batch ⇣Bwit ⌘
4.3 Learning the Embedding Matrix
Question 26: By combining all the previous steps, implement Algorithm 1 Algorithm 1 The Word2vec algorithm
Input: sequences (list of lists of integers), Nepochs (number of epochs)
Output: losses(list of losses associated with each epoch), W1, W2 (the trained parameters of the shallow neural
Initialize the matrices W1 and W2 randomly. Initialize an empty list of losses: losses = [] for epoch in Nepochs do
Initialize the loss associated with the whole corpus: J = 0 for sequence (Di) in sequences do
for word wit in sequence (Di) do
Get the true center word it = wit Get the context words {c1,…,cK} Getthefakecenterwordif=fit
1: 2: 3: 4: 5: 6: 7: 8: 9:
10: 11: 12: 13: 14:
15: 16: 17: 18: 19:
Do one step of Stochastic Gradient Descent associated with the positive batch Bwit
Calculate the loss J wit
Do one step of Stochastic Gradient Descent associated with the negative batch Bwit
Calculate the loss J wit
JJ+Jwit end for
Append the list of losses with the value J end for
Question 27: Plot the list of losses from Algorithm 1
– Machine Learning and Finance -Coursework- 13 Question 28: Which embedding matrix can you build once the training process is finished ?
Question 29: Using an unsupervised learning algorithm of your choice, reduce the dimensionality of your embedding vectors into 2 dimensions and show a scatter plot of the reduced embedding vectors.
Hint: You can use a PCA for instance.
Question 30: Show an example of an analogy like eFrance eParis ⇡ eEngland eLondon from the corpus using your trained embedding vectors.
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com