7a: Word Vectors
Week 7: Overview
This week, we will look at the statistics of language, and show how word vectors can be assigned in a
systematic way using approximations to singular value decomposition such as word2vec and GloVe.
Weekly learning outcomes
By the end of this module, you will be able to:
describe word frequencies, n-gram model, co-occurrence matrix
describe the components and properties of singular value decomposition
describe the word2vec model, including architecture, cost function and e�ciency issues
Statistics of Language
De�nitions, synonyms and antonyms
How do we capture the true meaning of a word, or its relationship to other words?
Words are sometimes arranged into a taxonomy for example, a bird is a mammal is an animal, etc.−
We can also look at dictionary de�nitions, or lists of synonyms and antonyms. Here is a list of
synonyms for the word ‘elegant’:
stylish, graceful, tasteful, discerning, re�ned, sophisticated, digni�ed, cultivated,
distinguished, classic, smart, fashionable, modish, decorous, beautiful, artistic, aesthetic,
lovely; charming, polished, suave, urbane, cultured, dashing, debonair; luxurious,
sumptuous, opulent, grand, plush, high-class, exquisite
Human e�ort is required to compile taxonomies, dictionary de�nitions and lists of synonyms and
antonyms. Some important relationships may be missing, nuances may be lost. Words like ‘king’ and
‘queen’ are not quite synonyms nor antonyms because they are similar in some attributes but
opposite in others.
Is it possible to extract statistical properties and relationships between words automatically, without
human involvement?
We will discuss a number of techniques for statistical language processing, using this children’s
rhyme as an example.
There was a Crooked Man
who walked a crooked mile
and found a crooked sixpence
upon a crooked stile.
He bought a crooked cat,
who caught a crooked mouse,
and they all lived together
in a little crooked house.
http://www.kearley.co.uk/images/uploads/JohnPatiencePJ03.gif
Counting word frequencies
Perhaps the simplest thing we can do to analyse a text is just to count the frequency with which each
word occurs.
This can be useful for document classi�cation. Each column of the matrix becomes a vector
representing the corresponding document. Some words like ‘a’, ‘and’ occur frequently in all (or most)
documents; other words occur frequently only in certain types of documents, so are more distinctive.
For example, words like ‘cat’, ‘mouse’, ‘house’ tend to occur in children’s books or rhymes, while
other groups of words might be characteristic of legal documents, political news, sporting results, etc.
Words occurring many times in one document may skew the vector. Sometimes it is better if the
matrix entries are just ‘1’ or ‘0’ indicating whether the word occurs at all.
Counting consecutive words
Another thing we can do is to count the frequency of two words occurring consecutively.
Predictive -Gram Word Modeln
If we normalise so that the sum of the entries in each row is equal to , we can use this matrix to
estimate the probability prob( ) of word occurring after
1
w ∣w j i w j w i
In practice, we need to aggregate over a large corpus, so that unusual words like ‘crooked’ will not
dominate.
The above model captures some common combinations like “there was”, “man who”, “and found”, “he
bought”, “who caught”, “and they”, “they all”, “lived together”, etc.
This unigram model can be generalised to a bi-gram, tri-gram, . . . , -gram model by considering the
preceding words. If the vocabulary is large, we need some tricks to avoid exponential use of
memory.
n
n
Unigram Text Generator
Here is an example of text generated from a unigram model based on the frequencies of word pairs
in Shakespeare’s Hamlet.
“Rashly – Good night is very liberal – it is easily said there is – gyved to a sore distraction in wrath and
with my king may choose but none of shapes and editing by this , and shows a sea And what this is
miching malhecho; And gins to me a pass , Transports his wit , Hamlet , my arms against the mind
impatient , by the conditions that would fain know ; which , the wicked deed to get from a deed to
your tutor .”
We see that the local structure looks plausible, but the global structure seems like gibberish.
Co-occurrence Matrix
Sometimes, we don’t necessarily predict the next word, but simply a “nearby word” (e.g. a word
occurring within an -word window centered on that word).n
For this case, we can build a matrix in which each row represents a word, and each column a nearby
word.
Co-occurrence Matrix (2-word window)
Co-occurrence Matrix (10-word window)
By aggregating over many documents, pairs (or groups) of words emerge which tend to occur near
each other (but not necessarily consecutively). For example:
‘cat’, ‘caught’, ‘mouse’
‘walked’, ‘mile’
‘little’, ‘house’
This analysis raises a number of questions:
1. Common words tend to dominate the matrix. Could we sample common words less often, in
order to reveal the relationships of less common words?
2. Each row of this matrix could be considered as a vector representation for the corresponding
word, but the number of dimensions is equal to the size of the vocabulary, which could be very
large (∼ ). Is there a way to reduce the dimensionality while still preserving the relationships
between words?
105
Singular Value Decomposition
Word Vectors
“Words that are used and occur in the same contexts tend to purport similar meanings.” Z. Harris (1954)
“You shall know a word by the company it keeps.” J.R. Firth (1957)
We would like to assign a vector to each word or document, in such a way that words with nearby
representations are likely to occur in similar contexts (or, similar documents have close vectors).
Each column in the co-occurrence matrix could be considered as a vector representation of a word or
document. However, the number of dimensions in this vector would be very large (equal to the size
of the vocabulary, which is typically about 60,000).
Singular Value Decomposition (SVD) is a way of projecting these vectors from 60,000 dimensions to
about 200 dimensions, in such a way that the relationship between words (or documents) is
preserved. SVD itself is too computationally expensive, so methods like Word2Vec and Glove are
introduced which provide a good approximation to SVD with reasonable computing resources.
Singular Value Decomposition
Let be a (co-)occurence matrix where each row represents a word, and each column represents
either a word (in which case = ) or a document (in which case might not be equal).
X
L M L, M
This matrix can be decomposed as where , are unitary (all
columns have unit length) and is diagonal, with diagonal entries
X (L×M) X = USV
T U (L×L) V (M×M)
S (L×M) s ≥1 s ≥2 … ≥ s ≥M 0
We can obtain an approximation for of rank by truncating to , to and X N < M U U ~ (L×N) S S ~ (N×N) ~ ~ to . The th row of then provides an -dimensional vector representing the word in the vocabulary. The th column of also provides an -dimensional vector, which can be seen either as a representation for documents, or as an alternative representation for words, depending on the origin of the (co-)occurrence matrix. V V ~ (M×N) k U ~ N kth k V ~ T N Word2vec and GloVe Typically, is the number of words in the vocabulary (about 60,000) and is either equal to or, in the case of document classi�cation, the number of documents in the collection. SVD is computationally expensive, proportional to if . L M L L × M2 L ≥ M word2vec (Mikolov, 2013) and GloVe (Pennington, 2014) are two common methods for computing an approximation to SVD incrementally, with less computation. word2vec is a predictive model which aims to maximise the probability of a word, based on the surrounding words. GloVe is a count-based model which tries to directly reconstruct a close approximation to the co-occurrence matrix .X Singular Value compared to Eigenvalue Decomposition Those familiar with Eigenvalue Decomposition will see that it is similar in some ways to Singular Value Decomposition. But, we must note a number of important di�erences, as illustrated in these examples: Eigenvalue Decomposition: =[ 0 1 1 0 ] ΩDΩ , where Ω =−1 , D = 2 1 [ 1 1 1 −1 ] [ 1 0 0 −1 ] =[ 0 1 −1 0 ] ΩDΩ , where Ω =−1 , D = 2 1 [ 1 −i 1 i ] [ i 0 0 −i ] Singular Value Decomposition: =[ 0 1 1 0 ] USV , U =T , S =[ 0 1 1 0 ] , V =[ 1 0 0 1 ] [ 1 0 0 1 ] =[ 0 1 −1 0 ] USV , U =T , S =[ 0 1 1 0 ] , V =[ 1 0 0 1 ] [ 1 0 0 −1 ] In general, eigenvalues can be negative or even complex, but singular values are always real and non- negative. Even if is a square matrix, singular value decomposition treats the source and target as two entirely di�erent spaces. The only case where the two decompositions are the same is when is symmetric and positive semi-de�nite. The word co-occurrence matrix is symmetric but not positive semi-de�nite. For example, if the text consisted entirely of two alternating letters X X ..ABABABABABABAB.. then A would be the context for B, and vice-versa. Word2Vec [source: Rong, 2014] The row of is a representation of word .kth v k W k The column of is an (alternative) representation of word .jth v j ′ W′ j If the (1-hot) input is , the linear sum at each output will be k u =j v v j ′T k Note that word2vec is a linear model in the sense that there is no activation function at the hidden nodes. Cost function Softmax can be used to turn these linear sums into a probability distribution estimating the probability of word occurring in the context of word u j j k prob(j ∣ k) = = exp u ∑ j =1′ V ( j′) exp u ( j) exp v v ∑ j =1′ V ( j′ ′T k) exp v v ( j ′T k) We can treat the text as a sequence of numbers where means that the w , w , ..., w 1 2 T w =i j i th word in the text is the word in the vocabulary. We then seek to maximize the log probabilityjth log prob w ∣ w T 1 t=1 ∑ T −c≤r≤c,r=0 ∑ ( t+r t) where is the size of training context (which may depend on ).c w t word2vec di�erentials If we assume the probabilities are computed by full softmax, and the correct output is , then the cost function is j∗ E = −u +j∗ log exp u j =1′ ∑ V ( j′) the output di�erentials are e =j =∂u j ∂E −δ +jj∗ log exp u ∂u j ∂ j =1′ ∑ V ( j′) where δ =jj∗ { 1,0, if j = j∗ otherwise word2vec weight updates hidden-to-output di�erentials = ∂w ij ′ ∂E = ∂u j ∂E ∂w ij ′ ∂u j e h j i hidden unit di�erentials = ∂h i ∂E = j=1 ∑ V ∂u j ∂E ∂h i ∂u j e w j=1 ∑ V j ij ′ input-to-hidden di�erentials = ∂w ki ∂E = ∂h i ∂E ∂w ki ∂h i e w x j=1 ∑ V j ij ′ k Continuous Bag of Words and Skip-Gram word2vec can be implemented either as Continuous Bag of Words (CBOW) or Skip-Gram, as illustrated in these �gures from [Rong, 2014]. For CBOW, several context words are each used independently to predict the centre word, and the hidden activation becomes a sum (or average) over all the context words. Note the di�erence between this and NetTalk in word2vec (CBOW) all context words share the same input-to-hidden weights. – The Skip-Gram version tries to predict the context words, given the centre word. This is similar to CBOW, except that in this case a single input word is used to predict multiple context words, and all context words share the same hidden-to-output weights. E�ciency issues Depending on the computational resources available, it may be prohibitively expensive to use the full softmax, which involves computing output values for all 60,000 words in the vocabulary. Hirerarchical Softmax and negative sampling are two alternatives which are computationally more e�cient. Hierarchical Softmax For Hierarchical Softmax, the words are arranged in a Hu�mann coded tree according to their frequency. One network output (with sigmoid activation) is assigned to each node in the tree, and the output corresponding to each node in the path from the root to the actual word is trained to produce either 0 or 1 depending on whether the left or right branch should be taken. In this way, the total number of outputs is the same, but if the size of the vocabulary is then only outputs need to be evaluated, on average, for each word in the training text. w j L log (L)2 Negative sampling The idea of negative sampling is that we train the network to increase its estimation of the target word and reduce its estimate not of all the words in the entire vocabulary but instead just a subset of them , drawn from an appropriate distribution. j∗ W neg E = − log σ v h −( j∗ ′ T ) log σ −v h j∈W neg ∑ ( j′ T ) This is a simpli�ed version of Noise Contrastive Estimation (NCE). It is not guaranteed to produce a well-de�ned probability distribution, but in practice it does produce high-quality word embeddings. The number of samples is 5-20 for small datasets, 2-5 for large datasets. Empirically, a good choice of the distribution from which to draw the negative samples is where is the unigram distribution determined by the previous word, and Z is a normalizing constant. P (w) = U(w) /Z3/4 U(w) Subsampling of frequent words In order to diminish the in�uence of more frequent words, each word in the corpus is discarded with probability P w =( i) 1 − f w ( i) t where is the frequency of word and is an empirically determined threshold.f (w )i w i t ∼ 10−5 References Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). E�cient estimation of word representations in vector space. arXiv: 1301.3781. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Xin Rong. (2014). word2vec parameter learning explained. arXiv: 1411.2738. Resources https://nlp.stanford.edu/projects/glove/ Word Relationships Sentence Completion Task One way to test the quality of the word embeddings is by using them to answer sentence completion tasks from standardised tests such as the Scholastic Aptitude Test (SAT). Q1. Seeing the pictures of our old home made me feel _______ and nostalgic. A. fastidious B. indignant C. wistful D. conciliatory This kind of question can be answered by feeding the surrounding words into the CBOW version of the network and seeing which of is assigned the highest probability by the network.A, B, C, D Q2. Because the House had the votes to override a presidential veto, the President has no choice but to ____________ . A. object B. abdicate C. abstain D. compromise The network might have some di�culty answering questions like Q2, because of the need for logical reasoning or culturally speci�c knowledge. Word Relationships Another way to probe the word vectors is to look for linguistic regularities. For example, if we add the vectors for and and subtract the vector for , we get something very close to the vector for , i.e. King Woman Man Queen King + Woman − Man ≃ Queen This can be used to answer questions of the form: is to as is to What? For example:A B C Q3. evening is to morning as dinner is to _______ . Q4. bow is to arrow as _______ is to bullet. A. breakfast B. soup C. coffee D. time A. defend B. lead C. shoot D. gun In each case, we look for the word in the dictionary whose vector is closest to ( ), with closeness determined by cosine similarity. X v X v +C v −B v A X = X arg max v + v − v ∥ C B A∥ v + v − v v ( C B A) T X Here are some examples of word relationships from (Mikolov, 2013). The network generally performs very well, but occasionally makes an understandable mistake. For example, did you notice what the network thinks is the �rst and last name of the President of Russia? References Mikolov, T., Sutskever, I., Chen, K., Corrado, D.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. NIPS: 3111-19. Exercise - Singular Value Decomposition Question 1 No response Question 2 No response Question 3 No response Consider the sentence: "two �owers grew tall on two tall towers" Write the co-occurrence matrix for this sentence, using a 4-word context window (i.e. two context words on either side of the central word). X Use torch.svd() to compute the singular value decompositon of this matrix .X = USVT Extract a word representation from the �rst two columns of and use matplotlib to plot the words on a 2-dimensional graph. U Week 7 Wednesday video