Dense Word Vectors
ANLP: Week 9, Unit 3
Shay Cohen
Based on slides from ANLP 2019
Last class
Represent a word by a context vector
Each word x is represented by a vector ⃗v. Each dimension in
the vector corresponds to a context word type y
Each vi measures the level of association between the word x
and context word yi Pointwise Mutual Information
Set each vi to log2 p(x,yi ) p(x)p(yi)
Measures “colocationness’
Vectors have many dimensions and very sparse (when PMI
< 0 is changed to 0)
Similarity metric between ⃗v and another context vector w⃗ :
The cosine of the angle between ⃗v and w⃗ : ⃗v w⃗ |⃗v ||w⃗ |
Roadmap for Main Course of Today
Skip-gram models - relying on the idea of pairing words with dense context and target vectors. If a word co-occurs with a context word wc, then its target vector should be similar to the context vector of wc
The computational problem with skip-gram models
An example solution to this problem: negative sampling skip-grams
Today’s Lecture
How to represent a word with vectors that are short (with length of 50 – 1,000) and dense (most values are non-zero)
Why short vectors?
Easier to include as features in machine learning systems
Because they contain fewer parameters, they generalise better
and are less prone to overfitting
Sparse vectors are better at capturing synonymy
Before the Main Course, on PMI and TF-IDF
PMI is one way of trying to detect important co-occurrences based on divergence between observed and predicted (from unigram MLEs) bigram probabilities
A different take: a word that is common in only some contexts carries more information than one that is common everywhere
How to formalise this idea?
TF-IDF: Main Idea
Key Idea: Combine together the frequency of a term in a context (such as document) with its relative frequency overall in all documents.
This is formalised under the name tf-idf tf Term frequency
idf Inverse document frequency
Originally from Information Retrieval, where there a lots of
documents, often with lots of words in them
Gives an “importance” level of a term in a specific context
Summary: TF-IDF
Compare two words using tf-idf cosine to see if they are similar Compare two documents
Take the centroid of vectors of all the terms in the document Centroid document vector is:
d = t1 +t2 +···+tk k
TF-IDF: Combine Two Factors
tf: term frequency of a word t in document d:
tf(t,d)= 1+logcount(t,d) ifcount(t,d)>0 . 0 otherwise
frequency count of term i in document d
Idf: inverse document frequency:
idf(t) = log N dft
N is total # of docs in collection
dft is#ofdocsthathavetermt
Terms such as the or good have very low idf
because dft ≈ N
tf-idf value for word t in document d: tfidft,d = tft,d × idft
TF-IDF and PMI are Sparse Representations
TF-IDF and PMI vectors
have many dimensions (as the size of the vocabulary)
are sparse (most elements are zero)
Alternative: dense vectors, vectors which are
short (length 50–1000)
dense (most elements are non-zero)
Neural network-inspired dense embeddings
Methods for generating dense embeddings inspired by neural network models
Key idea: Each word in the vocabulary is associated with two vectors: a context vector and a target vector. We try
to push these two types of vectors such that the target vector of a word is close to the context vectors of words with which it co-occurs.
This is the main idea, and what is important to understand. Now to the details to make it operational…
Prediction with Skip-Grams
Each word type w is associated with two dense vectors: v(w) (target vector) and c(w) (context vector)
Skip-gram model predicts each neighbouring word in a context window of L words, e.g. context window L = 2 the context is [wt−2, wt−1, wt+1, wt+2]
Skip-gram calculates the probability p(wk|wj) by computing dot product between context vector c(wk) of word wk and target vector v(wj) for word wj
The higher the dot product between two vectors, the more similar they are
Skip-gram modelling (or Word2vec)
Instead of counting how often each word occurs near “apricot” Train a classifier on a binary prediction task:
Is the word likely to show up near “apricot”?
A by-product of learning this classifier will be the context and
target vectors discussed.
These are the parameters of the classifier, and we will use these parameters as our word embeddings.
No need for hand-labelled supervision – use text with co-occurrence
Prediction with Skip-grams
We use softmax function to normalise the dot product into probabilities:
p(wk|wj) = exp(c(wk)·v(wj)) w∈V exp(c(w)·v(wj))
where V is our vocabulary.
If both fruit and apricot co-occur with delicious, then v(fruit) and v(apricot) should be similar both to c(delicious), and as such, to each other
Problem: Computing the denominator requires computing dot product between each word in V and the target word wj , which may take a long time
Skip-gram with Negative Sampling
Problem with skip-grams: Computing the denominator requires computing dot product between each word in V and the target word wj , which may take a long time
Instead:
Given a pair of target and context words, predict + or – (telling whether they co-occur together or not)
This changes the classification into a binary classification problem, no issue with normalisation
It is easy to get example for the + label (words co-occur)
Where do we get examples for – (words do not co-occur)?
Skip-gram with Negative Sampling
Training sentence for example word apricot:
lemon,a tablespoon of apricot preserves or jam
Select k = 2 noise words for each of the context words:
cement bacon dear coaxial apricot ocean hence never puddle
n1 n2 n3 n4 w n5 n6 n7 n8
We want noise words wni to have a low dot-product with
target embedding w
We want the context word to have high dot-product with target embedding w
Skip-gram with Negative Sampling
Problem with skip-grams: Computing the denominator requires computing dot product between each word in V and the target word wj , which may take a long time
Instead:
Given a pair of target and context words, predict + or – (telling whether they co-occur together or not)
This changes the classification into a binary classification problem, no issue with normalisation
It is easy to get example for the + label (words co-occur)
Where do we get examples for – (words do not co-occur)?
Solution: randomly sample “negative” examples
Skip-Gram Goal
To recap:
Given a pair (wt,wc) = target, context
(apricot, jam)
(apricot, aardvark)
return probability that wc is a real context word: P(+|wt,wc)
P(−|wt,wc)=1−P(+|wt,wc)
Learn from examples (wt,wc,l) where l ∈ {+,−} and the negative examples are obtained through sampling
How to Compute p(+|wt,wc)? Intuition:
Words are likely to appear near similar words
Again use dot-product to indicative positive/negative label,
coupled with logistic regression. This means
P(+|wt,wc) = 1 1+exp(−v(wt)·c(wc))
P(−|wt,wc)=1−P(+|wt,wc)= exp(−v(wt)·c(wc)) 1+exp(−v(wt)·c(wc))
Skip-gram with Negative Sampling
So, given the learning objective is to maximise:
log P (+|w t , w c ) + ki =1 log P (−|w t , w ni )
where we have k negative-sampled words wn1,··· ,wnk
We want to maximise the dot product of a word target vector with a true context word context vector
We want to minimise over all the dot products of a target word with all the untrue contexts
How do we maximise this learning objective? Using gradient descent
How to Compute p(+|wt,wc)? Intuition:
Words are likely to appear near similar words
Again use dot-product to indicative positive/negative label,
coupled with logistic regression. This means
P(+|wt,wc) = 1 1+exp(−v(wt)·c(wc))
P(−|wt,wc)=1−P(+|wt,wc)= exp(−v(wt)·c(wc)) 1+exp(−v(wt)·c(wc))
The function
σ(x) = 1 1+e−x
is also referred to as “the sigmoid”
How to Use the Context and Target Vectors?
After this learning process, use:
v(w) as the word embedding, discarding c(w)
Or the concatenation of c(w) with v(w)
A good example of representation learning: through our classifier setup, we learned how to represent words to fit the classifier model to the data
Food for thought: are c(w) and v(w) going to be similar for each w? Why?
How to Use the Context and Target Vectors?
After this learning process, use:
v(w) as the word embedding, discarding c(w)
Or the concatenation of c(w) with v(w)
A good example of representation learning: through our classifier setup, we learned how to represent words to fit the classifier model to the data
Food for thought: are c(w) and v(w) going to be similar for each w? Why?
v(fruit) → c(delicious) → v(apricot) → c(fruit)
Properties of Embeddings
Offsets between embeddings can capture relations between words, e.g. vector(king)+ (vector(woman) − vector(man)) is close to vector(queen)
Offsets can also capture grammatical number
Some Real Embeddings
Examples of the closest tokens to some target words using a phrase-based extension of the skip-gram algorithm (Mikolov et al. 2013):
Redmond
Havel
ninjutsu
graffiti
capitulate
Redmond Wash
Vaclav Havel
ninja
spray paint
capitulation
Redmond Washington
President Vaclav Havel
Martial arts
graffiti
capitulated
Microsoft
Velvet Revolution
swordsmanship
taggers
capitulating
Summary
skip-grams (and related approaches such as continuous bag of words (CBOW)) are often referred to as word2vec
Code available online – try it! Very fast to train
Idea: predict rather than count