l10-distributional-semantics-v3
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
Natural Language Processing
Lecture 10
Semester 1 2021 Week 5
Jey Han Lau
Distributional Semantics
COMP90042 L10
2
Lexical Databases – Problems
• Manually constructed
‣ Expensive
‣ Human annotation can be biased and noisy
• Language is dynamic
‣ New words: slang, terminology, etc.
‣ New senses
• The Internet provides us with massive amounts of
text. Can we use that to obtain word meanings?
COMP90042 L10
3
Distributional Hypothesis
• You shall know a word by the company it keeps
— Firth, 1957
• Document co-occurrence indicative of topic
(document as context)
‣ E.g. voting and politics
• Local context reflects its meaning
(word window as context)
‣ E.g. eat a pizza vs. eat a burger
COMP90042 L10
4
• Learn unknown word from its usage
• tezgüino
• Another way: look at words that share similar
contexts!
Guessing Meaning from Context
Chapter 14
Distributional and distributed
semantics
A recurring theme in natural language processing is the complexity of the mapping from
words to meaning. In chapter 4, we saw that a single word form, like bank, can have mul-
tiple meanings; conversely, a single meaning may be created by multiple surface forms,
a lexical semantic relationship known as synonymy. Despite this complex mapping be-
tween words and meaning, natural language processing systems usually rely on words
as the basic unit of analysis. This is especially true in semantics: the logical and frame
semantic methods from the previous two chapters rely on hand-crafted lexicons that map
from words to semantic predicates. But how can we analyze texts that contain words
that we haven’t seen before? This chapter describes methods that learn representations
of word meaning by analyzing unlabeled data, vastly improving the generalizability of
natural language processing systems. The theory that makes it possible to acquire mean-
ingful representations from unlabeled data is the distributional hypothesis.
14.1 The distributional hypothesis
Here’s a word you may not know: tezgüino (the example is from Lin, 1998). If you do not
know the meaning of tezgüino, then you are in the same situation as a natural language
processing system when it encounters a word that did not appear in its training data.
Now suppose you see that tezgüino is used in the following contexts:
(14.1) A bottle of is on the table.
(14.2) Everybody likes .
(14.3) Don’t have before you drive.
(14.4) We make out of corn.
325
326 CHAPTER 14. DISTRIBUTIONAL AND DISTRIBUTED SEMANTICS
(14.1) (14.2) (14.3) (14.4) …
tezgüino 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
choices 0 1 0 0
wine 1 1 1 0
Table 14.1: Distributional statistics for tezgüino and five related terms
Figure 14.1: Lexical semantic relationships have regular linear structures in two dimen-
sional projections of distributional statistics (Pennington et al., 2014).
What other words fit into these contexts? How about: loud, motor oil, tortillas, choices,
wine? Each row of Table 14.1 is a vector that summarizes the contextual properties for
each word, with a value of one for contexts in which the word can appear, and a value of
zero for contexts in which it cannot. Based on these vectors, we can conclude: wine is very
similar to tezgüino; motor oil and tortillas are fairly similar to tezgüino; loud is completely
different.
These vectors, which we will call word representations, describe the distributional
properties of each word. Does vector similarity imply semantic similarity? This is the dis-
tributional hypothesis, stated by Firth (1957) as: “You shall know a word by the company
it keeps.” The distributional hypothesis has stood the test of time: distributional statistics
are a core part of language technology today, because they make it possible to leverage
large amounts of unlabeled data to learn about rare words that do not appear in labeled
training data.
Distributional statistics have a striking ability to capture lexical semantic relationships
Jacob Eisenstein. Draft of October 15, 2018.
COMP90042 L10
5
Word Vectors
• Each row can be thought of a word vector
• It describes the distributional properties of a word
‣ i.e. encodes information about its context words
• Capture all sorts of semantic relationships
(synonymy, analogy, etc)
326 CHAPTER 14. DISTRIBUTIONAL AND DISTRIBUTED SEMANTICS
(14.1) (14.2) (14.3) (14.4) …
tezgüino 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
choices 0 1 0 0
wine 1 1 1 0
Table 14.1: Distributional statistics for tezgüino and five related terms
Figure 14.1: Lexical semantic relationships have regular linear structures in two dimen-
sional projections of distributional statistics (Pennington et al., 2014).
What other words fit into these contexts? How about: loud, motor oil, tortillas, choices,
wine? Each row of Table 14.1 is a vector that summarizes the contextual properties for
each word, with a value of one for contexts in which the word can appear, and a value of
zero for contexts in which it cannot. Based on these vectors, we can conclude: wine is very
similar to tezgüino; motor oil and tortillas are fairly similar to tezgüino; loud is completely
different.
These vectors, which we will call word representations, describe the distributional
properties of each word. Does vector similarity imply semantic similarity? This is the dis-
tributional hypothesis, stated by Firth (1957) as: “You shall know a word by the company
it keeps.” The distributional hypothesis has stood the test of time: distributional statistics
are a core part of language technology today, because they make it possible to leverage
large amounts of unlabeled data to learn about rare words that do not appear in labeled
training data.
Distributional statistics have a striking ability to capture lexical semantic relationships
Jacob Eisenstein. Draft of October 15, 2018.
COMP90042 L10
6
Word Embeddings?
• We’ve seen word vectors before: word
embeddings!
• Here we will learn other ways
to produce word vectors
‣ Count-based methods
‣ More efficient neural
methods designed just for
learning word vectors
Word Embeddings!
COMP90042 L10
7
Outline
• Count-based methods
• Neural methods
• Evaluation
COMP90042 L10
8
Count-Based Methods
COMP90042 L10
9
Learning Count-Based Word Vectors
• Generally two flavours
‣ Use document as context
‣ Use neighbouring words as context
COMP90042 L10
10
Document as Context:
The Vector Space Model
• Core idea: represent word meaning as a vector
• Consider documents as context
• One matrix, two viewpoints
‣ Documents represented by their words
‣ Words represented by their documents
… state fun heaven …
…
425 0 1 0
426 3 0 0
427 0 0 0
……
COMP90042 L10
11
Manipulating the VSM
• Weighting the values (beyond frequency)
• Creating low-dimensional dense vectors
COMP90042 L10
12
Tf-idf
• Standard weighting scheme for information
retrieval
• Discounts common words!
… the country hell …
…
425 43 5 1
426 24 1 0
427 37 0 3
…
df 500 14 7
tf matrix
… the country hell …
…
425 0 26.0 6.2
426 0 5.2 0
427 0 0 18.6
…
tf-idf matrix
𝑖𝑑𝑓𝑤 = log
|𝐷 |
𝑑𝑓𝑤
total #docs
#docs that
has w
COMP90042 L10
13
Dimensionality Reduction
• Term-document matrices are very sparse
• Dimensionality reduction: create shorter, denser
vectors
• More practical (less features)
• Remove noise (less overfitting)
COMP90042 L10
14
Singular Value Decomposition
𝐴 = 𝑈Σ𝑉𝑇
A
(term-document matrix)
|D|
|V|
0 1 0
0 0 0
0 0 0
⋯
0
0
0
⋮ ⋱ ⋮
0 0 0 ⋯ 0
U
(new term matrix)
|V|
2 . 2 0.3
5.5 −2.8
−1.3 3.7
⋯
8 . 7
0.1
3.5
⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −1.9
m
VT
(new document matrix)
m
−0.2 4.0
−4.1 0.6
2.6 6.1
⋯
−1.3
−0.2
1.4
⋮ ⋱ ⋮
−1.9 −1.8 ⋯ 0.3
|D|
m=Rank(A)
Σ
(singular values)
m
m
9 . 1 0 0
0 4.4 0
0 0 2.3
⋯
0
0
0
⋮ ⋱ ⋮
0 0 0 ⋯ 0.1
COMP90042 L10
15
Truncating – Latent Semantic Analysis
• Truncating U, Σ, and VT to k dimensions produces best
possible k rank approximation of original matrix
• Uk is a new low dimensional representation of words
• Typical values for k are 100-5000
Uk
|V|
2 . 2 0.3
5.5 −2.8
−1.3 3.7
⋯
−2.4
1.1
4.7
⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3
k
U
|V|
2 . 2 0.3
5.5 −2.8
−1.3 3.7
⋯
−2.4
1.1
4.7
⋯
8 . 7
0.1
3.5
⋮ ⋱ ⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3 ⋯ −1.9
m k
Word Vectors!
COMP90042 L10
16
Words as Context
• Lists how often words appear with other words
‣ In some predefined context (e.g. a window of N words)
• The obvious problem with raw frequency:
dominated by common words
‣ But we can’t use tf-idf!
… the country hell …
…
state 1973 10 1
fun 54 2 0
heaven 55 1 3
……
COMP90042 L10
17
Pointwise Mutual Information
• For two events x and y, PMI computes the
discrepancy between:
‣ Their joint distribution =
‣ Their individual distributions (assuming
independence) =
P(x, y)
P(x)P(y)
PMI(x, y) = log2
P(x, y)
P(x)P(y)
COMP90042 L10
18
Calculating PMI
… the country hell … Σ
…
state 1973 10 1 12786
fun 54 2 0 633
heaven 55 1 3 627
…
Σ 1047519 3617 780 15871304
PMI(x, y) = log2
P(x, y)
P(x)P(y)
P(x, y) =
count(x, y)
Σ
P(x) =
Σx
Σ
P(y) =
Σy
Σ
x = state, y = country
P(x, y) =
10
15871304
= 6.3 × 10−7
P(x) =
12786
15871304
= 8.0 × 10−4
P(y) =
3617
15871304
= 2.3 × 10−4
PMI(x, y) = log2
6.3 × 10−7
(8.0 × 10−4)(2.3 × 10−4)
= 1.78
COMP90042 L10
19
PMI(heaven, hell)?
… the country hell … Σ
…
state 1973 10 1 12786
fun 54 2 0 633
heaven 55 1 3 627
…
Σ 1047519 3617 780 15871304
PMI(x, y) = log2
P(x, y)
P(x)P(y)
P(x, y) =
count(x, y)
Σ
P(x) =
Σx
Σ
P(y) =
Σy
Σ
PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L10
20
COMP90042 L10
21
PMI(heaven, hell)?
… the country hell … Σ
…
state 1973 10 1 12786
fun 54 2 0 633
heaven 55 1 3 627
…
Σ 1047519 3617 780 15871304
PMI(x, y) = log2
P(x, y)
P(x)P(y)
P(x, y) =
count(x, y)
Σ
P(x) =
Σx
Σ
P(y) =
Σy
Σ
PMI(x, y) = log2
3
15871304
627
15871304
× 780
15871304
= 6.61
COMP90042 L10
22
PMI Matrix
• PMI does a better job of capturing semantics
‣ E.g. heaven and hell
• But very biased towards rare word pairs
• And doesn’t handle zeros well
… the country hell …
…
state 1.22 1.78 0.63
fun 0.37 3.79 -inf
heaven 0.41 2.80 6.61
……
COMP90042 L10
23
PMI Tricks
• Zero all negative values (Positive PMI)
‣ Avoid –inf and unreliable negative values
• Counter bias towards rare events
‣ Normalised PMI ( PMI(x, y)−log2 P(x, y) )
COMP90042 L10
24
SVD ( )A = UΣVT
• Regardless of whether we use document or word
as context, SVD can be applied to create dense
vectors
… the country hell …
…
425 0 26.0 6.2
426 0 5.2 0
427 0 0 18.6
…
tf-idf matrix
… the country hell …
…
state 1.22 1.78 0.63
fun 0.37 3.79 0
heaven 0.41 2.80 6.60
……
PPMI matrix
COMP90042 L10
25
Neural Methods
COMP90042 L10
26
Word Embeddings
• We’ve seen word embeddings used in neural
networks (lecture 7 and 8)
• But these models are designed for other tasks:
‣ Classification
‣ Language modelling
• Word embeddings are just a by-product
COMP90042 L10
27
Neural Models for Embeddings
• Can we design neural networks whose goal is to
purely learn word embeddings?
• Desiderata:
‣ Unsupervised
‣ Efficient
COMP90042 L10
28
Word2Vec
• Core idea
‣ You shall know a word by the company it keeps
‣ Predict a word using context words
COMP90042 L10
29
Word2Vec
• Framed as learning a classifier
‣ Skip-gram: predict surrounding words of target word
‣ CBOW: predict target word using surrounding words
• Use surrounding words within L positions, L=2 above
… Bereft of life he rests in peace! If you hadn’t nailed him …
P(life | rests)
P(he | rests) P(in | rests)
P(peace | rests)
P(rests | {he, in, life, peace})
COMP90042 L10
30
Skip-gram Model
• Predicts each neighbouring word given target word
• Total probability defined as
• Using a logistic regression model
∏
l∈−L,…,−1,1,…,L
P(wt+l |wt)
P(life | rests) =
exp(Wrests ⋅ Clife)
∑
u∈V exp(Wrests ⋅ Cu)
… Bereft of life he rests in peace! If you hadn’t nailed him …
P(life | rests)
P(he | rests) P(in | rests)
P(peace | rests)
word embedding of rests
word embedding of life
dot product
COMP90042 L10
31
Embedding parameterisation
• Two word embedding matrices ( and )!
• Words are numbered, e.g., by sorting vocabulary and
using word location as its index
P(life | rests) =
exp(Wrests ⋅ Clife)
∑
u∈V exp(Wrests ⋅ Cu)
W C
COMP90042 L10
32
Skip-gram model
Fig 19.18, JM3
19.6 • EMBEDDINGS FROM PREDICTION: SKIP-GRAM AND CBOW 19
|Vw| ⨉ d
1……..d
1
2
.
.
i
.
|Vw|
W
1
2
.
.
j
.
|Vc|
C
target word embedding
for word i
|Vc| ⨉ d
context word embedding
for word j
1……..d
Figure 19.17 The word matrix W and context matrix C (with embeddings shown as row
vectors) learned by the skipgram model.
Input layer Projection layer
Output layer
W
|V|⨉d
wt
wt-1
wt+1
1-hot input vector
1⨉d1⨉|V|
embedding for wt
probabilities of
context words
C d ⨉ |V|
C d ⨉ |V|
x1
x2
xj
x|V|
y1
y2
yk
y|V|
y1
y2
yk
y|V|
Figure 19.18 The skip-gram model (Mikolov et al. 2013, Mikolov et al. 2013a).
is to compute P(wk|w j).
We begin with an input vector x, which is a one-hot vector for the current wordone-hot
w j. A one-hot vector is just a vector that has one element equal to 1, and all the other
elements are set to zero. Thus in a one-hot representation for the word w j, x j = 1,
and xi = 0 8i 6= j, as shown in Fig. 19.19.
0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0
w0 wj w|V|w1
Figure 19.19 A one-hot vector, with the dimension corresponding to word w j set to 1.
We then predict the probability of each of the 2C output words—in Fig. 19.18
that means the two output words wt�1 and wt+1— in 3 steps:
1. Select the embedding from W: x is multiplied by W , the input matrix, to give
the hidden or projection layer. Since each column of the input matrix W isprojection layer
just an embedding for word wt , and the input is a one-hot vector for w j, the
projection layer for input x will be h = v j, the input embedding for w j.
COMP90042 L10
33
Training the skip-gram model
• Train to maximise likelihood of raw text
• Too slow in practice, due to normalisation over |V|
• Reduce problem to binary classification
‣ (life, rests) → real context word
‣ (aardvark, rests) → non-context word
‣ How to draw non-context word or negative samples?
‣ Randomly from V
P(life | rests) =
exp(Wrests ⋅ Clife)
∑
u∈V exp(Wrests ⋅ Cu)
COMP90042 L10
34
Negative Sampling
6.8 • WORD2VEC 21
6.8.2 Learning skip-gram embeddings
Word2vec learns embeddings by starting with an initial set of embedding vectors
and then iteratively shifting the embedding of each word w to be more like the em-
beddings of words that occur nearby in texts, and less like the embeddings of words
that don’t occur nearby.
Let’s start by considering a single piece of the training data, from the sentence
above:
… lemon, a [tablespoon of apricot jam, a] pinch …
c1 c2 t c3 c4
This example has a target word t (apricot), and 4 context words in the L = ±2
window, resulting in 4 positive training instances (on the left below):
positive examples +
t c
apricot tablespoon
apricot of
apricot preserves
apricot or
negative examples –
t c t c
apricot aardvark apricot twelve
apricot puddle apricot hello
apricot where apricot dear
apricot coaxial apricot forever
For training a binary classifier we also need negative examples, and in fact skip-
gram uses more negative examples than positive examples, the ratio set by a param-
eter k. So for each of these (t,c) training instances we’ll create k negative samples,
each consisting of the target t plus a ‘noise word’. A noise word is a random word
from the lexicon, constrained not to be the target word t. The right above shows the
setting where k = 2, so we’ll have 2 negative examples in the negative training set
� for each positive example t,c.
The noise words are chosen according to their weighted unigram frequency
pa(w), where a is a weight. If we were sampling according to unweighted fre-
quency p(w), it would mean that with unigram probability p(“the”) we would choose
the word the as a noise word, with unigram probability p(“aardvark”) we would
choose aardvark, and so on. But in practice it is common to set a = .75, i.e. use the
weighting p
3
4 (w):
Pa(w) =
count(w)aP
w0 count(w
0)a
(6.31)
Setting a = .75 gives better performance because it gives rare noise words slightly
higher probability: for rare words, Pa(w) > P(w). To visualize this intuition, it
might help to work out the probabilities for an example with two events, P(a) = .99
and P(b) = .01:
Pa(a) =
.99.75
.99.75 + .01.75
= .97
Pa(b) =
.01.75
.99.75 + .01.75
= .03 (6.32)
Given the set of positive and negative training instances, and an initial set of
embeddings, the goal of the learning algorithm is to adjust those embeddings such
that we
• Maximize the similarity of the target word, context word pairs (t,c) drawn
from the positive examples
P (+|t, c) =
1
1 + e−t·c
P (−|t, c) = 1−
1
1 + e−t·c
maximise similarity
between target word and
real context words
minimise similarity
between target word and
non-context words
1
1 + e−x
COMP90042 L10
35
Skip-gram Loss
• In practice, use k negative examples
COMP90042 L10
36
Desiderata
• Unsupervised
‣ Unlabelled corpus
• Efficient
‣ Negative sampling (avoid softmax over full
vocabulary)
‣ Scales to very very large corpus
COMP90042 L10
37
• Difficult to quantify the quality of word vectors
• Don’t capture polysemous words
• Data sparsity (not enough data to learn them)
• Difficult to incorporate them in downstream tasks
PollEv.com/jeyhanlau569
Problems with word vectors/embeddings
(count and neural methods)?
http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569
COMP90042 L10
38
COMP90042 L10
39
Evaluation
COMP90042 L10
40
Word Similarity
• Measure similarity of two words using cosine
similarity
• Compare predicted similarity with human intuition
• Datasets
‣ WordSim-353 are pairs of nouns with judged
relatedness
‣ SimLex-999 also covers verbs and adjectives
COMP90042 L10
41
Word Analogy
• Man is to King as Woman is to ???
• v(Man) – v(King) = v(Woman) – v(???)
• v(???) = v(Woman) – v(Man) + v(King)
• Find word whose embedding is closest to
v(Woman) – v(Man) + v(King)
COMP90042 L10
42
Embedding Space
• Word2Vec embeddings show interesting geometry
• Explains why they are good in word analogy task
COMP90042 L10
43
Downstream Tasks
• Best evaluation is in other downstream tasks
‣ Use bag-of-word embeddings as a feature
representation in a classifier
‣ First layer of most deep
learning models is to embed
input text
‣ Initialise them with pretrained
word vectors!
Word Embeddings!
COMP90042 L10
44
General Findings
• neural count
• Contextual word representation is shown to
work even better
• Dynamic word vectors that change depending on
context!
• ELMO & BERT (next lecture!)
>
COMP90042 L10
45
Pointers to Software
• Word2Vec
‣ C implementation of Skip-gram and CBOW
https://code.google.com/archive/p/word2vec/
• GenSim
‣ Python library with many methods include LSI, topic
models and Skip-gram/CBOW
https://radimrehurek.com/gensim/index.html
• GLOVE
‣ http://nlp.stanford.edu/projects/glove/
https://code.google.com/archive/p/word2vec/
https://radimrehurek.com/gensim/index.html
http://nlp.stanford.edu/projects/glove/
https://code.google.com/archive/p/word2vec/
https://radimrehurek.com/gensim/index.html
http://nlp.stanford.edu/projects/glove/
COMP90042 L10
46
Further Reading
‣ JM3, Ch 6