Distributional Semantics
COMP90042
Natural Language Processing Lecture 10
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L10
•
Manually constructed
‣ Expensive
‣ Human annotation can be biased and noisy
•
Language is dynamic
‣ New words: slang, terminology, etc. ‣ New senses
•
The Internet provides us with massive amounts of text. Can we use that to obtain word meanings?
Lexical Databases – Problems
2
COMP90042
L10
•
•
•
“You shall know a word by the company it keeps” (Firth, 1957)
Distributional Hypothesis
Document co-occurrence often indicative of topic (document as context)
‣ E.g. voting and politics
Local context reflects a word’s semantic class
(word window as context)
‣ E.g. eat a pizza, eat a burger
3
of word meaning by analyzing unlabeled data, vastly improving the gen o
s
8 t
COMP90042
natural language processing systems. The theory that makes it possible t
L10
• •
•
Here’sawordyoumaynotknow:tezgu ̈ino(theexampleisfromLin,199
know the meaning of tezgu ̈ino, then you are in the same situation as a n
processing system when it encounters a word that did not appear in i
ingful representations from unlabeled data is the distributional hypothe Guessing Meaning from Context
14.1 The distributional hypothesis
Learn unknown word from its usage
E.g., tezgüino
Nowsupposeyouseethattezgu ̈inoisusedinthefollowingcontexts:
(14.1) A bottle of is on the table.
(14.2) Everybody likes .
(14.3) Don’t have before you drive.
(14.4) We make out of corn.
Look at other words in same (or similar) contexts
326 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS (14.1) (14.2) (14.3) (14.4) …
tezgu ̈ino
loud
motoroil 1 tortillas 0 choices 0
wine
1 1 1 1 0000 0 0 1 101 1 0 0 1110
325
Table14.1:Distributionalstatisticsfortezgu ̈inoandfiverelatedterms
4
COMP90042
L10
326
Word Vectors
CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
•
•
•
Table14.1:Distributionalstatisticsfortezgu ̈inoandfiverelatedterms Each row can be thought of a word vector
tezgu ̈ino loud motoroil tortillas choices wine
(14.1) (14.2) (14.3) (14.4) …
1 1 1 1 0000 1 0 0 1 0101 0 1 0 0 1110
It describes the distributional properties
Capture all sorts of semantic relationships (synonymy, analogy, etc)
Figure 14.1: Lexical semantic relationships have regular linear structures in two dimen- sional projections of distributional statistics (Pennington et al., 2014).
5
COMP90042
L10
•
•
We’ve seen word vectors before: word embeddings!
Word Embeddings?
Here we will learn other ways
to produce word vectors
‣ Count-based methods
‣ More efficient neural
methods designed just for
learning word vectors
Word Embeddings!
6
COMP90042 L10
Count-Based Methods
7
COMP90042
L10
•
•
•
Fundamental idea: represent meaning as a vector
The Vector Space Model
Consider documents as context
One matrix, two viewpoints
‣ Documents represented by their words ‣ Words represented by their documents
…
state
fun
heaven
…
…
425
0
1
0
426
3
0
0
427
0
0
0
……
8
COMP90042
L10
• • •
Weighting the values (beyond frequency) Creating low-dimensional dense vectors Comparing vectors
Manipulating the VSM
9
COMP90042
L10
•
•
Tf-idf
Standard weighting scheme for information
retrieval
Discounts common words
𝑖𝑑𝑓𝑤 = log |𝐷| 𝑑𝑓𝑤
…
the
country
hell
…
…
425
43
5
1
426
24
1
0
427
37
0
3
…
df
500
14
7
…
the
country
hell
…
…
425
0
26.0
6.2
426
0
5.2
0
427
0
0
18.6
…
tf matrix
tf-idf matrix
10
COMP90042
L10
• •
• •
Term-document matrices are very sparse Dimensionality reduction: create shorter, denser
vectors
More practical (less features) Remove noise (less overfitting)
Dimensionality Reduction
11
COMP90042
L10
Singular Value Decomposition
|D|
A
𝐴 = 𝑈Σ𝑉𝑇 (term-document matrix)
010⋯0 000 0 000 0
⋮⋱⋮
⋯0
000
|V|
U
(new term matrix)
Σ
(singular values)
VT
(new document matrix)
m m |D|
m m=Rank(A)
9.100⋯0 04.40 0 002.3 0
⋮⋱⋮
⋯ 0.1
000
2.2 0.3 ⋯ 8.7 5.5 −2.8 0.1 −1.3 3.7 3.5
⋮⋱⋮
⋯ −1.9
2.9 −2.1
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
⋯ 0.3
−1.9 −1.8
|V|
m
12
COMP90042
L10 Truncating – Latent Semantic Analysis
• Truncating U, Σ, and VT to k dimensions produces best possible k rank approximation of original matrix
•
• Typicalvaluesforkare100-5000 U
So truncated, Uk (or VkT ) is a new low dimensional representation of the word
mk
Uk k
Word Vectors!
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
8.7 0.1 3.5
⋱⋮ ⋯ −1.9
⋯
|V|
|V|
13
COMP90042
L10
Words as Context
Lists how often words appear with other words
‣ In some predefined context (usually a window)
•
•
The obvious problem with raw frequency: dominated by common words
…
the
country
hell
…
…
state
1973
10
1
fun
54
2
0
heaven
55
1
3
……
14
COMP90042
L10
•
For two events x and y, PMI computes the discrepancy between:
‣ Their joint distribution
‣ Their individual distributions (assuming
independence)
𝑃𝑀𝐼(𝑥, 𝑦) = log2 𝑝(𝑥, 𝑦) 𝑝(𝑥)𝑝(𝑦)
Pointwise Mutual Information
15
COMP90042
L10
Calculating PMI
…
the
country
hell
…
Σ
…
state
1973
10
1
12786
fun
54
2
0
633
heaven
55
1
3
627
…
Σ
1047519
3617
780
15871304
•
p(x,y) = count(x,y) / Σ
•
x = state, y = country
p(x,y) = 10/15871304 = 6.3 x 10-7
p(x) = 12786/15871304 = 8.0 x 10-4
p(y) = 3617/15871304 = 2.3 x 10-4
PMI(x,y) = log (6.3 x 10-7)/((8.0 x 10-4) (2.3 x 10-4))
p(x)=Σ /Σ
x
•
•
p(y) = Σy / Σ •
2 = 1.78
16
COMP90042
L10
PMI Matrix
• PMIdoesabetterjobofcapturinginteresting semantics
‣ E.g. heaven and hell
• Butitisobviouslybiasedtowardsrarewords • Anddoesn’thandlezeroswell
…
the
country
hell
…
…
state
1.22
1.78
0.63
fun
0.37
3.79
-inf
heaven
0.41
2.80
6.60
……
17
COMP90042
L10
PMI Tricks Zero all negative values (PPMI)
‣ Avoid –inf and unreliable negative values
•
•
Counter bias towards rare events ‣ Smooth probabilities
18
COMP90042
L10
SVD
…
the
country
hell
…
…
state
1.22
1.78
0.63
fun
0.37
3.79
0
heaven
0.41
2.80
6.60
……
…
the
country
hell
…
…
425
0
26.0
6.2
426
0
5.2
0
427
0
0
18.6
…
•
Regardless of whether we use document or word as context, SVD can be applied to create dense vectors
tf-idf matrix
PPMI matrix
19
COMP90042
L10
•
•
•
Word similarity = comparison between word vectors (e.g. cosine similarity)
Similarity
Find synonyms, based on proximity in vector space
‣ automatic construction of lexical resources
Use vectors as features in classifier — more robust to different inputs (movie vs film)
20
COMP90042
L10
Neural Methods
21
COMP90042
L10
•
•
We’ve seen word embeddings used in neural networks (feedforward or recurrent)
•
Word embeddings is just one part of the model
Word Embeddings
But these models are designed for other tasks: ‣ Classification
‣ Language modelling
22
COMP90042
L10
•
•
Can we design neural networks whose goal is to learn word embeddings?
Neural Models for Embeddings
Desiderata:
‣ Unsupervised
‣ Efficient
‣ Useful representation
23
COMP90042
L10
•
•
Neural network inspired approaches seek to learn vector representations of words and their contexts
•
‣
Word2Vec
Key idea
‣ Word embeddings should be similar to embeddings of neighbouring words
‣ And dissimilar to other words that don’t occur nearby
Using vector dot product for vector ‘comparison’
u.v=∑j uj vj
24
COMP90042
L10
•
Word2Vec Framed as learning a classifier
‣ Skip-gram: predict words in local context surrounding given word
P(he | rests) P(life | rests)
P(in | rests)
P(peace | rests)
… Bereft of life he rests in peace! If you hadn’t nailed him …
P(rests | {he, in, life, peace})
‣ CBOW: predict word in centre, given words in the local surrounding context
•
Local context means words within L positions, L=2 above
25
COMP90042
L10
•
Skip-gram Model Generates each word in context given centre word
P(life | rests)
P(he | rests)
P(in | rests)
P(peace | rests)
•
•
… Bereft of life he rests in peace! If you hadn’t nailed him … Total probability defined as
‣ Where subscript
denotes position in
running text
Using a logistic
regression model
he rests
26
COMP90042
L10
Embedding parameterisation • Two parameter matrices, with d-dimensional
19.6 • EMBEDDINGS FROM PREDICTION: SKIP-GRAM AND CBOW 19 embedding for all words
1……..d
11 22 .. ..
1……..d
target word embedding for word i
i
j
context word embedding for word j
using word location as its index
Output layer
Input layer
probabilities of
context words
y1 Projection layer y2
27
W
.
|Vw| |Vc|
|Vw| ⨉ d
Figure 19.17 The word matrix W and context matrix C (with embeddings shown as row
vectors) learned by the skipgram model.
• Words are numbered, e.g., by sorting vocabulary and
Fig 19.17, JM3
.
C
|Vc| ⨉ d
embedding for w
|
|V | |V |
wc COMP90042 |Vw| ⨉ d |Vc| ⨉ d
L10
Figure 19.17 The word matrix W and context matrix C (with embeddings shown as row Skip-gram model
vectors) learned by the skipgram model.
Input layer
1-hot input vector
Projection layer
embedding for wt
Output layer
probabilities of
context words y1
y2
C yk d⨉|V|
wt
x1 x2
xj
x|V|
1⨉|V|
W |V|⨉d
wt-1
y|V|
y1 y2
wt+1
y|V|
C d⨉|V| 1⨉d
yk
Fig 19.18, JM3
Figure 19.18 The skip-gram model (Mikolov et al. 2013, Mikolov et al. 2013a). is to compute P(wk wj).
28
COMP90042
L10
• •
Train to maximise likelihood of raw text
Too slow in practice, due to normalisation over |V|
Training the skip-gram model
•
Reduce problem to binary classification, distinguish real context words from non-context words aka “negative samples”
‣ words drawn randomly from V
29
Word2vec learns embeddings by starting with an initial set of embedding vectors
The noise words are chosen according to their weighted unigram frequency
COMP90042and then iteratively shifting the embedding of each word w to be more like the em- L10
beddings of words that occur nearby in texts, and less like the embeddings of words that don’t occur nearby.
Negative Sampling
Let’s start by considering a single piece of the training data, from the sentence above:
… lemon, a [tablespoon of apricot jam, a] pinch …
c1 c2 t c3 c4
This example has a target word t (apricot), and 4 context words in the L = ±2 window, resulting in 4 positive training instances (on the left below):
positive examples +
tc
apricot tablespoon apricot of
apricot preserves apricot or
negative examples –
tctc apricot aardvark apricot twelve
apricot puddle apricot hello apricot where apricot dear apricot coaxial apricot forever
each consisting of the target t plus a ‘noise word1’. A noise word is a random word minimise similarity
1
maximise similarity
For training a binary classifier we also need negative examples, and in fact skip-
P (+|t, c) =
gram uses more negative examples than positive examples, the ratio set by a param-
P (−|t, c) = 1 −
from the lexicon, constrained not to be the target word t. The right above shows the
between target word and
1 + e−t·c real context words
eter k. So for each of these (t,c) training instances we’ll create k negative samples,
between target word and
1
1 + e−t·c non-context words setting where k = 2, so we’ll have 2 negative examples in the negative training set
1+e−x
for each positive example t,c.
30
COMP90042
L10
Skip-gram Loss
•
In practice, use k negative examples
31
b c
b
neighbor words.
We can then use stochastic gradient descent to train to this objective, iteratively
COMP90042 L10
modifying the parameters (the embeddings for each target word t and each context word or noise word c in the vocabulary) to maximize the objective.
target
Note that the skip-gram model thus actually learns two separate embeddings for each word w: the target embedding t and the context embedding c. These embeddings are stored in two matrices, the target matrix T and the context matrix
edding ontext
edding•
Iterative process (stochastic gradient desce
nt)
Training Illustration
C. So each row i of the target matrix T is the 1 ⇥ d vector embedding t for word
i
i in the vocabulary V , and each column i of the context matrix C is a d ⇥ 1 vector
‣ each step moves embeddings closer for context words embedding ci for word i in V . Fig. 6.13 shows an intuition of the learning task for
the embeddings encoded in these two matrices.
‣ and moves embeddings apart for noise samples
W
increase similarity( apricot , jam) wj . ck
1
C 1… … d
apricot
1.2…….j………V
1. . d.
“…apricot jam…” k. n.
jam neighbor word
aardvark
random noise word
decrease similarity( apricot , aardvark) wj . cn
. V
Figure 6.13 The skip-gram model tries to shift embeddings so the target embedding (here
for apricot) are closer to (have a higher dot product with) context embeddings for nearby 32
Source: JM3 Ch 6
words (here jam) and further from (have a lower dot product with) context embeddings for
COMP90042
L10
•
•
Unsupervised
‣ Raw, unlabelled corpus
•
Efficient
‣ Negative sampling (avoid softmax over full vocabulary)
‣ Scales to very very large corpus Useful representation:
‣ How do we evaluate word vectors?
Desiderata
33
COMP90042
L10
Evaluating Word Vectors Lexicon style tasks
‣ WordSim-353 are pairs of nouns with judged relatedness
‣ SimLex-999 also covers verbs and adjectives
‣ TOEFL asks for closest synonym as multiple choice ‣…
•
•
Test compatibility of word pairs using cosine similarity in vector space
34
COMP90042
Embeddings Exhibit Meaningful Geometry
•
L10
Word analogy task
‣ ManistoKingasWomanisto???
‣ v(Man) – v(King) = v(Woman) – v(???) ‣ v(???) = v(Woman) – v(Man) + v(King)
35
COMP90042
L10
Evaluating Word Vectors Best evaluation is in other downstream tasks
‣ Use bag-of-word embeddings as a feature representation in a classifier
‣ First layer of most deep learning models is to embed input text; use pre-trained word vectors as embeddings
•
•
Recently contextual word vectors shown to work even better
‣ ELMO & BERT (next lecture!)
36
COMP90042
L10
Pointers to Software Word2Vec
‣ C implementation of Skip-gram and CBOW
https://code.google.com/archive/p/word2vec/
•
•
•
GLOVE
‣ http://nlp.stanford.edu/projects/glove/
GenSim
‣ Python library with many methods include LSI, topic models and Skip-gram/CBOW
https://radimrehurek.com/gensim/index.html
37
COMP90042
L10
‣ JM3, Ch 6
Further Reading
38