Distributional Semantics
COMP90042
Natural Language Processing
Lecture 10
Semester 1 2021 Week 5 Jey Han Lau
COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L10
•
Manually constructed
‣ Expensive
‣ Human annotation can be biased and noisy
•
Language is dynamic
‣ New words: slang, terminology, etc. ‣ New senses
•
The Internet provides us with massive amounts of text. Can we use that to obtain word meanings?
Lexical Databases – Problems
2
COMP90042
L10
•
•
•
You shall know a word by the company it keeps
— Firth, 1957
Document co-occurrence indicative of topic
(document as context) ‣ E.g. voting and politics
Distributional Hypothesis
Local context reflects its meaning
(word window as context)
‣ E.g. eat a pizza vs. eat a burger
3
of word meaning by analyzing unlabeled data, vastly improving o
L n e
x
COMP90042
natural language processing systems. The theory that makes it p
L10
•
•
•
Here’sawordyoumaynotknow:tezgu ̈ino(theexampleisfrom
know the meaning of tezgu ̈ino, then you are in the same situatio
processing system when it encounters a word that did not app
ingful representations from unlabeled data is the distributional h Guessing Meaning from Context
Learn unknown word from its usage
14.1 The distributional hypothesis
Nowsupposeyouseethattezgu ̈inoisusedinthefollowingconte
(14.1) A bottle of is on the table.
(14.2) Everybody likes .
(14.3) Don’t have before you drive.
(14.4) We make out of corn.
Another way: look at words that share similar
326 CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS contexts!
tezgüino
1 0
01
0 1 1110
0 1 01 0 0
325
tezgu ̈ino loud motoroil tortillas choices wine
(14.1) (14.2) (14.3) (14.4) …
1 1 1 1 0000
Table14.1:Distributionalstatisticsfortezgu ̈inoandfiverelatedterms
4
COMP90042
L10
326
Word Vectors
CHAPTER14. DISTRIBUTIONALANDDISTRIBUTEDSEMANTICS
tezgu ̈ino loud motoroil tortillas choices wine
(14.1) (14.2) (14.3) (14.4) …
1 1 1 1 0000 1 0 0 1 0101 0 1 0 0 1110
• •
•
Table14.1:Distributionalstatisticsfortezgu ̈inoandfiverelatedterms Each row can be thought of a word vector
It describes the distributional properties of a word ‣ i.e. encodes information about its context words
Capture all sorts of semantic relationships (synonymy, analogy, etc)
Figure 14.1: Lexical semantic relationships have regular linear structures in two dimen- sional projections of distributional statistics (Pennington et al., 2014).
5
COMP90042
L10
•
•
We’ve seen word vectors before: word embeddings!
Word Embeddings?
Here we will learn other ways
to produce word vectors
‣ Count-based methods
‣ More efficient neural
methods designed just for
learning word vectors
Word Embeddings!
6
COMP90042
L10
• • •
Count-based methods Neural methods Evaluation
Outline
7
COMP90042
L10
Count-Based Methods
8
COMP90042
L10
Learning Count-Based Word Vectors
•
Generally two flavours
‣ Use document as context
‣ Use neighbouring words as context
9
COMP90042
L10
•
•
•
Core idea: represent word meaning as a vector
Document as Context:
The Vector Space Model
Consider documents as context
One matrix, two viewpoints
‣ Documents represented by their words ‣ Words represented by their documents
…
state
fun
heaven
…
…
425
0
1
0
426
3
0
0
427
0
0
0
……
10
COMP90042
L10
• •
Weighting the values (beyond frequency) Creating low-dimensional dense vectors
Manipulating the VSM
11
COMP90042
L10
•
•
Tf-idf
Standard weighting scheme for information
retrieval
Discounts common words!
𝑖𝑑𝑓𝑤 = log |𝐷| 𝑑𝑓𝑤
total #docs
#docs that
has w
…
the
country
hell
…
…
425
43
5
1
426
24
1
0
427
37
0
3
…
df
500
14
7
…
the
country
hell
…
…
425
0
26.0
6.2
426
0
5.2
0
427
0
0
18.6
…
tf matrix
tf-idf matrix
12
COMP90042
L10
• •
• •
Term-document matrices are very sparse Dimensionality reduction: create shorter, denser
vectors
More practical (less features) Remove noise (less overfitting)
Dimensionality Reduction
13
COMP90042
L10
Singular Value Decomposition
|D|
A
𝐴 = 𝑈Σ𝑉𝑇 (term-document matrix)
Σ
VT
010⋯0 000 0 000 0
⋮⋱⋮
⋯0
000
|V|
U
(new term matrix)
(singular values) (new document matrix) m m |D|
9.100⋯0 04.40 0 002.3 0
⋮⋱⋮
⋯ 0.1
000
−0.2 4.0 ⋯ −1.3 −4.1 0.6 −0.2 2.6 6.1 1.4
⋮⋱⋮
⋯ 0.3
−1.9 −1.8
2.2 0.3 ⋯ 8.7 5.5 −2.8 0.1 −1.3 3.7 3.5
⋮⋱⋮
⋯ −1.9
2.9 −2.1
|V|
m
m m=Rank(A)
14
COMP90042
L10 Truncating – Latent Semantic Analysis
•
• •
Truncating U, Σ, and VT to k dimensions produces best possible k rank approximation of original matrix
Uk is a new low dimensional representation of words Typical values for k are 100-5000
U mk
Uk k
Word Vectors!
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
2.2 0.3 ⋯ −2.4 5.5 −2.8 1.1 −1.3 3.7 4.7
⋮⋱⋮
⋯ −3.3
2.9 −2.1
8.7 0.1 3.5
⋱⋮ ⋯ −1.9
⋯
|V|
|V|
15
COMP90042
L10
•
•
Words as Context
Lists how often words appear with other words
‣ In some predefined context (e.g. a window of N words) The obvious problem with raw frequency:
dominated by common words ‣ But we can’t use tf-idf!
…
the
country
hell
…
…
state
1973
10
1
fun
54
2
0
heaven
55
1
3
……
16
COMP90042
L10
•
‣
Pointwise Mutual Information
For two events x and y, PMI computes the discrepancy between:
Their joint distribution = P(x, y)
‣ Their individual distributions (assuming
independence) = P(x)P(y)
PMI(x, y) = log2
P(x, y) P(x)P(y)
17
COMP90042
L10
Calculating PMI
…
the
country
hell
…
Σ
…
state
1973
10
1
12786
fun
54
2
0
633
heaven
55
1
3
627
…
Σ
1047519
3617
780
15871304
PMI(x, y) = log P(x, y) 2 P(x)P(y)
Σ
x = state, y = country
P(x, y) = P(x) = x
12786
= 8.0 × 10−4
= 2.3 × 10−4
count(x, y)
P(x, y) =
P(x) = P(y) =
10 = 6.3 × 10−7 15871304
Σ Σ
15871304
P(y) = Σy Σ
3617
15871304
PMI(x, y) = log
6.3 × 10−7
2 (8.0 × 10−4)(2.3 × 10−4)
= 1.78
18
COMP90042
L10
PMI(heaven, hell)?
…
the
country
hell
…
Σ
…
state
1973
10
1
12786
fun
54
2
0
633
heaven
55
1
3
627
…
Σ
1047519
3617
780
15871304
PMI(x, y) = log2
P(x, y) = count(x, y)
P(x, y) P(x)P(y)
PollEv.com/jeyhanlau569
ΣΣ
P(x)= x Σ
P(y) = Σy Σ
19
COMP90042
L10
20
COMP90042
L10
PMI(heaven, hell)?
…
the
country
hell
…
Σ
…
state
1973
10
1
12786
fun
54
2
0
633
heaven
55
1
3
627
…
Σ
1047519
3617
780
15871304
PMI(x, y) = log2 P(x, y) P(x)P(y)
P(x, y) = count(x, y)
3
15871304
627 × 780 15871304 15871304
Σ P(x)= x
Σ
PMI(x, y) = log2
= 6.61
Σ
P(y) = Σy Σ
21
COMP90042
L10
•
• •
PMI Matrix
PMI does a better job of capturing semantics
‣ E.g. heaven and hell
But very biased towards rare word pairs And doesn’t handle zeros well
…
the
country
hell
…
…
state
1.22
1.78
0.63
fun
0.37
3.79
-inf
heaven
0.41
2.80
6.61
……
22
COMP90042
L10
•
•
PMI Tricks
Zero all negative values (Positive PMI)
‣ Avoid –inf and unreliable negative values Counter bias towards rare events
‣
Normalised PMI ( PMI(x, y) ) −log2 P(x, y)
23
COMP90042
SVD (A = UΣVT)
L10
…
the
country
hell
…
…
state
1.22
1.78
0.63
fun
0.37
3.79
0
heaven
0.41
2.80
6.60
……
…
the
country
hell
…
…
425
0
26.0
6.2
426
0
5.2
0
427
0
0
18.6
…
•
Regardless of whether we use document or word as context, SVD can be applied to create dense vectors
tf-idf matrix
PPMI matrix
24
COMP90042
L10
Neural Methods
25
COMP90042
L10
•
•
We’ve seen word embeddings used in neural networks (lecture 7 and 8)
•
Word embeddings are just a by-product
Word Embeddings
But these models are designed for other tasks: ‣ Classification
‣ Language modelling
26
COMP90042
L10
•
•
Can we design neural networks whose goal is to purely learn word embeddings?
Neural Models for Embeddings
Desiderata:
‣ Unsupervised ‣ Efficient
27
COMP90042
L10
•
Core idea
Word2Vec
‣ You shall know a word by the company it keeps ‣ Predict a word using context words
28
COMP90042
L10
•
Word2Vec Framed as learning a classifier
‣ Skip-gram: predict surrounding words of target word
P(he | rests) P(life | rests)
P(in | rests)
P(peace | rests)
… Bereft of life he rests in peace! If you hadn’t nailed him …
P(rests | {he, in, life, peace})
•
‣ CBOW: predict target word using surrounding words Use surrounding words within L positions, L=2 above
29
COMP90042
L10
•
Skip-gram Model Predicts each neighbouring word given target word
P(life | rests)
P(he | rests)
P(in | rests)
P(peace | rests)
•
•
… Bereft of life he rests in peace! If you hadn’t nailed him … Total probability defined as
∏ P(wt+l|wt) l∈−L,…,−1,1,…,L
Using a logistic regression model
P(life|rests) =
exp(Wrests ⋅ Clife)
∑u∈V exp(Wrests ⋅ Cu) dot product
word embedding of rests word embedding of life
30
COMP90042
L10
Embedding parameterisation
exp(Wrests ⋅ Clife) ∑u∈V exp(Wrests ⋅ Cu)
•
P(life|rests) =
Two word embedding matrices (W and C)!
• Words are numbered, e.g., by sorting vocabulary and using word location as its index
31
|
|V | |V |
wc COMP90042 |Vw| ⨉ d |Vc| ⨉ d
L10
Figure 19.17 The word matrix W and context matrix C (with embeddings shown as row Skip-gram model
vectors) learned by the skipgram model.
Input layer
1-hot input vector
Projection layer
embedding for wt
Output layer
probabilities of
context words y1
y2
C yk d⨉|V|
wt
x1 x2
xj
x|V|
1⨉|V|
W |V|⨉d
wt-1
y|V|
y1 y2
wt+1
y|V|
C d⨉|V| 1⨉d
yk
Fig 19.18, JM3
Figure 19.18 The skip-gram model (Mikolov et al. 2013, Mikolov et al. 2013a). is to compute P(wk wj).
32
COMP90042
L10
Training the skip-gram model • Traintomaximiselikelihoodofrawtext
• Tooslowinpractice,duetonormalisationover|V|
exp(Wrests ⋅ Clife) ∑u∈V exp(Wrests ⋅ Cu)
• Reduceproblemtobinaryclassification
‣ (life, rests) → real context word
‣ (aardvark, rests) → non-context word
‣ How to draw non-context word or negative samples? ‣ Randomly from V
P(life|rests) =
33
Word2vec learns embeddings by starting with an initial set of embedding vectors
The noise words are chosen according to their weighted unigram frequency
COMP90042and then iteratively shifting the embedding of each word w to be more like the em- L10
beddings of words that occur nearby in texts, and less like the embeddings of words that don’t occur nearby.
Negative Sampling
Let’s start by considering a single piece of the training data, from the sentence above:
… lemon, a [tablespoon of apricot jam, a] pinch …
c1 c2 t c3 c4
This example has a target word t (apricot), and 4 context words in the L = ±2 window, resulting in 4 positive training instances (on the left below):
positive examples +
tc
apricot tablespoon apricot of
apricot preserves apricot or
negative examples –
tctc apricot aardvark apricot twelve
apricot puddle apricot hello apricot where apricot dear apricot coaxial apricot forever
each consisting of the target t plus a ‘noise word1’. A noise word is a random word minimise similarity
1
maximise similarity
For training a binary classifier we also need negative examples, and in fact skip-
P (+|t, c) =
gram uses more negative examples than positive examples, the ratio set by a param-
P (−|t, c) = 1 −
from the lexicon, constrained not to be the target word t. The right above shows the
between target word and
1 + e−t·c real context words
eter k. So for each of these (t,c) training instances we’ll create k negative samples,
between target word and
1
1 + e−t·c non-context words setting where k = 2, so we’ll have 2 negative examples in the negative training set
1+e−x
for each positive example t,c.
34
COMP90042
L10
Skip-gram Loss
•
In practice, use k negative examples
35
COMP90042
L10
•
•
Unsupervised
‣ Unlabelled corpus
Desiderata
Efficient
‣ Negative sampling (avoid softmax over full vocabulary)
‣ Scales to very very large corpus
36
COMP90042
L10
Problems with word vectors/embeddings (count and neural methods)?
• Difficult to quantify the quality of word vectors
• Don’t capture polysemous words
• Data sparsity (not enough data to learn them)
• Difficult to incorporate them in downstream tasks
PollEv.com/jeyhanlau569
37
COMP90042
L10
38
COMP90042
L10
Evaluation
39
COMP90042
L10
•
• •
Measure similarity of two words using cosine similarity
Word Similarity
Compare predicted similarity with human intuition Datasets
‣ WordSim-353 are pairs of nouns with judged relatedness
‣ SimLex-999 also covers verbs and adjectives
40
COMP90042
L10
•
•
•
•
Man is to King as Woman is to ??? v(Man) – v(King) = v(Woman) – v(???) v(???) = v(Woman) – v(Man) + v(King)
Word Analogy
Find word whose embedding is closest to v(Woman) – v(Man) + v(King)
41
COMP90042
L10
• •
Word2Vec embeddings show interesting geometry Explains why they are good in word analogy task
Embedding Space
42
COMP90042
L10
•
Downstream Tasks
Best evaluation is in other downstream tasks
‣ Use bag-of-word embeddings as a feature representation in a classifier
‣ First layer of most deep
learning models is to embed
input text
‣ Initialise them with pretrained
word vectors!
Word Embeddings!
43
COMP90042
L10
General Findings neural > count
Contextual word representation is shown to work even better
•
•
• •
Dynamic word vectors that change depending on context!
ELMO & BERT (next lecture!)
44
COMP90042
L10
Pointers to Software Word2Vec
‣ C implementation of Skip-gram and CBOW
https://code.google.com/archive/p/word2vec/
•
•
•
GLOVE
‣ http://nlp.stanford.edu/projects/glove/
GenSim
‣ Python library with many methods include LSI, topic models and Skip-gram/CBOW
https://radimrehurek.com/gensim/index.html
45
COMP90042
L10
‣ JM3, Ch 6
Further Reading
46