CS计算机代考程序代写 scheme python database algorithm Hive l10-distributional-semantics-v3

l10-distributional-semantics-v3

COMP90042
Natural Language Processing

Lecture 10
Semester 1 2021 Week 5

Jey Han Lau

Distributional Semantics

COMP90042 L10

Lexical Databases – Problems

• Manually constructed
‣ Expensive
‣ Human annotation can be biased and noisy

• Language is dynamic
‣ New words: slang, terminology, etc.
‣ New senses

• The Internet provides us with massive amounts of
text. Can we use that to obtain word meanings?

COMP90042 L10

Distributional Hypothesis

• You shall know a word by the company it keeps 
— Firth, 1957

• Document co-occurrence indicative of topic
(document as context)
‣ E.g. voting and politics

• Local context reflects its meaning  
(word window as context)
‣ E.g. eat a pizza vs. eat a burger

COMP90042 L10

• Learn unknown word from its usage

• tezgüino

• Another way: look at words that share similar
contexts!

Guessing Meaning from Context

Chapter 14

Distributional and distributed
semantics

A recurring theme in natural language processing is the complexity of the mapping from
words to meaning. In chapter 4, we saw that a single word form, like bank, can have mul-
tiple meanings; conversely, a single meaning may be created by multiple surface forms,
a lexical semantic relationship known as synonymy. Despite this complex mapping be-
tween words and meaning, natural language processing systems usually rely on words
as the basic unit of analysis. This is especially true in semantics: the logical and frame
semantic methods from the previous two chapters rely on hand-crafted lexicons that map
from words to semantic predicates. But how can we analyze texts that contain words
that we haven’t seen before? This chapter describes methods that learn representations
of word meaning by analyzing unlabeled data, vastly improving the generalizability of
natural language processing systems. The theory that makes it possible to acquire mean-
ingful representations from unlabeled data is the distributional hypothesis.

14.1 The distributional hypothesis

Here’s a word you may not know: tezgüino (the example is from Lin, 1998). If you do not
know the meaning of tezgüino, then you are in the same situation as a natural language
processing system when it encounters a word that did not appear in its training data.
Now suppose you see that tezgüino is used in the following contexts:

(14.1) A bottle of is on the table.

(14.2) Everybody likes .

(14.3) Don’t have before you drive.

(14.4) We make out of corn.

325

326 CHAPTER 14. DISTRIBUTIONAL AND DISTRIBUTED SEMANTICS

(14.1) (14.2) (14.3) (14.4) …

tezgüino 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
choices 0 1 0 0
wine 1 1 1 0

Table 14.1: Distributional statistics for tezgüino and five related terms

Figure 14.1: Lexical semantic relationships have regular linear structures in two dimen-
sional projections of distributional statistics (Pennington et al., 2014).

What other words fit into these contexts? How about: loud, motor oil, tortillas, choices,
wine? Each row of Table 14.1 is a vector that summarizes the contextual properties for
each word, with a value of one for contexts in which the word can appear, and a value of
zero for contexts in which it cannot. Based on these vectors, we can conclude: wine is very
similar to tezgüino; motor oil and tortillas are fairly similar to tezgüino; loud is completely
different.

These vectors, which we will call word representations, describe the distributional
properties of each word. Does vector similarity imply semantic similarity? This is the dis-
tributional hypothesis, stated by Firth (1957) as: “You shall know a word by the company
it keeps.” The distributional hypothesis has stood the test of time: distributional statistics
are a core part of language technology today, because they make it possible to leverage
large amounts of unlabeled data to learn about rare words that do not appear in labeled
training data.

Distributional statistics have a striking ability to capture lexical semantic relationships

Jacob Eisenstein. Draft of October 15, 2018.

COMP90042 L10

Word Vectors

• Each row can be thought of a word vector

• It describes the distributional properties of a word

‣ i.e. encodes information about its context words

• Capture all sorts of semantic relationships
(synonymy, analogy, etc)

326 CHAPTER 14. DISTRIBUTIONAL AND DISTRIBUTED SEMANTICS

(14.1) (14.2) (14.3) (14.4) …

tezgüino 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
choices 0 1 0 0
wine 1 1 1 0

Table 14.1: Distributional statistics for tezgüino and five related terms

Figure 14.1: Lexical semantic relationships have regular linear structures in two dimen-
sional projections of distributional statistics (Pennington et al., 2014).

Distributional statistics have a striking ability to capture lexical semantic relationships

Jacob Eisenstein. Draft of October 15, 2018.

COMP90042 L10

Word Embeddings?

• We’ve seen word vectors before: word
embeddings!

• Here we will learn other ways 
to produce word vectors

‣ Count-based methods

‣ More efficient neural  
methods designed just for 
learning word vectors

Word Embeddings!

COMP90042 L10

Outline

• Count-based methods

• Neural methods

• Evaluation

COMP90042 L10

Count-Based Methods

COMP90042 L10

Learning Count-Based Word Vectors

• Generally two flavours

‣ Use document as context

‣ Use neighbouring words as context

COMP90042 L10

Document as Context: 
The Vector Space Model

• Core idea: represent word meaning as a vector

• Consider documents as context

• One matrix, two viewpoints
‣ Documents represented by their words
‣ Words represented by their documents

… state fun heaven …

…

425 0 1 0

426 3 0 0

427 0 0 0

……

COMP90042 L10

Manipulating the VSM

• Weighting the values (beyond frequency)

• Creating low-dimensional dense vectors

COMP90042 L10

Tf-idf

• Standard weighting scheme for information
retrieval

• Discounts common words!
… the country hell …

…

425 43 5 1

426 24 1 0

427 37 0 3

…

df 500 14 7

tf matrix

… the country hell …

…

425 0 26.0 6.2

426 0 5.2 0

427 0 0 18.6

…

tf-idf matrix

𝑖𝑑𝑓𝑤 = log
|𝐷 |
𝑑𝑓𝑤

total #docs

#docs that 
has w

COMP90042 L10

Dimensionality Reduction

• Term-document matrices are very sparse

• Dimensionality reduction: create shorter, denser
vectors

• More practical (less features)

• Remove noise (less overfitting)

COMP90042 L10

Singular Value Decomposition

𝐴 = 𝑈Σ𝑉𝑇
A  
(term-document matrix)

|D|

|V|

0 1 0
0 0 0
0 0 0

⋯
0
0
0

⋮ ⋱ ⋮
0 0 0 ⋯ 0

U 
(new term matrix)

|V|

2 . 2 0.3
5.5 −2.8

−1.3 3.7
⋯

8 . 7
0.1
3.5

⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −1.9

VT  
(new document matrix)

−0.2 4.0
−4.1 0.6
2.6 6.1

⋯
−1.3
−0.2
1.4

⋮ ⋱ ⋮
−1.9 −1.8 ⋯ 0.3

|D|

m=Rank(A)

Σ 
(singular values)

9 . 1 0 0
0 4.4 0
0 0 2.3

⋯
0
0
0

⋮ ⋱ ⋮
0 0 0 ⋯ 0.1

COMP90042 L10

Truncating – Latent Semantic Analysis

• Truncating U, Σ, and VT to k dimensions produces best
possible k rank approximation of original matrix

• Uk is a new low dimensional representation of words

• Typical values for k are 100-5000

|V|

2 . 2 0.3
5.5 −2.8

−1.3 3.7
⋯

−2.4
1.1
4.7

⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3

k
U

|V|

2 . 2 0.3
5.5 −2.8

−1.3 3.7
⋯

−2.4
1.1
4.7

⋯
8 . 7
0.1
3.5

⋮ ⋱ ⋮ ⋱ ⋮
2 . 9 −2.1 ⋯ −3.3 ⋯ −1.9

m k

Word Vectors!

COMP90042 L10

Words as Context

• Lists how often words appear with other words
‣ In some predefined context (e.g. a window of N words)

• The obvious problem with raw frequency:
dominated by common words
‣ But we can’t use tf-idf!

… the country hell …

…

state 1973 10 1

fun 54 2 0

heaven 55 1 3

……

COMP90042 L10

Pointwise Mutual Information

• For two events x and y, PMI computes the
discrepancy between:

‣ Their joint distribution =

‣ Their individual distributions (assuming
independence) =

P(x, y)

P(x)P(y)

PMI(x, y) = log2
P(x, y)

P(x)P(y)

COMP90042 L10

Calculating PMI
… the country hell … Σ

…

state 1973 10 1 12786

fun 54 2 0 633

heaven 55 1 3 627

…

Σ 1047519 3617 780 15871304

PMI(x, y) = log2
P(x, y)

P(x)P(y)

P(x, y) =
count(x, y)

P(x) =
Σx
Σ

P(y) =
Σy
Σ

x = state, y = country

P(x, y) =
10

15871304
= 6.3 × 10−7

P(x) =
12786

15871304
= 8.0 × 10−4

P(y) =
3617

15871304
= 2.3 × 10−4

PMI(x, y) = log2
6.3 × 10−7

(8.0 × 10−4)(2.3 × 10−4)
= 1.78

COMP90042 L10

PMI(heaven, hell)?
… the country hell … Σ

…

state 1973 10 1 12786

fun 54 2 0 633

heaven 55 1 3 627

…

Σ 1047519 3617 780 15871304

PMI(x, y) = log2
P(x, y)

P(x)P(y)

P(x, y) =
count(x, y)

P(x) =
Σx
Σ

P(y) =
Σy
Σ

PollEv.com/jeyhanlau569

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L10

PMI(heaven, hell)?
… the country hell … Σ

…

state 1973 10 1 12786

fun 54 2 0 633

heaven 55 1 3 627

…

Σ 1047519 3617 780 15871304

PMI(x, y) = log2
P(x, y)

P(x)P(y)

P(x, y) =
count(x, y)

P(x) =
Σx
Σ

P(y) =
Σy
Σ

PMI(x, y) = log2

3
15871304

627
15871304

× 780
15871304

= 6.61

COMP90042 L10

PMI Matrix

• PMI does a better job of capturing semantics
‣ E.g. heaven and hell

• But very biased towards rare word pairs

• And doesn’t handle zeros well

… the country hell …

…

state 1.22 1.78 0.63

fun 0.37 3.79 -inf

heaven 0.41 2.80 6.61

……

COMP90042 L10

PMI Tricks

• Zero all negative values (Positive PMI)
‣ Avoid –inf and unreliable negative values

• Counter bias towards rare events

‣ Normalised PMI ( PMI(x, y)−log2 P(x, y) )

COMP90042 L10

SVD ( )A = UΣVT

• Regardless of whether we use document or word
as context, SVD can be applied to create dense
vectors

… the country hell …

…

425 0 26.0 6.2

426 0 5.2 0

427 0 0 18.6

…

tf-idf matrix

… the country hell …

…

state 1.22 1.78 0.63

fun 0.37 3.79 0

heaven 0.41 2.80 6.60

……

PPMI matrix

COMP90042 L10

Neural Methods

COMP90042 L10

Word Embeddings

• We’ve seen word embeddings used in neural
networks (lecture 7 and 8)

• But these models are designed for other tasks:

‣ Classification

‣ Language modelling

• Word embeddings are just a by-product

COMP90042 L10

Neural Models for Embeddings

• Can we design neural networks whose goal is to
purely learn word embeddings?

• Desiderata:

‣ Unsupervised

‣ Efficient

COMP90042 L10

Word2Vec

• Core idea
‣ You shall know a word by the company it keeps
‣ Predict a word using context words

COMP90042 L10

Word2Vec
• Framed as learning a classifier

‣ Skip-gram: predict surrounding words of target word

‣ CBOW: predict target word using surrounding words

• Use surrounding words within L positions, L=2 above

… Bereft of life he rests in peace! If you hadn’t nailed him …

P(life | rests)

P(he | rests) P(in | rests)

P(peace | rests)

P(rests | {he, in, life, peace})

COMP90042 L10

Skip-gram Model
• Predicts each neighbouring word given target word

• Total probability defined as

• Using a logistic regression model

∏
l∈−L,…,−1,1,…,L

P(wt+l |wt)

P(life | rests) =
exp(Wrests ⋅ Clife)

∑
u∈V exp(Wrests ⋅ Cu)

… Bereft of life he rests in peace! If you hadn’t nailed him …

P(life | rests)
P(he | rests) P(in | rests)

P(peace | rests)

word embedding of rests
word embedding of life

dot product

COMP90042 L10

Embedding parameterisation

• Two word embedding matrices ( and )!

• Words are numbered, e.g., by sorting vocabulary and
using word location as its index

P(life | rests) =
exp(Wrests ⋅ Clife)

∑
u∈V exp(Wrests ⋅ Cu)

W C

COMP90042 L10

Skip-gram model

Fig 19.18, JM3

19.6 • EMBEDDINGS FROM PREDICTION: SKIP-GRAM AND CBOW 19

|Vw| ⨉ d

1……..d
1
2
.
.
i
.

|Vw|

W
1
2
.
.
j
.
|Vc|

C
target word embedding

for word i

|Vc| ⨉ d

context word embedding
for word j

1……..d

Figure 19.17 The word matrix W and context matrix C (with embeddings shown as row
vectors) learned by the skipgram model.

Input layer Projection layer

Output layer

W
|V|⨉d

wt-1

wt+1

1-hot input vector

1⨉d1⨉|V|

embedding for wt

probabilities of
context words

C d ⨉ |V|

x1
x2

x|V|

y1
y2

y|V|
y1
y2

y|V|

Figure 19.18 The skip-gram model (Mikolov et al. 2013, Mikolov et al. 2013a).

is to compute P(wk|w j).
We begin with an input vector x, which is a one-hot vector for the current wordone-hot

w j. A one-hot vector is just a vector that has one element equal to 1, and all the other
elements are set to zero. Thus in a one-hot representation for the word w j, x j = 1,
and xi = 0 8i 6= j, as shown in Fig. 19.19.

0 0 0 0 0 … 0 0 0 0 1 0 0 0 0 0 … 0 0 0 0

w0 wj w|V|w1

Figure 19.19 A one-hot vector, with the dimension corresponding to word w j set to 1.

We then predict the probability of each of the 2C output words—in Fig. 19.18
that means the two output words wt�1 and wt+1— in 3 steps:

1. Select the embedding from W: x is multiplied by W , the input matrix, to give
the hidden or projection layer. Since each column of the input matrix W isprojection layer
just an embedding for word wt , and the input is a one-hot vector for w j, the
projection layer for input x will be h = v j, the input embedding for w j.

COMP90042 L10

Training the skip-gram model
• Train to maximise likelihood of raw text

• Too slow in practice, due to normalisation over |V|

• Reduce problem to binary classification

‣ (life, rests) → real context word

‣ (aardvark, rests) → non-context word

‣ How to draw non-context word or negative samples?

‣ Randomly from V

P(life | rests) =
exp(Wrests ⋅ Clife)

∑
u∈V exp(Wrests ⋅ Cu)

COMP90042 L10

Negative Sampling

6.8 • WORD2VEC 21

6.8.2 Learning skip-gram embeddings
Word2vec learns embeddings by starting with an initial set of embedding vectors
and then iteratively shifting the embedding of each word w to be more like the em-
beddings of words that occur nearby in texts, and less like the embeddings of words
that don’t occur nearby.

Let’s start by considering a single piece of the training data, from the sentence
above:
… lemon, a [tablespoon of apricot jam, a] pinch …

c1 c2 t c3 c4

This example has a target word t (apricot), and 4 context words in the L = ±2
window, resulting in 4 positive training instances (on the left below):

positive examples +
t c
apricot tablespoon
apricot of
apricot preserves
apricot or

negative examples –
t c t c
apricot aardvark apricot twelve
apricot puddle apricot hello
apricot where apricot dear
apricot coaxial apricot forever

For training a binary classifier we also need negative examples, and in fact skip-
gram uses more negative examples than positive examples, the ratio set by a param-
eter k. So for each of these (t,c) training instances we’ll create k negative samples,
each consisting of the target t plus a ‘noise word’. A noise word is a random word
from the lexicon, constrained not to be the target word t. The right above shows the
setting where k = 2, so we’ll have 2 negative examples in the negative training set
� for each positive example t,c.

The noise words are chosen according to their weighted unigram frequency
pa(w), where a is a weight. If we were sampling according to unweighted fre-
quency p(w), it would mean that with unigram probability p(“the”) we would choose
the word the as a noise word, with unigram probability p(“aardvark”) we would
choose aardvark, and so on. But in practice it is common to set a = .75, i.e. use the
weighting p

3
4 (w):

Pa(w) =
count(w)aP
w0 count(w

0)a
(6.31)

Setting a = .75 gives better performance because it gives rare noise words slightly
higher probability: for rare words, Pa(w) > P(w). To visualize this intuition, it
might help to work out the probabilities for an example with two events, P(a) = .99
and P(b) = .01:

Pa(a) =
.99.75

.99.75 + .01.75
= .97

Pa(b) =
.01.75

.99.75 + .01.75
= .03 (6.32)

Given the set of positive and negative training instances, and an initial set of
embeddings, the goal of the learning algorithm is to adjust those embeddings such
that we

• Maximize the similarity of the target word, context word pairs (t,c) drawn
from the positive examples

P (+|t, c) =
1

1 + e−t·c

P (−|t, c) = 1−
1

1 + e−t·c

maximise similarity
between target word and

real context words

minimise similarity
between target word and

non-context words
1

1 + e−x

COMP90042 L10

Skip-gram Loss

• In practice, use k negative examples

COMP90042 L10

Desiderata

• Unsupervised

‣ Unlabelled corpus

• Efficient

‣ Negative sampling (avoid softmax over full
vocabulary)

‣ Scales to very very large corpus

COMP90042 L10

• Difficult to quantify the quality of word vectors
• Don’t capture polysemous words
• Data sparsity (not enough data to learn them)
• Difficult to incorporate them in downstream tasks

PollEv.com/jeyhanlau569

Problems with word vectors/embeddings
(count and neural methods)?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L10

Evaluation

COMP90042 L10

Word Similarity

• Measure similarity of two words using cosine
similarity

• Compare predicted similarity with human intuition

• Datasets
‣ WordSim-353 are pairs of nouns with judged

relatedness
‣ SimLex-999 also covers verbs and adjectives

COMP90042 L10

Word Analogy

• Man is to King as Woman is to ???

• v(Man) – v(King) = v(Woman) – v(???)

• v(???) = v(Woman) – v(Man) + v(King)

• Find word whose embedding is closest to
v(Woman) – v(Man) + v(King)

COMP90042 L10

Embedding Space

• Word2Vec embeddings show interesting geometry

• Explains why they are good in word analogy task

COMP90042 L10

Downstream Tasks

• Best evaluation is in other downstream tasks
‣ Use bag-of-word embeddings as a feature

representation in a classifier
‣ First layer of most deep  

learning models is to embed  
input text

‣ Initialise them with pretrained  
word vectors!

Word Embeddings!

COMP90042 L10

General Findings

• neural count

• Contextual word representation is shown to
work even better

• Dynamic word vectors that change depending on
context!

• ELMO & BERT (next lecture!)

COMP90042 L10

Pointers to Software

• Word2Vec
‣ C implementation of Skip-gram and CBOW 

https://code.google.com/archive/p/word2vec/

• GenSim
‣ Python library with many methods include LSI, topic

models and Skip-gram/CBOW 
https://radimrehurek.com/gensim/index.html

• GLOVE
‣ http://nlp.stanford.edu/projects/glove/

https://code.google.com/archive/p/word2vec/
https://radimrehurek.com/gensim/index.html
http://nlp.stanford.edu/projects/glove/
https://code.google.com/archive/p/word2vec/
https://radimrehurek.com/gensim/index.html
http://nlp.stanford.edu/projects/glove/

COMP90042 L10

Related Posts