Distributional Semantics
AIMA 24.1; Jurafsky & Martin, Ch. 6
CMPSC 442
Week 14, Meeting 40, Three Segments
Outline
● High-Dimensional Distributional Semantics
● Dimension Reduction: Singular Value Decomposition (SVD)
● Using Word Embeddings
Outline, Wk 14, Mtg 40
2
Distributional Semantics
AIMA 24.1; Jurafsky & Martin, Ch. 6
CMPSC 442
Week 14, Meeting 40, Segment 1: High-Dimensional
Distributional Semantics
Firthian Linguistics
Firth (UK, 1890-1960) is noted for drawing attention to the
context-dependent nature of meaning with his notion of ‘context of
situation’ and his work on collocational meaning is widely
acknowledged in the field of distributional semantics. In particular,
he is known for the famous quotation:
You shall know a word by the company it keeps (Firth, J. R. 1957:11)
Distributional Semantics
4
● Words that occur in similar contexts have similar meanings
● Many important publications in early-mid 50’s discuss the distributional
hypothesis: Wittgenstein, 1953; Harris, 1954; Weaver, 1955; Firth, 1957
eat
consume
ingest
feed on
Substitutability & Contexts
cows
cattle
livestock
heifers
hay
grain
corn
cereal crops
4 x 4 x 4 = 64 similar sentences
Distributional Semantics
5
Vectors for Vector Space (Semantic) Matrices
● VSM: each cell of matrix is a (weighted) word frequency
○ How often the word occurs in a given context
○ Words with similar meanings should have similar context counts
● Varieties of high dimensional VSM: |V| x |C| where C(ontexts) can be
defined in various ways
○ Term-document (typical of IR)
○ Word-context (many applications)
○ Pair-pattern (incorporates higher order information about contexts)
Distributional Semantics
6
● Bold upper case, e.g., A: matrix
● Bold lower case, e.g., b: vector
● Lower case, e.g., c: scalar
Notation
Distributional Semantics
7
Term-Document Matrices (Salton, 1971)
● Row vectors: terms, e.g., words, . . .
● Column vectors: documents, e.g., web pages
● Bags, or multisets: unordered set with repetition of members
○ Represented as vectors with counts incremented for repetition
○ EG: {a, a, b, c, c, c} represented as x = <2,1,3>
● Term-document matrix size n (terms) x m (documents)
○ Columns x
:j
correspond to bags (of words)
○ Rows x
i:
correspond to terms
○ Cells x
ij
= frequency of x
i:
in x
:j
Distributional Semantics
8
Document Vectors from Word-Document Matrices
Jurafsky & Martin,
Ch. 6, Figs 6.3 and
6.4
Distributional Semantics
9
Bag-of-words (BOW) for Information Retrieval
● Query represented with the same bag-of-words as the document
● Bag-of-word vectors represent what the document or query is about
● Search engines designed with this VSM model work well
● Term-document matrices are sparse
○ Tens of thousands of words x tens of thousands (or more) documents
○ Most cells are empty or have very low counts
Distributional Semantics
10
Word Vectors from Word-Context Matrices
. . . . . .
Jurafsky &
Martin, Ch. 6,
Figs. 6.5 and 6.6
Distributional Semantics
11
Four Steps for Creation of Word-Context Matrices
1. Tokenize the input into words (or “terms”) and optionally apply other
preprocessing steps
a. Named Entity Recognition and normalization
b. Normalize variant word forms (e.g., stemming, lemmatization)
c. Tag numeric tokens (e.g., dates, measures, clock times, . . . )
2. Count word (term) frequencies
3. Convert raw frequencies to various weighting schemes
4. Smooth the vector space: dimensionality reduction
Distributional Semantics
12
Weighting Schemes, etc.
● tf-idf (Spärck Jones, 1972):
○ Term frequency: frequency of word w in document d
○ Inverse document frequency: log of inverse of how many documents a word w
occurs in
● Document length normalization
○ Equalizes long and short documents
● Pointwise Mutual Information (Church & Hanks, 1989; 1990)
● Feature selection methods
○ Χ-square
○ Information gain
○ Odds ratio
Distributional Semantics
13
● Mutual information of two random variables
● Pointwise mutual information (PMI) of two events
● Positive PMI (negative PMI unreliable)
Mutual Information, PMI, PPMI
Distributional Semantics
14
Example Matrices with PPMI: Smooths the Counts
Jurafsky & Martin, Ch. 6, Figs. 6.9-6.11: 4 words from Wikipedia corpus
Distributional Semantics
15
Distributional Semantics
AIMA 24.1; Jurafsky & Martin, Ch. 6
CMPSC 442
Week 14, Meeting 40, Segment 2: Dimension Reduction:
Singular Value Decomposition (SVD)
Word-Context Matrices
● Similarity of row x
i:
and row x
i+n:
captures how similar the contexts of
words w
i
and w
i+n
, hence the similarity of their meaning
● Latent semantic analysis (LSA) (Deerwester et al. 1990)
○ Inspired by term-document matrixes
○ Focus was on meaning similarity of words or contexts
● LSA is one method to capture the intuition of distributional semantics
○ Applies singular value decomposition (SVD) to matrices
○ Factors the original matrix into linearly independent components
○ Reduces dimensionality
○ Reduces sparsity
○ Reduces noise
Dimension Reduction
17
● Factor a matrix X into the product of three matrices: X = WΣCT
○ An orthogonal matrix W
○ Diagonal matrix Σ (zeroes except on diagonal) of singular values
○ Transpose of an orthogonal matrix C
● A matrix U is orthogonal if UUT = UTU = I
● EG
X =
SVD
Dimension Reduction
18
1
11
1
● W (orthogonal) is size |V| × m
○ Rows represent words
○ Columns are orthogonal to each other, ordered by amount of variance each
accounts for
○ m is the rank of X (linearly independent rows, or nonzero singular values)
● Σ is an m × m diagonal matrix of singular values
● CT represents contexts
○ each row is one of the new latent dimensions
○ m row vectors are orthogonal
Factoring the Word-Context Matrix: X = WΣCT
Dimension Reduction
19
Factorization of X into WΣCT
Dimension Reduction
20
Reduced (thin) SVD: Top k of m dimensions
Dimension Reduction
21
SVD Embeddings
● Truncate SVD to size V x k
● Each row of truncated W is a
word embedding
Dimension Reduction
22
Result of Truncated SVD on Word-Context Matrices
● Captures latent meaning: selecting k < rank r forces greater correspondence
between words and contexts (rank ≡ linearly independent columns)
● Reduces noise: the singular values in the diagonal matrix are ranked in
descending order of the amount of variation in X, so taking the top k
represents the signal, while the remainder is the noise
● Captures higher-order, or indirect, co-occurrence where two words appear in
similar contexts
● Sparsity reduction from a very sparse matrix to a very dense one; simulates
having enough data to eliminate sparsity
Dimension Reduction
23
Distributional Semantics
AIMA 24.1; Jurafsky & Martin, Ch. 6
CMPSC 442
Week 14, Meeting 40, Segment 3: GloVe
Measuring Similarity of Vectors
● Most widely used method to
compare meanings of two
semantic vectors: cosine similarity
(cosine of the angle between
them)
Using Word Embeddings
25
Measuring Similarity of Vectors
● Most widely used method to compare two vectors: cosine similiarity
● Class exercise
a: 0001011
b: 0001011
c: 0210100
d: 0160400
○ cos(a, b) = ?
○ cos(a, c) = ?
○ cos(c, d) = ?
Using Word Embeddings
26
Measuring Similarity of Vectors
● Most widely used method to compare two vectors: cosine similarity
● Class exercise
a: 0001011
b: 0001011
c: 0210100
d: 0160400
○ cos(a, b) = 3/(sqrt(3) x sqrt(3)) = 3/3 = 1
○ cos(a, c) = 0/(sqrt(3) x sqrt(6)) = 0
○ cos(c, d) = 12/(sqrt(6) x sqrt(53)) ≈ 12/(2.5 x 7.3) ≈ 12/17.8 ≈ 0.67
Using Word Embeddings
27
Vector Math and Word2Vec Word Embeddings
Considerations
● How well do 2D pictures represent ND Vectors (100 < N < 1000)?
● What % of all words can be structured into analogical relations?
● What if vector methods differ? (Word2Vec = local; GloVE = global)
Using Word Embeddings
28
GloVe: Global Vectors for Word Representation
● Based on addressing limitations of LSA and Word2Vec
● LSA: global context matrix factorization
○ Global co-occurrence statistics distinguish large-scale differences, e.g., stop words
○ Performs poorly on word analogy tasks
● Word2Vec: word classification by local contexts
○ Captures analogical meaning well
○ Ignores global statistics
Using Word Embeddings
29
GloVe Intuition
Given a pair of words i, j and any other word k
● (1) > 1 if k is more likely given i than given j
● (1) < 1 if k is more likely given j than given i
● (1) ≈ 1 if k is equally likely in either context
Using Word Embeddings
30
GloVe
● Global log bi-linear regression model
● Source code and trained vectors are available
● Introduces a weighted least squares training objective for the
word-word co-occurrence matrix X
● Very widely used as input representation for neural models
Using Word Embeddings
31
Summary
● Vector space semantics implements the distributional hypothesis
about word meaning
● Can represent words and contexts
● Semantic similarity easily computed by cosine of vectors
● Singular Value Decomposition and other smoothing improves
performance
Summary, Wk 14, Mtg 40
32