CS计算机代考程序代写 scheme information retrieval Distributional Semantics

Distributional Semantics
AIMA 24.1; Jurafsky & Martin, Ch. 6

CMPSC 442
Week 14, Meeting 40, Three Segments

Outline

● High-Dimensional Distributional Semantics
● Dimension Reduction: Singular Value Decomposition (SVD)
● Using Word Embeddings

Outline, Wk 14, Mtg 40
2

Distributional Semantics
AIMA 24.1; Jurafsky & Martin, Ch. 6

CMPSC 442
Week 14, Meeting 40, Segment 1: High-Dimensional
Distributional Semantics

Firthian Linguistics

Firth (UK, 1890-1960) is noted for drawing attention to the
context-dependent nature of meaning with his notion of ‘context of
situation’ and his work on collocational meaning is widely
acknowledged in the field of distributional semantics. In particular,
he is known for the famous quotation:

You shall know a word by the company it keeps (Firth, J. R. 1957:11)

Distributional Semantics
4

● Words that occur in similar contexts have similar meanings
● Many important publications in early-mid 50’s discuss the distributional

hypothesis: Wittgenstein, 1953; Harris, 1954; Weaver, 1955; Firth, 1957

eat

consume

ingest

feed on

Substitutability & Contexts

cows

cattle

livestock

heifers

hay

grain

corn

cereal crops

4 x 4 x 4 = 64 similar sentences

Distributional Semantics
5

Vectors for Vector Space (Semantic) Matrices

● VSM: each cell of matrix is a (weighted) word frequency
○ How often the word occurs in a given context
○ Words with similar meanings should have similar context counts

● Varieties of high dimensional VSM: |V| x |C| where C(ontexts) can be
defined in various ways
○ Term-document (typical of IR)
○ Word-context (many applications)
○ Pair-pattern (incorporates higher order information about contexts)

Distributional Semantics
6

● Bold upper case, e.g., A: matrix
● Bold lower case, e.g., b: vector
● Lower case, e.g., c: scalar

Notation

Distributional Semantics
7

Term-Document Matrices (Salton, 1971)

● Row vectors: terms, e.g., words, . . .
● Column vectors: documents, e.g., web pages
● Bags, or multisets: unordered set with repetition of members

○ Represented as vectors with counts incremented for repetition
○ EG: {a, a, b, c, c, c} represented as x = <2,1,3>

● Term-document matrix size n (terms) x m (documents)
○ Columns x

:j
correspond to bags (of words)

○ Rows x
i:
correspond to terms

○ Cells x
ij
= frequency of x

i:
in x

:j

Distributional Semantics
8

Document Vectors from Word-Document Matrices

Jurafsky & Martin,
Ch. 6, Figs 6.3 and
6.4

Distributional Semantics
9

Bag-of-words (BOW) for Information Retrieval

● Query represented with the same bag-of-words as the document
● Bag-of-word vectors represent what the document or query is about
● Search engines designed with this VSM model work well
● Term-document matrices are sparse

○ Tens of thousands of words x tens of thousands (or more) documents
○ Most cells are empty or have very low counts

Distributional Semantics
10

Word Vectors from Word-Context Matrices
. . . . . .

Jurafsky &
Martin, Ch. 6,
Figs. 6.5 and 6.6

Distributional Semantics
11

Four Steps for Creation of Word-Context Matrices

1. Tokenize the input into words (or “terms”) and optionally apply other
preprocessing steps
a. Named Entity Recognition and normalization
b. Normalize variant word forms (e.g., stemming, lemmatization)
c. Tag numeric tokens (e.g., dates, measures, clock times, . . . )

2. Count word (term) frequencies
3. Convert raw frequencies to various weighting schemes
4. Smooth the vector space: dimensionality reduction

Distributional Semantics
12

Weighting Schemes, etc.

● tf-idf (Spärck Jones, 1972):
○ Term frequency: frequency of word w in document d
○ Inverse document frequency: log of inverse of how many documents a word w

occurs in

● Document length normalization
○ Equalizes long and short documents

● Pointwise Mutual Information (Church & Hanks, 1989; 1990)
● Feature selection methods

○ Χ-square
○ Information gain
○ Odds ratio

Distributional Semantics
13

● Mutual information of two random variables

● Pointwise mutual information (PMI) of two events

● Positive PMI (negative PMI unreliable)

Mutual Information, PMI, PPMI

Distributional Semantics
14

Example Matrices with PPMI: Smooths the Counts

Jurafsky & Martin, Ch. 6, Figs. 6.9-6.11: 4 words from Wikipedia corpus

Distributional Semantics
15

Distributional Semantics
AIMA 24.1; Jurafsky & Martin, Ch. 6

CMPSC 442
Week 14, Meeting 40, Segment 2: Dimension Reduction:
Singular Value Decomposition (SVD)

Word-Context Matrices

● Similarity of row x
i:
and row x

i+n:
captures how similar the contexts of

words w
i
and w

i+n
, hence the similarity of their meaning

● Latent semantic analysis (LSA) (Deerwester et al. 1990)
○ Inspired by term-document matrixes
○ Focus was on meaning similarity of words or contexts

● LSA is one method to capture the intuition of distributional semantics
○ Applies singular value decomposition (SVD) to matrices
○ Factors the original matrix into linearly independent components
○ Reduces dimensionality
○ Reduces sparsity
○ Reduces noise

Dimension Reduction
17

● Factor a matrix X into the product of three matrices: X = WΣCT
○ An orthogonal matrix W

○ Diagonal matrix Σ (zeroes except on diagonal) of singular values
○ Transpose of an orthogonal matrix C

● A matrix U is orthogonal if UUT = UTU = I
● EG

X =

SVD

Dimension Reduction
18

1

11

1

● W (orthogonal) is size |V| × m
○ Rows represent words
○ Columns are orthogonal to each other, ordered by amount of variance each

accounts for
○ m is the rank of X (linearly independent rows, or nonzero singular values)

● Σ is an m × m diagonal matrix of singular values

● CT represents contexts
○ each row is one of the new latent dimensions
○ m row vectors are orthogonal

Factoring the Word-Context Matrix: X = WΣCT

Dimension Reduction
19

Factorization of X into WΣCT

Dimension Reduction
20

Reduced (thin) SVD: Top k of m dimensions

Dimension Reduction
21

SVD Embeddings

● Truncate SVD to size V x k
● Each row of truncated W is a

word embedding

Dimension Reduction
22

Result of Truncated SVD on Word-Context Matrices

● Captures latent meaning: selecting k < rank r forces greater correspondence between words and contexts (rank ≡ linearly independent columns) ● Reduces noise: the singular values in the diagonal matrix are ranked in descending order of the amount of variation in X, so taking the top k represents the signal, while the remainder is the noise ● Captures higher-order, or indirect, co-occurrence where two words appear in similar contexts ● Sparsity reduction from a very sparse matrix to a very dense one; simulates having enough data to eliminate sparsity Dimension Reduction 23 Distributional Semantics AIMA 24.1; Jurafsky & Martin, Ch. 6 CMPSC 442 Week 14, Meeting 40, Segment 3: GloVe Measuring Similarity of Vectors ● Most widely used method to compare meanings of two semantic vectors: cosine similarity (cosine of the angle between them) Using Word Embeddings 25 Measuring Similarity of Vectors ● Most widely used method to compare two vectors: cosine similiarity ● Class exercise a: 0001011 b: 0001011 c: 0210100 d: 0160400 ○ cos(a, b) = ? ○ cos(a, c) = ? ○ cos(c, d) = ? Using Word Embeddings 26 Measuring Similarity of Vectors ● Most widely used method to compare two vectors: cosine similarity ● Class exercise a: 0001011 b: 0001011 c: 0210100 d: 0160400 ○ cos(a, b) = 3/(sqrt(3) x sqrt(3)) = 3/3 = 1 ○ cos(a, c) = 0/(sqrt(3) x sqrt(6)) = 0 ○ cos(c, d) = 12/(sqrt(6) x sqrt(53)) ≈ 12/(2.5 x 7.3) ≈ 12/17.8 ≈ 0.67 Using Word Embeddings 27 Vector Math and Word2Vec Word Embeddings Considerations ● How well do 2D pictures represent ND Vectors (100 < N < 1000)? ● What % of all words can be structured into analogical relations? ● What if vector methods differ? (Word2Vec = local; GloVE = global) Using Word Embeddings 28 GloVe: Global Vectors for Word Representation ● Based on addressing limitations of LSA and Word2Vec ● LSA: global context matrix factorization ○ Global co-occurrence statistics distinguish large-scale differences, e.g., stop words ○ Performs poorly on word analogy tasks ● Word2Vec: word classification by local contexts ○ Captures analogical meaning well ○ Ignores global statistics Using Word Embeddings 29 GloVe Intuition Given a pair of words i, j and any other word k ● (1) > 1 if k is more likely given i than given j
● (1) < 1 if k is more likely given j than given i ● (1) ≈ 1 if k is equally likely in either context Using Word Embeddings 30 GloVe ● Global log bi-linear regression model ● Source code and trained vectors are available ● Introduces a weighted least squares training objective for the word-word co-occurrence matrix X ● Very widely used as input representation for neural models Using Word Embeddings 31 Summary ● Vector space semantics implements the distributional hypothesis about word meaning ● Can represent words and contexts ● Semantic similarity easily computed by cosine of vectors ● Singular Value Decomposition and other smoothing improves performance Summary, Wk 14, Mtg 40 32