Information Extraction
Vector Semantics
Dense Vectors
Dan Jurafsky
Sparse versus dense vectors
• PPMI vectors are
• long (length |V|= 20,000 to 50,000)
• sparse (most elements are zero)
• Alternative: learn vectors which are
• short (length 200-1000)
• dense (most elements are non-zero)
2
Dan Jurafsky
Sparse versus dense vectors
• Why dense vectors?
• Short vectors may be easier to use as features in machine
learning (less weights to tune)
• Dense vectors may generalize better than storing explicit counts
• They may do better at capturing synonymy:
• car and automobile are synonyms; but are represented as
distinct dimensions; this fails to capture similarity between a
word with car as a neighbor and a word with automobile as a
neighbor
3
Dan Jurafsky
Three methods for getting short dense
vectors
• Singular Value Decomposition (SVD)
• A special case of this is called LSA – Latent Semantic Analysis
• “Neural Language Model”-inspired predictive models
• skip-grams and CBOW
• Brown clustering
4
Vector Semantics
Dense Vectors via SVD
Dan Jurafsky
Intuition
• Approximate an N-dimensional dataset using fewer dimensions
• By first rotating the axes into a new space
• In which the highest order dimension captures the most
variance in the original dataset
• And the next dimension captures the next most variance, etc.
• Many such (related) methods:
• PCA – principle components analysis
• Factor Analysis
• SVD
6
Dan Jurafsky
7
Dimensionality reduction
Dan Jurafsky
Singular Value Decomposition
8
Any rectangular w x c matrix X equals the product of 3 matrices:
W: rows corresponding to original but m columns represents a
dimension in a new latent space, such that
• M column vectors are orthogonal to each other
• Columns are ordered by the amount of variance in the dataset each new
dimension accounts for
S: diagonal m x m matrix of singular values expressing the
importance of each dimension.
C: columns corresponding to original but m rows corresponding to
singular values
Dan Jurafsky
Singular Value Decomposition
9 Landuaer and Dumais 1997
• https://commons.wikimedia.org/wiki/File:Singular-Value-Decomposition.svg
• http://www.ams.org/samplings/feature-column/fcarc-svd
http://www.ams.org/samplings/feature-column/fcarc-svd
http://www.ams.org/samplings/feature-column/fcarc-svd
Dan Jurafsky
SVD applied to term-document matrix:
Latent Semantic Analysis
• If instead of keeping all m dimensions, we just keep the top k
singular values. Let’s say 300.
• The result is a least-squares approximation to the original X
• But instead of multiplying,
we’ll just make use of W.
• Each row of W:
• A k-dimensional vector
• Representing word W
10
k
/
/
k
/
k
/
k
Deerwester et al (1988)
Dan Jurafsky
LSA more details
• 300 dimensions are commonly used
• The cells are commonly weighted by a product of two weights
• Local weight: Log term frequency
• Global weight: either idf or an entropy measure
11
Dan Jurafsky
Let’s return to PPMI word-word matrices
• Can we apply to SVD to them?
12
Dan Jurafsky
SVD applied to term-term matrix
13 (I’m simplifying here by assuming the matrix has rank |V|)
Dan Jurafsky
Truncated SVD on term-term matrix
14
≈
• This is essentially the tri-matrix factorization model, or latent variable model. But why?
• Can you change it to a (bi-)matrix factorization model?
Dan Jurafsky
Truncated SVD produces embeddings
15
• Each row of W matrix is a k-dimensional
representation of each word w
• K might range from 50 to 1000
• Generally we keep the top k dimensions,
but some experiments suggest that
getting rid of the top 1 dimension or even
the top 50 dimensions is helpful (Lapesa
and Evert 2014).
Dan Jurafsky
Embeddings versus sparse vectors
• Dense SVD embeddings sometimes work better than
sparse PPMI matrices at tasks like word similarity
• Denoising: low-order dimensions may represent unimportant
information
• Truncation may help the models generalize better to unseen data.
• Having a smaller number of dimensions may make it easier for
classifiers to properly weight the dimensions for the task.
• Dense models may do better at capturing higher order co-
occurrence.
16
Vector Semantics
Embeddings inspired by
neural language models:
skip-grams and CBOW
Dan Jurafsky Prediction-based models:
An alternative way to get dense vectors
• Skip-gram (Mikolov et al. 2013a) CBOW (Mikolov et al. 2013b)
• Learn embeddings as part of the process of word prediction.
• Train a neural network to predict neighboring words
• Inspired by neural net language models.
• In so doing, learn dense embeddings for the words in the training corpus.
• Advantages:
• Fast, easy to train (much faster than SVD)
• Available online in the word2vec package
• Including sets of pretrained embeddings!18
Dan Jurafsky
Skip-grams
• Predict each neighboring word
• in a context window of 2C words
• from the current word.
• So for C=2, we are given word wt and predicting these
4 words:
19
Dan Jurafsky
Skip-grams learn 2 embeddings
for each w
input embedding v, in the input matrix W
• Column i of the input matrix W is the 1×d
embedding vi for word i in the vocabulary.
output embedding v′, in output matrix W’
• Row i of the output matrix W′ is a d × 1
vector embedding v′i for word i in the
vocabulary.
20
Dan Jurafsky
Setup
• Walking through corpus pointing at word w(t), whose index in
the vocabulary is j, so we’ll call it wj (1 < j < |V |). • Let’s predict w(t+1) , whose index in the vocabulary is k (1 < k < |V |). Hence our task is to compute P(wk|wj). 21 Dan Jurafsky Intuition: similarity as dot-product between a target vector and context vector 22 Dan Jurafsky Similarity is computed from dot product • Remember: two vectors are similar if they have a high dot product • Cosine is just a normalized dot product • So: • Similarity(j,k) ∝ ck ∙ vj • We’ll need to normalize to get a probability 23 Dan Jurafsky Turning dot products into probabilities • Similarity(j,k) = ck ∙ vj • We use softmax to turn into probabilities 24 Dan Jurafsky Embeddings from W and W’ • Since we have two embeddings, vj and cj for each word wj • We can either: • Just use vj • Sum them • Concatenate them to make a double-length embedding 25 Dan Jurafsky Learning • Start with some initial embeddings (e.g., random) • iteratively “improve” the embeddings for a word • more like the embeddings of its neighbors • less like the embeddings of other words. 26 Dan Jurafsky Visualizing W and C as a network for doing error backprop 27 See my modified picture on p29 Dan Jurafsky One-hot vectors • A vector of length |V| • 1 for the target word and 0 for other words • So if “popsicle” is vocabulary word 5 • The one-hot vector is • [0,0,0,0,1,0,0,0,0…….0] 28 Dan Jurafsky 29 Skip-gram ℎ = 𝑊𝑇 𝑤𝑡 = 𝑊𝑡,∗ 𝑦 = 𝐶 ℎ = 𝐶 𝑊 𝑇 𝑤𝑡ℎ = 𝑤𝑡 𝑅 𝑉 𝑅𝑑 𝑅 𝑉 𝑦𝑘 = 𝐶𝑘,∗ ℎ = 𝐶𝑘,∗ 𝑊𝑡,∗ 𝐶𝑇 (|V|xd) Dan Jurafsky Problem with the softamx • The denominator: have to compute over every word in vocab • Instead: just sample a few of those negative words 30 Dan Jurafsky Goal in learning • Make the word like the context words • We want this to be high: • And not like k randomly selected “noise words” • We want this to be low: 31 Dan Jurafsky Skipgram with negative sampling: Loss function 32 log 1 − 𝜎 𝑧 = … = log(𝜎(−𝑧)) Dan Jurafsky Relation between skipgrams and PMI! • If we multiply WW’T • We get a |V|x|V| matrix M , each entry mij corresponding to some association between input word i and output word j • Levy and Goldberg (2014b) show that skip-gram reaches its optimum just when this matrix is a shifted version of PMI: WW′T =MPMI −log k • So skip-gram is implicitly factoring a shifted version of the PMI matrix into the two embedding matrices. 33 Dan Jurafsky Properties of embeddings 34 • Nearest words to some embeddings (Mikolov et al. 20131) Dan Jurafsky Embeddings capture relational meaning! vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’) 35 Dan Jurafsky In-depth Reading • Mikolov’s original 2 papers on word2vec are not really readable by beginners. • Hardwork needed to really understand the inner working of word2vec model and its (many) assumptions/approximations • References: • Word2vec Explained- Deriving Mikolov et al.’s Negative-Sampling Word- Embedding Method [arXiv 2014] • Objective function • Word2vec Parameter Learning Explained [arXiv 2014] • Parameter learning 36 Vector Semantics Brown clustering Dan Jurafsky Brown clustering • An agglomerative clustering algorithm that clusters words based on which words precede or follow them • These word clusters can be turned into a kind of vector • We’ll give a very brief sketch here. 38 Dan Jurafsky Brown clustering algorithm • Each word is initially assigned to its own cluster. • We now consider consider merging each pair of clusters. Highest quality merge is chosen. • Quality = merges two words that have similar probabilities of preceding and following words • (More technically quality = smallest decrease in the likelihood of the corpus according to a class-based language model) • Clustering proceeds until all words are in one big cluster. 39 Dan Jurafsky Brown Clusters as vectors • By tracing the order in which clusters are merged, the model builds a binary tree from bottom to top. • Each word represented by binary string = path from root to leaf • Each intermediate node is a cluster • Chairman is 0010, “months” = 01, and verbs = 1 40 Brown Algorithm • Words merged according to contextual similarity • Clusters are equivalent to bit-string prefixes • Prefix length determines the granularity of the clustering 011 president walk run sprint chairman CEO November October 0 1 00 01 00110010 001 10 11 000 100 101010 Dan Jurafsky Brown cluster examples 41 Dan Jurafsky Class-based language model • Suppose each word was in some class ci: • 𝑃 𝑤𝑖 𝑤𝑖−1 = 𝑃 𝑐𝑖 𝑐𝑖−1 𝑃 𝑤𝑖 𝑐𝑖 • 𝑃 𝑐𝑜𝑟𝑝𝑢𝑠 𝐶 = ς𝑖−1 𝑛 𝑃 𝑐𝑖 𝑐𝑖−1 𝑃(𝑤𝑖|𝑐𝑖 42