程序代写代做代考 algorithm Information Extraction

Information Extraction

Vector Semantics

Dense Vectors

Dan Jurafsky

Sparse versus dense vectors

• PPMI vectors are

• long (length |V|= 20,000 to 50,000)

• sparse (most elements are zero)

• Alternative: learn vectors which are

• short (length 200-1000)

• dense (most elements are non-zero)

2

Dan Jurafsky

Sparse versus dense vectors

• Why dense vectors?

• Short vectors may be easier to use as features in machine
learning (less weights to tune)

• Dense vectors may generalize better than storing explicit counts

• They may do better at capturing synonymy:

• car and automobile are synonyms; but are represented as
distinct dimensions; this fails to capture similarity between a
word with car as a neighbor and a word with automobile as a
neighbor

3

Dan Jurafsky

Three methods for getting short dense
vectors

• Singular Value Decomposition (SVD)

• A special case of this is called LSA – Latent Semantic Analysis

• “Neural Language Model”-inspired predictive models

• skip-grams and CBOW

• Brown clustering

4

Vector Semantics

Dense Vectors via SVD

Dan Jurafsky

Intuition

• Approximate an N-dimensional dataset using fewer dimensions

• By first rotating the axes into a new space

• In which the highest order dimension captures the most
variance in the original dataset

• And the next dimension captures the next most variance, etc.

• Many such (related) methods:
• PCA – principle components analysis

• Factor Analysis

• SVD
6

Dan Jurafsky

7

Dimensionality reduction

Dan Jurafsky

Singular Value Decomposition

8

Any rectangular w x c matrix X equals the product of 3 matrices:

W: rows corresponding to original but m columns represents a
dimension in a new latent space, such that

• M column vectors are orthogonal to each other

• Columns are ordered by the amount of variance in the dataset each new
dimension accounts for

S: diagonal m x m matrix of singular values expressing the
importance of each dimension.

C: columns corresponding to original but m rows corresponding to
singular values

Dan Jurafsky

Singular Value Decomposition

9 Landuaer and Dumais 1997

• https://commons.wikimedia.org/wiki/File:Singular-Value-Decomposition.svg
• http://www.ams.org/samplings/feature-column/fcarc-svd

http://www.ams.org/samplings/feature-column/fcarc-svd
http://www.ams.org/samplings/feature-column/fcarc-svd

Dan Jurafsky

SVD applied to term-document matrix:
Latent Semantic Analysis

• If instead of keeping all m dimensions, we just keep the top k
singular values. Let’s say 300.

• The result is a least-squares approximation to the original X

• But instead of multiplying,
we’ll just make use of W.

• Each row of W:
• A k-dimensional vector

• Representing word W

10

k
/

/
k

/
k

/
k

Deerwester et al (1988)

Dan Jurafsky

LSA more details

• 300 dimensions are commonly used

• The cells are commonly weighted by a product of two weights
• Local weight: Log term frequency

• Global weight: either idf or an entropy measure

11

Dan Jurafsky

Let’s return to PPMI word-word matrices

• Can we apply to SVD to them?

12

Dan Jurafsky

SVD applied to term-term matrix

13 (I’m simplifying here by assuming the matrix has rank |V|)

Dan Jurafsky

Truncated SVD on term-term matrix

14

• This is essentially the tri-matrix factorization model, or latent variable model. But why?
• Can you change it to a (bi-)matrix factorization model?

Dan Jurafsky

Truncated SVD produces embeddings

15

• Each row of W matrix is a k-dimensional
representation of each word w

• K might range from 50 to 1000

• Generally we keep the top k dimensions,
but some experiments suggest that
getting rid of the top 1 dimension or even
the top 50 dimensions is helpful (Lapesa
and Evert 2014).

Dan Jurafsky

Embeddings versus sparse vectors

• Dense SVD embeddings sometimes work better than
sparse PPMI matrices at tasks like word similarity
• Denoising: low-order dimensions may represent unimportant

information

• Truncation may help the models generalize better to unseen data.

• Having a smaller number of dimensions may make it easier for
classifiers to properly weight the dimensions for the task.

• Dense models may do better at capturing higher order co-
occurrence.

16

Vector Semantics

Embeddings inspired by
neural language models:

skip-grams and CBOW

Dan Jurafsky Prediction-based models:
An alternative way to get dense vectors

• Skip-gram (Mikolov et al. 2013a) CBOW (Mikolov et al. 2013b)

• Learn embeddings as part of the process of word prediction.

• Train a neural network to predict neighboring words
• Inspired by neural net language models.

• In so doing, learn dense embeddings for the words in the training corpus.

• Advantages:
• Fast, easy to train (much faster than SVD)

• Available online in the word2vec package

• Including sets of pretrained embeddings!18

Dan Jurafsky

Skip-grams

• Predict each neighboring word

• in a context window of 2C words

• from the current word.

• So for C=2, we are given word wt and predicting these
4 words:

19

Dan Jurafsky

Skip-grams learn 2 embeddings
for each w

input embedding v, in the input matrix W

• Column i of the input matrix W is the 1×d
embedding vi for word i in the vocabulary.

output embedding v′, in output matrix W’

• Row i of the output matrix W′ is a d × 1
vector embedding v′i for word i in the
vocabulary.

20

Dan Jurafsky

Setup

• Walking through corpus pointing at word w(t), whose index in

the vocabulary is j, so we’ll call it wj (1 < j < |V |). • Let’s predict w(t+1) , whose index in the vocabulary is k (1 < k < |V |). Hence our task is to compute P(wk|wj). 21 Dan Jurafsky Intuition: similarity as dot-product between a target vector and context vector 22 Dan Jurafsky Similarity is computed from dot product • Remember: two vectors are similar if they have a high dot product • Cosine is just a normalized dot product • So: • Similarity(j,k) ∝ ck ∙ vj • We’ll need to normalize to get a probability 23 Dan Jurafsky Turning dot products into probabilities • Similarity(j,k) = ck ∙ vj • We use softmax to turn into probabilities 24 Dan Jurafsky Embeddings from W and W’ • Since we have two embeddings, vj and cj for each word wj • We can either: • Just use vj • Sum them • Concatenate them to make a double-length embedding 25 Dan Jurafsky Learning • Start with some initial embeddings (e.g., random) • iteratively “improve” the embeddings for a word • more like the embeddings of its neighbors • less like the embeddings of other words. 26 Dan Jurafsky Visualizing W and C as a network for doing error backprop 27 See my modified picture on p29 Dan Jurafsky One-hot vectors • A vector of length |V| • 1 for the target word and 0 for other words • So if “popsicle” is vocabulary word 5 • The one-hot vector is • [0,0,0,0,1,0,0,0,0…….0] 28 Dan Jurafsky 29 Skip-gram ℎ = 𝑊𝑇 𝑤𝑡 = 𝑊𝑡,∗ 𝑦 = 𝐶 ℎ = 𝐶 𝑊 𝑇 𝑤𝑡ℎ = 𝑤𝑡 𝑅 𝑉 𝑅𝑑 𝑅 𝑉 𝑦𝑘 = 𝐶𝑘,∗ ℎ = 𝐶𝑘,∗ 𝑊𝑡,∗ 𝐶𝑇 (|V|xd) Dan Jurafsky Problem with the softamx • The denominator: have to compute over every word in vocab • Instead: just sample a few of those negative words 30 Dan Jurafsky Goal in learning • Make the word like the context words • We want this to be high: • And not like k randomly selected “noise words” • We want this to be low: 31 Dan Jurafsky Skipgram with negative sampling: Loss function 32 log 1 − 𝜎 𝑧 = … = log(𝜎(−𝑧)) Dan Jurafsky Relation between skipgrams and PMI! • If we multiply WW’T • We get a |V|x|V| matrix M , each entry mij corresponding to some association between input word i and output word j • Levy and Goldberg (2014b) show that skip-gram reaches its optimum just when this matrix is a shifted version of PMI: WW′T =MPMI −log k • So skip-gram is implicitly factoring a shifted version of the PMI matrix into the two embedding matrices. 33 Dan Jurafsky Properties of embeddings 34 • Nearest words to some embeddings (Mikolov et al. 20131) Dan Jurafsky Embeddings capture relational meaning! vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’) 35 Dan Jurafsky In-depth Reading • Mikolov’s original 2 papers on word2vec are not really readable by beginners. • Hardwork needed to really understand the inner working of word2vec model and its (many) assumptions/approximations • References: • Word2vec Explained- Deriving Mikolov et al.’s Negative-Sampling Word- Embedding Method [arXiv 2014] • Objective function • Word2vec Parameter Learning Explained [arXiv 2014] • Parameter learning 36 Vector Semantics Brown clustering Dan Jurafsky Brown clustering • An agglomerative clustering algorithm that clusters words based on which words precede or follow them • These word clusters can be turned into a kind of vector • We’ll give a very brief sketch here. 38 Dan Jurafsky Brown clustering algorithm • Each word is initially assigned to its own cluster. • We now consider consider merging each pair of clusters. Highest quality merge is chosen. • Quality = merges two words that have similar probabilities of preceding and following words • (More technically quality = smallest decrease in the likelihood of the corpus according to a class-based language model) • Clustering proceeds until all words are in one big cluster. 39 Dan Jurafsky Brown Clusters as vectors • By tracing the order in which clusters are merged, the model builds a binary tree from bottom to top. • Each word represented by binary string = path from root to leaf • Each intermediate node is a cluster • Chairman is 0010, “months” = 01, and verbs = 1 40 Brown Algorithm • Words merged according to contextual similarity • Clusters are equivalent to bit-string prefixes • Prefix length determines the granularity of the clustering 011 president walk run sprint chairman CEO November October 0 1 00 01 00110010 001 10 11 000 100 101010 Dan Jurafsky Brown cluster examples 41 Dan Jurafsky Class-based language model • Suppose each word was in some class ci: • 𝑃 𝑤𝑖 𝑤𝑖−1 = 𝑃 𝑐𝑖 𝑐𝑖−1 𝑃 𝑤𝑖 𝑐𝑖 • 𝑃 𝑐𝑜𝑟𝑝𝑢𝑠 𝐶 = ς𝑖−1 𝑛 𝑃 𝑐𝑖 𝑐𝑖−1 𝑃(𝑤𝑖|𝑐𝑖 42