PowerPoint Presentation
LECTURE 6
Vector Representaton and Models for Word Embeddings
Arkaitz Zubiaga, 24th January, 2018
2
Vector space models for language representaton.
Word embeddings.
SVD: Singular Value Decompositon.
Iteraton based models.
CBOW and skip-gram models.
Word2Vec and Glove.
LECTURE 6: CONTENTS
VECTOR SPACE MODELS
4
Goal: compute the probability of a sequence of words:
P(W) = P(w
1
, w
2
, w
3
, w
4
, w
5
,…, w
n
)
Related task: probability of an upcoming word:
P(w
5
| w
1
, w
2
, w
3
, w
4
)
Both of the above are language models.
RECAP: STATISTICAL LANGUAGE MODELS
5
So far, we have viewed words as (sequences of) atomic symbols.
We have used edit distance to compute similarity.
N-grams & LMs → what may follow/precede the word?
But this doesn’t tell us anything about semantc similarity, e.g.:
Is “Chinese” closer to “Asian” or to “English”?
Are “king” & “queen” more related than “doctor” &
“mountain”?
WORDS AS ATOMIC SYMBOLS
6
We may identfy signifcant similarity based on word overlap between:
“Facebook to fght ‘fake news’ by asking users to rank trust in media outlets”
“Facebook’s latest fx for fake news: ask users what they trust”
But we’ll fail when there isn’t an overlap:
“Zuckerberg announces new feature that crowdsources trustworthiness of news
organisatons”
WORDS AS ATOMIC SYMBOLS
NO OVERLAP
Using stemmer/lemmatiser
7
Likewise for text classifcaton, e.g.:
If classifer learns that:
“Leicester will welcome back Jamie Vardy for their Premier League clash with Watord”
belongs to the class/topic “sport”
We’ll fail to classify the following also as “sport:”
“Blind Cricket World Cup: India beat Pakistan by two wickets in thrilling fnal to retain ttle”
WORDS AS ATOMIC SYMBOLS
NO OVERLAP
8
Assumptons:
We can represent words as vectors of some dimension.
There is some N-dimensional space enough for encoding
language semantcs.
Each dimension has some semantc meaning, unknown a
priori, but could be e.g.:
Whether it is an object/concept/person.
Gender of person.
…
WORD VECTORS OR EMBEDDINGS
9
Word represented as: vector, |V| = vocabulary size
WORD VECTORS: ONE-HOT OR BINARY MODEL
10
Word represented as: vector, |V| = vocabulary size
Stll no noton of similarity, e.g.:
Soluton: reduce dimensionality of vector space.
WORD VECTORS: ONE-HOT OR BINARY MODEL
11
Bag-of-words: v = {|w
1
|, |w
2
|, …, |w
n
|}
Toy example: hello world hello world hello I like chocolate
v = {2, 3, 1, 1, 1}
Widely used, but largely being replaced by word embeddings.
Con: inefcient for large vocabularies.
Con: doesn’t capture semantcs (each word is an unrelated token)
BAG-OF-WORDS MODEL
12
Given as input:
A text/corpus.
An ofset ∆ (e.g. 5 words)
In a co-occurrence matrix with |V| rows, |V| columns:
The (i, j)th value indicates the number of tmes words i and j
co-occur within the given ofset ∆.
BUILDING A CO-OCCURRENCE MATRIX
13
Examples (∆ = 2 words):
We need to tackle fake news to keep society informed.
How can we build a classifer to deal with fake news?
Fake co-occurs with: to(2), news(2), deal(1), tackle(1), with(1)
Deal (with) and tackle are diferent tokens for us.
Frequent occurrence in similar contexts will indicate similarity.
BUILDING A CO-OCCURRENCE MATRIX
14
The table will be huge (and sparse) for large |V| (vocabularies).
We need to reduce the dimensionality.
WORD EMBEDDINGS: WORD-WORD MATRIX
15
SVD: Singular Value Decompositon
We build co-occurrence matrix (|V|x|V|) with ofset ∆.
We use SVD to decompose X as , where:
U(|V| x r) and V(|V| x r) are unitary matrices, and
S(r x r) is a diagonal matrix.
The columns of U (the lef singular vectors) are then the word
embeddings of the vocabulary.
WORD EMBEDDINGS: SVD METHODS
16
WORD EMBEDDINGS: SVD METHODS
We get |V| vectors of k dimensions each: word embeddings
e.g. word embedding of word w:
WE(w) = {v
1
, v
2
, …, v
k
}
We’ve reduced w’s dimensionality from |V| to k.
17
SVD EXAMPLE IN PYTHON
∆ = 1
like & I co-occur twice
18
PLOTTING SVD EXAMPLE IN PYTHON
Corpus: I like NLP. I like deep learning. I enjoy fying.
19
PLOTTING SVD EXAMPLE IN PYTHON
Corpus: I like NLP. I like deep learning. I enjoy fying.
NLP and deep aren’t directly
connected in the corpus (∆ is 1),
but have common context (like)
20
COMPUTING WORD SIMILARITY
Corpus: I like NLP. I like deep learning. I enjoy fying.
We can compute similarity
between w
i
and w
j
by comparing:
U[i, 0:k] and U[j, 0:k]
21
COMPUTING WORD SIMILARITY
Given 2 words w
1
and w
2
, similarity is computed as:
Dot/inner product, which equates:
|w
1
| * |w
2
| * cos(θ)
Cosine of angle between vectors
Length of w
2
Length of w
1
22
COMPUTING WORD SIMILARITY
Given 2 words w
1
and w
2
, similarity is computed as:
Dot/inner product, which equates:
|w
1
| * |w
2
| * cos(θ)
High similarity for:
near-parallel vectors with high values in same dimensions.
Low similarity for:
orthogonal vectors, low value vectors.
23
PROS AND CONS OF SVD
Pro: has shown to perform well in a number of tasks.
Con: dimensions need to change as new words are added to the
corpus, costly.
Con: resultng vectors can stll be high dimensional and sparse.
Con: Quadratc cost to perform SVD.
ALTERNATIVES TO SVD
25
ALTERNATIVE: ITERATION BASED METHODS
Low dimensional, dense vectors instead of high dimensional,
sparse vectors.
Instead of computng co-occurrences from entre corpus, predict
surrounding words in a window of length c of every word.
Rely on a rule that can be updated.
This will be faster and can easily incorporate a new
sentence/document or add a word to the vocabulary
This is the idea behind word2vec (Mikolov 2013)
26
WORD2VEC: CBOW AND SKIPGRAM MODELS
Contnuous bag of words model (CBOW): A language model
where we approximate a word from its lef and right context
within a window sized c.
i.e. from context to the word.
Skip gram model: A language model where we approximate the
words surrounding a given a word within a window sized c to the
lef and right of the word.
i.e., from the word to the context.
CBOW and skip grams: reverse of each other.
27
WORD2VEC: WHY IS IT COOL?
They are very good for encoding similarity.
28
WORD2VEC: WHY IS IT COOL?
They are very good for inferring word relatons:
v(‘Paris’) – v(‘France’) + v(‘Italy’) = v(‘Rome’)
v(‘king’) – v(‘man’) + v(‘woman’) = v(‘queen’)
29
PROS AND CONS: ITERATION BASED METHODS
Pro: Do not need to operate on entre corpus which involves very
sparse matrices.
Pro: Can capture semantc propertes of words as linear
relatonships between word vectors.
Pro: Fast and can be easily updated with new sentences
(complexity in the order of O(|C|) ).
Con: Can’t take into account the vast amount of repetton in the
data.
30
ANOTHER ALTERNATIVE: GLOVE
Glove (Pennington et al. 2014), is a count based method, does
dimensionality reducton, has similar performance to word2vec.
Does matrix factorisaton.
Can leverage repettons in the corpus as using the entre word
co-occurrence matrix.
How? Train on non-zero entries of a global word co-occurrence
matrix from a corpus, rather than the entre sparse matrix or
local word context.
31
GLOVE
Computatonally expensive for frst tme, then much faster as
non-zero entries are much smaller than words in corpus.
The intuiton is that relatonships between words should be
explored in terms of the ratos of their co-occurrence
probabilites with some probe words k.
32
GLOVE: VISUALISATION
33
GLOVE: VISUALISATION
34
GLOVE: VISUALISATION
35
GLOVE: VISUALISATION
Want to play around?
https://lamyiowce.github.io/word2viz/
https://lamyiowce.github.io/word2viz/
36
PYTHON: USING WORD2VEC
Preparing the input:
Word2Vec takes lists of lists of words (lists of sentences) as input.
e.g.:
sentences = [[‘this’, ‘is’, ‘my’, first’, ‘sentence’],
[‘a’, ‘short’, ‘sentence’],
[‘another’, ‘sentence’],
[‘and’, ‘this’, ‘is’, ‘the’, ‘last’, ‘one’]]
37
PYTHON: USING WORD2VEC
Training the model:
model = Word2Vec(sentences, min_count=10, size=300, workers=4)
We will only train vectors for
words occurring 10+ times in
the corpus
We want to produce word
vectors of 300 dimensions
We want to parallellise the task
running 4 processes
38
PYTHON: USING WORD2VEC
It’s memory intensive!
It stores matrices: #vocabulary (dependent on min_count), #size
(size parameter) of floats (single precision aka 4 bytes).
Three such matrices are held in RAM. If you have:
100,000 unique words, size=200, the model will require approx.:
100,000*200*4*3 bytes = ~229MB.
39
PYTHON: USING WORD2VEC
Evaluation:
It’s unsupervised, there is no intrinsic way of evaluating.
Extrinsic evaluation: test your model in a text classification,
sentiment analysis, machine translation,… task!
● Does it outperform other methods (e.g. bag-of-words)?
● Compare two models A and B: which one’s better?
40
PYTHON: USING WORD2VEC
Storing a model:
model = Word2Vec.load_word2vec_format(‘mymodel.txt’, binary=False)
or
model = Word2Vec.load_word2vec_format(‘mymodel.bin.gz’, binary=True)
Resuming training:
model = gensim.models.Word2Vec.load(‘mymodel.bin.gz’)
model.train(more_sentences)
41
PYTHON: USING WORD2VEC
42
PYTHON: USING WORD2VEC
This will give us the vector representaton of ‘computer:’
● v(‘computer’) = {-0.00449447, -0.00310097, …}
How do we then the vector representatons for sentences, e.g.:
● I have installed Ubuntu on my computer
43
PYTHON: USING WORD2VEC
Vector representatons for sentences, e.g.:
● I have installed Ubuntu on my computer
Standard practce is either of:
● Summing word vectors (they have the same dimensionality):
v(‘I’) + v(‘have’) + v(‘installed’) + v(‘Ubuntu’) + …
● Getng the average of word vectors:
(v(‘I’) + v(‘have’) + v(‘installed’) + …) / 7
44
PRE-TRAINED WORD VECTORS
One can train a model from a large corpus (millions, if not billions
of sentences). Can be tme-consuming, memory-intensive.
Pre-trained models are available.
Remember to choose a suitable pre-trained model.
Don’t use word vectors pre-trained from news artcles when
you’re working with social media!
45
PRE-TRAINED WORD VECTORS
Glove’s pre-trained vectors:
https://nlp.stanford.edu/projects/glove/
https://nlp.stanford.edu/projects/glove/
46
PRE-TRAINED WORD VECTORS
Pre-trained word vectors for 30+ languages (from Wikipedia):
https://github.com/Kyubyong/wordvectors
https://github.com/Kyubyong/wordvectors
47
PRE-TRAINED WORD VECTORS
UK Twitter word embeddings:
https://fgshare.com/artcles/UK_cTwitter_cword_cembeddings_cII_c/5791650
https://figshare.com/articles/UK_Twitter_word_embeddings_II_/5791650
48
REFERENCES
Gensim (word2vec):
https://radimrehurek.com/gensim/
Word2vec tutorial:
https://rare-technologies.com/word2vec-tutorial/
FastText:
https://github.com/facebookresearch/fastText/
GloVe: Global Vectors for Word Representation:
https://nlp.stanford.edu/projects/glove/
https://radimrehurek.com/gensim/
https://rare-technologies.com/word2vec-tutorial/
https://github.com/facebookresearch/fastText/
https://nlp.stanford.edu/projects/glove/
49
ASSOCIATED READING
Not yet part of Jurafsky’s book.
See Deep learning for NLP CS224d lectures 1 and 2.
http://cs224d.stanford.edu/syllabus.html
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013).
Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems (pp. 3111-3119).
Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global
vectors for word representation. In Proceedings of the 2014
conference on empirical methods in natural language processing
(EMNLP) (pp. 1532-1543).
http://cs224d.stanford.edu/syllabus.html
Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49