程序代写代做代考 python deep learning PowerPoint Presentation

PowerPoint Presentation

LECTURE 6

Vector Representaton and Models for Word Embeddings

Arkaitz Zubiaga, 24th January, 2018

 Vector space models for language representaton.

 Word embeddings.

 SVD: Singular Value Decompositon.

 Iteraton based models.

 CBOW and skip-gram models.

 Word2Vec and Glove.

LECTURE 6: CONTENTS

VECTOR SPACE MODELS

 Goal: compute the probability of a sequence of words:
P(W) = P(w

1
, w

2
, w

3
, w

4
, w

5
,…, w

n
)

 Related task: probability of an upcoming word:

 P(w
5
| w

1
, w

2
, w

3
, w

4
)

 Both of the above are language models.

RECAP: STATISTICAL LANGUAGE MODELS

 So far, we have viewed words as (sequences of) atomic symbols.

 We have used edit distance to compute similarity.

 N-grams & LMs → what may follow/precede the word?

 But this doesn’t tell us anything about semantc similarity, e.g.:

 Is “Chinese” closer to “Asian” or to “English”?

 Are “king” & “queen” more related than “doctor” &
“mountain”?

WORDS AS ATOMIC SYMBOLS

 We may identfy signifcant similarity based on word overlap between:

 “Facebook to fght ‘fake news’ by asking users to rank trust in media outlets”

 “Facebook’s latest fx for fake news: ask users what they trust”

 But we’ll fail when there isn’t an overlap:

 “Zuckerberg announces new feature that crowdsources trustworthiness of news
organisatons”

WORDS AS ATOMIC SYMBOLS

NO OVERLAP

Using stemmer/lemmatiser

 Likewise for text classifcaton, e.g.:

 If classifer learns that:

“Leicester will welcome back Jamie Vardy for their Premier League clash with Watord”

belongs to the class/topic “sport”

 We’ll fail to classify the following also as “sport:”

“Blind Cricket World Cup: India beat Pakistan by two wickets in thrilling fnal to retain ttle”

WORDS AS ATOMIC SYMBOLS

NO OVERLAP

 Assumptons:

 We can represent words as vectors of some dimension.

 There is some N-dimensional space enough for encoding
language semantcs.

 Each dimension has some semantc meaning, unknown a
priori, but could be e.g.:

 Whether it is an object/concept/person.

 Gender of person.

 …

WORD VECTORS OR EMBEDDINGS

 Word represented as: vector, |V| = vocabulary size

WORD VECTORS: ONE-HOT OR BINARY MODEL

 Word represented as: vector, |V| = vocabulary size

 Stll no noton of similarity, e.g.:

 Soluton: reduce dimensionality of vector space.

WORD VECTORS: ONE-HOT OR BINARY MODEL

 Bag-of-words: v = {|w
1
|, |w

2
|, …, |w

n
|}

 Toy example: hello world hello world hello I like chocolate
v = {2, 3, 1, 1, 1}

 Widely used, but largely being replaced by word embeddings.

 Con: inefcient for large vocabularies.

 Con: doesn’t capture semantcs (each word is an unrelated token)

BAG-OF-WORDS MODEL

 Given as input:

 A text/corpus.

 An ofset ∆ (e.g. 5 words)

 In a co-occurrence matrix with |V| rows, |V| columns:

 The (i, j)th value indicates the number of tmes words i and j
co-occur within the given ofset ∆.

BUILDING A CO-OCCURRENCE MATRIX

 Examples (∆ = 2 words):
 We need to tackle fake news to keep society informed.

 How can we build a classifer to deal with fake news?

 Fake co-occurs with: to(2), news(2), deal(1), tackle(1), with(1)

 Deal (with) and tackle are diferent tokens for us.
 Frequent occurrence in similar contexts will indicate similarity.

BUILDING A CO-OCCURRENCE MATRIX

 The table will be huge (and sparse) for large |V| (vocabularies).

 We need to reduce the dimensionality.

WORD EMBEDDINGS: WORD-WORD MATRIX

 SVD: Singular Value Decompositon

 We build co-occurrence matrix (|V|x|V|) with ofset ∆.
 We use SVD to decompose X as , where:

 U(|V| x r) and V(|V| x r) are unitary matrices, and

 S(r x r) is a diagonal matrix.

 The columns of U (the lef singular vectors) are then the word
embeddings of the vocabulary.

WORD EMBEDDINGS: SVD METHODS

We get |V| vectors of k dimensions each: word embeddings
e.g. word embedding of word w:
WE(w) = {v

1
, v

2
, …, v

k
}

We’ve reduced w’s dimensionality from |V| to k.

SVD EXAMPLE IN PYTHON

∆ = 1

like & I co-occur twice

PLOTTING SVD EXAMPLE IN PYTHON

 Corpus: I like NLP. I like deep learning. I enjoy fying.

PLOTTING SVD EXAMPLE IN PYTHON

 Corpus: I like NLP. I like deep learning. I enjoy fying.

NLP and deep aren’t directly
connected in the corpus (∆ is 1),
but have common context (like)

COMPUTING WORD SIMILARITY

 Corpus: I like NLP. I like deep learning. I enjoy fying.

We can compute similarity
between w

i
and w

j
by comparing:

U[i, 0:k] and U[j, 0:k]

COMPUTING WORD SIMILARITY

 Given 2 words w
1
and w

2
, similarity is computed as:

 Dot/inner product, which equates:
|w

1
| * |w

2
| * cos(θ)

Cosine of angle between vectors

Length of w
2

Length of w
1

COMPUTING WORD SIMILARITY

 Given 2 words w
1
and w

2
, similarity is computed as:

 Dot/inner product, which equates:
|w

1
| * |w

2
| * cos(θ)

 High similarity for:

near-parallel vectors with high values in same dimensions.

 Low similarity for:
orthogonal vectors, low value vectors.

PROS AND CONS OF SVD

 Pro: has shown to perform well in a number of tasks.

 Con: dimensions need to change as new words are added to the
corpus, costly.

 Con: resultng vectors can stll be high dimensional and sparse.

 Con: Quadratc cost to perform SVD.

ALTERNATIVES TO SVD

ALTERNATIVE: ITERATION BASED METHODS

 Low dimensional, dense vectors instead of high dimensional,
sparse vectors.

 Instead of computng co-occurrences from entre corpus, predict
surrounding words in a window of length c of every word.

 Rely on a rule that can be updated.

 This will be faster and can easily incorporate a new
sentence/document or add a word to the vocabulary

 This is the idea behind word2vec (Mikolov 2013)

WORD2VEC: CBOW AND SKIPGRAM MODELS

 Contnuous bag of words model (CBOW): A language model
where we approximate a word from its lef and right context
within a window sized c.

 i.e. from context to the word.

 Skip gram model: A language model where we approximate the
words surrounding a given a word within a window sized c to the
lef and right of the word.

 i.e., from the word to the context.

 CBOW and skip grams: reverse of each other.

WORD2VEC: WHY IS IT COOL?

 They are very good for encoding similarity.

WORD2VEC: WHY IS IT COOL?

 They are very good for inferring word relatons:

 v(‘Paris’) – v(‘France’) + v(‘Italy’) = v(‘Rome’)

 v(‘king’) – v(‘man’) + v(‘woman’) = v(‘queen’)

PROS AND CONS: ITERATION BASED METHODS

 Pro: Do not need to operate on entre corpus which involves very
sparse matrices.

 Pro: Can capture semantc propertes of words as linear
relatonships between word vectors.

 Pro: Fast and can be easily updated with new sentences
(complexity in the order of O(|C|) ).

 Con: Can’t take into account the vast amount of repetton in the
data.

ANOTHER ALTERNATIVE: GLOVE

 Glove (Pennington et al. 2014), is a count based method, does
dimensionality reducton, has similar performance to word2vec.

 Does matrix factorisaton.

 Can leverage repettons in the corpus as using the entre word
co-occurrence matrix.

 How? Train on non-zero entries of a global word co-occurrence
matrix from a corpus, rather than the entre sparse matrix or
local word context.

GLOVE

 Computatonally expensive for frst tme, then much faster as
non-zero entries are much smaller than words in corpus.

 The intuiton is that relatonships between words should be
explored in terms of the ratos of their co-occurrence
probabilites with some probe words k.

GLOVE: VISUALISATION

 Want to play around?

https://lamyiowce.github.io/word2viz/

PYTHON: USING WORD2VEC

 Preparing the input:
Word2Vec takes lists of lists of words (lists of sentences) as input.

e.g.:

sentences = [[‘this’, ‘is’, ‘my’, first’, ‘sentence’],
[‘a’, ‘short’, ‘sentence’],
[‘another’, ‘sentence’],
[‘and’, ‘this’, ‘is’, ‘the’, ‘last’, ‘one’]]

PYTHON: USING WORD2VEC

 Training the model:

model = Word2Vec(sentences, min_count=10, size=300, workers=4)

We will only train vectors for
words occurring 10+ times in
the corpus

We want to produce word
vectors of 300 dimensions

We want to parallellise the task
running 4 processes

PYTHON: USING WORD2VEC

 It’s memory intensive!

It stores matrices: #vocabulary (dependent on min_count), #size
(size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM. If you have:
100,000 unique words, size=200, the model will require approx.:

100,000*200*4*3 bytes = ~229MB.

PYTHON: USING WORD2VEC

 Evaluation:

It’s unsupervised, there is no intrinsic way of evaluating.

Extrinsic evaluation: test your model in a text classification,
sentiment analysis, machine translation,… task!

● Does it outperform other methods (e.g. bag-of-words)?
● Compare two models A and B: which one’s better?

PYTHON: USING WORD2VEC

 Storing a model:

model = Word2Vec.load_word2vec_format(‘mymodel.txt’, binary=False)

 or
 model = Word2Vec.load_word2vec_format(‘mymodel.bin.gz’, binary=True)

 Resuming training:

model = gensim.models.Word2Vec.load(‘mymodel.bin.gz’)

 model.train(more_sentences)

PYTHON: USING WORD2VEC

 This will give us the vector representaton of ‘computer:’

● v(‘computer’) = {-0.00449447, -0.00310097, …}

 How do we then the vector representatons for sentences, e.g.:

● I have installed Ubuntu on my computer

PYTHON: USING WORD2VEC

 Vector representatons for sentences, e.g.:

● I have installed Ubuntu on my computer

 Standard practce is either of:

● Summing word vectors (they have the same dimensionality):
v(‘I’) + v(‘have’) + v(‘installed’) + v(‘Ubuntu’) + …

● Getng the average of word vectors:
(v(‘I’) + v(‘have’) + v(‘installed’) + …) / 7

PRE-TRAINED WORD VECTORS

 One can train a model from a large corpus (millions, if not billions
of sentences). Can be tme-consuming, memory-intensive.

 Pre-trained models are available.

 Remember to choose a suitable pre-trained model.

 Don’t use word vectors pre-trained from news artcles when
you’re working with social media!

PRE-TRAINED WORD VECTORS

 Glove’s pre-trained vectors:
https://nlp.stanford.edu/projects/glove/

https://nlp.stanford.edu/projects/glove/

PRE-TRAINED WORD VECTORS

 Pre-trained word vectors for 30+ languages (from Wikipedia):
https://github.com/Kyubyong/wordvectors

https://github.com/Kyubyong/wordvectors

PRE-TRAINED WORD VECTORS

 UK Twitter word embeddings:
https://fgshare.com/artcles/UK_cTwitter_cword_cembeddings_cII_c/5791650

https://figshare.com/articles/UK_Twitter_word_embeddings_II_/5791650

REFERENCES

 Gensim (word2vec):
https://radimrehurek.com/gensim/

 Word2vec tutorial:
https://rare-technologies.com/word2vec-tutorial/

 FastText:
https://github.com/facebookresearch/fastText/

 GloVe: Global Vectors for Word Representation:
https://nlp.stanford.edu/projects/glove/

https://radimrehurek.com/gensim/
https://rare-technologies.com/word2vec-tutorial/
https://github.com/facebookresearch/fastText/
https://nlp.stanford.edu/projects/glove/

ASSOCIATED READING

 Not yet part of Jurafsky’s book.
 See Deep learning for NLP CS224d lectures 1 and 2.

http://cs224d.stanford.edu/syllabus.html
 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013).

Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems (pp. 3111-3119).

 Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global
vectors for word representation. In Proceedings of the 2014
conference on empirical methods in natural language processing
(EMNLP) (pp. 1532-1543).

http://cs224d.stanford.edu/syllabus.html

Slide 1
Slide 2
Slide 3
Slide 4
Slide 5
Slide 6
Slide 7
Slide 8
Slide 9
Slide 10
Slide 11
Slide 12
Slide 13
Slide 14
Slide 15
Slide 16
Slide 17
Slide 18
Slide 19
Slide 20
Slide 21
Slide 22
Slide 23
Slide 24
Slide 25
Slide 26
Slide 27
Slide 28
Slide 29
Slide 30
Slide 31
Slide 32
Slide 33
Slide 34
Slide 35
Slide 36
Slide 37
Slide 38
Slide 39
Slide 40
Slide 41
Slide 42
Slide 43
Slide 44
Slide 45
Slide 46
Slide 47
Slide 48
Slide 49

Related Posts