程序代写代做代考 python deep learning lecture06.pptx

lecture06.pptx

LECTURE 6

Vector Representaton and Models for Word Embeddings

Arkaitz Zubiaga, 24
th

January, 2018

2

 Vector space models for language representaton.

 Word embeddings.

 SVD: Singular Value Decompositon.

 Iteraton based models.

 CBOW and skip-gram models.

 Word2Vec and Glove.

LECTURE 6: CONTENTS

VECTOR SPACE MODELS

4

 Goal: compute the probability of a sequence of words :
P(W) = P(w

1
, w

2
, w

3
, w

4
, w

5
,…, w

n
)

 Related task: probability of an upcoming word:

 P(w
5
| w

1
, w

2
, w

3
, w

4
)

 Both of the above are language models.

RECAP: STATISTICAL LANGUAGE MODELS

5

 So far, we have viewed words as (sequences of) atomic symbols .

 We have used edit distance to compute similarity.

 N-grams & LMs → what may follow/precede the word?

 But this doesn’t tell us anything about semantc similarity , e.g.:

 Is “Chinese” closer to “Asian” or to “English”?

 Are “king” & “queen” more related than “doctor” &
“mountain”?

WORDS AS ATOMIC SYMBOLS

6

 We may identfy signifcant similarity based on word overlap between:

 “Facebook to fght ‘fake news’ by asking users to rank trust in media outlets”

 “Facebook’s latest fx for fake news: ask users what they trust”

 But we’ll fail when there isn’t an overlap:

 “Zuckerberg announces new feature that crowdsources trustworthiness of news
organisatons”

WORDS AS ATOMIC SYMBOLS

NO OVERLAP

Using stemmer/lemmatiser

7

 Likewise for text classifcaton, e.g.:

 If classifer learns that:

“Leicester will welcome back Jamie Vardy for their Premier League clash with Watord”

belongs to the class/topic “sport”

 We’ll fail to classify the following also as “sport:”

“Blind Cricket World Cup: India beat Pakistan by two wickets in thrilling fnal to retain ttle”

WORDS AS ATOMIC SYMBOLS

NO OVERLAP

8

 Assumptons:

 We can represent words as vectors of some dimension.

 There is some N-dimensional space enough for encoding
language semantcs.

 Each dimension has some semantc meaning , unknown a
priori, but could be e.g.:

 Whether it is an object/concept/person.

 Gender of person.

 …

WORD VECTORS OR EMBEDDINGS

9

 Word represented as: vector, |V| = vocabulary size

WORD VECTORS: ONE-HOT OR BINARY MODEL

10

 Word represented as: vector, |V| = vocabulary size

 Stll no noton of similarity, e.g.:

 Solutonn reduce dimensionality of vector space.

WORD VECTORS: ONE-HOT OR BINARY MODEL

11

 Bag-of-wordsn v = {|w
1
|, |w

2
|, …, |w

n
|}

 Toy example: hello world hello world hello I like chocolate
v = {2, 3, 1, 1, 1}

 Widely used, but largely being replaced by word embeddings .

 Conn inefcient for large vocabularies.

 Conn doesn’t capture semantcs (each word is an unrelated token)

BAG-OF-WORDS MODEL

12

 Given as input:

 A text/corpus.

 An offset ∆ (e.g. 5 words)

 In a co-occurrence matrix with |V| rows, |V| columns:

 The (i, j)th value indicates the number of tmes words i and j
co-occur within the given offset ∆.

BUILDING A CO-OCCURRENCE MATRIX

13

 Examples (∆ = 2 words):
 We need to tackle fake news to keep society informed.

 How can we build a classifer to deal with fake news?

 Fake co-occurs with: to(2), news(2), deal(1), tackle(1), with(1)

 Deal (with) and tackle are diferent tokens for us.
 Frequent occurrence in similar contexts will indicate similarity.

BUILDING A CO-OCCURRENCE MATRIX

14

 The table will be huge (and sparse) for large |V| (vocabularies).

 We need to reduce the dimensionality .

WORD EMBEDDINGS: WORD-WORD MATRIX

15

 SVD: Singular Value Decompositon

 We build co-occurrence matrix (|V|x|V|) with ofset ∆.
 We use SVD to decompose X as , where:

 U(|V| x r) and V(|V| x r) are unitary matrices, and

 S(r x r) is a diagonal matrix.

 The columns of U (the lef singular vectors) are then the word
embeddings of the vocabulary .

WORD EMBEDDINGS: SVD METHODS

16

WORD EMBEDDINGS: SVD METHODS

We get |V| vectors of k dimensions each: word embeddings
e.g. word embedding of word w:
WE(w) = {v

1
, v

2
, …, v

k
}

We’ve reduced w’s dimensionality from |V| to k.

17

SVD EXAMPLE IN PYTHON

∆ = 1

like & I co-occur twice

18

PLOTTING SVD EXAMPLE IN PYTHON

 Corpusn I like NLP. I like deep learning. I enjoy fying.

19

PLOTTING SVD EXAMPLE IN PYTHON

 Corpusn I like NLP. I like deep learning. I enjoy fying.

NLP and deep aren’t directly
connected in the corpus (∆ is 1),
but have common context (like)

20

COMPUTING WORD SIMILARITY

 Corpusn I like NLP. I like deep learning. I enjoy fying.

We can compute similarity
between w

i
and w

j
by comparing:

U[i, 0:k] and U[j, 0:k]

21

COMPUTING WORD SIMILARITY

 Given 2 words w
1
and w

2
, similarity is computed as:

 Dot/inner product, which equates:
|w

1
| * |w

2
| * cos(θ)

Cosine of angle between vectors

Length of w
2

Length of w
1

22

COMPUTING WORD SIMILARITY

 Given 2 words w
1
and w

2
, similarity is computed as:

 Dot/inner product, which equates:
|w

1
| * |w

2
| * cos(θ)

 High similarity for:

near-parallel vectors with high values in same dimensions.

 Low similarity for:
orthogonal vectors, low value vectors.

23

PROS AND CONS OF SVD

 Pron has shown to perform well in a number of tasks.

 Conn dimensions need to change as new words are added to the
corpus, costly.

 Conn resultng vectors can stll be high dimensional and sparse.

 Conn Quadratc cost to perform SVD.

ALTERNATIVES TO SVD

25

ALTERNATIVE: ITERATION BASED METHODS

 Low dimensional, dense vectors instead of high dimensional,
sparse vectors.

 Instead of computng co-occurrences from entre corpus, predict
surrounding words in a window of length c of every word.

 Rely on a rule that can be updated.

 This will be faster and can easily incorporate a new
sentence/document or add a word to the vocabulary

 This is the idea behind word2vec (Mikolov 2013)

26

WORD2VEC: CBOW AND SKIPGRAM MODELS

 Contnuous bag of words model (CBOW)n A language model
where we approximate a word from its lef and right context
within a window sized c.

 i.e. from context to the word.

 Skip gram modeln A language model where we approximate the
words surrounding a given a word within a window sized c to the
lef and right of the word.

 i.e., from the word to the context.

 CBOW and skip grams: reverse of each other.

27

WORD2VEC: WHY IS IT COOL?

 They are very good for encoding similarity.

28

WORD2VEC: WHY IS IT COOL?

 They are very good for inferring word relatonsn

 v(‘Paris’) – v(‘France’) + v(‘Italy’) = v(‘Rome’)

 v(‘king’) – v(‘man’) + v(‘woman’) = v(‘queen’)

29

PROS AND CONS: ITERATION BASED METHODS

 Pron Do not need to operate on entre corpus which involves very
sparse matrices.

 Pron Can capture semantc propertes of words as linear
relatonships between word vectors.

 Pron Fast and can be easily updated with new sentences
(complexity in the order of O(|C|) ).

 Conn Can’t take into account the vast amount of repetton in the
data.

30

ANOTHER ALTERNATIVE: GLOVE

 Glove (Pennington et al. 2014), is a count based method, does
dimensionality reducton, has similar performance to word2vec.

 Does matrix factorisaton.

 Can leverage repettons in the corpus as using the entre word
co-occurrence matrix.

 How? Train on non-zero entries of a global word co-occurrence
matrix from a corpus, rather than the entre sparse matrix or
local word context.

31

GLOVE

 Computatonally expensive for frst tme, then much faster as
non-zero entries are much smaller than words in corpus.

 The intuiton is that relatonships between words should be
explored in terms of the ratos of their co-occurrence
probabilites with some probe words k.

32

GLOVE: VISUALISATION

33

GLOVE: VISUALISATION

34

GLOVE: VISUALISATION

35

GLOVE: VISUALISATION

 Want to play around?

htps://lamyiowce.github.io/word2viz/

36

PYTHON: USING WORD2VEC

 Preparing the input:
Word2Vec takes lists of lists of words (lists of sentences) as input.

e.g.:

sentences = [[‘this’, ‘is’, ‘my’, first’, ‘sentence’],
[‘a’, ‘short’, ‘sentence’],
[‘another’, ‘sentence’],
[‘and’, ‘this’, ‘is’, ‘the’, ‘last’, ‘one’]]

37

PYTHON: USING WORD2VEC

 Training the model:

model = Word2Vec(sentences, min_count=10, size=300, workers=4)

We will only train vectors for
words occurring 10+ times in
the corpus

We want to produce word
vectors of 300 dimensions

We want to parallellise the task
running 4 processes

38

PYTHON: USING WORD2VEC

 It’s memory intensive!

It stores matrices: #vocabulary (dependent on min_count), #size
(size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM. If you have:
100,000 unique words, size=200, the model will require approx.:

100,000*200*4*3 bytes = ~229MB.

39

PYTHON: USING WORD2VEC

 Evaluation:

It’s unsupervised, there is no intrinsic way of evaluating.

Extrinsic evaluation: test your model in a text classification,
sentiment analysis, machine translation,… task!

● Does it outperform other methods (e.g. bag-of-words)?
● Compare two models A and B: which one’s better?

40

PYTHON: USING WORD2VEC

 Storing a model:

model = Word2Vec.load_word2vec_format(‘mymodel.txt’, binary=False)

 or
 model = Word2Vec.load_word2vec_format(‘mymodel.bin.gz’, binary=True)

 Resuming training:

model = gensim.models.Word2Vec.load(‘mymodel.bin.gz’)

 model.train(more_sentences)

41

PYTHON: USING WORD2VEC

42

PYTHON: USING WORD2VEC

 This will give us the vector representaton of ‘computer:’

● v(‘computer’) = {-0.00449447, -0.00310097, …}

 How do we then the vector representatons for sentences, e.g.:

● I have installed Ubuntu on my computer

43

PYTHON: USING WORD2VEC

 Vector representatons for sentences , e.g.:

● I have installed Ubuntu on my computer

 Standard practce is either of:

● Summing word vectors (they have the same dimensionality):
v(‘I’) + v(‘have’) + v(‘installed’) + v(‘Ubuntu’) + …

● Getting the average of word vectorsn
(v(‘I’) + v(‘have’) + v(‘installed’) + …) / 7

44

PRE-TRAINED WORD VECTORS

 One can train a model from a large corpus (millions, if not billions
of sentences). Can be tme-consuming, memory-intensive.

 Pre-trained models are available.

 Remember to choose a suitable pre-trained model.

 Don’t use word vectors pre-trained from news artcles when
you’re working with social media!

45

PRE-TRAINED WORD VECTORS

 Glove’s pre-trained vectors:
htps://nlp.stanford.edu/projects/glove/

46

PRE-TRAINED WORD VECTORS

 Pre-trained word vectors for 30+ languages (from Wikipedia):
htps://github.com/Kyubyong/wordvectors

47

PRE-TRAINED WORD VECTORS

 UK Twiter word embeddings:
htps://fgshare.com/artcles/UK_Twiter_word_embeddings_II_/5791650

48

REFERENCES

 Gensim (word2vec):
https://radimrehurek.com/gensim/

 Word2vec tutorial:
https://rare-technologies.com/word2vec-tutorial/

 FastText:
https://github.com/facebookresearch/fastText/

 GloVe: Global Vectors for Word Representation:
https://nlp.stanford.edu/projects/glove/

49

ASSOCIATED READING

 Not yet part of Jurafsky’s book.
 See Deep learning for NLP CS224d lectures 1 and 2.

http://cs224d.stanford.edu/syllabus.html
 Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013).

Distributed representations of words and phrases and their
compositionality. In Advances in neural information processing
systems (pp. 3111-3119).

 Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global
vectors for word representation. In Proceedings of the 2014
conference on empirical methods in natural language processing
(EMNLP) (pp. 1532-1543).