程序代写代做代考 deep learning C COMP9444

COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors 2
COMP9444 20T3
Word Vectors 3
What is the meaning of meaning?
stylish, graceful, tasteful, discerning, refined, sophisticated, dignified, cultivated, distinguished, classic, smart, fashionable, modish, decorous, beautiful, artistic, aesthetic, lovely; charming, polished, suave, urbane, cultured, dashing, debonair; luxurious, sumptuous, opulent, grand, plush, high-class, exquisite
􏰀 dictionary definitions
􏰀 synonyms and antonyms
5b. Word Vectors
􏰀 statistical language processing 􏰀 n-gram models
􏰀 co-occurence matrix
􏰀 word representations
Word Meaning – Synonyms and Taxonomy?
Statistical Language Processing
􏰀 taxonomy
◮ penguin is-a bird is-a mammal is-a vertebrate
Synonyms, antonyms and taxonomy require human effort, may be incomplete and require discrete choices. Nuances are lost. Words like “king”, “queen” can be similar in some attributes but opposite in others.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors 1
􏰀 word2vec
􏰀 word relationships
􏰀 neural machine translation
􏰀 combining images and language
Synonyms for “elegant”
Could we instead extract some statistical properties automatically, without human involvement?

COMP9444 20T3 Word Vectors
4
COMP9444 20T3
Word Vectors 5
There was a Crooked Man
Counting Frequencies
There was a crooked man, who walked a crooked mile And found a crooked sixpence upon a crooked stile.
He bought a crooked cat,
who caught a crooked mouse And they all lived together
in a little crooked house.
a7 all 1 and 2 bought 1 cat 1 caught 1 crooked 7 found 1 he 1 house 1 in 1 little 1 lived 1 man 1 mile 1 mouse 1 sixpence 1 stile 1 there 1 they 1 together 1 upon 1 walked 1 was 1 who 2
􏰀 some words occur frequently in all (or most) documents
COMP9444
www.kearley.co.uk/images/uploads/JohnPatiencePJ03.gif ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
6
COMP9444 20T3
Word Vectors 7
Document Classification
Document Classification
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
word doc 1 doc 2 doc X
a..7 all.. 1 and.. 2 bought . . 1 cat.. 1 caught . . 1 crooked . . 7 found . . 1 he.. 1 house . . 1 in.. 1 little . . 1 lived . . 1
􏰀 each column of the matrix becomes a vector representing the corresponding document
man.. 1 mile.. 1
􏰀 other groups of words may be characteristic of legal documents, political news, sporting results, etc.
mouse
sixpence
stile
there
they
together
upon
walked
was.. 1 who.. 2
. . 1 . . 1 . . 1 . . 1 . . 1 . . 1 . . 1 . . 1
􏰀 words occurring many times in one document may skew the vector – might be better to just have a “1” or “0” indicating whether the word occurs at all
word frequency
􏰀 some words occur frequently in a particular document, but not generally
􏰀 this information can be useful for document classification
􏰀 words like “cat”, “mouse”, “house” tend to occur in children’s books or rhymes

a
all
and bought cat caught crooked found he house
in
little lived man mile mouse sixpence stile there they together upon walked was who
a
all
and bought cat caught crooked found he house
in
little lived man mile mouse sixpence stile there they together upon walked was who
COMP9444 20T3 Word Vectors
8
COMP9444 20T3 Word Vectors 9
Counting Consecutive Word Pairs
Predictive 1-Gram Word Model
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
10
COMP9444 20T3
Word Vectors 11
word
word a61
a61
all 1
and 1 1
bought 1
cat 1 caught 1
crooked 1 1 11111
found 1
he 1
house
in 1
little
lived
man 1 mile 1
mouse 1
sixpence 1
stile 1
there 1 they 1
together 1
upon 1
walked 1
was 1 who11
all771
and 1 1
N-Gram Model
1-Gram Text Generator
1
1
1
􏰀 bynormalizingeachrow(tosumto1)wecanestimatetheprobability prob(w j |wi ) of word w j occurring after wi
“Rashly – Good night is very liberal – it is easily said there is – gyved to a sore distraction in wrath and with my king may choose but none of shapes and editing by this , and shows a sea And what this is miching malhecho ; And gins to me a pass , Transports his wit , Hamlet , my arms against the mind impatient , by the conditions that would fain know ; which , the wicked deed to get from a deed to your tutor .”
􏰀 need to aggregrate over a large corpus, so that unusual words like “crooked” will not dominate
􏰀 the model captures some common combinations like “there was”, “man who”, “and found”, “he bought”, “who caught”, “and they”, “they all”, “lived together”, etc.
􏰀 this unigram model can be generalized to a bi-gram, tri-gram, …,n-gram model by considering the n preceding words
􏰀 if the vocabulary is large, we need some tricks to avoid exponential use of memory
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
1
1
bought 1 22
cat 1
caught 1 crooked
found 1
he 1 house
1 1 11111 7 7 77777
in
little
lived
man 1
mile mouse sixpence stile there they together upon walked was who
1 1
1
1
1 1 1
11 22
1
1
1

a
all
and bought cat caught
crooked found he house in
little lived man mile mouse sixpence
stile there they together upon walked was who
a
all
and bought cat caught crooked found he house
in
little lived man mile mouse sixpence stile there they together upon walked was who
COMP9444 20T3 Word Vectors
12
COMP9444 20T3 Word Vectors 13
Co-occurrence Matrix
Co-occurrence Matrix (2-word window)
􏰀 sometimes, we don’t necessarily predict the next word, but simply a “nearby word” (e.g. a word occurring within an n-word window centered on that word)
word
􏰀 we can build a matrix in which each row represents a word, and each column a nearby word
1
1 1
􏰀 each row of this matrix could be considered as a vector representation for the corresponding word, but the number of dimensions is equal to the size of the vocabulary, which could be very large (∼ 105)
1
1
1 1
COMP9444
⃝c Alan Blair, 2017-20
1
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
14
COMP9444 20T3 Word Vectors 15
◮ is there a way to reduce the dimensionality while still preserving the relationships between words?
11
1
1
Co-occurrence Matrix (10-word window)
Co-occurrence Matrix
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
word
􏰀 by aggregating over many documents, pairs (or groups) of words emerge which tend to occur near each other (but not necessarily consecutively)
a 102322213321111221221212214 all 21 1 1111 11
and 31 1311111111112 bought2 1121 1 11 cat21121 11 1 caught21112 111 crooked 1313 22 210221112221 231 1 12214 found312 1111 he21121111 house1 111 1 in1111111 11 little111111 1
◮ “cat”, “caught”, “mouse” ◮ “walked”, “mile”
◮ “little”, “house”
lived 111 2 11 1 11
man 2 2 1 1 111 mile21211111 mouse111111 1 111 sixpence21 211 11 1 stile21131 11 there111 11 they21111111 1
together 111 1 1111 1 1
upon 211 211 11 walked212111 11 was 1 1 1 1 11 who 4211141 111 1 11
􏰀 common words tend to dominate the matrix
a 116111 111 all 11
and
bought 1
cat
caught 1
crooked 6
found 1
he
house
in1
little 1
lived
man
mile
mouse
sixpence
stile
there
they 11
together
upon 1
walked 1
was 1
who 11 1 1
1 1
1 1 1 1 1
1 1
1
1111111 1
1
1
1
1
1
1 1
1
1
1
◮ could we sample common words less often, in order to reveal the relationships of less common words?

COMP9444 20T3 Word Vectors 16
COMP9444 20T3 Word Vectors 17
Word Embeddings
History of Word Embeddings
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
“Words that are used and occur in the same contexts tend to
􏰀 Structuralist Linguistics (Firth, 1957)
􏰀 Recurrent Networks (Rumelhart, Hinton & Williams, 1986)
􏰀 Latent Semantic Analysis (Deerwester et al., 1990)
􏰀 Hyperspace Analogue to Language (Lund, Burgess & Atchley, 1995) 􏰀 Neural Probabilistic Language Models (Bengio, 2000)
􏰀 NLP (almost) from Scratch (Collobert et al., 2008)
􏰀 word2vec (Mikolov et al., 2013)
􏰀 GloVe (Pennington, Socher & Manning, 2014)
purport similar meanings.”
Z. Harris (1954) “You shall know a word by the company it keeps.”
Aim of Word Embeddings:
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
18
COMP9444 20T3 Word Vectors 19
Find a vector representation of each word, such that words with nearby representations are likely to occur in similar contexts.
Word Embeddings
Singular Value Decomposition
0.4
together
Co-occurrence matrix X(L×M) can be decomposed as X = U S VT where U(L×L), V(M×M) are unitary (all columns have unit length) and S(L×M) is diagonal, with diagonal entries s1 ≥ s2 ≥ … ≥ sM ≥ 0
0.3
0.2
and MNNM
0.1
L Lu 12 ks
-0.1
man
-0.2
0
~~~T XUSV
a crooked
who
there found was
We can obtain an approximation for X of rank N < M by truncating U to U ̃ (L×N), S to S ̃(N×N) and V to V ̃ (N×M). The kth row of U ̃ then provides an N-dimensional vector representing the kth word in the vocabulary. -0.6 -0.4 -0.2 0 caught N bought stile J.R. Firth (1957) lived they little mouse house u1 s ~ u2 1s vv N 2 COMP9444 20T3 Word Vectors 20 COMP9444 20T3 Word Vectors 21 word2vec and GloVe Eigenvalue vs. Singular Value Decomposition Typically, L is the number of words in the vocabulary (about 60,000) and M is either equal to L or, in the case of document classification, the number of documents in the collection. SVD is computationally expensive, proportional to L × M2 if L ≥ M. Can we generate word vectors in a similar way but with less computation, and incrementally? Eigenvalue Decomposition: 􏰉0 1􏰊 −1 1􏰉1 1􏰊 􏰉1 0􏰊 􏰀 word2vec ◮ predictive model ◮ maximize the probability of a word based on surrounding words 1 0 =ΩDΩ , where Ω=√2 −i i , D= 0 −i Singular Value Decomposition: 􏰀 GloVe ◮ count-based model ◮ reconstruct a close approximation to the co-occurrence matrix X 􏰉0 1􏰊=USVT, U=􏰉0 1􏰊, S=􏰉1 0􏰊, V=􏰉1 0􏰊 10100101 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Word Vectors 22 COMP9444 20T3 Word Vectors 23 Eigenvalue vs. Singular Value Decomposition word2vec 1-Word Context Model 􏰀 if X is symmetric and positive semi-definite, eigenvalue and singular value decompositions are the same. 􏰀 in general, eigenvalues can be negative or even complex, but singular values are always real and non-negative. 􏰀 even if X is a square matrix, singular value decompositon treats the source and target as two entirely different spaces. 􏰀 the word co-occurrence matrix is symmetric but not positive semi- definite; for example, if the text consisted entirely of two alternating letters ..ABABABABABABAB.. then A would be the context for B, and vice-versa. The kth row vk of W is a representation of word k. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 1 0 =ΩDΩ , where Ω=√2 1−1 , D= 0−1 􏰉0−1􏰊 −1 1􏰉1 1􏰊 􏰉i 0􏰊 􏰉0−1􏰊=USVT, U=􏰉0 1􏰊, S=􏰉1 0􏰊, V=􏰉1 0􏰊 1 0 1 0 0 1 0 −1 The jth column v′j of W′ is an (alternative) representation of word j. If the (1-hot) input is k, the linear sum at each output will be uj = v′ Tvk j COMP9444 20T3 Word Vectors 24 COMP9444 20T3 Word Vectors 25 Cost Function word2vec Issues Softmax can be used to turn these linear sums uj into a probability distribution estimating the probability of word j occurring in the context 􏰀 word2vec is a linear model in the sense that there is no activation function at the hidden nodes of word k exp(u ) exp(v′ Tvk) j = j prob(j|k)= We can treat the text as a sequence of numbers w1,w2,...,wT where 􏰀 this 1-word prediction model can be extended to multi-word prediction in two different ways: ∑Vj′ =1 exp(u j′ ) ∑Vj′ =1 exp(v′j′ T vk ) wi = j means that the ith word in the text is the jth word in the vocabulary. We then seek to maximize the log probability ◮ Continuous Bag of Words ◮ Skip-Gram 1T T ∑ ∑ log prob(wt+r|wt) t=1 −c≤r≤c,r̸=0 where c is the size of training context (which may depend on wt ) need to sample frequent words less often COMP9444 COMP9444 ⃝c Alan Blair, 2017-20 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Word Vectors 26 COMP9444 20T3 Word Vectors 27 word2vec Weight Updates word2vec Weight Updates If we assume the full softmax, and the correct output is j∗, then the cost function is hidden-to-output differentials the output differentials are hidden unit differentials where ∂uj δjj∗ =1, i j=1 j i j=1 input-to-hidden differentials COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 ej = =−δjj∗ +  log ∑exp(uj′) ∂uj j′=1 ∂E V∂E∂uj V ′ ∂h =∑∂u ∂h =∑ejwij V E = −u j∗ + log ∑ exp(u j′ ) ∂E ∂E ∂uj ∂w′ =∂uj∂w′ =ejhi j′=1 ∂E ∂V ij ij 0, ∂w =∂h∂w =∑ejwijxk ki i ki j=1 if j=j∗, otherwise. ∂E ∂E∂hi V ′ 􏰀 need a computationally efficient alternative to Softmax (Why?) ◮ Hierarchical Softmax ◮ Negative Sampling 􏰀 COMP9444 20T3 Word Vectors 28 COMP9444 20T3 Word Vectors 29 Continuous Bag Of Words word2vec Skip-Gram Model COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Word Vectors 30 COMP9444 20T3 Word Vectors 31 Hierarchical Softmax Hierarchical Softmax 􏰀 target words are organized in a Huffman-coded Binary Tree 􏰀 If several context words are each used independently to predict the center word, the hidden activation becomes a sum (or average) over all the context words 􏰀 try to predict the context words, given the center word 􏰀 Note the difference between this and NetTalk – in word2vec (CBOW) all context words share the same input-to-hidden weights 􏰀 eachoutputofthenetworkcorrespondstoonebranchpointinthetree 􏰀 only those nodes that are visited along the path to the target word are evaluated (which is log2(V) nodes on average) COMP9444 ⃝c Alan Blair, 2017-20 [n′ = child(n)] = +1, −1, if n′ is left child of node n, otherwise. σ(u) = 1/(1 − exp(−u)) 􏰀 this skip-gram model is similar to CBOW, except that in this case a single input word is used to predict multiple context words 􏰀 all context words share the same hidden-to-output weights L(w)−1 T prob(w = wt ) = ∏ σ([n(w, j + 1) = child(n(w, j))]v′n(w, j) h) COMP9444 ⃝c Alan Blair, 2017-20 j=1 COMP9444 20T3 Word Vectors 32 COMP9444 20T3 Word Vectors 33 Negative Sampling Negative Sampling The idea of negative sampling is that we train the network to increase its estimation of the target word j∗ and reduce its estimate not of all the words in the vocabulary but just a subset of them Wneg, drawn from an appropriate distribution. 􏰀 The number of samples is 5-20 for small datasets, 2-5 for large datasets. ′T ′T E=−logσ(vj∗ h)− ∑ logσ(−vj h) 􏰀 Empirically, a good choice of the distribution from which to draw the negative samples is P(w) = U(w)3/4/Z where U(w) is the unigram distribution determined by the previous word, and Z is a normalizing constant. This is a simplified version of Noise Constrastive Estimation (NCE). It is not guaranteed to produce a well-defined probability distribution, but in practice it does produce high-quality word embeddings. COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Word Vectors 34 COMP9444 20T3 Word Vectors 35 Subsampling of Frequent Words Sentence Completion Task In order to diminish the influence of more frequent words, each word in the corpus is discarded with probability nostalgic. A. fastidious B. indignant C. wistful D. conciliatory t P(wi) = 1 − 􏰋 f (wi) Q2. Because the House had the votes to override a presidential veto, the President has no choice but to .......... . where f (wi) is the frequency of word wi and t ∼ 10−5 is an empirically determined threshold. A. object B. abdicate C. abstain D. compromise COMP9444 ⃝c Alan Blair, 2017-20 (use model to choose which word is most likely to occur in this context) COMP9444 ⃝c Alan Blair, 2017-20 j∈Wneg Q1. Seeing the pictures of our old home made me feel .......... and COMP9444 20T3 Word Vectors 36 COMP9444 20T3 Word Vectors 37 Linguistic Regularities Word Analogy Task King + Woman - Man ≃ Queen More generally, Q1. evening is to morning as dinner is to .......... . A is to B as C is to ?? A. breakfast B. soup C. coffee D. time COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 20T3 Word Vectors 38 COMP9444 20T3 Word Vectors 39 Capital Cities Word Analogies COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 (v + v − v )Tv d=argmax c b a x Q2. bow is to arrow as .......... is to bullet x ||vc+vb−va|| A. defend B. lead C. shoot D. gun COMP9444 20T3 Word Vectors 40 COMP9444 20T3 Word Vectors 41 Word Relationships References COMP9444 ⃝c Alan Blair, 2017-20 COMP9444 ⃝c Alan Blair, 2017-20 T. Mikolov, K. Chen, G. Corrado & J. Dean, 2013. “Efficient estimation of word representations in vector space”, arXiv preprint arXiv:1301.3781. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado & J. Dean, 2013. “Dis- tributed representations of words and phrases and their compositionality”, NIPS 2013, 3111-19. Xin Rong, 2014. “word2vec parameter learning explained.”, arXiv:1411.2738. https://nlp.stanford.edu/projects/glove/ https://devblogs.nvidia.com/parallelforall/ introduction-neural-machine-translation-gpus-part-3/