COMP9444
Neural Networks and Deep Learning
Outline
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors 2
COMP9444 20T3
Word Vectors 3
What is the meaning of meaning?
stylish, graceful, tasteful, discerning, refined, sophisticated, dignified, cultivated, distinguished, classic, smart, fashionable, modish, decorous, beautiful, artistic, aesthetic, lovely; charming, polished, suave, urbane, cultured, dashing, debonair; luxurious, sumptuous, opulent, grand, plush, high-class, exquisite
dictionary definitions
synonyms and antonyms
5b. Word Vectors
statistical language processing n-gram models
co-occurence matrix
word representations
Word Meaning – Synonyms and Taxonomy?
Statistical Language Processing
taxonomy
◮ penguin is-a bird is-a mammal is-a vertebrate
Synonyms, antonyms and taxonomy require human effort, may be incomplete and require discrete choices. Nuances are lost. Words like “king”, “queen” can be similar in some attributes but opposite in others.
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors 1
word2vec
word relationships
neural machine translation
combining images and language
Synonyms for “elegant”
Could we instead extract some statistical properties automatically, without human involvement?
COMP9444 20T3 Word Vectors
4
COMP9444 20T3
Word Vectors 5
There was a Crooked Man
Counting Frequencies
There was a crooked man, who walked a crooked mile And found a crooked sixpence upon a crooked stile.
He bought a crooked cat,
who caught a crooked mouse And they all lived together
in a little crooked house.
a7 all 1 and 2 bought 1 cat 1 caught 1 crooked 7 found 1 he 1 house 1 in 1 little 1 lived 1 man 1 mile 1 mouse 1 sixpence 1 stile 1 there 1 they 1 together 1 upon 1 walked 1 was 1 who 2
some words occur frequently in all (or most) documents
COMP9444
www.kearley.co.uk/images/uploads/JohnPatiencePJ03.gif ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
6
COMP9444 20T3
Word Vectors 7
Document Classification
Document Classification
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
word doc 1 doc 2 doc X
a..7 all.. 1 and.. 2 bought . . 1 cat.. 1 caught . . 1 crooked . . 7 found . . 1 he.. 1 house . . 1 in.. 1 little . . 1 lived . . 1
each column of the matrix becomes a vector representing the corresponding document
man.. 1 mile.. 1
other groups of words may be characteristic of legal documents, political news, sporting results, etc.
mouse
sixpence
stile
there
they
together
upon
walked
was.. 1 who.. 2
. . 1 . . 1 . . 1 . . 1 . . 1 . . 1 . . 1 . . 1
words occurring many times in one document may skew the vector – might be better to just have a “1” or “0” indicating whether the word occurs at all
word frequency
some words occur frequently in a particular document, but not generally
this information can be useful for document classification
words like “cat”, “mouse”, “house” tend to occur in children’s books or rhymes
a
all
and bought cat caught crooked found he house
in
little lived man mile mouse sixpence stile there they together upon walked was who
a
all
and bought cat caught crooked found he house
in
little lived man mile mouse sixpence stile there they together upon walked was who
COMP9444 20T3 Word Vectors
8
COMP9444 20T3 Word Vectors 9
Counting Consecutive Word Pairs
Predictive 1-Gram Word Model
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
10
COMP9444 20T3
Word Vectors 11
word
word a61
a61
all 1
and 1 1
bought 1
cat 1 caught 1
crooked 1 1 11111
found 1
he 1
house
in 1
little
lived
man 1 mile 1
mouse 1
sixpence 1
stile 1
there 1 they 1
together 1
upon 1
walked 1
was 1 who11
all771
and 1 1
N-Gram Model
1-Gram Text Generator
1
1
1
bynormalizingeachrow(tosumto1)wecanestimatetheprobability prob(w j |wi ) of word w j occurring after wi
“Rashly – Good night is very liberal – it is easily said there is – gyved to a sore distraction in wrath and with my king may choose but none of shapes and editing by this , and shows a sea And what this is miching malhecho ; And gins to me a pass , Transports his wit , Hamlet , my arms against the mind impatient , by the conditions that would fain know ; which , the wicked deed to get from a deed to your tutor .”
need to aggregrate over a large corpus, so that unusual words like “crooked” will not dominate
the model captures some common combinations like “there was”, “man who”, “and found”, “he bought”, “who caught”, “and they”, “they all”, “lived together”, etc.
this unigram model can be generalized to a bi-gram, tri-gram, …,n-gram model by considering the n preceding words
if the vocabulary is large, we need some tricks to avoid exponential use of memory
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
1
1
bought 1 22
cat 1
caught 1 crooked
found 1
he 1 house
1 1 11111 7 7 77777
in
little
lived
man 1
mile mouse sixpence stile there they together upon walked was who
1 1
1
1
1 1 1
11 22
1
1
1
a
all
and bought cat caught
crooked found he house in
little lived man mile mouse sixpence
stile there they together upon walked was who
a
all
and bought cat caught crooked found he house
in
little lived man mile mouse sixpence stile there they together upon walked was who
COMP9444 20T3 Word Vectors
12
COMP9444 20T3 Word Vectors 13
Co-occurrence Matrix
Co-occurrence Matrix (2-word window)
sometimes, we don’t necessarily predict the next word, but simply a “nearby word” (e.g. a word occurring within an n-word window centered on that word)
word
we can build a matrix in which each row represents a word, and each column a nearby word
1
1 1
each row of this matrix could be considered as a vector representation for the corresponding word, but the number of dimensions is equal to the size of the vocabulary, which could be very large (∼ 105)
1
1
1 1
COMP9444
⃝c Alan Blair, 2017-20
1
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
14
COMP9444 20T3 Word Vectors 15
◮ is there a way to reduce the dimensionality while still preserving the relationships between words?
11
1
1
Co-occurrence Matrix (10-word window)
Co-occurrence Matrix
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
word
by aggregating over many documents, pairs (or groups) of words emerge which tend to occur near each other (but not necessarily consecutively)
a 102322213321111221221212214 all 21 1 1111 11
and 31 1311111111112 bought2 1121 1 11 cat21121 11 1 caught21112 111 crooked 1313 22 210221112221 231 1 12214 found312 1111 he21121111 house1 111 1 in1111111 11 little111111 1
◮ “cat”, “caught”, “mouse” ◮ “walked”, “mile”
◮ “little”, “house”
lived 111 2 11 1 11
man 2 2 1 1 111 mile21211111 mouse111111 1 111 sixpence21 211 11 1 stile21131 11 there111 11 they21111111 1
together 111 1 1111 1 1
upon 211 211 11 walked212111 11 was 1 1 1 1 11 who 4211141 111 1 11
common words tend to dominate the matrix
a 116111 111 all 11
and
bought 1
cat
caught 1
crooked 6
found 1
he
house
in1
little 1
lived
man
mile
mouse
sixpence
stile
there
they 11
together
upon 1
walked 1
was 1
who 11 1 1
1 1
1 1 1 1 1
1 1
1
1111111 1
1
1
1
1
1
1 1
1
1
1
◮ could we sample common words less often, in order to reveal the relationships of less common words?
COMP9444 20T3 Word Vectors 16
COMP9444 20T3 Word Vectors 17
Word Embeddings
History of Word Embeddings
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
“Words that are used and occur in the same contexts tend to
Structuralist Linguistics (Firth, 1957)
Recurrent Networks (Rumelhart, Hinton & Williams, 1986)
Latent Semantic Analysis (Deerwester et al., 1990)
Hyperspace Analogue to Language (Lund, Burgess & Atchley, 1995) Neural Probabilistic Language Models (Bengio, 2000)
NLP (almost) from Scratch (Collobert et al., 2008)
word2vec (Mikolov et al., 2013)
GloVe (Pennington, Socher & Manning, 2014)
purport similar meanings.”
Z. Harris (1954) “You shall know a word by the company it keeps.”
Aim of Word Embeddings:
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
18
COMP9444 20T3 Word Vectors 19
Find a vector representation of each word, such that words with nearby representations are likely to occur in similar contexts.
Word Embeddings
Singular Value Decomposition
0.4
together
Co-occurrence matrix X(L×M) can be decomposed as X = U S VT where U(L×L), V(M×M) are unitary (all columns have unit length) and S(L×M) is diagonal, with diagonal entries s1 ≥ s2 ≥ … ≥ sM ≥ 0
0.3
0.2
and MNNM
0.1
L Lu 12 ks
-0.1
man
-0.2
0
~~~T XUSV
a crooked
who
there found was
We can obtain an approximation for X of rank N < M by truncating U to U ̃ (L×N), S to S ̃(N×N) and V to V ̃ (N×M). The kth row of U ̃ then provides an N-dimensional vector representing the kth word in the vocabulary.
-0.6
-0.4 -0.2
0
caught
N
bought stile
J.R. Firth (1957)
lived they
little mouse
house
u1
s
~ u2 1s vv N 2
COMP9444 20T3 Word Vectors 20
COMP9444 20T3 Word Vectors 21
word2vec and GloVe
Eigenvalue vs. Singular Value Decomposition
Typically, L is the number of words in the vocabulary (about 60,000) and M is either equal to L or, in the case of document classification,
the number of documents in the collection. SVD is computationally expensive, proportional to L × M2 if L ≥ M. Can we generate word vectors in a similar way but with less computation, and incrementally?
Eigenvalue Decomposition:
0 1 −1 11 1 1 0
word2vec
◮ predictive model
◮ maximize the probability of a word based on surrounding words
1 0 =ΩDΩ , where Ω=√2 −i i , D= 0 −i Singular Value Decomposition:
GloVe
◮ count-based model
◮ reconstruct a close approximation to the co-occurrence matrix X
0 1=USVT, U=0 1, S=1 0, V=1 0 10100101
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3 Word Vectors 22
COMP9444 20T3
Word Vectors
23
Eigenvalue vs. Singular Value Decomposition
word2vec 1-Word Context Model
if X is symmetric and positive semi-definite, eigenvalue and singular value decompositions are the same.
in general, eigenvalues can be negative or even complex, but singular values are always real and non-negative.
even if X is a square matrix, singular value decompositon treats the source and target as two entirely different spaces.
the word co-occurrence matrix is symmetric but not positive semi- definite; for example, if the text consisted entirely of two alternating letters ..ABABABABABABAB.. then A would be the context for B, and vice-versa.
The kth row vk of W is a representation of word k.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
1 0 =ΩDΩ , where Ω=√2 1−1 , D= 0−1 0−1 −1 11 1 i 0
0−1=USVT, U=0 1, S=1 0, V=1 0 1 0 1 0 0 1 0 −1
The jth column v′j of W′ is an (alternative) representation of word j.
If the (1-hot) input is k, the linear sum at each output will be uj = v′ Tvk j
COMP9444 20T3 Word Vectors
24
COMP9444 20T3 Word Vectors 25
Cost Function
word2vec Issues
Softmax can be used to turn these linear sums uj into a probability distribution estimating the probability of word j occurring in the context
word2vec is a linear model in the sense that there is no activation function at the hidden nodes
of word k
exp(u ) exp(v′ Tvk) j = j
prob(j|k)=
We can treat the text as a sequence of numbers w1,w2,...,wT where
this 1-word prediction model can be extended to multi-word prediction in two different ways:
∑Vj′ =1 exp(u j′ ) ∑Vj′ =1 exp(v′j′ T vk )
wi = j means that the ith word in the text is the jth word in the vocabulary. We then seek to maximize the log probability
◮ Continuous Bag of Words ◮ Skip-Gram
1T
T ∑ ∑ log prob(wt+r|wt)
t=1 −c≤r≤c,r̸=0
where c is the size of training context (which may depend on wt )
need to sample frequent words less often COMP9444
COMP9444 ⃝c Alan Blair, 2017-20
⃝c Alan Blair, 2017-20
COMP9444 20T3 Word Vectors
26
COMP9444 20T3 Word Vectors
27
word2vec Weight Updates
word2vec Weight Updates
If we assume the full softmax, and the correct output is j∗, then the cost function is
hidden-to-output differentials
the output differentials are
hidden unit differentials
where
∂uj
δjj∗ =1,
i j=1 j i j=1 input-to-hidden differentials
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
ej =
=−δjj∗ +
log ∑exp(uj′) ∂uj j′=1
∂E V∂E∂uj V ′ ∂h =∑∂u ∂h =∑ejwij
V
E = −u j∗ + log ∑ exp(u j′ )
∂E ∂E ∂uj
∂w′ =∂uj∂w′ =ejhi
j′=1
∂E ∂V
ij ij
0,
∂w =∂h∂w =∑ejwijxk ki i ki j=1
if j=j∗, otherwise.
∂E ∂E∂hi V ′
need a computationally efficient alternative to Softmax (Why?) ◮ Hierarchical Softmax
◮ Negative Sampling
COMP9444 20T3 Word Vectors 28
COMP9444 20T3 Word Vectors 29
Continuous Bag Of Words
word2vec Skip-Gram Model
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors 30
COMP9444 20T3
Word Vectors
31
Hierarchical Softmax
Hierarchical Softmax
target words are organized in a Huffman-coded Binary Tree
If several context words are each used independently to predict the center word, the hidden activation becomes a sum (or average) over all the context words
try to predict the context words, given the center word
Note the difference between this and NetTalk – in word2vec (CBOW) all context words share the same input-to-hidden weights
eachoutputofthenetworkcorrespondstoonebranchpointinthetree
only those nodes that are visited along the path to the target word are evaluated (which is log2(V) nodes on average)
COMP9444
⃝c Alan Blair, 2017-20
[n′ = child(n)] = +1, −1,
if n′ is left child of node n,
otherwise. σ(u) = 1/(1 − exp(−u))
this skip-gram model is similar to CBOW, except that in this case a single input word is used to predict multiple context words
all context words share the same hidden-to-output weights
L(w)−1 T prob(w = wt ) = ∏ σ([n(w, j + 1) = child(n(w, j))]v′n(w, j)
h) COMP9444 ⃝c Alan Blair, 2017-20
j=1
COMP9444 20T3 Word Vectors
32
COMP9444 20T3 Word Vectors 33
Negative Sampling
Negative Sampling
The idea of negative sampling is that we train the network to increase its estimation of the target word j∗ and reduce its estimate not of all the words in the vocabulary but just a subset of them Wneg, drawn from an appropriate distribution.
The number of samples is 5-20 for small datasets, 2-5 for large datasets.
′T ′T E=−logσ(vj∗ h)− ∑ logσ(−vj h)
Empirically, a good choice of the distribution from which to draw the negative samples is P(w) = U(w)3/4/Z where U(w) is the unigram distribution determined by the previous word, and Z is a normalizing constant.
This is a simplified version of Noise Constrastive Estimation (NCE). It is not guaranteed to produce a well-defined probability distribution, but in practice it does produce high-quality word embeddings.
COMP9444 ⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3 Word Vectors
34
COMP9444 20T3
Word Vectors 35
Subsampling of Frequent Words
Sentence Completion Task
In order to diminish the influence of more frequent words, each word in the corpus is discarded with probability
nostalgic.
A. fastidious
B. indignant
C. wistful
D. conciliatory
t P(wi) = 1 − f (wi)
Q2. Because the House had the votes to override a presidential veto, the President has no choice but to .......... .
where f (wi) is the frequency of word wi and t ∼ 10−5 is an empirically determined threshold.
A. object
B. abdicate
C. abstain
D. compromise
COMP9444
⃝c Alan Blair, 2017-20
(use model to choose which word is most likely to occur in this context) COMP9444 ⃝c Alan Blair, 2017-20
j∈Wneg
Q1. Seeing the pictures of our old home made me feel .......... and
COMP9444 20T3 Word Vectors
36
COMP9444 20T3 Word Vectors
37
Linguistic Regularities
Word Analogy Task
King + Woman - Man ≃ Queen More generally,
Q1. evening is to morning as dinner is to .......... .
A is to B as C is to ??
A. breakfast
B. soup
C. coffee
D. time
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 20T3
Word Vectors
38
COMP9444 20T3
Word Vectors
39
Capital Cities
Word Analogies
COMP9444
⃝c Alan Blair, 2017-20
COMP9444
⃝c Alan Blair, 2017-20
(v + v − v )Tv d=argmax c b a x
Q2. bow is to arrow as .......... is to bullet
x ||vc+vb−va||
A. defend B. lead C. shoot D. gun
COMP9444 20T3
Word Vectors
40 COMP9444 20T3 Word Vectors 41
Word Relationships
References
COMP9444
⃝c Alan Blair, 2017-20
COMP9444 ⃝c Alan Blair, 2017-20
T. Mikolov, K. Chen, G. Corrado & J. Dean, 2013. “Efficient estimation of word representations in vector space”, arXiv preprint arXiv:1301.3781.
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado & J. Dean, 2013. “Dis- tributed representations of words and phrases and their compositionality”, NIPS 2013, 3111-19.
Xin Rong, 2014. “word2vec parameter learning explained.”, arXiv:1411.2738.
https://nlp.stanford.edu/projects/glove/
https://devblogs.nvidia.com/parallelforall/
introduction-neural-machine-translation-gpus-part-3/