Sparse and dense embeddings as input to neural networks
Input to feedforward neural networks
I A bag-of-word model where the input x is the count of each word (feature) xi .
I The connections from word (feature) i to each of the hidden units zk form a vector ✓(x!z) that is sometimes described as
I the embedding of word i.
With su cient training data, word embeddings can be learned within the same network as the classification task.
I Pretrained word embeddings learned separately from unlabeled data, using techniques such as Word2Vec and GLOVE.
I Contextualized word embeddings (e.g., ELMO, BERT) that are computed dynamically for a word sequence. The requires more advanced architectures (Transformers) that we will talk about later in the course.
i
One-hot encodings for features
A one-hot encoding is one in which each dimension corresponds to a unique feature, and the resulting feature vector of a classification instance can be thought of as the sum of indicator feature vectors in which a single dimension has a value of one while all others have a value of zero.
Example:
When considering a bag-of-words representation over a vocabulary of 40000 words. A short document of 20 words will be represented with a very sparse 40000-dimensional vector in which at most 20 dimensions have non-zero values
Sparse vectors for text classification
Sparse vectors for text classification can be viewed as a summation of one-hot features for a text instance:
⇥0 0 0 1 0 0 0 0⇤ ⇥+⇤
01000000
⇥+⇤ 00000010
⇥=⇤ 01010010
Shortcomings for sparse representations
I Each feature is a sparse vector in which one dimension is 1 and the rest are 0s (thus “one-hot”)
I Dimensionality of one-hot vector is same as number of distinct features
I Features can be completely independent of one another. The feature “word is ‘dog’ ” is as dissimilar to “word is ‘thinking’ ” as it is to “word is ‘cat”’
I Features for one classifying instance can only combined by summation.
I A recent trend is to use dense representations that can capture similarities between features, which lead to better generalizations to new data.
Dense vectors for text classification
I Extract a set of linguistic features f1, · · · , fk that are relevant for predicting the output class
I For each feature fi of interest, retrieve the corresponding vector vi , which can be pre-trained, pre-computed, or random initialized.
I Each core feature is embedded into a d-dimensional space (typically 50-600), and represented as a vector in that space.
I Combine the vectors (either by concatenation, summation, or a combination of both) into an input vector x for a classification instance.
I Note: concatenation if we care about relative position, but doesn’t work for variable-length vectors such as document classification
I Model training will cause similar features to have similar vectors – information is shared between similar features.
Relationship between one-hot and dense vectors
I Dense representations are typically pre-computed or pre-trained word embeddings
I One-hot and dense representations may not be as di↵erent as one might think
I In fact, using sparse, one-hot vectors as input when training a neural network amounts to dedicating the first layer of the network to learning a dense embedding vector [for each feature] based on training data.
I With task-specific word embedding, the training set is typically smaller, but the training objective for the embedding and the task objective are one and same
I With pre-trained word embeddings, the training data is easy to come by (just unannotated text), but the embedding objective and task objective may diverge.
Two ways of obtaining dense word vectors
I Count based methods, known in NLP as Distributional Semantic Models (DSM) or Vector Semantic Models (VSM)
I Predictive methods, originating from the neural network community, aimed at producing Distributed Representations for words, commonly known as word embeddings
I Distributed word representations were initially a by-product of neural language models and later became a separate task on its own
Distributional semantics
I Based on the well-known observation of Z. Harris: Words are similar if they occur in the same context (Harris, 1954)
I Further summarized it as a slogan: “You shall know a word by the company it keeps.” (J. R. Firth, 1957)
I A long history of using word-context matrices to represent word meaning where each row is a word and each column represents a context word it can occur with in a corpus
I Each word is represented as a sparse vector in a high-dimensional space
I Then word distances and similarities can be computed with such a matrix
Steps for building a distributional semantic model
I Preprocess a (large) corpus: tokenization at a minimum, possibly lemmatization, POS tagging, or syntactic parsing
I Define the “context” for a target term (words or phrases). The context can be a window centered on the target term, terms that are syntactically related to the target term (subject-of, object-of, etc.).
I Compute a term-context matrix where each row corresponds to a term and each column corresponds to a context term for the target term.
I Each target term is then represented with a high-dimensional vector of context terms.
Mathematical processing for building a DSM
I Weight the term-context matrix with association strength metrics such as Positive Pointwise Mutual Information (PPMI) to correct frequency bias
PPMI(x,y) = max(log p(x,y) ,0) p(x)p(y)
I Its dimensionality can also be reduced by matrix factorization techniques such as singular value decomposition (SVD)
A = U⌃VT
A 2 Rm⇥n,U 2 Rm⇥k,⌃ 2 Rk⇥k,V 2 Rn⇥k,n >> k
I This will result in a matrix that has much lower dimension but retains most of the information of the original matrix.
Getting pre-trained word embeddings using predictive methods
I Learns word embeddings from large naturally occurring text, using various language model objectives.
I Decide on the context window
I Define the objective function that is used to predict the
context words based on the target word or predict the target I word based on context
I Train the neural network
The resulting weight matrix will serve as the vector representation for the target word
I “Don’t count, predict!” (Baroni et al, 2014) conducted systematic studies and found predict-based word embeddings outperform count-based embeddings.
I One of popular early word emdeddings are Word2vec embeddings.
Word2vec
I Word2vec is a software package that consists of two main models: CBOW (Continuous Bag of Words) and Skip-gram.
I It popularized the use of distributed representations as input to neural networks in natural language processing, and inspired many follow-on works, e.g., GLOVE, ELMO, BERT, XLNet)
I It has it roots in language modeling (the use of window-based context to predict the target word), but gives up the goal of getting good language models and focuses instead on getting good word embeddings.
Understanding word2vec: A simple CBOW model with
only one context word input
I Inputx2RV isaone-hotvectorwherexk =1andxk0 =0 for k0 6= k. ⇥ 2 RN⇥V is the weight matrix from the input layer to the hidden layer. Each column of ⇥ is an N-dimensional vector representation vw of the associated word of the input layer.
z =⇥x :=vwi
I ⇥0 2 RV ⇥N is the matrix from the hidden layer to the output layer and uwj is the j-th row of ⇥0 . A “similarity” score oj for each target word wj and context word wi can be computed as:
o j = u w> v w i j
I Finally we use softmax to obtain a posterior distribution p(wj|wi) = yj = P exp(oj)
where yj is the output of the j-th unit in the output layer
V exp(oj0) j0=1
A simple CBOW model
Computing the hidden layer is just embedding lookup
Hidden layer computation “retrieves” vwi : vwi =z=⇥x=
203 607
617
20.1 0.3 0.5 0.4 0.6 0.1 0.3 0.5 0.4 0.63 607 20.53 40.2 0.5 0.8 0.7 0.9 0.4 0.8 0.2 0.5 0.15 ⇥ 607 = 40.85 0.2 0.5 0.8 0.7 0.9 0.4 0.8 0.2 0.5 0.1 607 0.8
64075
Note there is no activation at the hidden layer (or there is a linear activation function), so this is a “degenerate neural network”.
Computing the output layer
o = ⇥0z =
20.3 0.4 60.7 0.1 60.5 0.2 60.2 0.6 60.6 0.5 60.3 0.1 60.2 0.4 640.3 0.2
0.63 0.67 0.77 0.37 0.67 0.17 0.87 0.175 0.6 0.1
20.53
⇥ 40.85 = 0.8
20.953 60.917 60.977 60.827
61.187
60.317
61.067
640.3975 0.95
0.3 0.4 0.3 0.5
0.63 Each row of ⇥0 correspond to vector for a target word wj .
Taking the softmax over the output
20.110392153 60.106063627 60.112622227 60.096934857
y = softmax(o) = 60.138939577 60.058208957
60.123228347 64 0.063057 75 0.11039215 0.08016116
The output y is a probabilistic distribution over the entire vocabulary.
Input vector and output vector
Since there is no activation function at the hidden layer, the output is really just the dot product of the vector of the input context word and the vector of the output target word.
p(wj|wi)=yj =P
exp(o ) j
=P
exp(uw> vwi ) j
V j0=1
exp(o 0) j
V j0=1
exp(u> v ) wj0 wi
where vwi from ⇥ is the input vector for word wi and uwj from ⇥0 is the output vector for word wj
Computing the gradient on the hidden-output weights
I Use the familiar cross-entropy loss
` = X|V| tj logyj = logyj?
j
where j? is the index of the target word
I Given yj is the output of a softmax function, the gradient on
the output is:
@` =yj tj @oj
@` = @` @oj =(yj tj)zi @✓j0i @oj @✓j0i
I Update the hidden!output weights
✓ j0 i = ✓ j0 i ⌘ ( y j t j ) z i
Updating input!hidden weights
I Compute the error at the hidden layer
@ ` XV @ ` @ o j XV
@z = @o @z = (yj tj)✓j0i
i j=1 j i j=1
I Since
XV
zi =
The derivative of ` on the input! hidden weights:
@` @`@zi XV
@✓ =@z@✓ = (yj tj)✓j0ixk
ik i ik j=1 I Update the input!hidden weights
XV j=1
k=1
✓ikxk
✓ki =✓ki ⌘
(yj tj)✓i0jxk
Gradient computation in matrix form
20.110392153 60.106063627 60.112622227 60.096934857
Do = Y T = 60.138939577 60.058208957
60.123228347 64 0.063057 75 0.11039215 0.08016116
203 607
617 607
607 = 607
607 64075
0 0
2 0.11039215 3 6 0.10606362 7 6 0.887377787 6 0.09693485 7 6 0.13893957 7 6 0.05820895 7 6 0.12322834 7 64 0.063057 75 0.11039215 0.08016116
Computing the errors at the hidden layer
D z = D o> ⇥ 0 =
⇥0.110 0.106 0.887 0.097 0.139 0.058 0.123 0.063 0.110 0.080⇤
20.3 0.4 0.63
60.7 0.1 0.67
60.5 0.2 0.77
60.2 0.6 0.37 ⇥ ⇤
⇥ 60.6
60.3 0.1
60.2 0.4 640.3 0.2 0.3 0.4 0.3 0.5
0.17 0.87 0.175 0.6 0.1
0.5 0.67 = 0.115 0.157 0.194
Computing the updates to ⇥0 r⇥0 =Doz> =
2 0.110 3 6 0.106 7 6 0.8877 6 0.097 7 6 0.139 7 6 0.058 7 6 0.123 7 64 0.063 75 0.110 0.080
⇥⇤
⇥ 0.5 0.8 0.8 =
2 0.055 0.088 0.088 3 6 0.053 0.085 0.085 7 6 0.443 0.710 0.7107 6 0.048 0.078 0.0778 7 6 0.069 0.111 0.111 7 6 0.029 0.047 0.0467 7
6 0.062 64 0.032 0.055 0.040
0.099 0.050 0.088 0.064
0.099 7 0.050 75 0.088 0.064
Computing the update to ⇥ 203
607 6107
r⇥ = xD = 607 607
2 64075
0. 6 0.
6 0.11538456
6 0.
= 6 0.
6 0.
6 0. 0.
64 0. 0. 0. 0. 0. 0.
0. 7 0.75
⇥ ⇥ 0.11538456
0. 0.
0.19388611⇤
0. 0.7 0.15687943 0.193886117 0. 0. 7 0. 0. 7 0. 0. 7
0.15687943
0. 0.
3
CBOW for multiple context words
z = 1 ⇥(x1 +x2 +···xM) M
= 1 (vw1 +vw2 +···+vwM) M
where M is the number of words in the context, w1, w2, · · · , wM are the words in the context and vw is an input vector. The loss function is
`= logp(w|wX,w,···,w ) j12 M
V
= oj? +log exp(oj0)
j0=1 XV
= uw> z + log exp(uw>0 z) jj
j0=1
Computing the hidden layer for multiple context words
z = ⇥x =
203 607
617
20.1 0.3 0.5 0.4 0.6 0.1 0.3 0.5 0.4 0.63 6017 21.53 40.2 0.5 0.8 0.7 0.9 0.4 0.8 0.2 0.5 0.15 ⇥ 607 = 43.05 0.2 0.5 0.8 0.7 0.9 0.4 0.8 0.2 0.5 0.1 617 3.0
6401075 During backprop, update vectors for four words instead of just one.
Skip-gram: model
p(wc,j = wO,c |wI ) = yc,j = P exp(oc,j )
V exp(oj0)
where wc,j is the j-th word on the c-th panel of the output layer, wO,c is the actual c-th word in the output context words; wI is the only input word, yc,j is the output of the j-th unit on the c-th panel of the output layer; oc,j is the net input of the j-th unit on the c-th panel of the output layer.
oc,j =oj =uwj ·z,forc=1,2,···,C
j0=1
Skip-gram: loss function
` = logpY(wO,1,wO,2,··· ,wO,C|wI) C
= log Pexp(oc,jc?)
V exp(oj0)
XC
XV j0=1
c=1 j0=1
oc,jc? +C⇥log
where jc? is the index of the actual c-th output context word.
Combine the loss of C context words with multiplication. Note: oj0 is the same for all C panels
= c=1
exp(oj0)
Skip-gram: updating the weights
I We take the derivative of ` with regard to the net input of every unit on every panel of the output layer, oc,j, and obtain
ec,j= @` =yc,j tc,j @oc,j
which is the prediction error of the unit.
We define a V-dimensional vector E = E1,··· ,EV as the sum
IP of the prediction errors of the context word: Ej =
@ ` = XC @ ` · @ o c , j = E j · z i @✓j0i c=1 @oc,j @✓j0i
I Updating the hidden!output weight matrix: ✓j0i =✓j0i ⌘·Ej ·zi
Cc=1 ec,j
I No change in how the input!hidden weights are updated.
Additional sources on the skip-gram model
I For step-by-step derivation of the Skip-gram model, here is an excellent tutorial:
https://aegis4048.github.io/
demystifying neural network in skip gram language modeling
Optimizing computational e ciency
I Computing softmax at the output layer is expensive. It involves iterative over the entire vocabulary
I Two methods for optimizing computational e ciency
I Hierarchical softmax: an alternative way to compute the probability of a word that reduces the computation complexity
I from |V| to log|V|.
Negative sampling: Instead of updating the weights for all the words in the vocabulary, only sample a small number of words that are not actual context words in the training corpus
Hierarchical softmax
0.3
n15 0.65
n13
0.7
0.35
n14
0.4 0.6
n11 n12
0.8 0.3 0.7
Australia penguin watch 0.11 0.06 0.15
0.5
koala 0.1
n9 n10
0.5 0.25 0.75
kangaroo tree eat 0.1 0.11 0.34
0.2
leaf 0.028
Computing the probabilities of the leaf nodes
P(“Kangaroo”|z) = Pn15 (Left|z) ⇥ Pn13 (Left|z) ⇥ Pn9 (Right|z) Pn(Right|z) = 1 Pn(Left|z)
Pn(Left|z) = ( n>z)
where n is a vector from a set of new parameters that replace ⇥
Hu↵man Tree Building
A simple algorithm:
I Prepare a collection of n initial Hu↵man trees, each of which is a single leaf node. Put the n trees onto a priority queue organized by weight (frequency).
I Remove the first two trees (the ones with lowest weight). Join these two trees to create a new tree whose root has the two trees as children, and whose weight is the sum of the weights of the two children trees. Put this new tree into the priority queue.
I Repeat steps 2-3 until all of the partial Hu↵man trees have been combined into one.
Negative sampling
I Computing softmax over the vocabulary is expensive. Another alternative is to approximate softmax by only updating a small sample of (context) words at a time.
I Given a pair of words (w,c), let P(D = 1|w,c) be the probability of the pair of words came from the training corpus, and P(D = 0|w,c) be the probability that the pair did not come from the corpus.
I This probability can be modeled as a sigmoid function:
P(D = 1|w,c) = (uw>vc) = 1 1+e uw>vc
New learning objective for negative sampling
I We need a new objective for negative sampling, which is to minimize the following loss function:
L= X log (owj) X log ( owj) wj2D wj2D0
where D is a set of correct context – target word pairs and D0 is a set of incorrect context – target word pairs.
I Note that we use the sum for positive samples as well as negative samples. In the skip-gram algorithm, there will be multiple positive context words in the output. In the CBOW algorithm, there will be only one positive target word.
I The derivative of the loss function with respect to the output word will be:
@` = (owj) twj @ owj
wheretwj =1ifwj 2D andtwj =0ifwj 2D0
Updates to the hidden!output weights
I Compute the gradient on the output weights
@` =( (ow ) tw )z @uw j j
j
I When updating the output weights, only the weight vectors for words in the positive sample and negative sample need to be updated:
Updates to the input!hidden weights
I Computing the derivative of the loss function with respect to
the hidden layer
@` = X ( (ow ) tw )uw @z j j j
wj 2D[D0
I In the CBOW algorithm, the weights for all input context words will be updated. In the Skip-gram algorithm, the weight for the target word will be updated.
vwi =vwi ⌘( (owj) twj)uwjxi
How to pick the negative samples?
I If we just randomly pick a word from a corpus, the probability of any given word wi getting picked is:
p(wi ) = P freq(wi )
Vj = 0 f r e q ( w j )
More frequent words will be more likely to be picked and this may not be ideal.
I Adjust the formula to give the less frequent words a bit more chance to get picked:
freq(w )3 p(wi)=P i4
I Generate a sequence of words using the adjusted probability, and randomly pick nD0 words
V freq(w )3 j=0 j4
Use of embeddings: word and short document similarity
I Word embeddings can be used to compute word similarity with cosine similarity
simcos = u·v kuk2kvk2
I How accurately can they be used to compute word similarity can also be used to evaluate word embeddings
I They can also be used to compute the similarity of short documents
Xm Xn i=1 j=1
simdoc (D1, D2) =
cos(wi1, wj2)
Use of embeddings: word analogy
I What’s even more impressive is that they can be used to compute word analogy
analogy(m:w!k:?) = argmax cos(v,k m+w) v2V\m,k,w
analogy(m:w!k:?) = argmax cos(v,k) cos(v,m) v2V\m,k,w
+cos(v,w)
analogy(m : w ! k :?) = argmax cos(v,k)cos(v,w) v2V\m,k,w cos(v,m)+✏
Word analogy
Use of word embeddings
I Computing word similarities is not a “real” problem in the eyes of many
I The most important use word embeddings is as input to predict the outcome of tasks that have real-world applications
I Many follow-on work in develop more e↵ective word embeddings, e.g., GLOVE
I word2vec: http://vectors.nlpl.eu/repository
I fasttext: https://fasttext.cc/docs/en/english-vectors.html I GLOVE: https://nlp.stanford.edu/projects/glove
Shortcoming of “per-type” word embeddings
I “Per-type” word embeddings learned via models like word2vec do not account for the meanings of words in context:
I “Work out the solution in your head.”
I “Heat the solution to 75 Celsius.”
I Having the same embedding for both instances of “solution”
doesn’t make sense
I The solution is Contextualized word embeddings, which are generated on the fly given the entire sentence as input. The same word will have di↵erent embeddings if they occur in di↵erent sentences.
I ELMO: https://allennlp.org/elmo
I BERT: https://github.com/google-research/bert
I Roberta: https://pytorch.org/hub/pytorch fairseq roberta
I The contextualized word embeddings can be fine-tuned when used in a new classification task in a process called transfer learning.
I This turns out to be a very powerful idea that leads to many breakthroughs.
Embeddings in Pytorch