2020/8/14 https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz6_answers.html
COMP9444 Neural Networks and Deep Learning Quiz 6 (Word Vectors)
This is an optional quiz to test your understanding of Word Vectors from Week 5.
1. What are the potential benefits of continuous word representations compared to synonyms or taxonomies?
Synonyms, antonyms and taxonomy require human effort, may be incomplete, and force discrete choices. Continuous representations have the potential to capture gradations of meaning and more fine-grained relationships between words, as well as being extracted automatically without human involvement.
2. What is meant by the Singular Value Decomposition of a matrix X? What are the special properties of the component matrices? What is the time complexity for computing it?
The Singular Value Decomposition of X is X = U S VT where U, V are unitary (all columns of unit length) and S is diagonal with all entries ¡Ý 0.
The time to compute it is proportional to L ¡Á M2 if X is L-by-M and L ¡Ý M.
3. What cost function is used to train the word2vec skip-gram model? (remember to define any symbols you use)
If the text is 1 … T then the cost function is
-(1/T) ¡Æ log prob( t+r | )
4. Explain why full softmax may not be computationally feasible for word-based language processing tasks.
The number of outputs is equal to the total number of words in the lexicon (approximately 60,000) and all of them would need to be evaluated at every step.
5. Write the formula for Hierarchical Softmax and explain the meaning of all the symbols.
prob( ) = ¦° =1 ( )-1 ¦Ò([ ( +1) = child( ( ))] v’ ( )T h)
( ,1), …, ( ( )) are the nodes along the path in a Binary Search Tree
from the root to
h = hidden unit activations, ¦Ò( ) = 1/(1 + exp(- )),
[ ‘ = child( )] = +1, if ‘ is left child of node ; -1, otherwise.
6. Write the formula for Negative Sampling and explain the meaning of all the symbols.
https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz6_answers.html 1/2
n nnn uu
j,wn j,wn j,wn wL j tw=w
tw w 0 ¡ r ,c ¡Ü r ¡Ü c-¡Æ T1=t ww
tw
wL,w n
w n
2020/8/14 https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz6_answers.html
E = -log ¦Ò(v’j*T h) – ¦²j ¡Ê neg log ¦Ò(-v’jT h)
j* = target word, neg = set of negative examples drawn from some
distribution
7. From what probability distribution are the negative examples normally drawn?
( ) = ( )3/4/Z,
( ) = Unigram distribution determined by previous word,
= normalizing constant.
https://www.cse.unsw.edu.au/~cs9444/20T2/quiz/ans/quiz6_answers.html 2/2
W W
Z wU wU wP