Vector Semantics
Lizhen Qu
Recap
• Language model.
– Kneser-Ney smoothing:
– Stupid Backoff:
2
P (x1, x2, …, xl) = P (x1)
lY
i=2
P (xi|xi�1)
S(wi |wi−k+1
i−1 ) =
count(wi−k+1
i )
count(wi−k+1
i−1 )
if count(wi−k+1
i )> 0
0.4S(wi |wi−k+2
i−1 ) otherwise
”
#
$$
%
$
$
Overview of the NLP Lectures
• Introduction to natural language processing (NLP).
• Regular expressions, sentence splitting,
tokenization, part-of-speech tagging.
• Language models.
• Vector semantics.
– Multiclass logistic regression.
• Parsing.
• Compositional semantics..
3
Logistic Regression for Binary Classification
4
P (Y = 1|X = x) =
1
1 + exp(�(↵+
P
i �ixi))
Classification of Multi-classes
5
One versus rest One versus One
When there is only
• Key idea: Use K discriminant functions and
pick the max.
6
gm(x)
Ri Rj
Rk
gk(x) > gi(x) and gk(x) > gj(x)
Softmax for Classification
• Definition:
7
P (Y = j|X = xi) =
exp(zj)
PK
j0=1 exp(zj0)
zm = gm(xi)where
Ri Rj
Rk
gm(x) is a linear function so that gm(x) = ↵m +
P
k �
m
k xk.
Training of Multiclass LR
• Training data .
8
ˆ✓ = argmax
✓
log[
NY
j=1
P (yj |xj ; ✓)]
{(xi, yi)}Ni=1
Rewrite each yi with 1-of-K encoding:
if y1 = 3, then t1 = (0,0,1,0,0)
if y2 = 1, then t2 = (1,0,0,0,0)
Training of Multiclass LR
• Training data .
9
{(xi, yi)}Ni=1
ˆ✓ = argmax
✓
log[
NY
j=1
P (yj |xj ; ✓)]
= argmax
✓
log[
NY
j=1
KY
k=1
P (Y = k|xj ; ✓)tjk ]
Training of Multiclass LR
• Training data .
10
{(xi, yi)}Ni=1
ˆ✓ = argmax
✓
log[
NY
j=1
P (yj |xj ; ✓)]
= argmax
✓
log[
NY
j=1
KY
k=1
P (Y = k|xj ; ✓)tjk ]
= argmax
✓
NX
j=1
KX
k=1
tjk logP (Y = k|xj ; ✓)
Cross Entropy
Loss Functions (Classification)
Given:
Loss:
11
Usually minimize negative log-likelihood.
D = {(x1, y1), …, (xm, ym)}
LD(✓) = � logP (D|✓)
argmin
✓
LD(✓) = argmax
✓
P (D|✓)
Parameter Learning
Have a parametric loss function
Aim to
Family of Stochastic gradient descent (SGD):
1. Start with some .
2. Repeat: update to reduce
Until: reach a local minimum.
12
min
w
L(w0, w1, …, wn)
L(w0, w1, …, wn)
w0, w1, …, wn
w0, w1, …, wn
L(w0, …, wn)
where ✓ = (w0, w1, …, wn)
Gradient Descent
13
L
w
@L
@w
Descent Methods
Procedure:
1. Start with some .
2. Repeat: update by
Until convergence.
14
w0
w
step size
search direction
wt+1 = wt + ↵t�wt
Gradient Descent (Univariate)
15
L(w)
wwt
@L
@w
= lim
�w!0
L(wt +�w)� L(wt)
�w
L(wt +�w) < L(wt) Gradient Descent (Univariate) 16 L(w) wwt wt � ↵( @L @w ) > wt
@L
@w
= lim
�w!0
L(wt +�w)� L(wt)
�w
L(wt +�w) < L(wt) Gradient Descent (Univariate) 17 L(w) w @L @w = lim �w!0 L(wt +�w)� L(wt) �w L(wt +�w) > L(wt)
wt � ↵(
@L
@w
) < wt
wt
Gradient Descent
Procedure:
1. Start with some .
2. Repeat: update by
Until convergence.
18
w0
w
wt+1 = wt � ↵t
@L(wt)
@w
Gradient Descent
Procedure:
1. Start with some .
2. Repeat: update by
Until convergence.
19
w0
w
Example:
L(wt)
w
=
1
N
NX
i=1
((wt)Txi � yi)xi
wt+1 = wt � ↵t
@L(wt)
@w
Search for Optimal Parameters
20
min
w
L(w)
w0
w1
L(w0, w1)
Gradient Descent
21
min
w
L(w)
�
@L(w)
@w
w0
w1
L(w0, w1)
SGD with Mini-Batch
• Use mini-batch:
22
wt+1
j
= wt
j
� ↵
1
|B|
X
xi2B
@
@w
j
L(✓)
REPEAT k epochs:
FOR each batch in training dataset:
early stopping evaluated on validation set
SHUFFLE training dataset
Overview of the NLP Lectures
• Introduction to natural language processing (NLP).
• Regular expressions, sentence splitting,
tokenization, part-of-speech tagging.
• Language models.
• Vector semantics.
– Multiclass logistic regression.
– Feedforward neural networks.
• Parsing.
• Compositional semantics.
23
Nonlinearity
Linear Classification
24
Y
X
Y
X
Nonlinear Classification
Artificial Neuron
25
w1
w2
w3
3X
i=1
wixi
x1
x2
x3
output
Artificial Neuron
26
w1
w2
w3
3X
i=1
wixi
x1
x2
x3
output
x =
2
4
x1
x2
x3
3
5 w =
2
4
w1
w2
w3
3
5
o = wTx
Activation Function
• Sigmoid function
27
w1
w2
w3
3X
i=1
wixi
x1
x2
x3
output
sigmoid(x) =
1
1 + exp(�x)
o
sigmoid(o)
Activation Function
• Hyperbolic tangent
• Linear rectifier
28
tanh(x) =
exp(x)� exp(�x)
exp(x) + exp(�x)
rectifier(x) = max(0, x)
Artificial Neuron (Compact form)
29
w1
w2
w3
x1
x2
x3
output
x =
2
4
x1
x2
x3
3
5 w =
2
4
w1
w2
w3
3
5
g(
3X
i=1
wixi)
Feedforward Networks
30
x1
x2
x3
al11
al12
al21
Hidden Layer Output Layer Input Layer
x1
x2
x3
LR classifier
LR classifier
LR classifier
Y
X
Forward Propagation
31
x1
x2
x3
al11
al12
al21
Hidden Layer Output Layer Input Layer
a
l1
1 = g(w
l0
11x1 + w
l0
21x2 + w
l0
31x3)
a
l1
2 = g(w
l0
12x1 + w
l0
22x2 + w
l0
32x3)
a
l2
1 = g(W
l1g(Wl0x))Composite Function:
Forward propagation
al21 = g(w
l1
12a
l1
1 + w
l1
22a
l1
2 )
Computing Derivatives
32
w1
x1
y = 1
w1x1 sigmoid(h1) sigmoid(h2)w2o1
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may
have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to
delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to
delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted.
Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your
computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.
The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still
appears, you may have to delete the image and then insert it again.
error
Chain Rule
BackPropagation
Overview of the NLP Lectures
• Introduction to natural language processing (NLP).
• Regular expressions, sentence splitting, tokenization,
part-of-speech tagging.
• Language models.
• Vector semantics.
– Multiclass logistic regression.
– Feedforward neural networks.
– Word embeddings.
• Parsing.
• Compositional semantics and NLP applications.
33
Weaknesses of Discrete Representations
• Missing new words.
• Hard to compute word similarity.
– Similarity between Canberra and Paris?
• Require human labor to create and maintain.
34
Distributional Similarity
'study in the united states',
'live in New Zealand',
'study in the january 10',
'stay in the New England’,
'live in the US',
'study in the United Kingdom’
35
'work in the United States’,
'work in Australia.',
'its meeting on 05 January',
'annual meeting on 01 December',
'a meeting on January NUM',
'ordinary meeting on 9 December',
'regular meeting of February 10'
Word Co-occurrence Counts
36
Neural Word Representations
• Using continuous-value vectors.
• Learned from unlabeled data.
• Capture distributional similarity.
Learning Word Representation
38
Paris – Fance + Italy = Rome
Android – Google + Microsoft = Windows
Neural Networks with Word Embeddings
39
h(x)
Classifier
Look-up Table
Feature Learning function
Input text
Learning Word Embeddings
• Predict the word in the middle given context words.
40
NY
i=1
P (xi|xi�k, ..., xi�1, xi+1, ..., xi+k)
k is the size of the context window.
Continuous Bag-of-Words Model (CBOW) [1,2]
• Basic form:
41
Look-up table:
esum = ei�2 + ei�1 + ei+1 + ei+2
(ei�2, ei�1, ei, ei+1, ei+2)
sum layer :
(xi�2, xi�1, xi, xi+1, xi+2)
softmax layer :
exp(eTi esum)P
j2V exp(e
T
j esum)
where c is the context size and V is the vocabulary.
P (xi|xi�c, ..., xi�1, xi+1, ..., xi+c) =
exp(eTi esum)P
j2V exp(e
T
j esum)
Let c = 2
High Scalability with Negative Sampling
• Approximate the softmax loss with a set of binary
classification losses.
– Positive example: the loss of seeing word i given context
words.
– Negative examples : the loss of not observing |M| word j
given context words, where M is a set of randomly picked
words from vocabulary.
42
where �(z) = 1
1+exp(�z)
Positive example
negative examples
Lneg sam(xi) = � log �(eTi esum)�
X
u2M
log(1� �(eTuesum))
Skip-Gram [1,2]
• Basic form:
43
Look-up table:
(xi�2, xi�1, xi, xi+1, xi+2)
softmax layer :
Y
�cjc,j 6=0
P (xi+j |xi) =
Y
�cjc,j 6=0
exp(eTi+jei)P
l2V exp(e
T
l ei)
exp(eTi+jei)P
l2V exp(e
T
l el)
ei, ei+j
where c is the context size and V is the vocabulary.
Skip-gram with Negative Sampling
44
where �(z) = 1
1+exp(�z)
Positive example negative examples
Lneg sam(xi) =
X
�cjc,c 6=0
⇥
� log �(eTi+jei)�
X
u2Mj
log(1� �(eTuei))
⇤
Evaluation with Word Analogy [1]
45
e
?
= e
Athens
� e
Greece
+ e
Norway
demo: http://bionlp-www.utu.fi/wv_demo/
TensorFlow
• Tensors.
– scalar (rank 0) : 1.8, 2 etc.
–
• Define functions on tensors.
• Auto-differentiation.
46
Computation Graph
• Nodes
– tensors.
– tensor operations.
47
Applications
• Neural Language Model [4].
• Sentiment analysis [5, 6].
48
Softmax
hidden layers
Embedding look-up table
Softmax
hidden layers
Embedding look-up table
REFERENCES
• [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient Estimation of Word Representations in Vector Space. In
Proceedings of Workshop at ICLR, 2013.
• [2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and
Jeffrey Dean. Distributed Representations of Words and Phrases
and their Compositionality. In Proceedings of NIPS, 2013.
• [3] https://code.google.com/archive/p/word2vec/
• [4] Yoshua Bengio, Rejean Ducharme, Pascal Vincent and Christian
Jauvin. A Neural Probabilistic Language Model, JMLR, 2003.
• [5] Yoon Kim. Convolutional Neural Networks for Sentence
Classification.
• [6] Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang,
Christopher D. Manning, Andrew Y. Ng and Christopher Potts.
Recursive Deep Models for Semantic Compositionality Over a
Sentiment Treebank.
• [7] Understanding Word Vectors.
https://medium.com/explorations-in-language-and-learning/understanding-word-vectors-
f5f9e9fdef98 49