程序代写代做代考 android chain Hive Vector Semantics

Vector Semantics

Lizhen Qu

Recap

•  Language model.

–  Kneser-Ney smoothing:
–  Stupid Backoff:

P (x1, x2, …, xl) = P (x1)
lY

i=2

P (xi|xi�1)

S(wi |wi−k+1
i−1 ) =

count(wi−k+1
i )

count(wi−k+1
i−1 )

if count(wi−k+1
i )> 0

0.4S(wi |wi−k+2
i−1 ) otherwise

”

#
$$

%
$
$

Overview of the NLP Lectures

•  Introduction to natural language processing (NLP).

•  Regular expressions, sentence splitting,
tokenization, part-of-speech tagging.

•  Language models.

•  Vector semantics.
–  Multiclass logistic regression.

•  Parsing.

•  Compositional semantics..
3

Logistic Regression for Binary Classification

P (Y = 1|X = x) =
1

1 + exp(�(↵+
P

i �ixi))

Classification of Multi-classes

One versus rest One versus One

When there is only

•  Key idea: Use K discriminant functions and
pick the max.

gm(x)

Ri Rj

gk(x) > gi(x) and gk(x) > gj(x)

Softmax for Classification

•  Definition:

P (Y = j|X = xi) =
exp(zj)

PK
j0=1 exp(zj0)

zm = gm(xi)where

Ri Rj

gm(x) is a linear function so that gm(x) = ↵m +
P

k �
m
k xk.

Training of Multiclass LR

•  Training data .

ˆ✓ = argmax
✓

log[

j=1

P (yj |xj ; ✓)]

{(xi, yi)}Ni=1

Rewrite each yi with 1-of-K encoding:

if y1 = 3, then t1 = (0,0,1,0,0)
if y2 = 1, then t2 = (1,0,0,0,0)

Training of Multiclass LR

•  Training data .

{(xi, yi)}Ni=1

ˆ✓ = argmax
✓

log[

j=1

P (yj |xj ; ✓)]

= argmax

✓
log[

j=1

k=1

P (Y = k|xj ; ✓)tjk ]

Training of Multiclass LR

•  Training data .

{(xi, yi)}Ni=1

ˆ✓ = argmax
✓

log[

j=1

P (yj |xj ; ✓)]

= argmax

✓
log[

j=1

k=1

P (Y = k|xj ; ✓)tjk ]

= argmax

✓

j=1

k=1

tjk logP (Y = k|xj ; ✓)

Cross Entropy

Loss Functions (Classification)

Given:

Loss:

Usually minimize negative log-likelihood.

D = {(x1, y1), …, (xm, ym)}

LD(✓) = � logP (D|✓)

argmin

✓
LD(✓) = argmax

✓
P (D|✓)

Parameter Learning

Have a parametric loss function

Aim to

Family of Stochastic gradient descent (SGD):

1.  Start with some .
2.  Repeat: update to reduce
Until: reach a local minimum.

min
w

L(w0, w1, …, wn)

w0, w1, …, wn

w0, w1, …, wn
L(w0, …, wn)

where ✓ = (w0, w1, …, wn)

Gradient Descent

Descent Methods

Procedure:

1.  Start with some .

2.  Repeat: update by

Until convergence.

step size

search direction
wt+1 = wt + ↵t�wt

Gradient Descent (Univariate)

L(w)

wwt

@w
= lim

�w!0

L(wt +�w)� L(wt)
�w

L(wt +�w) < L(wt) Gradient Descent (Univariate) 16 L(w) wwt wt � ↵( @L @w ) > wt

@w
= lim

�w!0

L(wt +�w)� L(wt)
�w

L(wt +�w) < L(wt) Gradient Descent (Univariate) 17 L(w) w @L @w = lim �w!0 L(wt +�w)� L(wt) �w L(wt +�w) > L(wt)

wt � ↵(
@L

@w
) < wt wt Gradient Descent Procedure: 1.  Start with some . 2.  Repeat: update by Until convergence. 18 w0 w wt+1 = wt � ↵t @L(wt) @w Gradient Descent Procedure: 1.  Start with some . 2.  Repeat: update by Until convergence. 19 w0 w Example: L(wt) w = 1 N NX i=1 ((wt)Txi � yi)xi wt+1 = wt � ↵t @L(wt) @w Search for Optimal Parameters 20 min w L(w) w0 w1 L(w0, w1) Gradient Descent 21 min w L(w) � @L(w) @w w0 w1 L(w0, w1) SGD with Mini-Batch •  Use mini-batch: 22 wt+1 j = wt j � ↵ 1 |B| X xi2B @ @w j L(✓) REPEAT k epochs: FOR each batch in training dataset: early stopping evaluated on validation set SHUFFLE training dataset Overview of the NLP Lectures •  Introduction to natural language processing (NLP). •  Regular expressions, sentence splitting, tokenization, part-of-speech tagging. •  Language models. •  Vector semantics. –  Multiclass logistic regression. –  Feedforward neural networks. •  Parsing. •  Compositional semantics. 23 Nonlinearity Linear Classification 24 Y X Y X Nonlinear Classification Artificial Neuron 25 w1 w2 w3 3X i=1 wixi x1 x2 x3 output Artificial Neuron 26 w1 w2 w3 3X i=1 wixi x1 x2 x3 output x = 2 4 x1 x2 x3 3 5 w = 2 4 w1 w2 w3 3 5 o = wTx Activation Function •  Sigmoid function 27 w1 w2 w3 3X i=1 wixi x1 x2 x3 output sigmoid(x) = 1 1 + exp(�x) o sigmoid(o) Activation Function •  Hyperbolic tangent •  Linear rectifier 28 tanh(x) = exp(x)� exp(�x) exp(x) + exp(�x) rectifier(x) = max(0, x) Artificial Neuron (Compact form) 29 w1 w2 w3 x1 x2 x3 output x = 2 4 x1 x2 x3 3 5 w = 2 4 w1 w2 w3 3 5 g( 3X i=1 wixi) Feedforward Networks 30 x1 x2 x3 al11 al12 al21 Hidden Layer Output Layer Input Layer x1 x2 x3 LR classifier LR classifier LR classifier Y X Forward Propagation 31 x1 x2 x3 al11 al12 al21 Hidden Layer Output Layer Input Layer a l1 1 = g(w l0 11x1 + w l0 21x2 + w l0 31x3) a l1 2 = g(w l0 12x1 + w l0 22x2 + w l0 32x3) a l2 1 = g(W l1g(Wl0x))Composite Function: Forward propagation al21 = g(w l1 12a l1 1 + w l1 22a l1 2 ) Computing Derivatives 32 w1 x1 y = 1 w1x1 sigmoid(h1) sigmoid(h2)w2o1 The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. error Chain Rule BackPropagation Overview of the NLP Lectures •  Introduction to natural language processing (NLP). •  Regular expressions, sentence splitting, tokenization, part-of-speech tagging. •  Language models. •  Vector semantics. –  Multiclass logistic regression. –  Feedforward neural networks. –  Word embeddings. •  Parsing. •  Compositional semantics and NLP applications. 33 Weaknesses of Discrete Representations •  Missing new words. •  Hard to compute word similarity. –  Similarity between Canberra and Paris? •  Require human labor to create and maintain. 34 Distributional Similarity 'study in the united states', 'live in New Zealand', 'study in the january 10', 'stay in the New England’, 'live in the US', 'study in the United Kingdom’ 35 'work in the United States’, 'work in Australia.', 'its meeting on 05 January', 'annual meeting on 01 December', 'a meeting on January NUM', 'ordinary meeting on 9 December', 'regular meeting of February 10' Word Co-occurrence Counts 36 Neural Word Representations •  Using continuous-value vectors. •  Learned from unlabeled data. •  Capture distributional similarity. Learning Word Representation 38 Paris – Fance + Italy = Rome Android – Google + Microsoft = Windows Neural Networks with Word Embeddings 39 h(x) Classifier Look-up Table Feature Learning function Input text Learning Word Embeddings •  Predict the word in the middle given context words. 40 NY i=1 P (xi|xi�k, ..., xi�1, xi+1, ..., xi+k) k is the size of the context window. Continuous Bag-of-Words Model (CBOW) [1,2] •  Basic form: 41 Look-up table: esum = ei�2 + ei�1 + ei+1 + ei+2 (ei�2, ei�1, ei, ei+1, ei+2) sum layer : (xi�2, xi�1, xi, xi+1, xi+2) softmax layer : exp(eTi esum)P j2V exp(e T j esum) where c is the context size and V is the vocabulary. P (xi|xi�c, ..., xi�1, xi+1, ..., xi+c) = exp(eTi esum)P j2V exp(e T j esum) Let c = 2 High Scalability with Negative Sampling •  Approximate the softmax loss with a set of binary classification losses. –  Positive example: the loss of seeing word i given context words. –  Negative examples : the loss of not observing |M| word j given context words, where M is a set of randomly picked words from vocabulary. 42 where �(z) = 1 1+exp(�z) Positive example negative examples Lneg sam(xi) = � log �(eTi esum)� X u2M log(1� �(eTuesum)) Skip-Gram [1,2] •  Basic form: 43 Look-up table: (xi�2, xi�1, xi, xi+1, xi+2) softmax layer : Y �cjc,j 6=0 P (xi+j |xi) = Y �cjc,j 6=0 exp(eTi+jei)P l2V exp(e T l ei) exp(eTi+jei)P l2V exp(e T l el) ei, ei+j where c is the context size and V is the vocabulary. Skip-gram with Negative Sampling 44 where �(z) = 1 1+exp(�z) Positive example negative examples Lneg sam(xi) = X �cjc,c 6=0 ⇥ � log �(eTi+jei)� X u2Mj log(1� �(eTuei)) ⇤ Evaluation with Word Analogy [1] 45 e ? = e Athens � e Greece + e Norway demo: http://bionlp-www.utu.fi/wv_demo/ TensorFlow •  Tensors. –  scalar (rank 0) : 1.8, 2 etc. –  •  Define functions on tensors. •  Auto-differentiation. 46 Computation Graph •  Nodes –  tensors. –  tensor operations. 47 Applications •  Neural Language Model [4]. •  Sentiment analysis [5, 6]. 48 Softmax hidden layers Embedding look-up table Softmax hidden layers Embedding look-up table REFERENCES •  [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013. •  [2] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013. •  [3] https://code.google.com/archive/p/word2vec/ •  [4] Yoshua Bengio, Rejean Ducharme, Pascal Vincent and Christian Jauvin. A Neural Probabilistic Language Model, JMLR, 2003. •  [5] Yoon Kim. Convolutional Neural Networks for Sentence Classification. •  [6] Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng and Christopher Potts. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. •  [7] Understanding Word Vectors. https://medium.com/explorations-in-language-and-learning/understanding-word-vectors- f5f9e9fdef98 49

Related Posts