程序代写代做代考 deep learning C graph Computational

Computational
Linguistics
CSC 485 Summer 2020
4a
4a. Vector-based Semantics
Gerald Penn
Department of Computer Science, University of Toronto
(slides borrowed from Chris Manning)
Copyright © 2019 Gerald Penn. All rights reserved.

From symbolic to distributed representa’ons
The vast majority of rule-based and staHsHcal NLP work regarded words as atomic symbols: hotel, conference, walk
In vector space terms, this is a vector with one 1 and a lot of zeroes
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
We call this a “one-hot” representaHon. Its problem:
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]T hotel [000000010000000]=0
Sec. 9.2.2

Distribu’onal similarity based representa’ons
You can get a lot of value by represenHng a word by means of its neighbors
“You shall know a word by the company it keeps”
One of the most successful ideas of modern NLP
(J. R. Firth 1957: 11)
government debt problems turning into saying that Europe needs unified
banking banking
crises as has happened in
regulation to replace the hodgepodge
ëThese words will represent bankingì

CALLED BASHED
Figure 10: Multidimensional scaling of three verb semantic classes.
With distributed, distribu’onal representa’ons, syntac’c and seman’c pa;erning is captured
CHOOSING CHOOSE
CHOSCEHNOSE
STOLEN STEAL
STOLE STEALING
0.286
SPOKE SPEAK SPOKEN
TAKE
TAKEN TAKING TOOK
0.792 −0.177 −0.107
SPEAKING
0.109
SHOWN
SHOWED
SHOWING SHOW
EATEN EAT
ATE EATING
THROW THROWN THREW
THROWING
shown =
−0.542
0.349 0.271
GROWN GROW
GREW
GROWING
Figure 11: Multidimensional scaling of present, past, progressive, and past participle forms fo
[Rohde et al. 2005. An Improved Model of SemanHc Similarity Based on Lexical Co-Occurrence]
r

Menu
1. Vector space representaHons of language
2. Predict! vs. Count!: The GloVe model of word vectors
3. Wanted: meaning composiHon funcHons
4. Tree-structured Recursive Neural Networks for SemanHcs
5. Natural Language Inference with TreeRNNs

LSA vs. word2vec
LSA: Count!
• Factorize a (maybe weighted, maybe log scaled) term-
document or word-context matrix (Schütze 1992) into UΣVT
• Retain only k singular values, in order to generalize
k
[Cf. Baroni: Don’t count, predict! A systemaHc comparison of context- counHng vs. context-predicHng semanHc vectors. ACL 2014]
Sec. 18.3

LSA vs. word2vec
LSA: Count! vs.
word2vec CBOW/SkipGram: Predict!
• Train word vectors to try to either:
• Predict a word given its bag-of- words context (CBOW); or
• Predict a context word (posiHon- independent) from the center word
• Update word vectors unHl they can do this predicHon well
Sec. 18.3

Word Analogies: word2vec captures dimensions of similarity as linear rela’ons
Test for linear relaHonships, examined by Mikolov et al. 2013
queen king
woman man
a
a
?
:
:
b
:
b
:
:
c
:
c
:
:
?
+
− +
man:woman :: king:? king [ 0.30 0.70 ]
man [ 0.20 0.20 ] woman [ 0.60 0.30 ]
queen [ 0.70 0.80 ]

COALS model (count-modified LSA)
[Rohde, Gonnerman & Plaut, ms., 2005]

From Nissim et al. (2019) arXiv:1905.09866v1 [cs.CL]

Count based vs. direct predic’on
LSA, HAL (Lund & Burgess),
COALS (Rohde et al), 

Hellinger-PCA (Lebret & Collobert) Collobert & Weston; Huang et al; Mnih &
• Fast training
• Efficient usage of statistics
• Primarily used to capture word similarity
• Disproportionate importance given to small counts
Hinton; Mikolov et al; Mnih & Kavukcuoglu)
• Scales with corpus size
• Inefficient usage of statistics
• Generate improved performance 

on other tasks
• Can capture complex patterns 
 beyond word similarity
• NNLM, HLBL, RNN, word2vec Skip-gram/CBOW, (Bengio et al;

Crucial insight:
RaHos of co-occurrence probabiliHes can encode meaning components
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
x = solid large
small
large
x = gas small
large
small
x = water large
large
~1
x = random small
small
~1

Crucial insight:
RaHos of co-occurrence probabiliHes can encode meaning components
Encoding meaning in vector differences
[Pennington, Socher, and Manning, EMNLP 2014]
x = solid 1.9 x
10-4 2.2 x
10-5 8.9
x = gas 6.6 x
10-5 7.8 x
10-4 8.5 x
10-2
x = water 3.0 x
10-3 2.2 x
10-3 1.36
x = fashion 1.7 x
10-5 1.8 x
10-5 0.96

Encoding meaning in vector differences
Q: How can we capture raHos of co-occurrence probabiliHes as meaning components in a word vector space?
A: Log-bilinear model: with vector differences

GloVe: A new model for learning word representa’ons [Pennington et al., EMNLP 2014]

Word similari’es
Nearest words to frog: 1. frogs
2. toad
3. litoria
4. leptodactylidae
5. rana
6. lizard
7. eleutherodactylus
litoria
leptodactylidae
eleutherodactylus
htp://nlp.stanford.edu/projects/glove/
rana

Word Analogies
[Mikolov et al., 2012, 2013]
Task: predict the last column

Word analogy task [Mikolov, Yih & Zweig 2013a]
Model Dimensions Corpus size
Performance (Syn + Sem)
CBOW (Mikolov et al. 2013b) 300 1.6 billion
36.1
GloVe (this work) 300 1.6 billion
70.3
CBOW (M et al. 2013b, by us) 300 6 billion
65.7
GloVe (this work) 300 6 billion
71.7
CBOW (Mikolov et al. 2013b) 1000 6 billion
63.7
GloVe (this work) 300 42 billion
75.0

Glove Visualiza’ons
http://nlp.stanford.edu/projects/glove/

Glove Visualiza’ons: Company – CEO

Glove Visualiza’ons: Superla’ves

5 6
Analogy evalua’on and hyperparameters
Training Time (hrs)
3 6 9 12 15 18 21 24
GloVe Skip-Gram
20 40 60 80 100
GloVe CBOW
72
70
68
66
64
62
60
20
)
25 NB!è 40 50
Iterations (GloVe)
12345671012 15 20
Negative Samples (Skip-Gram)
(b) GloVe vs Skip-Gram
Accuracy [%]
W

.
,
Semantic
Syntactic
Overall
85 80 75
70 65 60 55 50
Analogy evalua’on and hyperparameters
Wiki2010 1B tokens
Wiki2014 1.6B tokens
Gigaword5 4.3B tokens
Gigaword5 + Wiki2014 6B tokens
Common Crawl 42B tokens
e
Figure 3: Accuracy on the analogy task for 300-
Accuracy [%]

Word Embeddings Conclusion
Developed a model that can translate meaningful relaHonships between word-word co-occurrence probabiliHes into linear relaHons in the word vector space
GloVe shows the connecHon between Count! work and Predict! work – appropriate scaling of counts gives the properHes and performance of Predict! models
Can one explain word2vec’s linear structure?
See Arora, Li, Liang, Ma, & Risteski. 2015. Random Walks on Context Spaces: Towards an ExplanaHon of the Mysteries of SemanHc Word Embeddings. [Develops a generaHve model.]

Compositionality

Artificial Intelligence requires understanding bigger things from knowing about smaller things

WE need more! What of larger seman’c units?
How can we know when larger units are similar in meaning?
• Two senators received contribu4ons engineered
by lobbyist Jack Abramoff in return for poli4cal favors.
• Jack Abramoff a>empted to bribe two legislators.
People interpret the meaning of larger text units – enHHes, descripHve terms, facts, arguments, stories – by seman’c composi’on of smaller elements

Represen’ng Phrases as Vectors
x2 4
1 55
Germany 13 France 2.5
1.1 4
Monday 2
Tuesday 9.5 1.5
3 2 1
9
0 1 2 3 4 5 6 7 8 9 10
Vector for single words are useful as features but limited!
x1
Can we extend the ideas of word vector spaces to phrases?
the country of my birth
the place where I was born

How should we map phrases into a vector space?
Use the principle of composiHonality!
The meaning (vector) of a sentence is determined by
(1) the meanings of its words and (2) a method that combine them.
1 5
x 25
4 3 2 1
the country of my birth
the place where I was born
Germany France
Monday Tuesday
5.5 6.1
x
0 1 2 3 4 5 6 7 8 9 10
1
1 3.5
2.5 3.8
0.4 2.1 7 4 2.3 0.3 3.3 7 4.5 3.6
the country of my birth



Tree Recursive Neural Networks (Tree RNNs)
Basic computaHonal unit: Recursive Neural Network
(Goller & Küchler 1996, Costa et al. 2003, Socher et al. ICML, 2011)
8 3
8 3
Neural Network
83 53
1.3
3 3
894 513
on the mat.
Soocchheer,rN, Mg,aMnanningin,gNg

Version 1: Simple concatena’on Tree RNN
p = tanh(W c1 c2
where tanh: score = VT p
+ b),
Wscore W
c1
s p
c2
Only a single weight matrix = composiHon funcHon!
No really interacHon between the input words!
Not adequate for human language composiHon funcHon
Soocchheer,rN, Mg,aMnanningin,gNg

Version 2: PCFG + Syntac’cally-Un’ed RNN
• A symbolic Context-Free Grammar (CFG) backbone is adequate for basic syntacHc structure
• We use the discrete syntacHc categories of the children to choose the composiHon matrix
• An RNN can do beter with a different composiHon matrix for different syntacHc environments
• The result gives us a beter semanHcs
Soocchheer,rN, Mg,aMnanningin,gNg

SU-RNN
Learns soy noHon of head words IniHalizaHon:
NP-CC PRP$-NP
NP-PP
PP-NP
Soocchheer,rN, Mg,aMnanningin,gNg

SU-RNN
ADVP-ADJP DT-NP
ADJP-NP JJ-NP
Soocchheer,rN, Mg,aMnanningin,gNg

Version 3: Matrix-vector RNNs
[Socher, Huval, Bhat, Manning, & Ng, 2012]
p

Version 3: Matrix-vector RNNs
[Socher, Huval, Bhat, Manning, & Ng, 2012]
p=
=P
AB

Classifica’on of Seman’c Rela’onships
• Can an MV-RNN learn how a large syntacHc context conveys a semanHc relaHonship?
• My [apartment]e1 has a prety large [kitchen] e2 àcomponent-whole relaHonship (e2,e1)
• Build a single composiHonal semanHcs for the minimal consHtuent including both terms
Soocchheer,rN, Mg,aMnanningin,gNg

Classifica’on of Seman’c Rela’onships
Classifier Features
F1
SVM POS, stemming, syntacHc paterns
60.1
MaxEnt POS, WordNet, morphological features, noun compound system, thesauri, Google n-grams
77.6
SVM POS, WordNet, prefixes, morphological features, dependency parse features, Levin
classes, PropBank, FrameNet, NomLex-Plus, Google n-grams, paraphrases, TextRunner
82.2
RNN –
74.8
MV-RNN –
79.1
MV-RNN POS, WordNet, NER
82.4
Soocchheer,rN, Mg,aMnanningin,gNg

Version 4: Recursive Neural Tensor Network
• Less parameters than MV-RNN
• Allows the two word or phrase vectors to interact mulHplicaHvely
Soocchheer,rN, Mg,aMnanningin,gNg

Version 4: Recursive Neural Tensor Network
• Idea: Allow both addiHve and mediated mulHplicaHve interacHons of vectors
Soocchheer,rN, Mg,aMnanningin,gNg

Recursive Neural Tensor Network
Soocchheer,rN, Mg,aMnanningin,gNg

Recursive Neural Tensor Network
Soocchheer,rN, Mg,aMnanningin,gNg

Recursive Neural Tensor Network
• Use resulHng vectors in tree as input to a classifier like logisHc regression
• Train all weights jointly with gradient descent
Soocchheer,rN, Mg,aMnanningin,gNg

Version 5:
Improving Deep Learning Seman’c
Representa’ons using a TreeLSTM
[Tai et al., ACL 2015]
Goals:
• SHll trying to represent the meaning of a sentence as a locaHon in a (high-dimensional, conHnuous) vector space
• In a way that accurately handles semanHc composiHon and sentence meaning
• Beat Paragraph Vector!

Tree-Structured Long Short-Term Memory Networks
Use Long Short- Term Memories (Hochreiter and Schmidhuber 1997)
Use syntacHc structure

• An LSTM creates a sentence representaHon via ley-to-right composiHon
• ︎ Natural language has syntacHc structure
• ︎ We can use this addiHonal structure over inputs to guide how representaHons should be composed

Tree-Structured Long Short-Term Memory Networks [Tai et al., ACL 2015]

Results: Seman’c Relatedness
SICK 2014 (Sentences Involving Composi’onal Knowledge)
Method Pearson correla’on
Meaning Factory (Bjerva et al. 2014) 0.827
ECNU (Zhao et al. 2014) 0.841
LSTM (sequence model) 0.853
Tree LSTM 0.868

Natural Language Inference
Can we tell if one piece of text follows from another?
• Two senators received contribu4ons engineered
by lobbyist Jack Abramoff in return for poli4cal favors.
• Jack Abramoff a>empted to bribe two legislators.
Natural Language Inference = Recognizing Textual Entailment [Dagan 2005, MacCartney & Manning, 2009]

The task: Natural language inference
James Byron Dean refused to move without blue jeans
{entails, contradicts, neither} James Dean didn’t dance without pants

MacCartney’s natural logic
An implementable logic for natural language inference without logical forms. (MacCartney and Manning ‘09)
● Sound logical interpretaHon (Icard and Moss ‘13)

The task: Natural language inference
Claim: Simple task to define, but engages the full complexity of composiHonal semanHcs:
• Lexical entailment
• QuanHficaHon
• Coreference
• Lexical/scope ambiguity
• Commonsense knowledge
• ProposiHonal aÅtudes
• Modality
• FacHvity and implicaHvity …

Natural logic: rela’ons
Seven possible relaHons between phrases/sentences:
Venn symbol
x ≡ y x ⊏ y x ⊐ y x ^ y x | y x ‿ y x # y
name
equivalence
forward entailment (strict)
reverse entailment (strict)
negation (exhaustive exclusion)
alternation (non-exhaustive exclusion)
cover (exhaustive non-exclusion)
independence
example
couch ≡ sofa crow ⊏ bird European ⊐ French human ^ nonhuman cat | dog animal ‿ nonhuman hungry # hippo

Natural logic: rela’on joins
Can our NNs learn to make these inferences over pairs of embedding vectors?

A minimal NN for lexical rela’ons
[Bowman 2014]
● Words are learned embedding vectors.
Soymax classifier
● One plain TreeRNN or TreeRNTN layer
● Softmax emits relation labels
P(entailment) = 0.9
Comparison N(T)N layer
jeans vs. pants
● Learn everything with SGD.
jeans pants
Learned word vectors

Lexical rela’ons: results
● Both models tuned, then trained to convergence on five randomly generated datasets
● Reported figures: % correct (macroaveraged F1)
● Both NNs used 15d embeddings, 75d comparison layer

Quan’fiers
Experimental paradigm: Train on relaHonal statements generated from some formal system, test on other such relaHonal statements.
The model needs to:
● Learn the relaHons between individual words. (lexical
relaHons)
● Learn how lexical relaHons impact phrasal relaHons.
(projecHvity)
● QuanHfiers present some of the harder cases of both of
these.

Quan’fiers
● Small vocabulary
o Three basic types:
§ QuanHfiers: some, all, no, most, two, three, not-all, not-most, less-than-two, less-than-three
§ Predicates: dog, cat, mammal, animal …
§ NegaHon: not
● 60k examples generated using a generaHve implementaHon of the relevant porHon of MacCartney and Manning’s logic.
● All sentences of the form QPP, with opHonal negaHon on each predicate.

Quan’fier results
Most freq. class (# only) 25d SumNN (sum of words) 25d TreeRNN
25d TreeRNTN
Train Test
35.4% 35.4% 96.9% 93.9% 99.6% 99.2%
100% 99.7%

Natural language inference data
[Bowman, Manning & Po;s 2015]
● To do NLI on real English, we need to teach an NN model English almost from scratch.
● What data do we have to work with:
o GloVe/word2vec (useful w/ any data source)
o SICK: Thousands of examples created by ediHng and pairing
hundreds of sentences.
o RTE: Hundreds of examples created by hand.
o DenotaHonGraph: Millions of extremely noisy examples
(~73% correct?) constructed fully automaHcally.

Results on SICK (+DG, +tricks) so far
Most freq. class 30 dim TreeRNN 50 dim TreeRNTN
SICK Train 56.7%
95.4% 97.8%
DG Train 50.0%
67.0% 74.0%
Test 56.7%
74.9%
76.9%

Is it c ompe”ve? Sort of…
Best result (UIllinois)
≈ interannotator agreement!
Median submission (out of 18): Our TreeRNTN:
TreeRNTN is a purely-learned system None of the ones in the compeHHon were
84.5%
77% 76.9%

Natural language inference data
● To do NLI on real English, we need to teach an NN model English almost from scratch.
● What data do we have to work with:
o GloVe/word2vec (useful w/ any data source)
o SICK: Thousands of examples created by ediHng and pairing
hundreds of sentences.
o RTE: Hundreds of examples created by hand.
o DenotaHonGraph: Millions of extremely noisy examples
(~73% correct?) constructed fully automaHcally.
o Stanford NLI corpus: ~600k examples, wri;en by Turkers.

The Stanford NLI corpus

Envoi
There are very good reasons to want to represent meaning with distributed representaHons
So far, distribuHonal learning has been most effecHve for this
But cf. [Young, Lai, Hodosh & Hockenmaier 2014] on denotaHonal representaHons, using visual scenes
However, we want not just word meanings! We want: Meanings of larger units, calculated composiHonally The ability to do natural language inference