b’slides-notes.tgz’
seg-49
Context-Dependent Embeddings
‣ Train a neural language model to predict the next word given previous
words in the sentence, use its internal representa
John visited Madagascar yesterday
Char CNN Char CNN Char CNN Char CNN
4096-dim LSTMs w/ 512-dim projec
i xj)
x0i =
nX
j=1
↵i,jxj
↵k,i,j = softmax(x
>
i Wkxj) x
0
k,i =
nX
j=1
↵k,i,jVkxj
What can self-aJen
word- level 1989
– 2000
source target
Big chunks make translation
easier
↳ How to get chunks
?
↳ How to translate
?
MT Evaluation
Evaluate MT at the
word (phrase level
BLEU : geometric mean of l
–
gram,
2-
gram ,
3 – gram,
U -gram precision of the output
× a brevity penalty
– reference
← translation
How many 4- grans
are in the reference ?
BLEU scones are low
! n 30-40
BLEU correlates w/
human judgments
BERT for QA p ( start lpietsoftmaxlsi)
end
s ‘ Eye
.
.
–
ei
FEET
T
( CLS ] What event . . . (SEP]
The assassination of . –
– –
① Or q③ On . . . p
Caveat : 512 word piece length
limit
Identify small
contest as input
to BERT
seg-89
Seq2seq Summariza-on
‣ Extrac-ve paradigm isn’t all that flexible, even with compression
‣ Can we just use seq2seq models to simplify things?
Its president quit suddenly
The
oat bran craze has
… …
‣ Train to produce summary based on document
‣ Need lots of data: most methods are going to be single-document
Chopra et al. (2016)
Seq2seq Headline Genera-on
‣ Headline genera-on task: generate headline from first sentence of ar-cle
(can get lots of data!)
no aWen-on
with aWen-on
reference
sentence
‣ Works preWy well, though these models can generate incorrect summaries
(who has the knee injury?)
‣ What happens if we try this on a longer ar-cle?
Chopra et al. (2016)
Seq2seq Summariza-on
See et al. (2017)
‣What’s wrong with
this summary?
Pointer-Generator Model
‣ Copying approach like
in Jia+Liang / MT
See et al. (2017)
Pointer-Generator Model
‣ Solu-ons: copy mechanism, coverage, just like in MT…
Pointer-Generator Model
See et al. (2017)
‣ How abstrac-ve is this, anyway? Mostly doing copying!
‣ Next: pre-trained models which are beWer at introducing new content
seg-88
Compressive Summariza/on
Indian Express — A massive earthquake of magnitude 7.3 struck Iraq on Sunday, 103 kms
(64 miles) southeast of the city of As-Sulaymaniyah, the US Geological Survey said, reports
Reuters. US Geological Survey iniKally said the quake was of a magnitude 7.2, before
revising it to 7.3.
‣ Sentence extrac/on isn’t aggressive enough at removing irrelevant
content
‣Want to extract sentences and also delete content from them
Syntac/c Cuts
Berg-Kirkpatrick et al. (2011)
‣ Delete adjuncts
A massive earthquake of magnitude 7.3 struck Iraq on Sunday, 103 kms (64 miles)…
NP
NP-LOCVBD NP NP-TMP
VP
S
‣ Use syntac/c rules to make certain dele/ons
Syntac/c Cuts
Berg-Kirkpatrick et al. (2011)
‣ Use syntac/c rules to make certain dele/ons
‣ Delete second parts of coordina/on structures
At least six people were killed and many others injured
S SCC
S
Compressive Integer Linear Program
‣ Recall the Gillick+Favre ILP:
‣ Now sj variables are nodes or sets of nodes in the parse tree
At least six people were killed and many others injured
S SCC
S
s2s1
‣ New constraint: s2 ≤ s1
“s1 is a prerequisite for s2”
Berg-Kirkpatrick et al. (2011)
Compressive Integer Linear Program
Durrec et al. (2016)
This hasn’t been Kellogg’s year.
Its president quit suddenly.
The oat-bran craze has cost it market share.
And now Kellogg is canceling its new cereal plant, which would have cost $1 billion.SBARNP
NP
The oat-bran craze has cost Kellogg market share.
x1
x2
x3
x4
x5
max
x
�
w>f(x)
�
s.t. summary(x) obeys length limit
summary(x) is grammatical
summary(x) is coherent
ILP:
Compressive Integer Linear Program
Durrec et al. (2016)
max
x
�
w>f(x)
�
Gramma/cality constraints: allow cuts within sentences
Coreference constraints: do not allow pronouns that would refer to nothing
‣ Otherwise, force its antecedent to be included in the summary
‣ If we’re confident about coreference, rewrite the pronoun (it Kellogg)
s.t. summary(x) obeys length limit
summary(x) is grammatical
summary(x) is coherent
Results
Durrec et al. (2016)
30
35
40
45
50
5 6 7 8 9 10
Linguis/c Quality
(Human study on Mechanical Turk)
Content fidelity
(ROUGE-1: word
overlap with
reference) First n words
Extract sentences
Extract clauses Full system
Yoshida et al. (2014)
Results
Xu and Durrec (2019)
‣Model is now a neural
model that scores
sentences and
compression op/ons
‣ Decoding is done by
beam search (not ILP),
length constraint is not
enforced as strongly
anymore
‣ Stronger results on NYT and on CNN/Daily Mail
seg-77
Problems in QA
Devlin et al. (2019)
‣ SQuAD ques
alphabets!
‣ Japanese => English: different script and very different syntax
seg-99
Ethics in NLP
Bias amplifica+on: systems
exacerbate real-world bias
rather than correct for it
Unethical use: powerful systems can be
used for bad ends
Exclusion: underprivileged users are le>
behind by systems
Dangers of automa+on:
automa?ng things in ways we don’t
understand is dangerous
Types of riskSystem
Applica?on-specific
‣ IE / QA / summariza?on?
‣Machine transla?on?
Machine learning, generally
Deep learning, generally
‣ Dialog?
Bias Amplifica?on
‣ Bias in data: 67% of training images involving
cooking are women, model predicts 80%
women cooking at test ?me — amplifies bias
‣ Can we constrain models to avoid this while
achieving the same predic?ve accuracy?
‣ Place constraints on propor?on of predic?ons
that are men vs. women?
Zhao et al. (2017)
Bias Amplifica?on
Zhao et al. (2017)
Bias Amplifica?on
Alvarez-Melis and Jaakkola (2011)
‣ English -> French machine transla?on
requires inferring gender even when
unspecified
‣ “dancer” is assumed to be female in
the context of the word “charming”…
but maybe that reflects how language
is used?
Exclusion
Alvarez-Melis and Jaakkola (2011)
‣Most of our annotated data is English data, especially newswire
Codeswitching?
Dialects?
Other languages? (Non-European/CJK)
‣What about:
‣ Efforts to broaden along all these axes, including Universal
Dependencies, Masakhane NLP, …
Dangers of Automa?on
Dangers of Automa?on
Slide credit: allout.org
‣ Offensive terms
http://allout.org
Dangers of Automa?on
Slide credit: https://www.reuters.com/article/us-amazon-com-
jobs-automation-insight/amazon-scraps-secret-ai-recruiting-
tool-that-showed-bias-against-women-idUSKCN1MK08G
‣ “Amazon scraps secret AI recrui?ng tool that showed bias
against women”
‣ “Women’s X” organiza?on was a nega?ve-weight feature in resumes
‣Women’s colleges too
‣Was this a bad model? Maybe it correctly reflected the biases in the
what the humans did in the actual recrui?ng process
Dangers of Automa?on
https://toxicdegeneration.allenai.org/
‣ “Toxic degenera?on”: systems that generate toxic stuff
‣ System trained on a big chunk of the Internet: condi?oning on “SJW”,
“black” gives the system a chance of recalling bad stuff from its
training data
Unethical Use
‣ Surveillance applica?ons?
‣ Genera?ng convincing fake news / fake comments?
‣What if these were
undetectable?
Unethical Use
‣ Wang and Kosinski: gay vs. straight
classifica?on based on faces
‣ Blog post by Agüera y Arcas, Todorov,
Mitchell: the system detects mostly social
phenomena (glasses, makeup, angle of
camera, facial hair)
‣ Authors argued they were tes?ng a
hypothesis: sexual orienta?on has a
gene?c component reflected in
appearance
Slide credit: hkps://medium.com/@blaisea/do-
algorithms-reveal-sexual-orienta?on-or-just-expose-
our-stereotypes-d998fafdf477
‣ Poten?ally dangerous tool, and not
even good science
https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477
https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477
https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477
Unethical Use
hkp://www.facecep?on.com
http://www.faceception.com
How to move forward
‣ Hal Daume III: Proposed code of ethics
hkps://nlpers.blogspot.com/2016/12/should-nlp-and-ml-communi?es-have-code.html
‣ Value-sensi?ve design: vsdesign.org
‣ Account for human values in the design process: understand whose
values maker here, analyze how technology impacts those values
‣ Contribute to society and human well-being, and minimize nega?ve consequences of compu?ng systems
‣ Make reasonable effort to prevent misinterpreta?on of results
‣ Make decisions consistent with safety, health, and welfare of public
‣ Improve understanding of technology, its applica?ons, and its poten?al consequences (pos and neg)
‣Many other points, but these are relevant:
Reading Comprehension
knowledge base question answering
:
question
→ formal repr →④→ ans
when was Brie Larson
born ? → Xd . birthday ( BL ,
d)
–
What temp should I cook chicken to ?
Why did WW2 start
?
What event led to the start of WW1
?
avg
↳
.
.
.
The assassination of Franz Ferdinand
took place on . . .
. –
This event was a direct caus! of WW1 .
Reading comp : ④Et÷e7I7Y§÷!
✓ which answers
the question
seg-66
Syntac’c Machine Transla’on
‣ Rather than use phrases, use a synchronous context-free grammar
NP → [DT1 JJ2 NN3; DT1 NN3 JJ2]
DT → [the, la]
NN → [car, voiture]
JJ → [yellow, jaune]
the yellow car
‣ Assumes parallel tree structures, but there can be reordering
DT → [the, le]
la voiture jaune
NP NP
DT1 NN3 JJ2DT1 NN3JJ2
‣ Transla’on = parse the input with “half” of the grammar, read off the
other half
Syntac’c Machine Transla’on
Slide credit: Dan Klein
‣ To get flexibility in
transla’on, use lexicalized
rules, look like “syntac’c
phrases”
‣ Leads to HUGE grammars,
parsing is slow
seg-70
GPT/GPT-2
Radford et al. (2018, 2019)
‣ GPT2: trained on 40GB of text
collected from upvoted links
from reddit
‣ 1.5B parameters — by far the
largest of these models trained
as of March 2019
‣ GPT: “ELMo with transformers” (works beRer than ELMo)
‣ Train a single unidirecSonal transformer LM on long contexts
‣ Because it’s a language model, we can generate from it
GPT-2
credit:
OpenAI
QuesSons
3) How do we harness these priors for condiSonal generaSon tasks
(summarizaSon, generate a report of a basketball game, etc.)
4) Is this technology dangerous? (OpenAI pursued a “staged release”
model
1) How novel is the stuff being generated? (Is it just doing nearest
neighbors on a large corpus?)
2) How do we understand and disSll what is learned in this model?
Grover
Zellers et al. (2019)
‣ Sample from a large language model condiSoned on a domain, date,
authors, and headline
‣ Humans rank Grover-generated propaganda as more realisSc than
real “fake news”
‣ NOTE: Not a GAN, discriminator
trained separately from the generator
‣ Fine-tuned Grover can detect
Grover propaganda easily —
authors argue for releasing it
for this reason
Pre-training Cost
hRps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/
‣ XLNet (BERT variant): $30,000 — $60,000 (unclear)
‣ Grover-MEGA: $25,000
‣ BERT: Base $500, Large $7000
‣ This is for a single pre-training run…developing new pre-training
techniques may require many runs
‣ Fine-tuning these models can typically be done with a single GPU (but
may take 1-3 days for medium-sized datasets)
GPT-3 (June 2020)
Brown et al. (2020)
‣ 175B parameter model: 96 layers, 96 heads, 12k-dim vectors
‣ Trained on Microsol Azure, esSmated to cost roughly $10M. Almost
1000x BERT-large
GPT-3: Few-shot Learning
Brown et al. (2020)
‣Model “learns” by condiSoning on a few examples of
the task.
‣ Not as good as systems that train on more data,
but impressive results in the few-shot semng
IBM Model I
a- = (ai , . . .,
an ) I = ( ti , . . . ,tn ) 5=6 . , . . .,sm, NULL)
→
each word in I placeholder for
aligns to one
word
unaligned words
in 5
PIE ,aTst= Play Ktla , 5) = II Plait Pftilsa 😉
Model pamms
: translation prob . matrix
includes → I Vslxlvtl Pftargetl – aura )
NULL
Phil Sai ) : look up
the prob of ti given
source
word Sai ai is a
”
pointer
”
f) ( ai ) = uniform over [ I , . . . . , m ,
NULL) m÷,
Inference in Model 1
plays ,E) = Plait
Ist
=
II Fitilsai )
–
–
pets) pt# ignore
Plats ,E) A #Pltilsai ) Plait tpftilsa!
÷÷÷÷÷:÷÷÷÷÷÷u
.
Pla
,
15
,
t )
213 J
‘
yoo
‘ aime –
so aine
0.4 NULL ‘ 13 NULL
HMM Model ( Vogel , 1996 )
Phil – II Plait ai – it
”
Categorical ( ai – ai – i )
a ”
moving
the alignment pointer by
+1 is most likely
”
Learning No labeled alignment
data
–
optimize : log Pftls )
= !? log ⇐ PIE,a- 15 )
–
model :
PCE ,
Ils) marginal log
likelihood
Expectation maximization (FM )
seg-58
Unknown Words
‣Want to be able to copy named en44es like Pont-de-Buis
1
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])
from a;en4on
from RNN
hidden state
‣ Problems: target word has to be in the vocabulary, a;en4on + RNN need
to generate good embedding to pick it
Luong et al. (2015), Jean et al. (2015)
Pointer Networks
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])
‣ Pointer network: predict from source words instead of target vocab
th
e
m
ov
ie
wa
sa of … … …
the movie was great th
e
m
ov
ie
wa
sa of … … …
‣ Standard decoder (Pvocab): soOmax
over vocabulary, all words get >0 prob
{ h>j V h̄i if yi = wj
w1 w2 w3 w4
Ppointer(yi|x, y1, . . . , yi�1) /
Pointer Generator Mixture Models
‣Marginalize over copy variable during training and inference
th
e
m
ov
ie
wa
s
gr
ea
ta of … … …
‣ Define the decoder model as a mixture model of the and
models (previous slide)
Pvocab
Ppointer
P (yi|x, y1, . . . , yi�1) = P (copy)Ppointer + (1� P (copy))Pvocab
‣ Predict P(copy) based on decoder state, input, etc.
‣Model will be able to both
generate and copy, flexibly adapt
between the two
Copying
{ {theazebraPont-de-Buis
ecotax
…
‣ Solu4on: expand the vocabulary dynamically. New words can
only be predicted by copying (always 0 probability under Pvocab)
‣ Some words we may want to copy may not be in
the fixed output vocab (Pont-de-Buis)
Results on Seman4c Parsing
‣ Copying typically helps a bit, but a;en4on captures most of the
benefit. However, vocabulary expansion is cri4cal for some tasks
(machine transla4on)
‣ For seman4c parsing, copying tokens from the input (texas) can be
very useful
Jia and Liang (2016)
seg-59
Handling Rare Words
Sennrich et al. (2016)
‣Words are a difficult unit to work with: copying can be cumbersome,
word vocabularies get very large
‣ Character-level models don’t work well
Input: _the _eco tax _port i co _in _Po nt – de – Bu is …
Output: _le _port ique _éco taxe _de _Pont – de – Bui s
‣ Compromise soluOon: use subword tokens, which may be full words but
may also be parts of words
‣ Can achieve transliteraOon with this, subword structure makes some
translaOons easier to achieve
Byte Pair Encoding (BPE)
‣ Vocabulary stats are weighted over a large corpus
‣ Start with every individual byte (character) as its own symbol
‣ Count bigram character
cooccurrences in dicOonary
‣Merge the most frequent pair
of adjacent characters
‣ Doing 30k merges => vocabulary of 30000 word pieces. Includes many
whole words:
Sennrich et al. (2016)
and there were no re_ fueling sta2ons anywhere
one of the city ’s more un_ princi_ pled real estate agents
Word Pieces
‣ SentencePiece library from Google: unigram LM
Build a language model over your corpus
Merge pieces that lead to highest improvement in language model
perplexity
‣ Issues: what LM to use? How to make this tractable?
while voc size < target voc size: ‣ AlternaOve to BPE Schuster and Nakajima (2012), Wu et al. (2016) Comparison Bostrom and Durrea (2020) ‣ BPE produces less linguisOcally plausible units than word pieces (unigram LM) ‣ Some evidence that unigram LM works beaer in pre-trained transformer models ‣ Other work explores ensembling across mulOple tokenizaOons seg-65 Phrase-Based Machine Transla1on Unlabeled English data cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 … Language model P(e) Phrase table P(f|e) P (e|f) / P (f |e)P (e) Noisy channel model: combine scores from translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Phrase-Based Machine Transla1on ‣ Phrase table: set of phrase pairs (e, f) with probabili1es P(f|e) ‣ Inputs needed ‣What we want to find: e produced by a series of phrase-by-phrase transla1ons from an input f ‣ Language model that scores P (ei|e1, . . . , ei�1) ⇡ P (ei|ei�n�1, . . . , ei�1) ‣ Noisy channel model: P(e|f) ∝ P(f|e) P(e) (ignore P(f) term) Transla1on model (TM) Language model (LM) Phrase LaFce ‣ Given an input sentence, look at our phrase table to find all possible transla1ons of all possible spans ‣Monotonic transla1on: need to translate each word in order, explore paths in the laFce that don’t skip any words ‣ Looks like Viterbi, but the scoring is more complicated Koehn (2004) Monotonic Transla1on ‣ If we translate with beam search, what state do we need to keep in the beam? ‣Where are we in the sentence ‣What words have we produced so far (actually only need to remember the last 2 words when using a 3-gram LM) ‣ Score Monotonic Transla1on Mary idx = 1 -1.1 ‣ Beam state: where we’re at, what the current transla1on so far is, and score of that transla1on …did not idx = 2 Mary not Mary no -0.1 -1.2 -2.9 idx = 2 idx = 2 ‣ Advancing state consists of trying each possible transla1on that could get us to this 1mestep Monotonic Transla1on …did not idx = 2 Mary not Mary no -0.1 -1.2 -2.9 idx = 2 idx = 2 score = log [P(Mary) P(not | Mary) P(Maria | Mary) P(no | not)]{ { LM TM In reality: score = α log P(LM) + β log P(TM) …and TM is broken down into several features Monotonic Transla1on ‣ Two ways to get here: Maria + no dio or Maria no + dio ‣ Beam contains op1ons from mul1ple segmenta.ons of input — as many hypotheses as paths through the laFce (up to beam size) …did not idx = 2 Mary not Mary no -0.1 -1.2 -2.9 idx = 2 idx = 2 Non-Monotonic Transla1on ‣More flexible model: can visit source sentence “out of order” ‣ State needs to describe which words have been translated and which haven’t translated: Maria, dio, una, bofetada ‣ Big enough phrases already capture lots of reorderings, so this isn’t as important as you think “Training” Decoders ‣MERT (Och 2003): decode to get 1000- best transla1ons for each sentence in a small training set (<1000 sentences), do line search on parameters to directly op1mize for BLEU score = α log P(t) + β log P(s|t) …and P(s|t) is in fact more complex ‣ Usually 5-20 feature weights to set, want to op1mize for BLEU score which is not differen1able Moses ‣ Pharaoh (Koehn, 2004) is the decoder from Koehn’s thesis ‣ Toolkit for machine transla1on due to Philipp Koehn + Hieu Hoang ‣Moses implements word alignment, language models, and this decoder, plus a ton more stuff ‣ Highly op1mized and heavily engineered, could more or less build SOTA transla1on systems with this from 2007-2013 seg-71 BART Lewis et al. (2019) ‣ Sequence-to-sequence BERT variant: permute/make/delete tokens, then predict full sequence autoregressively ‣ For downstream tasks: feed document into both encoder + decoder, use decoder hidden state as output BART Lewis et al. (2019) ‣ Good results on dialogue, summarizaOon tasks. Will discuss more when we get to these problems ‣ Results on GLUE not beRer than RoBERTa T5 Raffel et al. (2019) ‣ Pre-training: similar denoising scheme to BART ‣ Frame many problems as sequence-to-sequence ones: T5 ‣We sOll haven't hit the limit of bigger data being useful for pre- training: here we see stronger MT results from the biggest data ‣ Colossal Cleaned Common Crawl: 750 GB of text Raffel et al. (2019) The Future ‣ Explosion of techniques, remains to be seen what comes out on top seg-16 Batching ‣ Batching data gives speedups due to more efficient matrix opera5ons ‣ Need to make the computa5on graph process a batch at the same 5me probs = ffnn.forward(input) # [batch_size, num_classes] loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label)) ... ‣ Batch sizes from 1-100 o>en work well
def make_update(input, gold_label)
# input is [batch_size, num_feats]
# gold_label is [batch_size, num_classes]
…
Training Basics
‣ Basic formula: compute gradients on batch, use first-order op5miza5on
method (SGD, Adagrad, etc.)
‣ How to ini5alize? How to regularize? What op5mizer to use?
‣ This segment: some prac5cal tricks. Take deep learning or op5miza5on
courses to understand this further
How does ini5aliza5on affect learning?
V
n features
d hidden units
d x n matrix m x d matrix
so>maxWf(
x
)
z
nonlinearity
(tanh, relu, …)
g P
(y
|x
)
P (y|x) = softmax(Wg(V f(x)))
‣ How do we ini5alize V and W? What consequences does this have?
‣ Nonconvex problem, so ini5aliza5on maTers!
How does ini5aliza5on affect learning?
‣ Nonlinear model…how does this affect things?
‣ If cell ac5va5ons are too large in absolute value, gradients are small
‣ ReLU: larger dynamic range (all posi5ve numbers), but can produce
big values, can break down if everything is too nega5ve
Ini5aliza5on
1) Can’t use zeroes for parameters to produce hidden layers: all values in
that hidden layer are always 0 and have gradients of 0, never change
‣ Can do random uniform / normal ini5aliza5on with appropriate scale
U
”
�
r
6
fan-in + fan-out
,+
r
6
fan-in + fan-out
#
‣ Glorot ini5alizer:
‣Want variance of inputs and gradients for each layer to be the same
‣ Batch normaliza5on (Ioffe and Szegedy, 2015): periodically shi>+rescale
each layer to have mean 0 and variance 1 over a batch (useful if net is deep)
2) Ini5alize too large and cells are saturated
Dropout
‣ Probabilis5cally zero out parts of the network during training to prevent
overfiang, use whole network at test 5me
Srivastava et al. (2014)
‣ Similar to benefits of
ensembling: network
needs to be robust to
missing signals, so it
has redundancy
‣ Form of stochas5c
regulariza5on
‣ One line in Pytorch/Tensorflow
Op5mizer
‣ Adam (Kingma and Ba, ICLR 2015):
very widely used. Adap5ve step size
+ momentum
‣Wilson et al. NeurIPS 2017:
adap5ve methods can actually
perform badly at test 5me
(Adam is in pink, SGD in black)
‣ One more trick: gradient clipping
(set a max value for your gradients)
seg-11a
Fairness in Classifica-on
‣ Classifiers can be used to make real-world decisions:
‣ Humans making these decisions are typically subject to an–discrimina-on laws;
how do we ensure classifiers are fair in the same way?
‣ Is someone a criminal based on their face?
‣ Who gets an interview?
‣ Who should we lend money to?
‣ Is this online ac-vity suspicious?
‣ Many other factors to consider when deploying classifiers in the real world (e.g.,
impact of a false posi-ve vs. a false nega-ve) but we’ll focus on fairness here
Don’t do this!
Evalua-ng Fairness
Idea 1: Classifiers need to be evaluated beyond just accuracy
‣ T. Anne Cleary (1966-1968): a test is
biased if predic-on on a subgroup
makes consistent nonzero predic-on
errors compared to the aggregate
‣ Individuals of X group could s-ll score
lower on average. But the errors
should not be consistently impac-ng X Test result
Ground truth
‣ Member of π1 has a test result higher than a
member of π2 for the same ground truth ability. Test
penalizes π2
Hutchinson and Mitchell (2018)
Evalua-ng Fairness
Petersen and Novik (1976)
Hutchinson and Mitchell (2018)
‣ Thorndike (1971), Petersen and Novik (1976): fairness in classifica-on: ra-o of
predicted posi-ves to ground truth posi-ves must be approximately the same for
each group
‣ Allows for different criteria across groups: imposing different classifica-on
thresholds actually can give a fairer result
‣ Group 1: 50% posi-ve movie reviews. Group 2: 60% posi-ve movie reviews
‣ A classifier classifying 50% posi-ve in both groups is unfair, regardless of accuracy
‣ Can’t we just make our classifiers not depend on sensi-ve features like gender?
Idea 1: Classifiers need to be evaluated beyond just accuracy
Discrimina-on
Idea 2: It is easy to build classifiers that discriminate even without meaning to
‣ Bag-of-words features can iden-fy par-cular dialects of English like AAVE or
code-switching (using two languages). Impacts classifica-on on social media, etc.
Credit: https://www.reuters.com/article/us-amazon-com-
jobs-automation-insight/amazon-scraps-secret-ai-recruiting-
tool-that-showed-bias-against-women-idUSKCN1MK08G
‣ Reuters: “Amazon scraps secret AI recrui-ng tool that showed bias against women”
‣ “Women’s X” organiza-on, women’s colleges were nega-ve-weight features
‣ Accuracy will not catch these problems, very complex to evaluate depending
on what humans did in the actual recrui-ng process
‣ ZIP code as a feature is correlated with race
‣ A feature might correlate with minority group X and penalize that group:
Takeaways
‣ What minority groups in the popula-on should I be mindful of? (Review sen-ment:
movies with female directors, foreign films, …)
‣ Do aspects of my system or features it uses introduce poten-al correla-ons with
protected classes or minority groups?
‣ Can I check one of these fairness criteria?
Word Embeddings Vaio , ooo words
movie was good
EY [ Ow ! ,
O O O
g
!
.
do
00
–
– –
mo
!,j )
via -wTo – I – o] + [ of
.
!
– I + Co –
g
! ;D
movie
film is great = Cfm – – is
”
– – –
great
)
orthogonal to
”
movie was good
”
word embeddings : low
– dimensional representations
of words
movie
(50 – 300) capturing their similarity
film
¥
”
is
How to learn embeddings
JR firm 1957 Distributional hypothesis
”
You shall know a
word by the company
it keeps
”
I watched the movie I developed
the film
I
watched the film in the
darkroom
The film inspired
me
The movie inspired
me
Brown Clusters , .
.
.
Nikolov et al . 2013
”
word Zvec
”
predict each word ‘s
Each word w e
Tw word vector context given
→ In context
vector that word
Optimization Basics
find it to minimize loss ;
search over space
of parang
loss L ( CI
” ‘
, y
‘ “)
;
!
,
F) = §
,
LIF ” ‘
,
y
”
,
F )
–
training
data function of
T
stochastic gradient descent
: repeatedly pick example
i
I ← T
– XIE LEE)←
loss on ith example
9
step size
4-
Steps suppose L ( iyw
) = w
‘
one feature
5=4]
W=
– l if a -1
: W =
– I
⇐ e — w ⇒ I
4- ” •¥
Keep at
a
.
oscillates !
if 2=112 : W → O oa¥
Choosing Step size
How to choose step size
?
– Try them out
: le
°
le
– t
le
– Z
.
. .
– Large →
small
, e.g
.
‘
It for epoch t ( Yrt . . .)
(fixed schedule )
or decrease step site when performance stagnates
on held – out data
Newton ‘s method
: Te – (3¥ L )
– ‘
IFL
DL very
expensive
to compute
Adagrad , Ada
della
,
Adam : “adaptive
”
methodsD
* run)approximations
to the inverse Hessian
( linear in
Regularization : don’t really use
seg-29
Named En)ty Recogni)on
Barack Obama will travel to Hangzhou today for the G20 mee=ng .
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ BIO tagset: begin, inside, outside
‣Why might an HMM not do so well here?
‣ Lots of O’s
‣ Sequence of tags — should we use an HMM?
‣ Insufficient features/capacity with mul)nomials (especially for unks)
‣Mo)va)ng example for CRFs
Condi)onal Random Fields
‣ Flexible discrimina)ve model for tagging tasks that can use arbitrary
features of the input. Similar to logis)c regression, but structured
Barack Obama will travel to Hangzhou today for the G20 mee=ng .
B-PER I-PER
Curr_word=Barack & Label=B-PER
Next_word=Obama & Label=B-PER
Curr_word_starts_with_capital=True & Label=B-PER
Posn_in_sentence=1st & Label=B-PER
Label=B-PER & Next-Label = I-PER
…
Condi)onal Random Fields
any real-valued scoring func)on of its arguments
‣ CRFs: discrimina)ve model with the following form:
P (y|x) =
1
Z
Y
k
exp(�k(x,y))
normalizer
‣ Special case: linear feature-based poten)als �k(x,y) = w>fk(x,y)
P (y|x) =
1
Z
exp
nX
k=1
w>fk(x,y)
!
‣ Looks like our single
weight vector mul)class
logis)c regression model
‣ There is a set of poten=als φk that each assign a score to x and y
Log-linear CRFs in General
‣ Each square represents a poten)al (φk for features
fk) and touches the y’s and x’s it depends on
y1 y2
x1
x2
x3‣ Condi)onal model: x is observed, y isn’t
P (y|x) =
1
Z
exp
nX
k=1
w>fk(x,y)
!
Z =
X
y0
exp
nX
k=1
w>fk(x,y
0)
!
ybest = argmaxy0 exp
nX
k=1
w>fk(x,y
0)
!
‣ Sum or max over exponen)ally many y, even just to compute Z
φ1
φ2
φ3
‣ Problem: intractable inference!
Sequen)al CRFs
‣ Omit x from this factor graph. Every feature func)on can use x because
the model condi)ons on it
y1 y2 yn…
�t
‣ Two types of factors: transi=ons
(look at adjacent y’s, but not x) and
emissions (look at y and all of x)
P (y|x) =
1
Z
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
�e
�t
�e
Sequen)al CRFs (expressed with poten)als φ)
Sequen)al CRFs
P (y|x) =
1
Z
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
Sequen)al CRFs (expressed with poten)als φ)
P (y|x) =
1
Z
expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
Special case: log-linear with transi=on and emission features:
‣ Spoiler: this structure is going to allow us to use dynamic programming
(Viterbi) to sum or max over all sequences!
HMMs vs. CRFs
y1 y2 yn
x1 x2 xn
…
‣ Both models are expressible in different factor graph nota)on
P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .
y1 y2 yn…
�t
�e
P (y|x) =
1
Z
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
‣ If we squint, an exponen)ated poten)al plays a similar role as a probability
P (y,x) = P (y1)
nY
i=2
P (yi|yi�1)
nY
i=1
P (xi|yi)
CRF
HMM
HMMs vs. CRFs
‣ HMMs: in the standard HMM, emissions consider one word at a )me
‣ CRFs support features over many words simultaneously, non-independent
features (e.g., suffixes and prefixes), doesn’t have to be a genera)ve model
P (y|x) =
1
Z
expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
‣ Naive Bayes : logis)c regression :: HMMs : CRFs
local vs. global normaliza)on <-> genera)ve vs. discrimina)ve
(locally normalized discrimina)ve models do exist (MEMMs))
seg-15
Computa(on Graphs
‣ Compu(ng gradients is hard!
‣ Automa(c differen(a(on: instrument code to keep track of deriva(ves
y = x * x (y,dy) = (x * x, 2 * x * dx)
codegen
‣ Computa(on is now something we need to reason about symbolically;
use a library like PyTorch (or Tensorflow)
PyTorch
‣ Framework for defining computa(ons that provides easy access to
deriva(ves
torch.nn.Module‣Module: defines a neural
network (can use wrap
other modules which
implement predefined
layers)
forward(x):
# Takes an example x and computes result
backward(): # produced automa(cally
# Computes gradient aKer forward() is called
…
…
‣ If forward() uses crazy
math, you have to write
backward yourself
Neural Networks for Classifica(on
V
n features
d hidden units
d x n matrix num_classes x d
matrix
soKmaxWf(
x
)
z
nonlinearity
(tanh, relu, …)
g P
(y
|x
)
P (y|x) = softmax(Wg(V f(x)))
num_classes
probs
Computa(on Graphs in PyTorch
P (y|x) = softmax(Wg(V f(x)))
class FFNN(nn.Module):
def __init__(self, input_size, hidden_size, out_size):
super(FFNN, self).__init__()
self.V = nn.Linear(input_size, hidden_size)
self.g = nn.Tanh() # or nn.ReLU(), sigmoid()…
self.W = nn.Linear(hidden_size, out_size)
self.softmax = nn.Softmax(dim=0)
‣ Define forward pass for
def forward(self, x):
return self.softmax(self.W(self.g(self.V(x))))
(syntac(c sugar for forward)
Input to Network
‣Whatever you define with torch.nn needs its input as some sort of
tensor, whether it’s integer word indices or real-valued vectors
def form_input(x) -> torch.Tensor:
# Index words/embed words/etc.
return torch.from_numpy(x).float()
‣ torch.Tensor is a different datastructure from a numpy array, but you can
translate back and forth fairly easily
‣ Note that transla’ng out of PyTorch will break backpropaga’on; don’t
do this inside your Module
Training and Op(miza(on
P (y|x) = softmax(Wg(V f(x)))
ffnn = FFNN(inp, hid, out)
loss.backward()
probs = ffnn.forward(input)
loss = torch.neg(torch.log(probs)).dot(gold_label)
optimizer.step()
for (input, gold_label) in training_data:
ffnn.zero_grad() # clear gradient variables
one-hot vector
of the label
(e.g., [0, 1, 0])
optimizer = optim.Adam(ffnn.parameters(), lr=lr)
nega(ve log-likelihood of correct answer
for epoch in range(0, num_epochs):
Compu(ng Gradients with Backprop
class FFNN(nn.Module):
def __init__(self, inp, hid, out):
super(FFNN, self).__init__()
self.V = nn.Linear(inp, hid)
self.g = nn.Tanh()
self.W = nn.Linear(hid, out)
self.softmax = nn.Softmax(dim=0)
‣ Ini(alizing to a nonzero value is cri(cal!
nn.init.uniform(self.V.weight)
Training a Model
Define modules, etc.
For each epoch:
Compute loss on batch
For each batch of data:
Run on dev/test set
Autograd to compute gradients and take step on op(mizer
Zero out gradient
Ini(alize weights and op(mizer
[Op(onal: check performance on dev set to iden(fy overfi^ng]
seg-14
Vectoriza*on and So/max
‣ Single scalar probability
‣ Three classes,
“different weights” =
-1.1
2.1
-0.4
0.036
0.89
0.07
so
/
m
ax
class
probs
‣We write this as:
‣ So/max opera*on = “exponen*ate and normalize”
w>1 f(x)
w>2 f(x)
w>3 f(x)
softmax(Wf(x))
P (y|x) =
exp(w>y f(x))P
y02Y exp(w
>
y0f(x))
Logis*c Regression with NNs
‣ Single scalar probability
P (y|x) = softmax(Wf(x)) ‣Weight vector per class;
W is [num classes x num feats]
P (y|x) = softmax(Wg(V f(x))) ‣ Now one hidden layer
P (y|x) =
exp(w>y f(x))P
y02Y exp(w
>
y f(x))
Neural Networks for Classifica*on
V
n features
d hidden units
d x n matrix num_classes x d
matrix
so/maxWf(
x
)
z
nonlinearity
(tanh, relu, …)
g P
(y
|x
)
P (y|x) = softmax(Wg(V f(x)))
num_classes
probs
Training Neural Networks
‣Maximize log likelihood of training data
‣ i*: index of the gold label
‣ ei: 1 in the ith row, zero elsewhere. Dot by this = select ith index
z = g(V f(x))P (y|x) = softmax(Wz)
L(x, i⇤) = Wz · ei⇤ � log
X
j
exp(Wz) · ej
L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)
Backpropaga*on Picture
V so/maxWf(
x
)
z
g P
(y
|x
)
P (y|x) = softmax(Wg(V f(x)))
@L
@W
z
n features
d hidden units
num_classes
probs
‣ Gradient w.r.t. W: looks like logis*c
regression, can be computed trea*ng z
as the features
Backpropaga*on Picture
V so/maxWf(
x
)
z
g P
(y
|x
)
P (y|x) = softmax(Wg(V f(x)))
@L
@W
err(z)
‣ Can forget everything a/er z, treat
it as the output and keep backpropping
z
Compu*ng Gradients with Backprop
z = g(V f(x))
Ac*va*ons at
hidden layer
‣ Gradient with respect to V: apply the chain rule
a = V f(x)
‣ First term: gradient of nonlinear ac*va*on func*on at a (depends on
current value)
‣ Second term: gradient of linear func*on
‣ First term: err(z); represents gradient w.r.t. z
@L(x, i⇤)
@Vij
=
@L(x, i⇤)
@z
@z
@Vij
@z
@Vij
=
@g(a)
@a
@a
@Vij
L(x, i⇤) = Wz · ei⇤ � log
X
j
exp(Wz) · ej
Backpropaga*on Picture
V so/maxWf(
x
)
z
g P
(y
|x
)
P (y|x) = softmax(Wg(V f(x)))
@L
@W
@z
@V
zf(x)
n features
d hidden units
num_classes
probs
‣ Combine backward gradients with forward-pass products
err(z)
seg-28
Trigram Taggers
‣ Trigram model: y1 = (, NNP), y2 = (NNP, VBZ), …
‣ Probabili?es now look like P((NNP, VBZ) | (, NNP)) — more context!
We know the verb is occurring two words aMer
Fed raises interest rates 0.5 percent
NNP VBZ NN NNS CD NN
‣ Tradeoff between model capacity and data size — trigrams are a
“sweet spot” for POS tagging
‣ Normal HMM “bigram” model: y1 = NNP, y2 = VBZ, …
HMM POS Tagging
‣ Baseline: assign each word its most frequent tag: ~90% accuracy
‣ Trigram HMM: ~95% accuracy / 55% on words not seen in train
‣ TnT tagger (Brants 1998, tuned HMM): 96.2% acc / 86.0% on unks
‣ State-of-the-art (BiLSTM-CRFs, BERT): 97.5% / 89%+
‣MaxEnt tagger (Toutanova + Manning 2000): 96.9% / 87.0%
‣ Penn Treebank English POS tagging: 44 tags
POS Errors
official knowledge made up the story recently sold shares
JJ/NN NN VBD RP/IN DT NN RB VBD/VBN NNS
Toutanova + Manning (2000)
(NN NN: tax cut, art gallery, …)
Remaining Errors
‣ Underspecified / unclear, gold standard inconsistent / wrong: 58%
‣ Lexicon gap (word not seen with that tag in training): 4.5% of errors
‣ Unknown word: 4.5%
‣ Could get right: 16% (many of these involve parsing!)
‣ Difficult linguis?cs: 20%
They set up absurd situa/ons, detached from reality
VBD / VBP? (past or present?)
a $ 10 million fourth-quarter charge against discon/nued opera/ons
adjec?ve or verbal par?ciple? JJ / VBN?
Manning 2011 “Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguis?cs?”
POS with Feedforward Networks
…
Fed raises interest rates in order to …
f(x)
??
em
b
(raises)
‣Word embeddings for each word form input
‣ f(x) doesn’t look like a bag-of-words, instead
captures posi?on-sensi?ve informa?on
em
b
(interest)
em
b
(rates)
previous word
curr word
next word
other words, feats, etc.
POS with Feedforward Networks
‣ Hidden layer mixes these
different signals and learns
feature conjunc?ons
‣ Use bag-of-character
bigram + trigram
embeddings for each word
‣ Botha et al. (2017): small
FFNNs for NLP tasks
Botha et al. (2017)
POS with Feedforward Networks
‣Works well on a range of
languages
‣ Beser than a RNN-based
approach (Gillick et al.,
2016)
Botha et al. (2017)
Multiclass Classification
-i y
–
– uns )
“e -us
. . ” ,
#
n binary classifiers E¥
Twoteches
One weight vector per
class ( different weights , Dw)
or different
features LDF) per
class
DW : arg;yF
HI ) of : argymeaigwtflgy)
W hypothesized y
NN
Topic Classification
I = too many drug
trials
,
too few patients
y = {
health
, sports,
science)
f- (E) = bag- of
–
uuig rains
[ drug, patients,
baseball)
f- (E) = ( C , 1,0]
DW : The.im
= [ 2,5-6,
-3) T
sports
= [ 1.2 , -3.1 , 5- 7)
⇒ 7.6 ⇒
– I -9
DF : f LI ,y )
= f II ) replicated for each class
f- (I
, y
= Health ) = [ I , 1,0 , O O
o o O o ]
WU W Indicator ( sent
f- (I
, y
‘ Sports ) = ( ooo I , I , O O
O O) contains word i
✓ A y= Sports )
Jr = [ 2 , 5.6, -3, 1.2, -3.1 , 5.7 , . . . ]
Multiclass Perception t logistic Regression Y
–
–
“””
Multiclass Perception
Ex : [ I , 1,0] y=I
health
too many drug
finals
,
too few
patients
for t in epochs
for i in data diffreitdhts [ 1,0, 4=2 sports
yprea
= argtynaxwtyfcx )
baseball players taking drugs
if ypred F y
” ) Assume
”
default
”
pred is y⇒
Typed
← Typed
-aft) ( 000 {
000 9 ooo y
¥
Iya , ← Tycil
txt#
Update on [ 1,1 , o ) , I
ft ypred =3 y I 9=1
Df : w- ←
It Lf
# it
” )
[ 1,1 , o { ooo {
– I -to]
Ci)
Ypred
= I 4=2
[ 0,1 , -1$ I , 0,1 {
– I – I o)
Muhticlass LR
in binary
U :
Ply- III ) =
enttffx ,
DF
– TTFCE)
[ evttflx, y
‘
) y=tl : e
g
,
,
f y’Ey y= – l : l=e
must sum
to
lover y
¥ loss (Itil,ycil,T. )=
– f- (I ,yc”It §, Ply
‘ II )fG,y ‘)
Ey
pfycillx )=l
:
– FLI ,y
” ‘) + FLI ,y
“)=O
P( ybadtxlnl : – f- (I ,y”It FLI, ybad )
SGD : – x : Fe
seg-38
Lexicalized Parsing
S(ran)
NP(dog)
VP(ran)
PP(to)
NP(house)
DT(the) NN(house)TO(to)VBD(ran)DT(the) NN(dog)
the housetoranthe dog
Dependency Parsing
DT NNTOVBDDT NN
the housetoranthe dog
‣ Dependency syntax: syntac@c structure is defined by these arcs
‣ Head (parent, governor) connected to dependent (child, modifier)
‣ Each word has exactly one parent except for the ROOT symbol,
dependencies must form a directed acyclic graph
ROOT
‣ POS tags same as before, oKen run a tagger first as preprocessing
Dependency Parsing
DT
NN
TO
VBD
DT
NN
the
house
to
ran
the
dog
‣ S@ll a no@on of hierarchy! Subtrees oKen align with cons@tuents
Labeled Dependency Parsing
DT NNTOVBDDT NN
the housetoranthe dog
‣ Can label dependencies according to syntac@c func@on
det
‣Major source of ambiguity is in the structure, so we focus on that more
(labeling separately with a classifier works preQy well)
nsubj
pobj
detprep
Dependency vs. Cons@tuency
‣ Cons@tuency: several rule produc@ons need to change
the children ate the cake with a spoon
‣ Dependency: one parent differs
Dependency vs. Cons@tuency
‣ Cons@tuency: ternary rule NP -> NP CC NP
dogs in houses and cats dogs in houses and cats
[dogs in houses] and cats dogs in [houses and cats]
‣ Dependency: first item is the head. Doesn’t really make sense
Stanford Dependencies
‣ Designed to be prac@cally useful for rela@on extrac@on
Standard Collapsed
Bills on ports and immigra@on were submiQed by Senator Brownback, Republican of Kansas
Universal Dependencies
‣ Annotate dependencies with the same representa@on in many languages
English
Bulgarian
Czech
Swiss
‣ Dependencies are more portable cross-lingually: languages with free
word order are not well handled by cons@tuency
Projec@vity
‣ Any subtree is a con@guous span of the sentence <-> tree is projec’ve
credit: Language Log
‣ Crossing arcs / nonprojec@ve:
‣ Equivalent to drawing the structure and having no crossing arcs
Projec@vity
‣ Swiss German example: X said [that we helped Hans paint the house]
Pitler et al. (2013)
‣Many trees in other languages are nonprojec@ve
seg-39
Transi’on-based Parsing
‣We can build a dependency parser using a chart-based algorithm
like CKY
‣ Time: O(n3), but the algorithm is very tricky!
‣ Transi’on-based, or shi.-reduce, is another style of parser;
similar to determinis’c parsing for compilers
‣ A tree is built from a sequence of incremental decisions moving
leI to right through the sentence
‣ Stack contains par’ally-built tree, buffer contains rest of sentence
Transi’on System
I ate some spagheM bolognese
ROOT
‣ ShiI 1: Stack: [ROOT I] Buffer: [ate some spagheM bolognese]
‣ ShiI: top of buffer -> top of stack
‣ Ini’al state: Stack: [ROOT] Buffer: [I ate some spagheM bolognese]
‣ ShiI 2: Stack: [ROOT I ate] Buffer: [some spagheM bolognese]
Transi’on System
I ate some spagheM bolognese
ROOT
‣ State: Stack: [ROOT I ate] Buffer: [some spagheM bolognese]
‣ LeI-arc (reduce): Let denote the stack, = stack ending in w-1�
‣ “Pop two elements, add an arc, put them back on the stack”
‣ State: Stack: [ROOT ate] Buffer: [some spagheM bolognese]
I
�|w�2, w�1 ! �|w�1 w�1w�2 is now a child of
�|w�2, w�1 ! �|w�1
Arc-Standard Parsing
‣ Start: stack: [ROOT], buffer: [I ate some spagheM bolognese]
‣ ShiI: top of buffer -> top of stack
‣ LeI-Arc: �|w�2, w�1 ! �|w�1 w�1w�2
‣ Right-Arc �|w�2, w�1 ! �|w�2
is now a child of,
w�1 w�2,
I ate some spagheM bolognese
‣ End: stack contains [ROOT], buffer is empty []
‣ How many transi’ons do we need if we have n words in a sentence?
is now a child of
ROOT
‣ Arc-standard system: three opera’ons
‣ There are other transi’on systems, but we won’t discuss these
Arc-Standard Parsing
[I ate some spagheM bolognese][ROOT]
[ROOT I]
[ROOT I ate]
[ROOT ate]
I
S
S
L
‣ Could do the leI arc later! But no reason to wait
‣ Can’t a^ach ROOT <- ate yet even though this is a correct dependency!
S top of buffer -> top of stack
LA
RA
[ate some spagheM bolognese]
[some spagheM bolognese]
[some spagheM bolognese]
I ate some spagheM bolognese
ROOT
pop two, leI arc between them
pop two, right arc between them
Arc-Standard Parsing
S top of buffer -> top of stack
LA
RAI ate some spagheM bolognese
ROOT
pop two, leI arc between them
pop two, right arc between them
[ROOT ate]
I
[some spagheM bolognese]
[ROOT ate some spagheM]
I
[bolognese]
[ROOT ate spagheM]
I some
[bolognese]
S
L
S
S
Arc-Standard Parsing
S top of buffer -> top of stack
LA
RAI ate some spagheM bolognese
ROOT
pop two, leI arc between them
pop two, right arc between them
[ROOT ate spagheM bolognese]
I some
[ROOT ate spagheM]
I some bolognese
[ROOT ate]
I
some bolognese
spagheM
Stack consists of all words that are
s’ll wai’ng for right children, end
with a bunch of right-arc ops
[ROOT]
I
some
bolognese
spagheM
ate
[]
[]
[]
[]
Final state:
R
R
Building Transi’on-Based Parsers
[ROOT ate some spagheM]
I
[bolognese]
‣Mul’-way classifica’on problem: shiI, leI-arc, or right-arc?
[ROOT] [I ate some spagheM bolognese]
‣ How do we make the right decision in this case?
‣ How do we make the right decision in this case? (all three ac’ons legal)
‣ Only one legal move (shiI)
argmaxa2{S,LA,RA}w
>f(stack, bu↵er, a)
Features for ShiI-Reduce Parsing
[ROOT ate some spagheM]
I
[bolognese]
‣ Features to know this should leI-arc?
‣ One of the harder feature design tasks!
‣ In this case: the stack tag sequence VBD – DT – NN is pre^y informa’ve
— looks like a verb taking a direct object which has a determiner in it
‣ Things to look at: top words/POS of buffer, top words/POS of stack,
leImost and rightmost children of top items on the stack
Training a Greedy Model
[ROOT ate some spagheM]
I
[bolognese]
‣ Train a classifier to predict the right decision using these as training data
‣ Can turn a tree into a decision sequence a by building an oracle
‣ Training data assumes you made correct decisions up to this point
and teaches you to make the correct decision, but what if you
screwed up…
argmaxa2{S,LA,RA}w
>f(stack, bu↵er, a)
Training a Greedy Model
‣ Greedy: 2n local training examples
State space
Gold end stateStart state
‣ Non-gold states unobserved during training: consider
making bad decisions but don’t condi’on on bad decisions
seg-11
Text Classifica-on
~20 classes
Sports
Health
‣ 20 Newsgroups, Reuters, Yahoo! Answers, …
Textual Entailment
Bowman et al. (2015)
A black race car starts up in front of a crowd of people.
A man is driving down a lonely road
A soccer game with mul-ple males playing.
Some men are playing a sport.
A smiling costumed woman is holding an umbrella.
A happy woman in a fairy costume holds an umbrella.
CONTRADICTS
ENTAILS
NEUTRAL
‣ Three-class task
over sentence
pairs
‣ Not clear how to
do this with
simple bag-of-
words features
En-ty Disambigua-on/En-ty Linking
‣ 4,500,000 classes (all ar-cles in Wikipedia)
Although he originally won the
event, the United States An–
Doping Agency announced in
August 2012 that they had
disqualified Armstrong from
his seven consecu-ve Tour de
France wins from 1999–2005.
Lance Edward Armstrong is
an American former
professional road cyclist
Armstrong County
is a county in
Pennsylvania…
?
?
‣ Needs a very different structure for f(x,y) to accommodate so many classes
Authorship A`ribu-on
‣ Early work: Shakespeare’s plays, Federalist papers (Hamilton v. Madison)
‣ Sta-s-cal methods date back to 1930s and 1940s
‣ Based on handcraeed heuris-cs like stopword frequencies
‣ Twi`er: given a bunch of tweets, can we figure out who wrote them?
‣ Schwartz et al. EMNLP 2013: 500M tweets, take 1000 users with at
least 1000 tweets each
‣ Task: given a held-out tweet by one of the 1000 authors, who wrote it?
Authorship A`ribu-on
Schwartz et al. (2013)
‣ SVM with character 4-
grams, words 2-grams
through 5-grams
50-author task
Authorship A`ribu-on
Schwartz et al. (2013)
‣ k-signature: n-gram that appears in k% of the authors tweets but not
appearing for anyone else — suggests why these are so effec-ve
Authorship A`ribu-on
Schwartz et al. (2013)
‣ k-signature: n-gram that appears in k% of the authors tweets but not
appearing for anyone else — suggests why these are so effec-ve
seg-13
Neural Networks
Taken from h2p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Warp
space Shift
Nonlinear
transformation
z = g(V f(x) + b)
ypred = argmaxyw
>
y z
‣ Ignore shiE / +b term for the
rest of the course
Neural Networks
Taken from h2p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Linear classifier Neural network
Linear classificaIon
in the transformed
space!
Deep Neural Networks
Taken from h2p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
z1 = g(V1f(x))
z2 = g(V2z1)
…
ypred = argmaxyw
>
y zn
Neural Nets
g
b n
good I O O
– I tbad 0 I 0
not good I 0 I
not bad 0 11
TF
NNs : transform the data
into a latent feature space
TTf replace
FLI ) with a nonlinear function of
the original FLI )
Define E = g ( V FLI
))
.
Classify with ITEE)
✓ dxY#imensivnal feat vector
nonlinearity
matrix
g=
tanh #
How can Vtg give g= Rew #
us useful latent features ?
#
NN example
suppose V= [ 49, ] g=
tanh
tanh (01=0 ,
tanh ( 1) =/
tanh (2) 21
X , I = [ tanh ( x , ) , tanhlxz ) , tanhlxitxzl)
73
E=gH¥§J
+ Iq”‘t’
”
1¥
—
Zg
Zz
Sentiment Analysis and Basic feature Extraction
the movie was great
! would watch again
!
– –
the film was axed ;
I’ll never watehagac.nl
① text I ⇒ FLI ) feature extraction
② {ftx ” ‘) , y
lil )!
,
dataset of D labeled exs
,
⇒ train classifier
Feature extraction
the movie was great
Bag -of – words :
Assume 10,000 words in vocabulary
f [the Oa % It
.
. .
imovie
.
.
w’
as . . . grint .?
4 Is
9996 Os
t
counts ( how many
”
the ” are present)
presence ( absence ( 0/1 )
n -gram
:
sequence of n consecutive
words
Bag – of – engrams 2-grams : thematic , moviewas
,
wasgreat
total # does
tf – idf
tf : Count of the term
↳ tfxidf
Idf : inverse document frequency log q¥eog
# does with W
Preprocessing
① Tokenization was greatly
C . – – great . –
– – great ! . . .]
was greet[
waswgra.tw !
wasn’t → was- n ‘t
② [ Sometimes ] Stopword removal (
the
,
of
,
a
,
.
.
. )
③ [ Sometimes ] Casing ( lower casing , true casing )
④ Handling unknown words Durrett ⇒ UNK
⑤ Indexing : map each { word,
n -gram
) into IN
use a map
seg-37
PCFG Independence Assump0ons
11%
9%
6%
NP PP DT NN PRP
9% 9%
21%
NP PP DT NN PRP
7%
4%
23%
NP PP DT NN PRP
All NPs NPs under S NPs under VP
‣ Language is not context-free: NPs in different contexts rewrite differently
‣ [They]NP received [the package of books]NP
Ver0cal Markoviza0on
S^ROOT
NP^S VP^S
She saw it
VBD^VP PRP^VPPRP^NP
S
NP VP
She saw it
VBD PRPPRP
Basic tree (v = 1) v = 2 Markoviza0on
‣Why is this a good idea?
Binariza0on Revisited
VP
sold books to her
NP PPVBZ PP
for $50
VP
sold
books
to her
NP
PP
VBZ
PP
for $50
VP [VBZ]
VP [VBZ NP]
‣ Equivalent to the original because of the
chain rule of probability
P(VP -> VBZ NP PP PP) = P(VBZ | VP) P(NP | VP [VBZ])
P (PP | VP [VBZ NP]) P (PP | VP [VBZ NP PP])
VP [VBZ NP PP]
‣ Another way of doing lossless binariza0on:
(abusing nota0on slightly)
Horizontal Markoviza0on
VP
sold books to her
NP PPVBZ PP
for $50
VP
sold
books
to her
NP
PP
VBZ VP [VBZ]
VP [VBZ NP]
h = 2: VP [… VBZ NP]
h = 1: VP [… NP]
h = 0: VP
h = 2: VP [… VBZ]
h = 1: VP [… VBZ]
h = 0: VP
‣ Changes amount of context remembered
in binariza0on (h=∞: remember all)
‣ In prac0ce: always remember the head tag
…
Annota0ng Trees
‣ First apply ver0cal Markoviza0on, then binarize + apply horizontal
VP^S
NP^VP PP^VPVBZ^VP
S^ROOT
NP^S VP^S
NP^VP PP^VP
VBZ^VP
S^ROOT
NP^S
VP^S [… VBZ^VP]
Annota0ng Trees
Klein and Manning (2003)
Lexicaliza0on
§ What’s different between basic PCFG scores here?
§ What (lexical) correla;ons need to be scored?
‣ Even with parent annota0on, these trees have the same rules. Need to
use the words
Lexicaliza0on
§ Add “head words” to
each phrasal node
§ Syntac4c vs. seman4c
heads
§ Headship not in (most)
treebanks
§ Usually use head rules,
e.g.:
§ NP:
§ Take leFmost NP
§ Take rightmost N*
§ Take rightmost JJ
§ Take right child
§ VP:
§ Take leFmost VB*
§ Take leFmost VP
§ Take leF child
‣ Annotate each grammar symbol with
its “head word”: most important
word of that cons0tuent
‣ Rules for iden0fying headwords (e.g.,
the last word of an NP before a
preposi0on is typically the head)
‣ Collins and Charniak (late 90s):
~89 F1 with these
Sequence Labeling (Tagging) , Tagging
with Classifiers
Input I = ( Xi , Xz ,
–
–
‘
, Xn )
Output y- = ( y , , ya , .
. .
, Yn ) one prediction per
word
structured classification
fed raises interest rates
0.5 percent
– –
Predict each y , independently w/ logistic regression
fly ; = y II , i )
index we’re predicting
at
i =3
✓
f- (E) = [ I O O
l O O O .
– ]
Fed raises
.
.
BOW : FLI
, y
–
– NN ) = [¥0 I 1€00 I 00-00 . . .]
NN
independent of i simple way
: single feature
on Corr
word
Positional : f (I
,
y
– NN
,
i =3) ( O- Ho go 00010- ]
Positional features with context
f CI
, y
‘ NN
,
i =3 )
our
word prev next
–
O O l 00 .
–
. I 010 . . – fo . . . I
[¥ it
Conjuncts
.
#
m of properties
Indicator [Curr word = interest A tag – NN
”
)
“word ” in a bag – of-words space
Problem w/ classification for tagging
indicators → classifier
what goes wrong
?
VBZ 1-VBP
NNS-1 NN
Fed raises interest rates
.
. .
predictions
of a classifier may
be incoherent
sequence modeling :
Hidden Markov Models
Conditional Random fields
Part – of – Speech Tagging
record → arms →DIE
a a s
noun {erb noun verb
POS Tags : closed class
Open class new
words can join
–
–
these categories
Determiners : the , some
Nouns : proper (IBM , Italy) conjunctions
: and
,
or
common ( cat, snow ) Pronouns
Verbs : see,
tweet
Auxiliaries : had [ VERB]
Modals : could weep . –
Adjectives prepositions : up ,
in
,
to
Adverbs : swiftly particles : Ip made up
Example
Fed raises interest rates
0.5 percent
r:p: Nnp fins
NN – NNS – CD – NN
VBN
VBZ
VBP ✓Bz
VB
✓BD
ve d
VBD I
interest you
I want
to
interest you
I fed
–
– – VB
I
had
fed –
–
✓
BN
( KY Algorithm
argrf
” PCT II ) = a ITA P (
T
,
I )
Viterbi¥+
“M
s – I score of S : – I
co,i ,
I\ runtime : 0436)
o
the
,
child
z
raises
z
i n
Chart t [i
, j ,
X) score of the best derivation of
X over span ( i,;) logco -51=-1
1.0
t Ci
,
iet,X]=logP(will) i.optime S- NPVP
+ [ igj , X]= MIX
Maxl. > raises °-5NP→NN NNS
log P(XsX , a) + tfi, k, × ,]
t.oVBZ-raisesw.ve →Voter
+the,j, Xz]
to PRP → it
Basic Machine learning,
Gradient Descent
Parameters it to optimize
labeled data ( Ici)
,
yci
) )
,
?
,
D points
input
label
suppose 10,000 feats
.
TER
“””
.
Searching for an optimal
T
Optimization problem
: formulate training objective
which
–
–
,
is (for us ) linear over examples, optimize
objective
:
.
loss ( Ici )
, y
‘ ”
,T )
t of the loss
stochastic gradient descent ,
W- rate W
g.
gradin
for t up
to nom epochs :
for i up
to D : points
towards it
that give
sample j
– ( I
,
D ] higher
loss
✓ ← T – * If loss #
‘il
,
y
” ‘
,
T )
step size
Linear Binary Classification
Point IE Rd ft) f.
feature
extractor
I : string
→ f- (F) EIRD
Label yet -11+1 ) ‘ ”
.
.
. .frClassifier : weight vegtorgw
‘
– decision
boundary
fest ( 35 ‘ ‘ ‘]gd%t÷%Jtf (E) + b 20( bias X not use I
f- CE) ( 3 , -1,2 ,
I ]
seg-20
Bias in Word Embeddings
‣ Iden1fy she – he axis in
word vector space,
project words onto this
axis
‣ Nearest neighbor of (b –
a + c)
‣Word embeddings are trained from the whole web, and this data
encodes biases about the real world
Bolukbasi et al. (2016)
Manzini et al. (2019)
Debiasing
Bolukbasi et al. (2016)
‣ Iden1fy gender subspace with gendered
words
she
he
homemaker
woman
man
‣ Project words onto this subspace
‣ Subtract those projec1ons from
the original word
homemaker’
Hardness of Debiasing
‣ Not that effec1ve…and the male
and female words are s1ll
clustered together
‣ Bias pervades the word embedding
space and isn’t just a local property
of a few words
Gonen and Goldberg (2019)
seg-34
Syntax
‣ Study of word order and how words form sentences
‣ Why do we care about syntax?
‣ Recognize verb-argument structures (who is doing what to whom?)
‣ Mul@ple interpreta@ons of words (noun or verb? Fed raises… example)
‣ Higher level of abstrac@on beyond words: some languages are SVO,
some are VSO, some are SOV, parsing can canonicalize
Cons@tuency Parsing
‣ Tree-structured syntac@c analyses of sentences
‣ Cons,tuents: (S)entence, (N)oun (P)hrases,
(V)erb (P)hrases, (P)reposi@onal (P)hrases, and
more
‣ BoMom layer is POS tags
‣ Examples will be in English. Cons@tuency
makes sense for a lot of languages but
not all
Cons@tuency Parsing
senten@al complement
whole embedded sentence
adverbial phrase
unary rule
ternary rule
Cons@tuency Parsing
a refund that the court es@mated *-1
Challenges
§ If we do no annota+on, these trees differ only in one rule:
§ VP → VP PP
§ NP → NP PP
§ Parse will go one way or the other, regardless of words
§ Lexicaliza+on allows us to be sensi+ve to specific words
§ If we do no annota+on, these trees differ only in one rule:
§ VP → VP PP
§ NP → NP PP
§ Parse will go one way or the other, regardless of words
§ Lexicaliza+on allows us to be sensi+ve to specific words
PP aMachment
same parse as “the cake with some icing”
Challenges
The students complained to the professor that they didn’t understand
The man picked up his hammer and saw
Eisenstein textbook
plas,c cup holder plas,c cup holder
JJ NN NN JJ NN NN
NP
NP
NP
compare: The man picked up his hammer and swung
‣ Modifier scope:
‣ Complement structure:
‣ Coordina@on scope:
Cons@tuency
‣ How do we know what the cons@tuents are?
‣ Cons@tuency tests:
‣ Subs@tu@on by proform (pronoun, do so)
‣ CleWing (It was with a spoon that…)
‣ Answer ellipsis (What did they eat? the cake)
(How? with a spoon)
‣ Some@mes cons@tuency is not clear, e.g., coordina@on: she went to
and bought food at the store
Probabilistic Context – free Grammars
CFI ( N ,T , S , R )
non terminals
,
terminals
,
start
,
rules
the
,
children
, s binary unary
S
,
NP
,
VP
,
PP ate
,
cake
.
–
–
314
DT , NN ,
VBD
,
IN
,
NNS Y’Foh l S – NP VP
Veno
# 114 VP
→ VBD NP DTs
the
Pos : preterm
inals vz Np → pt NN
NNS → children
112 NP
→ DT NNS
NNS cake
NN→ spoon
PCFG : rules have probs , probs sum to one
V →ate
P ( rule (parent (
rule ) ) per parent
P ( tree T ) — II
,
Nrl parent Cr ) )
Steps to Parsing
Treeline of sent labeled with
trees
① Binarizetion ②
Estimation of rule probs ③ Inference
( count -‘ normalizing) argnfaxpftlx )
( maximum likelihood est . )
Bihari ration lossless
– Lossy
VP VP
::¥÷⇐:⇒⇒:÷i÷÷÷÷ vii.aprob ‘t ”
seg-21
Applying Embeddings
‣ Approach 1: learn these embeddings as parameters from your data
‣ Approach 2: ini:alize word embeddings using GloVe, keep fixed
‣ Approach 3: ini:alize word embeddings GloVe, fine-tune
‣ Faster because no need to update these parameters
‣Works best for some tasks
‣ OHen works preIy well
‣ First layer of your network: map from word indices to embeddings
‣ Can also evaluate embeddings intrinsically on tasks like word similarity
Deep Averaging Networks
‣ Deep Averaging Networks: feedforward neural network on average of word
embeddings from input
Iyyer et al. (2015)
Deep Averaging Networks
‣ Contradicts a widely-held view
that we need to model syntac:c
structure to represent language
‣ Simple averaging can work as
well as syntac:c composi:on for
some problems!
Iyyer et al. (2015)
Deep Averaging Networks
{
{
Bag-of-words
Tree-structured
neural networks
Wang and Manning
(2012)
Kim (2014)
Iyyer et al. (2015)
No pretrained
embeddings
Iyyer et al. (2015)
Deep Averaging Networks
‣ Will return to composi:onality with syntax and LSTMs
Iyyer et al. (2015)
seg-0
What is the goal of NLP?
‣ Be able to solve problems that require deep understanding of text
Siri, what’s your
favorite kind of
movie?
I like superhero
movies!
What’s come
out recently?
‣ Systems that talk to us: dialogue systems, machine translaFon,
summarizaFon
The Avengers
What is the goal of NLP?
中共中央政治局7⽉30⽇召开会议,会议分析研究当前经
济形势,部署下半年经济⼯作。
The PoliFcal Bureau of the CPC Central CommiKee held a meeFng on July 30 to
analyze and study the current economic situaFon and plan economic work in the
second half of the year.
People’s Daily, August 10, 2020
‣ Be able to solve problems that require deep understanding of text
‣ Systems that talk to us: dialogue systems, machine translaFon,
summarizaFon
The PoliFcal Bureau
of the CPC Central
CommiKee July 30 hold a meeFng
Translate
What is the goal of NLP?
When was Abraham Lincoln born?
February 12, 1809
Name Birthday
Lincoln, Abraham 2/12/1809
Washington, George 2/22/1732
Adams, John 10/30/1735
‣ Build systems that extract informaFon from text and answer quesFons
The park has a total of five
visitor centers
five
How many visitors centers are there in Rocky Mountain NaFonal Park?
map to Birthday field
“Standard” NLP Pipeline
SyntacFc parses
Coreference resoluFon
EnFty disambiguaFon
Discourse analysis
Summarize
Extract informaFon
Answer quesFons
IdenFfy senFment
Translate
Text Analysis Applica=onsText Annota=ons
‣ All of these components are modeled with staFsFcal
approaches using machine learning
How do we represent language?
Labels
Sequences/tags
Trees
Text
the movie was good +
Beyoncé had one of the best videos of all 6me subjec=ve
Tom Cruise stars in the new Mission Impossible film
PERSON WORK_OF_ART
I eat cake with icing
PP
NP
S
NP
VP
VBZ NN
flights to Miami
λx. flight(x) ∧ dest(x)=Miami
How do we use these representaFons?
Labels
Sequences
Trees
Text AnalysisText
‣Why is this predicFon hard? Because language is complex and ambiguous!
‣What ambiguiFes do we need to resolve?
…
Applica=ons
Tree transducers (for machine
translaFon)
Extract syntacFc features
Tree-structured neural networks
end-to-end models …
What makes language hard?
Teacher Strikes Idle Kids
Ban on Nude Dancing on Governor’s Desk
Iraqi Head Seeks Arms
What makes language hard?
‣ There aren’t just one or two possibiliFes, but many!
It is really nice out
il fait vraiment beau
It’s really nice
The weather is beauFful
It is really beauFful outside
He makes truly beauFful
It fact actually handsome
“AI winter”
rule-based,
expert systems
A brief history of NLP techniques
Unsup: topic
models,
grammar inducFon
Collins vs.
Charniak
parsers
1980 1990 2000 2010 2020
earliest stat MT
work at IBM
Penn
treebank
NP VP
S
Ratnaparkhi
tagger
NNP VBZ
Sup: SVMs,
CRFs, NER,
SenFment
Semi-sup,
structured
predicFon
Neural/
pretrained
‣ This course focuses on supervised learning, semi-supervised methods, and
neural models (including pre-training)
‣ All of these models can handle ambiguity by learning how to map text into
linguisFc representaFons using data
Outline of the course
‣ ClassificaFon: linear and neural, word representaFons
‣ Text analysis: tagging (HMMs, CRFs) and parsing (PCFGs)
‣ Language modeling and pre-training
‣ QuesFon answering and semanFcs
‣ Machine translaFon
‣ ApplicaFons
I
2=1
Perception Ex : Y HII oooo
Binary classifier Tvtflx )
50 movie good -it 7948 , Yoo
‘Area -_ – I
↳ YEE – Igel ) ep# movie bad – I 1010 0,40 Ypres
– it
not good
– I 0101 Ooty – I Yprei
”
for t up to epochs
:
g
for i up
to D : ep2[movie good
-11 1100
11¥ -I ypr.ua =
-I
Ypred
← TTf(Ici
)) y
?
-0
I ← it if ypred
= yci )
else I ← T+xf(Ici
‘) if y
“! -11
else Te I – xf ( IC
” ) if ycils – I
n
y g
b n ng
nb not separable
Ex 2 good tl I 00 o o
–
–
–
bad –
‘
o lo o o
–
t
not good
– l l o
not bad a
oil
”
o ,#
b
↳ Add big rains
9
HMMS : Parameter Estimation
Labeled data (I
” ‘
, y-
lil) !
,
Maximize § log P(y-
lil
,
I
” ‘) generative (joint)
likelihood
= ? log Ry .
“” t ?§logK
‘
ly It ? flog Kyi”l%
train setae
MLE with frequency counts
: biased coin w/prob p of It
any
HHHT 314=9×31ogptlogli
–
p)
HMM param
estimation : Count + normalizing
Example
T = { N
,
V
,
STOP) V = { they, can, fish)
N v stop
tneycm s=I⑦t=j
N V
STOP
g
L t b
tiny fish Nv ④ ¥19 I
7
Smoothy add counts
to avoid Os NOT Phil can )
1- →i
97¥77 vs Manly
1/5 45 315
V -V
⇐ In
,
i. ii. ⇒ PEEK ; Y:
”
seg-31
CRF Learning and Inference
‣Model: P (y|x) =
1
Z
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
‣ Inference
‣ Learning
P (y|x) / expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
Inference: Compu7ng (arg)maxes
max
y1,…,yn
e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)
‣ This is maxing over a product of terms with exactly the same
structure as HMM inference
‣ Replace log probabili7es in Viterbi code with poten7als here
logP (yi|yi�1) ! �t(yi�1, yi)
logP (xi|yi) ! �e(yi, i,x)
‣ These poten7als depend on more of x, but x is constant, and this
doesn’t change inference over y
‣ Can do efficient inference in any tree-structured CRF (max-product or
sum-product algorithms)
CRF Learning and Inference
‣Model: P (y|x) =
1
Z
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
‣ Inference: argmax P(y|x) from Viterbi
‣ Learning
P (y|x) / expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
Learning in Log-linear CRFs
‣ How did the gradient look in mul7class logis7c regression?
P (y|x) / expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
P (y|x) / expw>f(x, y)‣ Logis7c regression:
‣Maximize L(y⇤,x) = logP (y⇤|x)
@
@wi
L(xj , y⇤j ) = fi(xj , y
⇤
j )� Ey[fi(xj , y)]
gold feature value
model’s expecta7on of
feature value
‣ Subtract off “expected features” summing over all possible other labels
@
@wi
L(xj , y⇤j ) = fi(xj , y
⇤
j )�
X
y
fi(xj , y)Pw(y|xj)
Learning in Log-linear CRFs
‣ Gradient is analogous to logis7c regression:
P (y|x) / expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
P (y|x) / expw>f(x, y)‣ Logis7c regression:
‣Maximize L(y⇤,x) = logP (y⇤|x)
intractable! requires
summing over all
possible tag seqs
@
@w
L(y⇤,x) =
nX
i=2
ft(y
⇤
i�1, y
⇤
i ) +
nX
i=1
fe(y
⇤
i , i,x)
�Ey
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
Learning in Log-linear CRFs
‣ Let’s focus on emission feature expecta7on
@
@w
L(y⇤,x) =
nX
i=2
ft(y
⇤
i�1, y
⇤
i ) +
nX
i=1
fe(y
⇤
i , i,x)
�Ey
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
Ey
”
nX
i=1
fe(yi, i,x)
#
=
X
y2Y
P (y|x)
”
nX
i=1
fe(yi, i,x)
#
=
nX
i=1
X
y2Y
P (y|x)fe(yi, i,x)
=
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
‣ How do we compute these marginals ? Forward-backwardP (yi = s|x)
P (yi = s|x) =
X
y1,…,yi�1,yi+1,…,yn
P (y|x)
Training CRFs
‣ Transi7on features: need to compute
‣ …but you can build a preVy good system without learned transi7on
features (use heuris7c weights, or just enforce constraints like
B-PER -> I-ORG is illegal)
P (yi = s1, yi+1 = s2|x)
using forward-backward as well
‣ For emission features:
gold features — expected features under model
@
@w
L(y⇤,x) =
nX
i=1
fe(y
⇤
i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
CRF Learning and Inference
‣Model: P (y|x) =
1
Z
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
‣ Inference: argmax P(y|x) from Viterbi
P (y|x) / expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
‣ Learning: run forward-backward to compute posterior probabili7es; then
@
@w
L(y⇤,x) =
nX
i=1
fe(y
⇤
i , i,x)�
nX
i=1
X
s
P (yi = s|x)fe(s, i,x)
CRF Learning Implementa7on
for each epoch
for each example
extract features on each emission and transi7on (look up in cache)
compute marginal probabili7es with forward-backward
compute poten7als phi based on features + weights
accumulate gradient over all emissions and transi7ons
‣ Caching is your friend! Cache feature vectors especially, and reduce any
redundant computa7on
‣ Exploit sparsity in feature vectors where possible, especially in feature
vectors and gradients
Debugging CRFs
‣ Hard to know whether inference, learning, or the model is broken!
‣ Compute the objec7ve — is op7miza7on working?
‣ Learning: is the objec7ve going down? Try to fit 1 example / 10
examples. Are you applying the gradient correctly?
‣ Inference: check gradient computa7on (most likely place for bugs),
check marginals from forward-backward
‣ If objec7ve is going down but model performance is bad:
‣ Inference: check performance if you decode the training set
seg-19
Skip-Gram
Mikolov et al. (2013)
the dog bit the man‣ Predict each word of context from word in turn,
up to distance k
bit
soCmax
MulDply
by W
gold = dog
‣ Parameters: d x |V| word vectors, |V| x d context vectors (stacked into
a matrix W)
‣ Another training example: bit -> the
P (w0|w) = softmax(We(w))
‣Why skip-gram? With window size >1, we predict a context word
skipping over intermediate words
ConDnuous Bag-of-words
‣ Predict word from mulDple words of context
the dog bit the man
‣ Parameters: d x |V| (one d-length context vector per voc word),
|V| x d word vectors (in matrix W)
dog
the
+
size d
soCmax
MulDply
by W
gold label = bit,
no manual labeling
required!
d-dimensional
word embeddings
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
size |V| x d
Mikolov et al. (2013)
Hierarchical SoCmax
‣ Matmul + soCmax over |V| is very slow to compute for both techniques
‣ Hierarchical soCmax:
P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))
‣ Standard soCmax:
|V| dot products of
size d
log(|V|) dot products of size d,
…
…
the
a
‣ Huffman encode vocabulary,
use binary
classifiers to decide which
branch to take
|V| x d parameters
P (w0|w) = softmax(We(w))
‣ log(|V|) binary decisions
Skip-gram:
CBOW:
Mnih and Hinton (2008)
Skip-gram with NegaDve Sampling
‣ d x |V| vectors, d x |V| context vectors (same # of params as before)
(bit, the) => +1
(bit, cat) => -1
(bit, a) => -1
(bit, fish) => -1
‣ Take (word, context) pairs and classify them as “real” or not. Create
random negaDve examples by sampling from unigram distribuDon
words in similar
contexts select for
similar c vectors
P (y = 1|w, c) =
ew·c
ew·c + 1
‣ ObjecDve =
sampled
logP (y = 1|w, c) +
1
k
nX
i=1
logP (y = 0|wi, c)
‣ Are there alternaDve ways to learn vectors while avoiding O(|V|) term?
Mikolov et al. (2013)
ConnecDons with Matrix FactorizaDon
‣ Skip-gram model looks at word-word co-occurrences and produces two
types of vectors
word pair
counts
|V|
|V| |V|
d
d
|V|
context vecs
word
vecs
‣ Looks almost like a matrix factorizaDon…can we interpret it this way?
Levy et al. (2014)
Skip-gram as Matrix FactorizaDon
|V|
|V|
Mij = PMI(wi, cj)� log k
PMI(wi, cj) =
P (wi, cj)
P (wi)P (cj)
=
count(wi,cj)
D
count(wi)
D
count(cj)
D
‣ If we sample negaDve examples from the unigram distribuDon over words
num negaDve samples
‣ …and it’s a weighted factorizaDon problem (weighted by word freq)
Skip-gram objecDve exactly corresponds to factoring this matrix:
Levy et al. (2014)
GloVe (Global Vectors)
X
i,j
f(count(wi, cj))
�
w>i cj + ai + bj � log count(wi, cj))
�2
‣ ObjecDve =
‣ Also operates on counts matrix, weighted
regression on the log co-occurrence matrix
‣ Constant in the dataset size (just need counts), quadraDc in voc size
‣ By far the most common word vectors used today (5000+ citaDons)
word pair
counts
|V|
|V|
Pennington et al. (2014)
fastText: Sub-word Embeddings
Bojanowski et al. (2017)
‣ Same as SGNS, but break words down into n-grams with n = 3 to 6
where:
3-grams:
4-grams:
5-grams:
6-grams:
‣ Replace in skip-gram computaDon with w · c
X
g2ngrams
wg · c
!
Pre-trained Models: ELMo, GPT, BERT
‣ These encode “subwords” rather than words. Underscore indicates that
the following token conDnues the exisDng word
‣ Learn embeddings through language modeling (discussed in the second
half of the course)
and there were no re_ fueling sta=ons anywhere
‣ Embeddings are computed using RNNs and Transformers. We can’t just
look up an embedding for each word, but actually need to run a model
one of the city ’s more un_ princi_ pled real estate agents
‣ Any word is either in the subword vocabulary or can be expressed as a
sequence of subwords in the vocabulary
Skip – gram
Input : large corpus of sentences
Output
: Tw
,
Ew for each word type
W
Hyperparams
: word vector dim
d (-50-300)
window size
K ( assume K=l )
Lord c Take all
film → inspired neighbors of each
The film inspired film → The word token up
✓I
go ke positions
Skip
–
gram
: probabilistic model of context (word
away
f) ( context -_ yl word =× ) =
exp (Tx
– Ty ) T ,T model
–
params
If Ignis ;
”
Iveta “sina.%4f.ve/PKx-Eilz.m..aheinctItexT
parans
in model
EI Corpus = I saw
T±= [ 1,0
] Tsar = [0,1)
Tsar EI word context
# I Taw→Isa- saw I
If c-
saw
= [ 1,0 ) and E±= [0,1) ,
what is
P( context I word -_ saw )
?
exp (Tsan
– I) exp (Tsm – Esau )
,
,
I vocab
=3
Pf context –I Iwan -_ saw ) = Za sunlit
Training
Maximize { log P( context-_yl word
-_x )
(x,y)
pairs in
data
”
Impossible
”
problem : cannot
drive P – I
Initialize params randomly
seg-30
Feature Func*ons
y1 y2 yn…
�e
�t
‣ Phis are flexible (can be NN with 1B+ parameters). Here: sparse linear fcns
�t(yi�1, yi) = w
>ft(yi�1, yi)
P (y|x) / expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
�e(yi, i,x) = w
>fe(yi, i,x)
P (y|x) =
1
Z
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
Basic Features for NER
Barack Obama will travel to Hangzhou today for the G20 mee=ng .
O B-LOC
Transi*ons:
Emissions: Ind[B-LOC & Current word = Hangzhou]
Ind[B-LOC & Prev word = to]
ft(yi�1, yi) = Ind[yi�1 & yi]
fe(y6, 6,x) =
P (y|x) / expw>
”
nX
i=2
ft(yi�1, yi) +
nX
i=1
fe(yi, i,x)
#
= Ind[O — B-LOC]Ind[yi-1 — yi]
Emission Features for NER
Leicestershire is a nice place to visit…
I took a vaca=on to Boston
Apple released a new version…
According to the New York Times…
ORG
ORG
LOC
LOC
Texas governor Greg AbboI said
Leonardo DiCaprio won an award…
PER
PER
LOC
�e(yi, i,x)
Emission Features for NER
‣ Context features (can’t use in HMM!)
‣Words before/aWer
‣ POS before/aWer (if we run a POS tagger first)
‣Word features (can use in HMM)
‣ Capitaliza*on
‣Word shape
‣ Prefixes/suffixes
‣ Lexical indicators
‣ Gaze]eers
‣Word clusters
Leicestershire
Boston
Apple released a new version…
According to the New York Times…
Hidden Markov Models
Two-step
Generative sequence
model ① Param
estimation
Tags Yi ET words × ;
EV ② Inference
vocabulary
tags
Pty ,# = Ply . ) PG .ly/Plyaly,1PCxalyal–.PlstoHyn
)
” OIG . . . Issue Parameters.in?!Ya?..!!iE:D
tr b b
x , O O Oxn ynext
S
Xa
y
‘s forma m process 7¥! You km :#¥
‘
Yi
is conditionally
independent
of y . .
–
‘ Yi –
– 9
“” “”
¥; !
i! ‘”
“”
seg-5
Perceptron as minimizing loss
Perceptron as minimizing loss
x
y
Perceptron Loss Demonstra3on
−
−
(-1, -1)
(-1, 0)
(-1, 1)
+
+ (1, 2)
x
y
x
y
Perceptron Loss Demonstra3on
−
−
(-1, -1)
(-1, 0)
(-1, 1)
+
x
y
x
y
seg-7
Sen$ment Analysis
the movie was gross and overwrought, but I liked it
this movie was great! would watch again
‣ Bag-of-words doesn’t seem sufficient (discourse structure, nega$on)
this movie was not really very enjoyable
‣ There are some ways around this: extract bigram feature for “not X” for
all X following the not
+
+
—
Sen$ment with Bag-of-words Models
‣ Simple feature sets can do preIy well!
Bo Pang, Lillian Lee, Shivakumar Vaithyanathan (2002)
Sen$ment with Bag-of-words Models
Wang and Manning (2012)
Before neural nets had taken
off — results weren’t that
great
81.5 89.5Kim (2014) CNNs
‣ 10 years later — revisited basic BoW classifiers vs. other methods
Kim (2014)
Sen$ment with Bag-of-words Models
https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md
…
‣ Best systems now:
large pretrained
networks
‣ Stanford Sen$ment
Treebank (SST)
binary classifica$on
‣ 90 -> 97 over the
last 2 years
seg-32
Forward-Backward Algorithm
‣ How do we compute marginals ?P (yi = s|x)
P (yi = s|x) =
X
y1,…,yi�1,yi+1,…,yn
P (y|x)
‣What did Viterbi compute? P (ymax|x) = max
y1,…,yn
P (y|x)
‣ Can compute marginals with dynamic programming as well using
forward-backward
Forward-Backward Algorithm
P (y3 = 2|x) =
sum of all paths through state 2 at time 3
sum of all paths
slide credit: Dan Klein
Forward-Backward Algorithm
P (y3 = 2|x) =
sum of all paths through state 2 at time 3
sum of all paths
=
‣ Easiest and most flexible to do one
pass to compute and one to
compute
slide credit: Dan Klein
Forward-Backward Recurrence
‣ IniIal:
‣ Recurrence:
‣ Same as Viterbi but summing
instead of maxing!
‣ These quanIIes get very small!
Store everything as log probabiliIes
↵1(s) = exp(�e(s, 1,x))
↵t(st) =
X
st�1
↵t�1(st�1) exp(�e(st, t,x))
slide credit: Dan Klein
Forward-Backward Recurrence
‣ IniIal:
�n(s) = 1
‣ Recurrence:
‣ Difference from forward: count
emission for the next Imestep (not
current one)
�t(st) =
X
st+1
�t+1(st+1) exp(�e(st+1, t+ 1,x))
slide credit: Dan Klein
CompuIng Marginals
�n(s) = 1
P (s3 = 2|x) =
↵3(2)�3(2)P
i ↵3(i)�3(i)
‣What is the denominator equal to?
‣ Does this explain why beta doesn’t
account for the current Imestep?
P (x)
↵1(s) = exp(�e(s, 1,x))
↵t(st) =
X
st�1
↵t�1(st�1) exp(�e(st, t,x))
�t(st) =
X
st+1
�t+1(st+1) exp(�e(st+1, t+ 1,x))
slide credit: Dan Klein
CompuIng Marginals
y1 y2 yn…
�e
�t
P (y|x) =
1
Z
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
Z =
X
y
nY
i=2
exp(�t(yi�1, yi))
nY
i=1
exp(�e(yi, i,x))
‣ For both HMMs and CRFs:
‣ Normalizing constant
P (yi = s|x) =
forwardi(s)backwardi(s)P
s0 forwardi(s
0)backwardi(s0)
Z for CRFs, P(x)
for HMMs
‣ Analogous to P(x) for HMMs
‣ This denominator is the same for all i! Nice way to debug your forward-
backward implementaIon and catch bugs
Posteriors vs. ProbabiliIes
P (yi = s|x) =
forwardi(s)backwardi(s)P
s0 forwardi(s
0)backwardi(s0)
‣ Posterior is derived from the parameters and the data (condiIoned on x!)
HMM
CRF
Model parameter (usually
mulInomial distribuIon)
Inferred quanIty from
forward-backward
Inferred quanIty from
forward-backward
Undefined (model is by
definiIon condiIoned on x)
P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)
HMM Inference : The Viterbi Algorithm
HMMS : model of Pty ,E) = Ply , ) plx.ly , ) Kya ly , ) . . .
Inference : anginal Pty
II ) = arty
” P(y constant
PCI )£ v.ntay
= asf
”
Phir ) = 99mF
‘
log HIEI
=
,?? log Plyiltlogplx .ly/tlogPlyiIyTlt–
Viterbi Dynamic Program
Define y. (g) = n x ITI
u sent ten
ITI number
of tags
score of the best path ending in yn
at time i
Base : v. (5) = log Pk , II )
t log Pty )
Recurrence : Vi ly I = log Plxily ) t Ep!! logply45pre.lt
vi.
,
tyred
Viterbi for it . – in
for y in T
:
compute vily)
compute vn+ , ( stop ) , this
= Mff log Pfxjy)
Track
”
backpointers
”
‘
i’ ⇐
Vil’T )
they can can fish
STOP
N
v .
Beam Search
Viterbi runtime : O (n IT 12 )
IT I 240 for English
Penn Tree bank
Beam search : keep only the top
k Viterbi States
at each time step Kc IT I
””
÷
:*.in#:i:i:iti:::o:
into a new priority queue
÷¥÷÷÷÷. : no, ”
seg-33
NER: Evalua*on
Barack Obama will travel to Hangzhou today for the G20 mee=ng .
PERSON LOC ORG
B-PER I-PER O O O B-LOC B-ORGO O O O O
‣ Predic*on of all Os gets 66% accuracy on this example!
‣What we really want to know: how many named en*ty chunk
predic*ons did we get right?
‣ Precision: of the ones we predicted, how many are right?
‣ Recall: of the gold named en**es, how many did we find?
‣ F-measure: harmonic mean of these two
‣ Example:
NER: CoNLL 2003
‣ CRF with lexical features can get around 85 F1 on CoNLL 2003: 4
classes (PER, ORG, LOC, MISC), newswire data
‣What else do we need to capture?
‣World knowledge:
The delega*on met the president at the airport, Tanjug said.
Nonlocal Features
ORG?
PER?
The delega*on met the president at the airport, Tanjug said.
The news agency Tanjug reported on the outcome of the mee*ng.
‣More complex factor graph structures can let you capture this, or just
decode sentences in order and use features on previous sentences
Finkel et al. (2005), Ra*nov and Roth (2009)
Results: CoNLL 2003
Ra*nov and Roth (2009)
Lample et al. (2016)
BiLSTM-CRF + ELMo
Peters et al. (2018)
92.2
BERT
Devlin et al. (2019)
92.8
‣ Knowledge from pre-training is a huge factor
Scaling Up
Choi et al. (2018)
‣Modern work uses larger and larger typesets: 17 -> 112 -> 10,000+
‣ Lots of challenges scaling to this many tags and making these
systems useful!
Logistic Regression
Discriminative probabilistic model Ply II )
( generative
: PCI
, y) ) logistic
ear. .in#=.:II:’a ÷. ‘t
Decision boundary
?
P ( y= – I II ) = II return +1 it fly -HII ) >0.5
JTFLI ) >O
Training
for a dataset ( I
” ‘
, ylil) ? , , want to maximize
II fly ” II
“‘) maximum likelihood
M
i
.§ log Nyc ” II
“) log
likelihood #
⇒ E.is?!i:i.:::?.D::;:i:eoi:;iii:iin.
(NLL)
Need to compute Fa loss txt
”
,
Y
”
F )
LR Gradients Assumey”
Zoe loss = If [- TTHI) + log ( tee
“” ) )
= – FLI ) t – E
“”
– ft)
= f E) f- It E.FI?7EI)=fElfPly=.nix-H)
Update : I
← I +*FCI ) ( l – Ply –HII ) )
P( yet II ) Il : no update Let z=
T THE)
fly# III )=o : perception update NLL
: log ( Ite
‘t
) -z
loss”
”
# z= THE)
seg-68
BERT
‣ Four major changes compared to ELMo:
‣ Transformers instead of LSTMs
‣ Bidirec
‣ Decoder: separate module, single cell. Two inputs: hidden state (vector
h or tuple (h, c)) and previous token. Outputs token + new state
Encoder
…
film
le
Decoder Decoder
Seq2seq Training
‣ Objec.ve: maximize
the movie was great le film était bon
le
‣ One loss term for each target-sentence word, feed the correct word
regardless of model’s predic.on (called “teacher forcing”)
[STOP]était
X
(x,y)
nX
i=1
logP (y⇤i |x, y
⇤
1 , . . . , y
⇤
i�1)
Scheduled Sampling
‣ Star.ng with p = 1 (teacher forcing) and decaying it works best
‣ Scheduled sampling: with probability p, take the gold as input, else
take the model’s predic.on
the movie was great
la film étais bon [STOP]
le film était
‣Model needs to do the right thing even with its own predic.ons
Bengio et al. (2015)
sample
‣ BeZer approach: train with reinforcement learning
Implementa.on Details
‣ Sentence lengths vary for both encoder and decoder, so pad to
make sequences the right length, use masking
‣ Encoder: standard RNN module
‣ Decoder: execute one step of computa.on at a .me, so
computa.on graph is formulated as taking one input + hidden state
‣ Test .me: do this un.l you generate the stop token
‣ Training: do this un.l you reach the gold stopping point
Batching and Beam Search
‣ Batching is preZy tricky: decoder is across .me steps, so you probably
want your label vectors to look like [num .mesteps x batch size x num
labels], iterate upwards by .me steps
‣ Beam search: can help with lookahead. Finds the (approximate) highest
scoring sequence:
argmaxy
nY
i=1
P (yi|x, y1, . . . , yi�1)
Beam Search
‣Maintain decoder state, token history in beam
la: 0.4
la
le
les
le: 0.3
les: 0.1
lo
g(0.4)
lo
g(0.3)
lo
g(0.1)
film: 0.4
la
…
film: 0.8
le
…
le
film
la
film
lo
g(0.3)+lo
g(0.8)
…
lo
g(0.4)+lo
g(0.4)
‣ Keep both film states! Hidden state vectors are different
the movie was great
Other Architectures
‣What’s the basic abstrac.on here?
‣ Encoder: sentence -> vector
‣ Decoder: hidden state, output prefix -> new hidden state, new output
‣Wide variety of models can apply here: CNN encoders, decoders can be
any autoregressive model including certain types of CNNs
‣ Transformer: another model discussed later
‣ OR: sentence, output prefix -> new output (more general)
seg-40
CRF Parsing
Taskar et al. (2004)
Hall, Durre;, and Klein (2014)
score
LeB child last word = report ∧ NP PP
NP
w>f
NP PP
NP
2 5 7
=
f
NP PP
NP
2 5 7
He wrote a long report on Mars .
PPNP
NP
=
2 5 7
wrote a long report on Mars .
wrote a long report on Mars .
‣ Can learn that we report [PP], which is common due to repor’ng on things
‣ Can “neuralize” this by compuSng these scores with a neural network
CRF Parsing
+Discrete ConSnuous
He wrote a long report on Mars
NP
PP
NP
‣ Chart remains discrete!
‣ Feedforward pass on nets
‣ Run CKY dynamic program
‣ Discrete feature computaSon
+Discrete ConSnuous
…
Parsing a sentence:
Durre; and Klein (2015)
Using ELMo
Kitaev and Klein (2018)
‣ Improves the neural CRF by using a
transformer layer (self-a;enSve),
character-level modeling, and ELMo
‣ 95.21 on Penn Treebank dev set — much
be;er than past parsers! (~92-93)
‣ This consStuency parser with BERT is
one of the strongest today, or use a
transiSon-based version due to Kitaev
and Klein (2020)
Graph-based Dependency Parsers
‣ ConjuncSons of parent and child words + POS, POS of words in between,
POS of surrounding words
DT NNTOVBDDT NN
the housetoranthe dog
ROOT
‣ HEAD=TO & MOD=NN
‣ HEAD=TO & MOD-1=the
‣ HEAD=TO & MOD=house
‣ ARC_CROSSES=DT
f(i, parent(i),x)
McDonald et al. (2005)
‣ Score of a parse is a sum of weights * features over all arcs it contains.
Use dynamic programming to find the best parse
Higher-order Parsing
Koo and Collins (2010)
‣ Track addiSonal state during parsing so
we can look at “grandparents” (and
siblings). O(n4) dynamic program or
use approximate search
DT NNTOVBDDT NN
the housetoranthe dog
ROOT
f(i, parent(i), parent(parent(i)),x)
Biaffine Neural Parsing
‣ Neural CRFs for dependency parsing: let c = LSTM embedding of i, p =
LSTM embedding of parent(i). score(i, parent(i), x) = pTUc
Dozat and Manning (2017)
(num words x hidden size) (num words x
num words)
LSTM looks at words and POS
‣ This is the best graph-based parser today when combined with BERT
TransiSon-based Parsers
Chen and Manning (2014)
Andor et al. (2016)
‣ 94.61 UAS on the Penn Treebank using a global transiSon-based system with early
updaSng (compared to 95.8 for Dozat, 93.7 for Koo in 2009)
‣ Feature set pioneered by Chen and Manning (2014), Google fine-tuned it as part of
the SyntaxNet parser
TransiSon-based Parsers
Chen and Manning (2014)
UnopSmized S-R{
{
{
{
OpSmized S-R
Graph-based
Neural S-R
‣ TransiSon-based parsers are very fast, and now give good performance
seg-97
Cross-lingual Tagging
‣ Labeling POS datasets is expensive
‣ Can we transfer annota;on from high-resource languages (English, etc.)
to low-resource languages?
English
Raw text
POS data
Spanish:
+
Raw text
en-es bitext
POS data
Malagasy
bitext
Raw text+ Malagasy tagger
Spanish
tagger
Cross-lingual Tagging
Das and Petrov (2011)
‣ Can we leverage word alignment here?
N PR V ??
‣ Tag with English tagger, project across bitext, train French tagger?
Works preTy well
I like it a lot
Je l’ aime beaucoup
align
I like it a lot
Je l’ aime beaucoup
N V PR DT ADJ
tag
I like it a lot
Je l’ aime beaucoup
Projected tags
Cross-lingual Parsing
‣ Now that we can POS tag other languages, can we parse them too?
‣ Direct transfer: train a parser over POS sequences in one language, then
apply it to another language
I like tomatoes
PRON VERB NOUN
Je les aime
PRON PRON VERB
I like them
PRON VERB PRON
Parser trained
to accept tag
input
VERB is the
head of PRON
and NOUN
parse
new
data
train
McDonald et al. (2011)
Cross-lingual Parsing
‣Mul;-dir: transfer a parser trained on several source treebanks to the
target language
‣Mul;-proj: more complex annota;on projec;on approach
McDonald et al. (2011)
seg-83
Probing
Tenney et al. (2019)
‣ Given a simple, fixed class of model (e.g., one-layer FFNN), how
well can we predict various things from word representaGons?
‣We want to know what
informaGon is captured in
a neural network. Try to
predict that informaGon
from the network’s
representaGons
Probing: Results
Tenney et al. (2019)
‣ Lex: baseline built on
context-independent vectors
‣ Large gains from
contextualizaGon, and BERT
beats ELMo
Probing: Results
Tenney et al. (2019b)
‣ Purple: BERT-large
performance on each task
(as delta from mean) using
representaGons from that
layer of BERT
‣ Earlier layers of the
network: beSer at POS and
low-level tasks. Later layers
are beSer at higher-level
tasks like coreference
seg-82
Text-based Explana/ons
‣ Can we generate a natural language explana/on of a model’s
behavior?
‣ Possible advantages:
‣ Easy for untrained users to understand
‣ Easy for annotators to provide ground truth human explana/ons
(which may also help our models)
‣ Possible disadvantages:
‣ Hard to generate gramma/cal/seman/cally meaningful text
‣ Can text truly explain a model’s behavior?
Explana/ons of Bird Classifica/on
Hendricks et al. (2016)
‣What makes a visual explana/on? Should be
relevant to the class and the image
‣ Are these features really what the model used?
Explana/ons of Bird Classifica/on
Hendricks et al. (2016)
‣ Are these features really what the model used? The decoder looks at
the image, but what it reports may not truly reflect the model’s
decision-making
‣More likely to produce plausible (look good to humans) but unfaithful
explana/ons!
e-SNLI
Camburu et al. (2019)
‣ e-SNLI: natural language inference with explana/ons
e-SNLI
Camburu et al. (2019)
‣ Similar to birds: explana/on is generated condi/oned on the
label and the network state f
‣ Informa/on from f is fed into the explana/on LSTM, but no
constraint that this must be used. Explana/on might be purely
generated from the label
f = func/on of premise and hypothesis vectors
Latent Textual Explana/ons
‣Model generates text “hypothesis”, which is completely latent
Latcinnik and Berant (2020)
‣ Hypothesis isn’t constrained to be natural language, ends up
being keywords
seg-96
Morphological Analysis
‣ In English, lexical features on words and word vectors are pre;y effec=ve
‣When we’re building systems, we probably want to know base form +
morphological features explicitly
‣ In other languages, lots more unseen words due to rich morphology!
Affects parsing, transla=on, …
‣ How to do this kind of morphological analysis?
Morphological Analysis: Hungarian
Ám a kormány egyetlen adó csökkentését sem javasolja .
n=
sin
gu
lar
|c
as
e=
no
m
in
a=
ve
|p
ro
pe
r=
no
de
g=
po
si=
ve
|n
=s
in
gu
lar
|c
as
e=
no
m
in
a=
ve
n=
sin
gu
lar
|c
as
e=
no
m
in
a=
ve
|p
ro
pe
r=
no
n=
sin
gu
lar
|c
as
e=
ac
cu
sa
=v
e|
pr
op
er
=n
o|
pp
er
so
n=
3r
d|
pn
um
be
r=
sin
gu
lar
m
oo
d=
in
di
ca
=v
e|
t=
pr
es
en
t|
p=
3r
d|
n=
sin
gu
lar
|d
ef
=y
es
But the government does not recommend reducing taxes.
CoNLL dependency treebank
Morphological Analysis
‣ Given a word in context, need to predict what its morphological features
are
‣ Lots of work on Arabic inflec=on (high amounts of ambiguity)
‣ Basic approach: combines two modules:
‣ Lexicon: tells you what possibili=es are for the word
‣ Analyzer: sta=s=cal model that disambiguates
‣Models are largely CRF-like: score morphological features in context
Morphological Inflec=on
‣ Inverse task of analysis: given base form + features, inflect the word
‣ Hard for unknown words — need models that generalize
w i n d e n
Durre; and DeNero (2013)
Morphological Inflec=on for MT
‣Machine transla=on where phrase table is defined in terms of lemmas
‣ “Translate-and-inflect”: translate into uninflected words and predict
inflec=on based on source side
Chahuneau et al. (2013)
Chinese Word Segmenta=on
‣ LSTMs over character
embeddings / character
bigram embeddings to
predict word boundaries
‣Word segmenta=on:
some languages
including Chinese are
totally untokenized
‣ Having the right
segmenta=on can help
machine transla=on
Xinchi Chen et al. (2015)
N – gram Language Modeling
PCT ) = Plw , , . . . , win )= Pcw , ) Plwalw , ) Pfwslwiwa)
–
– –
n -gram LM
:
Pk ) – IIP (wit wine , . . . , wi . . )
” =3 – I
–
previous n – l
words
Z -gram
Ln : Plw , less )P( wzlw , ) wslwz) – – –
W
,
cord . indep . of w , I wz
n -gram LM #
n – I – order Markov model
3-gran
LM : Plural as > as > ) Plural as > w , )P(wslwiwal . . .
Z -gram
: multinomial distributions
. Ivfxlvl params
P (w/ the)=
0.001 house very
flat distribution !
O
– 0005 dog
0.0005 cat
÷
Parameter estimation MLE from a large corpus
–
P ( dog I the)=
couhtCthe,d#
count ( the)
why ① Generation
: machine translation
② Grammatical error correction
③ Way to build
”
wordZvectt
”
seg-55
Seman&c Parsing as Transla&on
‣Write down a linearized form of the seman&c parse, train seq2seq
models to directly translate into this representa&on
‣ Concerns: may produce invalid logical forms…and does it work?
“what states border Texas”
lambda x ( state( x ) and border( x , e89 ) ) )
‣ Benefits: no grammar engineering, latent deriva&ons, …
Jia and Liang (2016)
Encoding Invariances
‣ Parsing-based approaches handle these the same way (with small
possibly divergences due to feature weights)
‣ Don’t change the model, change the data. Data augmenta5on lets
us encode invariances by automa&cally genera&ng new training
examples
what states border Texas what states border Ohio
‣ Can we get seq2seq seman&c parsers to handle these the same way?
Jia and Liang (2016)
Data Augmenta&on
‣ Abstract out en&&es: now we can “remix” examples and encode
invariance to en&ty ID. More complicated remixes too
‣ Lets us synthesize a “what states border ohio ?” example
Jia and Liang (2016)
Seman&c Parsing as Transla&on
Prolog
Lambda calculus
Other DSLs
‣ Handle all of these with uniform machinery!
Jia and Liang (2016)
Seman&c Parsing as Transla&on
‣ Three forms of data
augmenta&on all help
‣ Results on these tasks are s&ll not
as strong as hand-tuned systems
from 10 years ago, but the same
simple model can do well at all
problems
Jia and Liang (2016)
Regex Genera&on from Language
‣ Predict regex from text
‣ Problem: requires a lot of data: 10,000 examples needed to get ~60%
accuracy on pre`y simple regexes
‣ Does not scale when regex specifica&ons are more abstract (I want to
recognize a decimal number less than 20)
Locasico et al. (2016)
SQL Genera&on
‣ Convert natural language
descrip&on into a SQL
query against some DB
‣ How to ensure that well-
formed SQL is generated?
‣ Three seq2seq models
‣ How to capture column
names + constants?
‣ Pointer mechanisms, to
be discussed later
Zhong et al. (2017)
‣ Problem to address: how to generate following
the input closely? A@en5on mechanism
seg-69
BERT: Model and Applica2ons
‣ BERT Base: 12 layers, 768-dim per
wordpiece token, 12 heads. Total
params = 110M
‣ BERT Large: 24 layers, 1024-dim
per wordpiece token, 16 heads.
Total params = 340M
‣ Posi2onal embeddings and
segment embeddings, 30k word
pieces
‣ This is the model that gets pre-
trained on a large corpus
Devlin et al. (2019)
What can BERT do?
Devlin et al. (2019)
‣ Ar2ficial [CLS] token is used as the vector to do classifica2on from
‣ BERT can also do tagging by predic2ng tags at each word piece
‣ Sentence pair tasks (entailment): feed both sentences into BERT
What can BERT do?
Devlin et al. (2019)
‣ How does BERT model sentence pair tasks?
‣ Transformers can capture interac2ons between the two sentences (even
though the NSP objec2ve doesn’t really cause this to happen)
Transformer
Transformer
…
[CLS] A boy plays in the snow [SEP] A boy is outside
Entails (first sentence implies second is true)
What can BERT NOT do?
Devlin et al. (2019)
‣ BERT cannot generate text (at least not in an obvious way)
‣ Can fill in MASK tokens, but can’t generate lea-to-right (you can put MASK
at the end repeatedly, but this is slow)
‣ Masked language models are intended to be used primarily for “analysis”
tasks
Fine-tuning BERT
‣ Fine-tune for 1-3 epochs, small
learning rate
‣ Large changes to weights up here
(par2cularly in last layer to route the right
informa2on to [CLS])
‣ Smaller changes to weights lower down in
the transformer
‣ Small LR and short fine-tuning schedule
mean weights don’t change much
‣ More complex “triangular learning
rate” schemes exist
Fine-tuning BERT
‣ BERT is typically beeer if the whole network is fine-tuned, unlike ELMo
Peters et al. (2019)
Evalua2on
Wang et al. (2019)
Evalua2on
‣ Huge improvements over prior work (even compared to ELMo)
‣ Effec2ve at “sentence pair” tasks: textual entailment (does sentence A
imply sentence B), paraphrase detec2on
Devlin et al. (2019)
Analysis
Clark et al. (2019)
‣ Heads on transformers learn interes2ng and diverse things: content
heads (aeend based on content), posi2onal heads (based on posi2on),
etc.
Analysis
‣ S2ll way worse than what supervised parsing systems can do, but
interes2ng that this is learned organically
RoBERTa
Liu et al. (2019)
‣ “Robustly op2mized BERT”
‣ 160GB of data instead
of 16 GB
‣ Dynamic masking: standard
BERT uses the same MASK
scheme for every epoch,
RoBERTa recomputes them
‣ New training + more data = beeer performance
‣ For this and more: check out Huggingface Transformers or fairseq
Neural language Models
PCT ) = Plur , )P( wzlw , ) – – – P ( wnlw , – –
– Wu – i )
–
model with a neural net
Vi – Ci – I
Very basic neural Ln : pfwilwi – e) =e-
{ e
”
– Ci
– I
skip
–
gram w
‘
EV
More generally : P ( wi Iw , . . . Wi . . )
= softmax (Uwi – f- (w. . – wid)
f : neural net to embed
the context
f- is a DAN ? Ignores offering mwithistftmax
f- is a FFNN ? Ti
→ P(wily , . . . , wi – i)
I saw the day Mniht Hinton ( 20031
seg-57
A”en%on
‣ At each decoder state, compute a distribu%on over source inputs
based on current decoder state, use that in output layer
the movie was great le
th
e
m
ov
ie
wa
s
gr
ea
t
th
e
m
ov
ie
wa
s
gr
ea
t
… …
A”en%on
the movie was great
h1 h2 h3 h4
h̄1
‣ For each decoder state,
compute weighted sum of
input states
eij = f(h̄i, hj)
ci =
X
j
↵ijhj
c1
‣ Some func%on
f (TBD)
‣Weighted sum
of input hidden
states (vector)
le
↵ij =
exp(eij)P
j0 exp(eij0)
P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])
P (yi|x, y1, . . . , yi�1) = softmax(Wh̄i)‣ Normal seq2seq:
th
e
m
ov
ie
wa
s
gr
ea
t
A”en%on Variants
h̄1
eij = f(h̄i, hj)
ci =
X
j
↵ijhj
c1
‣ Note that this all uses outputs of hidden layers
f(h̄i, hj) = tanh(W [h̄i, hj ])
f(h̄i, hj) = h̄i · hj
f(h̄i, hj) = h̄
>
i Whj
‣ Bahdanau+ (2014): addi%ve
‣ Luong+ (2015): dot product
‣ Luong+ (2015): bilinear
le
↵ij =
exp(eij)P
j0 exp(eij0)
Bahdanau et al. (2014), Luong et al. (2015)
A”en%on Structures
‣When do we compute a”en%on? Can compute before or aTer RNN cell
h̄1
c1
c1
ATer RNN cell
le le
Before RNN cell
Bahdanau et al. (2014)Luong et al. (2015)
What can a”en%on do?
‣ Can learn to copy a sequence and subsample tokens
0 3 2 1
3 1
‣ LSTMs can learn to count — gives posi%on-based addressing
‣ LSTMs can learn to reject certain inputs — allows us to ignore
undesirable inputs from a”en%on
What can a”en%on do?
‣ Decoder hidden states are now
mostly responsible for selec%ng
what to a”end to
‣ Doesn’t take a complex hidden
state to walk monotonically
through a sentence and spit
out word-by-word transla%ons
‣ Encoder hidden states capture
contextual source word iden%ty
Bahdanau et al. (2014)
Batching A”en%on
the movie was great
the movie was great
token outputs: batch size x sentence length x hidden size
sentence outputs:
batch size x hidden size
hidden state: batch size
x hidden size
eij = f(h̄i, hj)
↵ij =
exp(eij)P
j0 exp(eij0)
a”en%on scores = batch size x sentence length
c = batch size x hidden size ci =
X
j
↵ijhj
‣Make sure tensors are the right size!
seg-80
Explana’ons in NLP
‣ Neural models in NLP have complex behavior. How can we
understand them?
Lipton (2016)
‣ QA: why did the model prefer Stewart over Devin Funchess?
Explana’ons in NLP
‣ Neural models in NLP have complex behavior. How can we
understand them?
Iyyer et al. (2015)
‣ Sen’ment:
‣ LeK side: predic’ons model makes on individual words
DAN Ground Truth
‣ Tells us how these words combine
Why explana’ons?
Lipton (2016)
‣ Trust: if we see that models are behaving in human-like ways and
making human-like mistakes, we might be more likely to trust them and
deploy them
‣ Causality: if our classifier predicts class y because of input feature x,
does that tell us that x causes y? Not necessarily, but it might be helpful
to know
‣ Informa2veness: more informa’on may be useful (e.g., predic’ng a
disease diagnosis isn’t that useful without knowing more about the
pa’ent’s situa’on)
‣ Fairness: ensure that predic’ons are non-discriminatory
What are explana’ons?
Lipton (2016); Belinkov and Glass (2018)
‣ Some models are naturally transparent: we can understand why they
do what they do (e.g., a decision tree with <10 nodes)
‣ Explana'ons of more complex models
‣ Local explana2ons: highlight what led to this classifica'on decision.
(Counterfactual: if they were different, the model would’ve predicted a
different class)
‣ Text explana2ons: describe the model’s behavior in language
‣ Model probing: auxiliary tasks, challenge sets, adversarial examples to
understand more about how our model works
seg-94
Dialogue and QA
‣ Dialogue is a very natural way to find informa7on from a search engine
or a QA system
‣ Challenges: hard to
annotate good
dialogue datasets in a
purely sta7c way
Iyyer et al. (2017)
Dialogue and QA
Choi et al. (2018)
‣ UW QuAC dataset: Ques7on
Answering in Context
Conversa7onal Machine Reading
‣ Conversa7onal machine
reading: answer repeated
ques7ons based on a passage
‣ Interes7ng and poten7ally
useful idea, but annota7ng
data is very hard!
Saeidi et al. (2018)
seg-95
Morphology
‣ Study of how words form
‣ Deriva7onal morphology: create a new lexeme from a base
estrange (v) => estrangement (n)
become (v) => unbecoming (adj)
I become / she becomes
‣ Inflec7onal morphology: word is inflected based on its context
‣May not be totally regular: enflame => inflammable
‣Mostly applies to verbs and nouns
Morphological Inflec7on
‣ In English: I arrive you arrive he/she/it arrives
we arrive you arrive they arrive
[X] arrived
‣ In French:
Wik7onary
Morphological Inflec7on
Wik7onary
‣ In Spanish:
Morphological Inflec7on
‣ Nomina7ve: I/he/she, accusa7ve: me/him/her, geni7ve: mine/his/hers
‣ Not just verbs either; gender, number, case complicate things
I give the children a book <=> Ich gebe ein Buch
I taught the children <=> Ich unterrichte die Kinder
‣ Da7ve: merged with accusa7ve in English, shows recipient of something
den Kindern
Irregular Inflec7on
‣ Common words are oSen irregular
‣ I am / you are / she is
‣ Less common words typically fall into some regular paradigm —
these are somewhat predictable
‣ Je suis / tu es / elle est
‣ Soy / está / es
Agglu7na7ng Languages
‣ Finnish/Hungarian
(Finno-Ugric), also
Turkish: what a
preposi7on would
do in English is
instead part of the
verb
‣Many possible forms — and in newswire data, only a few are observed
illa7ve: “into” adessive: “on”
halata: “hug”
Morphologically-Rich Languages
‣Many languages spoken all over the world have much richer morphology
than English
‣ CoNLL 2006 / 2007: dependency parsing + morphological analyses for
~15 mostly Indo-European languages
‣Word piece / byte-pair encoding models for MT are prehy good at
handling these if there’s enough data
‣ SPMRL shared tasks (2013-2014): Syntac7c Parsing of Morphologically-
Rich Languages
seg-81
Local Explana+ons
‣ An explana+on should help us answer counterfactual ques+ons:
if the input were x’ instead of x, what would the output be?
that movie was not ____ , in fact it was terrible !
that movie was not great , in fact it was _____ !
that movie was not great , in fact it was terrible !
Model
—
—
+
‣ Perturb input many +mes and assess the impact on the model’s
predic+on
LIME
Ribeiro et al. (2016)
‣ LIME: Locally-Interpretable Model-Agnos+c Explana+ons
‣ Local because we’ll focus on this one example
‣Model-agnos:c: treat model as black box
hOps://www.oreilly.com/learning/introduc+on-to-local-interpretable-model-agnos+c-explana+ons-lime
‣ Check predic+ons on subsets of components
‣ Train a model to explain which components yield the model’s preds
LIME
Ribeiro et al. (2016)
hOps://www.oreilly.com/learning/introduc+on-to-local-interpretable-model-agnos+c-explana+ons-lime
‣ Break down input into many small pieces for interpretability
x 2 Rd ! x0 2 {0, 1}d
0
‣ Now learn a model to predict f(x’’) based on x’. This model’s
weights will serve as the explana+on for the decision
‣ If the pieces are very coarse, can
interpret but can’t learn a good
model of the boundary. If pieces
are too fine-grained, can
interpret but not predict
‣ Draw samples by using x’ as a mask to form a new example x’’.
Compute f(x’’)
why
it’s +
LIME
‣ Evalua+on: the authors train a sparse model (only looks at 10
features of each example), then try to use LIME to recover the
features. Greedy: remove features to make predicted class
prob drop by as much as possible
Ribeiro et al. (2016)
hOps://www.oreilly.com/learning/introduc+on-to-local-interpretable-model-agnos+c-explana+ons-lime
Gradient-based Methods
‣ Problem: fully removing pieces of the input may cause it to be very
unnatural
data manifold (points we
observe in prac+ce)
LIME zeroes out
certain features
‣ Alterna+ve approach: look at what this perturba+on does locally
right around the data point using gradients
Gradient-based Methods
Simonyan et al. (2013)
‣ Originally used for images
‣ Approximate score with a
first-order Taylor series
approxima+on around the
current data point
‣ Higher gradient magnitude = small
change in pixels leads to large
change in predic+on
Sc = score of class c
I0 = current image
Gradient-based Methods
Simonyan et al. (2013)
Integrated Gradients
Sundararajan et al. (2017)
‣ Suppose you have predic+on = A OR B for features A and B. Changing
either feature doesn’t change the predic+on, but changing both
would. Gradient-based method says neither is important
‣ Integrated gradients: compute
gradients along a path from
the origin to the current data
point, aggregate these to
learn feature importance
‣ Now at intermediate points,
increasing “par+al A” or
“par+al B” reveals the
importance of A and B
seg-56
Problems with Seq2seq Models
‣ Need some no5on of input coverage or what input words we’ve
translated
‣ Encoder-decoder models like to repeat themselves:
A boy plays in the snow boy plays boy playsUn garçon joue dans la neige
‣Why does this happen?
‣Models trained poorly
‣ LSTM state is not behaving as expected so it gets stuck in a “loop”
of genera5ng the same output tokens again and again
Problems with Seq2seq Models
‣ Bad at long sentences: 1) a fixed-size hidden representa5on doesn’t
scale; 2) LSTMs s5ll have a hard 5me remembering for really long
periods of 5me
RNNenc: the model we’ve
discussed so far
RNNsearch: uses aUen5on
Bahdanau et al. (2014)
Problems with Seq2seq Models
‣ Unknown words:
‣ In fact, we don’t want to encode them, we want a way of directly
looking back at the input and copying them (Pont-de-Buis)
‣ Encoding these rare words into a vector space is really hard
Luong et al. (2015)
Aligned Inputs
le film était bon
the movie was great
the movie was great
le film était bon
‣Much less burden on the hidden
state
‣ Suppose we knew the source and
target would be word-by-word
translated
‣ Can look at the corresponding
input word when transla5ng —
this could scale!
le film était bon [STOP]
‣ How can we achieve this without hardcoding it?
AUen5on
‣ At each decoder state, compute a distribu5on over source inputs
based on current decoder state, use that in output layer
the movie was great le
th
e
m
ov
ie
wa
s
gr
ea
t
th
e
m
ov
ie
wa
s
gr
ea
t
… …
Smoothing in N -gram LMS
5 – gram
models work well ! gauvin
cu Plwlko ) o) Plwlgo to ) CHP (w/ want
to
go
to)
Cst
w/ hate to go
to )
dass
f) ( Austin I to ) so seen in
data
P ( Austin I want to go to )=O if corpus
isn’t huge
Absolute Discounting
Reserve mass from seen 5-grams
to allocate to
OCKLI
unseen 5-grams u -gram
– K –
# Austin want to go tot-CIIYI.GE#-txP.aslAltgtl
# of word types seen
× get to
make this
normalize in this context times
K- 0.2
↳ K
want to go
to a
Maui 2 → I -8
x=
→ Class I → 0.8
county to campus I
– O -8
yneser-NIP.no( A ltgt ) = . – – – e- X ‘ Pao ( Algt) smoothing
– –
–
PLA ) > 0
seg-46
What do RNNs produce?
‣ Encoding of each word (each hi) — can pass this to another layer to make
a predic>on (can also pool these to get a different sentence encoding)
=
‣ Encoding of the sentence (final hi/ci) — can pass this a decoder or make a
classifica>on decision about the sentence
the movie was great
‣ RNN can be viewed as a transforma>on of a sequence of vectors into a
sequence of context-dependent vectors
RNN Uses
‣ Transducer: make some predic>on for each element in a sequence
‣ Acceptor/encoder: encode a sequence into a fixed-sized vector and use
that for some purpose
the movie was great
predict sen>ment (matmul + soNmax)
translate
the movie was great
DT NN VBD JJ
paraphrase/compress
output y = score for each tag, then soNmax
Mul>layer Bidirec>onal RNN
‣ Sentence classifica>on
based on concatena>on of
both final outputs
‣ Token classifica>on based on
concatena>on of both direc>ons’ token
representa>ons
the movie was great the movie was great
What do LSTMs return in Pytorch?
the movie was great
‣ Hidden/cell states are a 2-tuple, tensors of size
[num_layers * num_direc>ons, batch size,
dimensionality]
‣ 2x1xdim here
‣ Outputs are a single tensor of size
[seq_len, batch size, num_direc>ons *
hidden_size]
‣ 4x1xdim here
Training RNNs
the movie was great
‣ Loss = nega>ve log likelihood of probability of gold label
P (y|x)
‣ Backpropagate through en>re network, RNN parameters get a gradient update
from each >mestep
‣ Example: sen>ment analysis
Training RNNs
the movie was great
‣ Loss = nega>ve log likelihood of probability of gold predic>ons, summed over
the tags
‣ Loss terms filter back through network
P (ti|x)
‣ Example: language modeling, POS tagging
seg-52
Combinatory Categorial Grammar
‣ Steedman+Szabolcsi (1980s): formalism bridging syntax and seman?cs
‣ Syntac?c categories (for this lecture): S, NP, “slash”
categories
‣ S\NP: “if I combine with an NP on my leJ
side, I form a sentence” — verb
NP S\NP
Eminem sings
e728 λy. sings(y)
S
sings(e728)
‣ Parallel deriva?ons of syntac?c parse and lambda calculus expression
‣ When you apply this, there has to be a
parallel instance of func?on applica?on on
the seman?cs side
Combinatory Categorial Grammar
‣ Steedman+Szabolcsi (1980s): formalism bridging syntax and seman?cs
‣ Syntac?c categories (for this class): S, NP, “slash” categories
‣ S\NP: “if I combine with an NP on my leJ side, I form a sentence” — verb
‣ (S\NP)/NP: “I need an NP on my right and then on my leJ” — verb with a
direct object
NP S\NP
Eminem sings
e728 λy. sings(y)
S
sings(e728)
NP (S\NP)/NP
Oklahoma borders
e101
Texas
e89
NP
λx.λy borders(y,x)
S\NP
λy borders(y,e89)
S
borders(e101,e89)
CCG Parsing
‣ “What” is a very complex type: needs a noun and needs a S\NP to form a
sentence. S\NP is basically a verb phrase (border Texas)
CCG Parsing
‣ “What” is a very complex type: needs a noun and needs a S\NP to form a
sentence. S\NP is basically a verb phrase (border Texas)
‣ Lexicon is highly ambiguous — all the challenge of CCG parsing is in picking the
right lexicon entries
CCG Parsing
‣ Many ways to build these parsers
‣ One approach: run a “supertagger” (tags the sentence with complex labels), then
run the parser
‣ Parsing is easy once you have the tags, so we’ve reduced it to a (hard) tagging
problem
Building CCG Parsers
‣ Training data looks like pairs of sentences and logical forms
What states border Texas λx. state(x) ∧ borders(x, e89)
‣ Texas corresponds to NP | e89 in the logical form (easy to figure out)
(S/(S\NP))/N | λf.λg.λx. f(x) ∧ g(x)‣ What corresponds to
‣ How do we infer that without being told it?
‣ Problem: we don’t know the deriva?on
‣ GENLEX: takes sentence S and logical form L. Break up logical form into chunks
C(L), assume any substring of S might map to any chunk
Ze`lemoyer and Collins (2005)
‣ Latent variable model learns how to map input to deriva?ons through the
learning process, supervised only by the correct answer
Seq2seq Seman?c Parsing
‣ Parsers have been pre`y hard to build…
‣ Cons?tuency/graph-based: complex dynamic programs
‣ Transi?on-based: complex transi?on systems
‣ CCG/seman?c parsers: complex syntax/seman?cs interface, challenging
inference, challenging learning
‣ For seman?c parsing in par?cular: bridging the syntax-seman?cs divide results
in structural weirdnesses in parsers, hard to learn the right seman?c grammar
What states border Texas λ x state( x ) ∧ borders( x , e89 )
‣ seq2seq models will provide a way to learn this mapping directly, without
having to model complex deriva?on structures
seg-85
Summariza(on
‣What makes a good summary?
Summariza(on
BAGHDAD/ERBIL, Iraq (Reuters) – A strong earthquake hit large parts of northern Iraq and the
capital Baghdad on Sunday, and also caused damage in villages across the border in Iran where
state TV said at least six people had been killed.
There were no immediate reports of casualLes in Iraq aMer the quake, whose epicenter was in
Penjwin, in Sulaimaniyah province which is in the semi-autonomous Kurdistan region very close
to the Iranian border, according to an Iraqi meteorology official.
But eight villages were damaged in Iran and at least six people were killed and many others
injured in the border town of Qasr-e Shirin in Iran, Iranian state TV said.
The US Geological Survey said the quake measured a magnitude of 7.3, while an Iraqi
meteorology official put its magnitude at 6.5 according to preliminary informaLon.
Many residents in the Iraqi capital Baghdad rushed out of houses and tall buildings in panic.
…
Summariza(on
Indian Express — A massive earthquake of magnitude 7.3 struck Iraq on Sunday, 103 kms
(64 miles) southeast of the city of As-Sulaymaniyah, the US Geological Survey said, reports
Reuters. US Geological Survey iniLally said the quake was of a magnitude 7.2, before
revising it to 7.3.
The quake has been felt in several Iranian ciLes and eight villages have been damaged.
Electricity has also been disrupted at many places, suggest few TV reports.
A massive earthquake of magnitude 7.3 struck Iraq on Sunday. The epicenter was
close to the Iranian border. Eight villages were damaged and six people were killed in
Iran.
Summary
What makes a good summary?
A strong earthquake of magnitude 7.3 struck Iraq and Iran on Sunday. The epicenter
was close to the Iranian border. Eight villages were damaged and six people were
killed in Iran.
Summary
‣ Content selec(on: pick the right content
‣ Right content was repeated within and across documents
‣ Domain-specific (magnitude + epicenter of earthquakes are important)
‣ Genera(on: write the summary
‣ Extrac(on: pick whole sentences from the summary
‣ Compression: compress those sentences but basically just do dele(on
‣ Abstrac(on: rewrite + reexpress content freely
seg-91
Chatbots: The Turing Test
Turing (1950s)
‣ Imita9on game: A and B are locked in rooms and answer C’s ques9ons
via typewriter. Both are trying to act like B
A B
C
B B
trained judge
C trained judge
Original Interpreta9on: Standard Interpreta9on:
‣ The test is not “does this computer seem human-like to random people
with a web browser?”
ELIZA
‣ Created 1964-1966 at
MIT, heavily scripted
‣ DOCTOR script was
most successful:
repeats user’s input,
asks inane ques9ons
Weizenbaum (1966)
ELIZA
(.*) you (.*) me
Why do you think I $2 you?
‣ Iden9fy keyword, iden9fy context, apply transforma9on rule
‣ Very li_le need to generate new content, but can only have one type of
conversa9on
My (.) (.*)
What else comes to mind when you
think of your $1?
Weizenbaum (1966)
Cleverbot
Carpenter (1986)
‣ “Nearest neighbors”: human says statement A, find a human response
in human-human or human-computer chats to statement A, repeat that
User: Hi there cleverbot.
Cleverbot: How are you?
User: Doing okay. what plans do you have today?
Cleverbot: I do revision.
User: What are you revising for?
Cleverbot: Maths history.
User: How old are you?
Cleverbot: That’s none of your business.
User: What is maths history?
Cleverbot: I don’t know, farming maybe
‣ Can ofen give sensible
answers, but the bot doesn’t
really impose high-level
discourse structure
Chatbots as Transla9on
Ri_er et al. (2011)
‣ Can treat as a machine transla9on problem: “translate” from current
u_erance to next one
‣ Filter the data, use sta9s9cal measures to prune extracted phrases to get
be_er performance
Chatbots as Transla9on
Ri_er et al. (2011)
seg-90
BART
Lewis et al. (2020)
‣ Sequence-to-sequence BERT
variant: permute/make/delete
tokens, then predict full
sequence autoregressively
‣ For downstream tasks: feed
document into both encoder +
decoder, use decoder hidden
state as output
‣ Treat summarizaLon as a sequence-to-sequence task with this
model
PEGASUS
Zhang et al. (2020)
‣Mask out “gap sentences” and generate them from the context
‣ Strong results, a bit beUer than BART
BART Summaries
Lewis et al. (2020)
‣ These look great! But they’re not always factual
Factuality via Entailment
Goyal and DurreU (2020)
Pre-trained Encoder Model
nummod
nsubj:pass
amod
obl:tmod
nmod:involving
Seven games …. arrested last November [SEP] Seven games involving Nimes were arrested last November
head
word
child
word
dep
label
Classifier W
Output Dependency Parse: d(h)
Encoder Output: E(x;h)
Arc Entailment Decision
Input x Hypothesis h
y = 1 y = 1 y = 0 y = 1 y = 1
raArc Embedding:
‣ AUempt to classify individual dependency arcs in generated output as entailed/
not entailed by the input
“Seven games involving Nimes
were inves&gated aZer Conrad
was arrested last November”
Factuality via QA
Wang et al. (2020)
seg-84
Annota&on Ar&facts
Gururangan et al. (2018); Poliak et al. (2018)
‣ Some datasets might be easy because of how they’re constructed
‣ Understanding our datasets is just as important as understanding
our models in order to know what they’re learning
Task: Natural Language Inference
‣ To create neutral sentences: annotators add informa*on
‣ To create contradic&ons: annotators add nega*on
Gururangan et al. (2018); Poliak et al. (2018)
‣Models can do very well
without looking at the premise
Performance of models that only look at the hypothesis:
~70% on 3-class SNLI dataset
What do we do about this?
‣Why is this a problem? Because our models learn these simple cues
and not actually the hard task we want them to learn
‣ Solu&ons: build harder tasks, tweak data or training objec&ve to
inoculate models against this (many proposals)
‣ They don’t generalize to challenging new examples without these
paYerns — understanding this behavior is crucial to explaining what
our models are doing!
Other Tasks: Ques&on Answering
Mudrakarta et al. (2018)
‣ Ques&on type is very powerful indicator. Only a couple of loca&ons
in this context!
Other Tasks: Mul&-hop QA
The Oberoi family is an Indian family that is famous for its involvement
in hotels, namely through The Oberoi Group
Ques%on: The Oberoi family is part of a hotel company that has a head office
in what city?
The Oberoi Group is a hotel company with its head office in Delhi.
…
…
D
o
c
1
D
o
c
2
High lexical overlap
Chen and DurreY (2019)
Other Tasks: SWAG
Zellers et al. (2018)
‣ Answers were filtered using ELMo to only leave answers that were
difficult for a model to reject…but then BERT solved this dataset
easily
seg-53
Seq2seq Models
‣ Seman.c parsing:
What states border Texas λ x state( x ) ∧ borders( x , e89 )
‣ Syntac.c parsing
The dog ran (S (NP (DT the) (NN dog) ) (VP (VBD ran) ) )
(but what if we produce an invalid tree or one with different words?)
‣Machine transla.on, summariza.on, dialogue can all be viewed in this
framework as well
Seq2seq Models
‣ Encode a sequence into a fixed-sized vector
the movie was great
‣ Now use that vector to produce a series of tokens as output from a
separate LSTM decoder
le film était bon [STOP]
Seq2seq Models
‣ Generate next word condi.oned on previous word as well as hidden state
the movie was great
h̄
‣W size is |vocab| x |hidden state|, soPmax over en.re vocabulary
Decoder has separate
parameters from encoder, so
this can learn to be a language
model (produce a plausible next
word given current one)
P (y|x) =
nY
i=1
P (yi|x, y1, . . . , yi�1)
P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)y1
Inference
‣ Generate next word condi.oned on previous word and hidden state
the movie was great
‣ During inference: need to compute the argmax over the word
predic.ons and then feed that to the next RNN state
le
‣ Need to actually evaluate computa.on graph up to this point to form
input for the next state
‣ Decoder is advanced one state at a .me un.l [STOP] is reached
film était bon [STOP]
seg-47
RNN Language Modeling
I saw the dog
hi
P (w|context) =
exp(w · hi)P
w0 exp(w
0 · hi)
P (w|context) = softmax(Whi)
‣W is a (vocab size) x (hidden size) matrix; linear layer in PyTorch (rows
are word embeddings)
equivalent to
word probs
=
Training RNNLMs
I saw the dog
‣ Input is a sequence of words, output is those words shiHed by one
I saw the dog running
‣ Allows us to train on predicJons across several Jmesteps
simultaneously (similar to batching but we don’t call this batching)
Training RNNLMs
I saw the dog
‣ Total loss = sum of negaJve log likelihoods at each posiJon
‣ In PyTorch: simply add the losses together and call .backward()
P(w|context)
loss = — log P(w*|context)
Batched LM Training
I saw the dog running in the park and it looked very excited to be there
I saw the dog
I saw the dog running
in the park and
in the park and it
batch dim
‣MulJple sequences and mulJple
Jmesteps per sequence
looked very excited to be
Batched LM Training
‣ torch.nn.LSTM / torch.nn.GRU: expect input in [seq len, batch, word dim]
format
executed in parallel
‣ Cannot parallelize across Jmesteps
of RNN since output depends on
previous Jmesteps, so using larger
batches gives beZer parallelism
‣ Input: [4, 2, dim]
I saw the dog
in the park and
Other ImplementaJon Details
‣ torch.nn.Embedding: maps sequence of word indices to vectors
‣ [126, 285] -> [[0.1, -0.07, 1.2],
[-2.3, 0.2, 1.4]]
‣Moves from [sequence length] vector of indices -> [seq len, dim]
tensor or [batch, sequence length] matrix -> [batch, seq len, dim
tensor]
LM EvaluaJon: Perplexity
‣ Accuracy doesn’t make sense — predicJng the next word is generally
impossible so accuracy values would be very low
‣ Evaluate LMs on the likelihood of held-out data (averaged to
normalize for length)
1
n
nX
i=1
logP (wi|w1, . . . , wi�1)
‣ Perplexity: exp(average negaJve log likelihood). Lower is beZer
‣ Suppose we have probs 1/4, 1/3, 1/4, 1/3 for 4 predicJons
‣ Avg NLL (base e) = 1.242 Perplexity = 3.464 <== geometric mean of
denominators
seg-51
Montague Seman,cs
Id Name Alias Birthdate Sings?
e470 Stefani Germano3a Lady Gaga 3/28/1986 T
e728 Marshall Mathers Eminem 10/17/1972 T
‣ Database containing en,,es, predicates, etc.
‣ Sentence expresses something about the world which is either true or
false
NP VP
NNP NNP
S
VBP
Lady Gaga sings
‣ Denota,on: evalua,on of some expression against this database
[[Lady Gaga]] = e470
denota,on of this string is an en,ty
[[sings(e470)]] = True
denota,on of this expression is T/F
Montague Seman,cs
NP VP
NNP NNP
S
VBP
Lady Gaga sings
e470
λy. sings(y)
takes one argument (y, the en,ty) and
returns a logical form sings(y)
λy. sings(y)
sings(e470)
‣We can use the syntac,c parse as a bridge to the lambda-calculus
representa,on, build up a logical form (our model) composi/onally
func,on applica,on: apply this to e470
ID
Parses to Logical Forms
NP
VP
NNP NNP
S
VBP
Lady Gaga
sings
e470
λy. sings(y)
sings(e470) ∧ dances(e470)
VP
CC VP
VBP
dances
λy. dances(y)
and
VP: λy. a(y) ∧ b(y) -> VP: λy. a(y) CC VP: λy. b(y)
λy. sings(y) ∧ dances(y)
‣ General rules:
S: f(x) -> NP: x VP: f
Parses to Logical Forms
NP
NNP NNP
S
VBD
Lady Gaga was
e470
λx.λy. born(y,x)
born(e470,3/28/1986)
VP
NP
March 28, 1986born
λy. born(y, 3/28/1986)
VBN
VP
λy. born(y, 3/28/1986)
‣ How to handle tense: should we indicate that this happened in the past?
‣ Func,on takes two arguments: first x (date), then y (en,ty)
3/28/1986
Tricky Things
‣ Adverbs/temporality: Lady Gaga sang well yesterday
∃e. type(e,sing) ∧ agent(e,e470) ∧ manner(e,well) ∧ …
‣ “Neo-Davidsonian” view of events: things with many proper,es:
‣ Quan,fica,on: Everyone is friends with someone
‣ Generic: Cats eat mice (all cats eat mice? most cats? some cats?)
∀x ∃y friend(x,y)∃y ∀x friend(x,y)
(different friends)(one friend)
‣ Same syntac,c parse for both! Syntax doesn’t resolve all ambigui,es
sings(Lady Gaga, time=yesterday, manner=well)
‣ Indefinite: Amy ate a waffle ∃w. waffle(w) ∧ ate(Amy,w)
Seman,c Parsing
‣ For ques,on answering, syntac,c parsing doesn’t tell you
everything you want to know, but indicates the right structure
‣ Solu,on: seman/c parsing: many forms of this task depending on
seman,c formalisms
‣ Applica,ons: database querying/ques,on answer: produce lambda-
calculus expressions that can be executed in these contexts
seg-45
Gated Connec+ons
‣ Designed to fix “vanishing gradient” problem using gates
‣ Vector-valued “forget gate” f computed based
on input and previous hidden state
‣ Sigmoid: elements of f are in (0, 1)
f = �(W xfxt +W
hfht�1)
ht = ht�1 � f + func(xt)
=
ht-1 f ht
ht = tanh(Wxt + V ht�1 + bh)
gated Elman
‣ If f ≈ 1, we simply sum up a func+on of all inputs — gradient doesn’t
vanish! More stable without matrix mul+ply (V) as well
LSTMs
‣ “Cell” c in addi+on to hidden state h
‣ Vector-valued forget gate f depends on the h hidden state
‣ Basic communica+on flow: x -> c -> h -> output, each step of this process is gated in
addi+on to gates from previous +mesteps
ct = ct�1 � f + func(xt,ht�1)
f = �(W xfxt +W
hfht�1)
‣ “Long short-term memory” network: hidden state is a “short-term” memory
hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/
Goldberg Primer
LSTMs
xj
f
g
i
o
hjhj-1
cj-1 cj
‣ f, i, o are gates that control informa+on flow
‣ g reflects the main computa+on of the cell
hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/
Goldberg Primer
LSTMs
‣ Is an LSTM able to “throw out” the previous value of c at this +mestep?
‣ Can we ignore a par+cular input x?
‣ Can we sum up the inputs x?
‣ Can we output something without changing c?
xj
f
g
i
o
hjhj-1
cj-1 cj
hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/
Goldberg Primer
LSTMs
xj
f
g
i
o
hjhj-1
cj-1 cj
‣ Ignoring recurrent state en+rely:
‣ Lets us discard stopwords
‣ Summing inputs:
‣ Lets us get feedforward layer over token
‣ Ignoring input:
‣ Lets us compute a bag-of-words
representa+on
hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTMs
‣ Gradient s+ll diminishes, but in a controlled way and generally by less — can
ini+alize forget gate = 1 to remember everything to start
<- gradientsimilar gradient <-
hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/
Other Units
‣ Gated recurrent unit (GRU): simpler architecture without c
“update”
“reset”
‣ Other variants of LSTMs: mul+plica+ve LSTMs, rota+onal unit of memory (RUM), …
‣ Quan+zed (binary/ternary weights) and other compressed forms
seg-79
Mul$-hop QA
‣ Very few SQuAD ques$ons require actually combining mul$ple
pieces of informa$on — this is an important capability QA
systems should have
‣ Several datasets test mul$-hop reasoning: ability to answer
ques$ons that draw on several sentences or several documents
to answer
Welbl et al. (2018); Yang et al. (2018)
‣WikiHop: annotators
given mul$ple
documents
HotpotQA
Yang et al. (2018)
Meet Corliss Archer is an American television sitcom that aired on CBS …
Ques%on: What government posi$on was held by the woman who portrayed
Corliss Archer in the film Kiss and Tell ?
Shirley Temple Black was an American actress, businesswoman, and singer …
Kiss and Tell is a comedy film in which 17-year-old Shirley Temple acts as
Corliss Archer .
…
…
As an adult, she served as Chief of Protocol of the United States Do
c
1
D
o
c
2
Same en$ty
D
o
c
3
Same en$ty
Mul$-hop Reasoning
Yang et al. (2018)
The Oberoi family is an Indian family that is famous for its involvement
in hotels, namely through The Oberoi Group
Ques%on: The Oberoi family is part of a hotel company that has a head office
in what city?
The Oberoi Group is a hotel company with its head office in Delhi.
…
…
D
o
c
1
D
o
c
2
Same en$ty
Same en$ty
This is an idealized version of mul$-hop reasoning. Do models need to do this to
do well on this task?
Mul$-hop Reasoning
Yang et al. (2018)
The Oberoi family is an Indian family that is famous for its involvement
in hotels, namely through The Oberoi Group
Ques%on: The Oberoi family is part of a hotel company that has a head office
in what city?
The Oberoi Group is a hotel company with its head office in Delhi.
…
…
D
o
c
1
D
o
c
2
Model can ignore the bridging en$ty and directly predict the answer
High lexical overlap
Mul$-hop Reasoning
Meet Corliss Archer is an American television sitcom that aired on CBS …
Ques%on: What government posi$on was held by the woman who portrayed
Corliss Archer in the film Kiss and Tell ?
Shirley Temple Black was an American actress, businesswoman, and singer …
Kiss and Tell is a comedy film in which 17-year-old Shirley Temple acts as
Corliss Archer .
…
…
As an adult, she served as Chief of Protocol of the United States Do
c
1
D
o
c
2
Same en$ty
D
o
c
3
Same en$ty
No simple lexical overlap.
…but only one government posi$on appears in the context!
Weaknesses of Mul$-hop Reasoning
Chen and DurreX (2019); Min et al. (2019)
A
cc
u
ra
cy
30
50
70
59.3
67.466.4
64.8
42.9
38.8
NoContextEn$ty-
GCN
CFC BAGMajority-
candidate
BiDAF
state-of-the-artweak baselines
NoContextEn$ty-
GCN
CFC BAGMajority-
candidate
BiDAF
‣ Can get >60%
accuracy on a
mul$ple choice task
without using the
context
Weaknesses of Mul$-hop Reasoning
Chen and DurreX (2019); Min et al. (2019)
‣ Can get >50%
accuracy on a
tougher task looking
at sentences in
isola$on
BiDAF++ QFE GRN DFGN Sentence
Factored
F1
0
35
70
50.8
69.769.068.1
58.7
BiDAF++ QFE GRN DFGN Sentence
Factored
State-of-the-art Models
Asai et al. (2020)
‣ Best systems: use hyperlink structure of Wikipedia and a strong
mul$-step retrieval mode built on BERT
seg-92
Neural Chatbots
Sordoni et al. (2015)
What are you doing
I
am going home [STOP]
‣ Just like convenIonal MT, can train seq2seq models for this task
‣ Hard to evaluate:
OpenSubItles Training Data
do you want to meet your sponsor for the last 10 years ?
of course ! but he doesn’t want to see me !
yeah , we were just going to hit up taco bell .
well , it’ s my pleasure .
and where had you been just before ?
i’ d been to the palace of the legion of honor , the art gallery .
‣Why might this model perform poorly? What might it be bad at?
Lack of Diversity
Li et al. (2016)
‣ Training to maximize likelihood gives a system that prefers common
responses:
Lack of Diversity
Li et al. (2016)
‣ SoluIon: mutual informaIon criterion; response R should be
predicIve of user u^erance U as well
‣Mutual informaIon:
‣ Standard condiIonal likelihood: logP (R|U)
log
P (R,U)
P (R)P (U)
= logP (R|U)� logP (R)
‣ log P(R) = probabiliIes under a language model
Lack of Diversity
Li et al. (2016)
Specificity
Ko, Durre^, Li (2019)
What are you doing
I
don’t know [STOP]
SPECIFICITY=1 (nonspecific)
‣When training the decoder, condiIon on the automa&cally predicted
specificity of the response
I don’t know => SPECIFICITY=1
‣ Train a specificity classifier on labeled data
Going to the store => SPECIFICITY=3
Specificity
Ko, Durre^, Li (2019)
Ko, Durre^, Li (2019)
What are you doing
I
don’t know [STOP]
SPECIFICITY=1 (nonspecific)
What are you doing
Going to the store [STOP]
SPECIFICITY=4 (specific)
‣ At test Ime, set
the specificity
level higher to
get less generic
responses
PersonaChat
Zhang et al. (2018)
Meena
Adiwardana et al. (2020)
‣ 2.6B-parameter seq2seq model (larger than GPT-2)
‣ Trained on 341GB of online conversaIons scraped from public social
media
‣ Sample responses:
Blender
Roller et al. (2020)
‣ 2.7B-param model (like the previous one), also 9.4B-parameter
seq2seq model
‣ “Poly-encoder” Transformer architecture, some training tricks
‣ Three models: retrieve (from training data), generate, retrieve-and-refine
‣ Fine-tuning on three prior datasets: PersonaChat, EmpatheIc Dialogues
(discuss personal situaIon, listener is empatheIc), Wizard of Wikipedia
(discuss something from Wikipedia)
Blender
Roller et al. (2020)
Blender
Roller et al. (2020)
‣ Inconsistent responses: this
model doesn’t really have
anything to say about itself
‣ Holding a conversaIon != AI
‣ Can’t acquire new informaIon
‣ Did it learn “fun guy”? No, it
doesn’t understand
phonology. It probably had
this in the data somewhere
seg-86
Extrac’ve Summariza’on: MMR
Carbonell and Goldstein (1998)
‣ Given some ar’cles and a length budget of k words, pick some
sentences of total length <= k and make a summary
‣ Pick important yet diverse content: maximum marginal relevance (MMR)
While summary is < k words
Calculate
Add highest MMR sentence that doesn’t overflow length
“make this sentence
similar to a query”
“make this sentence
maximally different from
all others added so far”
“max over all sentences
not yet in the summary”
MMR = argmaxDi2R\S
�(Sim1(Di, Q)� (1� �) max
Dj2S
Sim2(Di, Dj))
�
Centroid
‣ Represent the documents and each sentences as bag-of-words with TF-IDF
weigh’ng
While summary is < k words Calculate score(sentence) = cosine(sent-vec, doc-vec) Discard all sentences whose similarity with some sentence already in the summary is too high Add the best remaining sentence that won’t overflow the summary Erkan and Radev (2004) Summariza'on ‣ Count number of documents each bigram occurs in to measure importance score(massive earthquake) = 3 score(Iraqi capital) = 1score(six killed) = 2 score(magnitude 7.3) = 2 ‣ ILP formula'on: c and s are indicator variables indexed over bigrams (“concepts”) and sentences, respec'vely “set ci to 1 iff some sentence that contains it is included” sum of included sentences’ lengths can’t exceed L ‣ Find summary that maximizes the score of bigrams it covers Gillick and Favre (2009) Evalua'on: ROUGE ‣ ROUGE-n: n-gram precision/recall/F1 of summary w.r.t. gold standard ‣ ROUGE-2 correlates somewhat well with human judgments for mul'- document summariza'on tasks An earthquake was detected in Iraq on Sunday A massive earthquake of magnitude 7.3 struck Iraq on Sunday ‣ Many hyperparameters: stemming, remove stopwords, etc. reference predic'on ROUGE 2 recall = 1 correct bigram (Iraq, Sunday) / 4 reference bigrams ‣ Historically: ROUGE recall @ k {words, characters}. Now: ROUGE F1 ROUGE 2 precision = 1 correct bigram (Iraq, Sunday) / 6 predicted bigrams Lin (2004) Results Ghalandari (2017) Bejer centroid: 38.58 9.73 1.53 Gillick and Favre / bigram recall ‣ Caveat: these techniques all work bejer for mul'-document than single- document! Mul'-document vs. Single-document ‣ “a massive earthquake hit Iraq” “a massive earthquake struck Iraq” — lots of redundancy to help select content in mul'-document case ‣When you have a lot of documents, there are more possible sentences to extract: But eight villages were damaged in Iran and at least six people were killed and many others injured in the border town of Qasr-e Shirin in Iran, Iranian state TV said. The quake has been felt in several Iranian ciGes and eight villages have been damaged. ‣Mul'-document summariza'on is easier? seg-87 Neural Extrac,ve Methods Nallapa, et al. (2017) p o o lin g ‣ Most neural methods: encode sentences, classify each sentence as belonging to summary or not ‣ Some sort of step (sequence model, ranking) to restrict to at most k sentences (not k words) BERTSum Liu et al. (2017) BERTSum Liu et al. (2017) ‣ Final “summariza,on” layer doesn’t maQer that much ‣ LEAD baseline: take the first few document sentences. Very strong for single-document summariza,on! seg-93 Task-oriented Dialogue Siri, find me a good sushi restaurant in Chelsea Sushi Seki Chelsea is a sushi restaurant in Chelsea with 4.4 stars on Google ‣ Personal assistants / API front-ends: How expensive is it? Entrees are around $30 each Find me something cheaper Task-oriented Dialogue Hey Alexa, why isn’t my Amazon order here? Let me retrieve your order. Your order was scheduled to arrive at 4pm today. ‣ Customer service: It never came Okay, I can put you through to customer service. ATIS DARPA (early 1990s), Figure from Tur et al. (2010) ‣ Given an uYerance, predict a domain-specific semanZc interpretaZon ‣ Can formulate as semanZc parsing, but simple slot-filling soluZons (classifiers) work well too Intents ‣ 29 different intents what days of the week do flights from san jose to nashville fly on does tacoma airport offer transportaZon from the airport to the downtown area which flights go from cleveland to indianapolis on april fi_h what meals are served on american flight 811 from tampa to milwaukee Intent: flight Intent: ground_service Intent: day_name Intent: meal Full Dialogue Task Find me a good sushi restaurant in Chelsea restaurant_type <- sushi location <- Chelsea Sushi Seki Chelsea is a sushi restaurant in Chelsea with 4.4 stars on Google curr_result <- execute_search() How expensive is it? get_value(cost, curr_result) Entrees are around $30 each ‣ Involves slot filling, API calls, and generaZon Dataflow Graphs for Dialogue SemanZc Machines; Andreas et al. (2020) Training Dialogue Models Find me a good sushi restaurant in Chelsea restaurant_type <- sushi location <- Chelsea curr_result <- execute_search() {wizard enters these Sushi Seki Chelsea is a sushi restaurant in Chelsea with 4.4 stars on Google{ wizard types this out or invokes templates ‣ Learning from demonstraZons: “wizard” pulls the levers and makes the dialogue system update its state and take acZons Goal-oriented Dialogue ‣ Big Companies: Apple Siri, Google Assistant, Amazon Alexa, Microso_ Cortana/SemanZc Machines, Facebook, Samsung Bixby, Tencent WeChat, ASAPP ‣ Tons of industry interest! ‣ Dozens of startups + medium-sized companies in this space seg-78 Open-Domain QA ‣ SQuAD-style QA is very ar6ficial, not really a real applica6on ‣ Real QA systems should be able to handle more than just a paragraph of context — theore6cally should work over the whole web? Q: What was Marie Curie the recipient of? Marie Curie was awarded the Nobel Prize in Chemistry and the Nobel Prize in Physics… Mother Teresa received the Nobel Peace Prize in… Curie received his doctorate in March 1895… Skłodowska received accolades for her early work… Open-Domain QA ‣ SQuAD-style QA is very ar6ficial, not really a real applica6on ‣ Real QA systems should be able to handle more than just a paragraph of context — theore6cally should work over the whole web? ‣ QA pipeline: given a ques6on: ‣ Retrieve some documents with an IR system ‣ Zero in on the answer in those documents with a QA model ‣ This also introduces more complex distractors (bad answers) and should require stronger QA systems DrQA Chen et al. (2017) ‣ How oTen does the retrieved context contain the answer? (uses Lucene) ‣ Full retrieval results using a QA model trained on SQuAD: task is much harder Retrieval with BERT Lee et al. (2019) ‣ Can we do beZer than a simple IR system? ‣ Encode the query with BERT, pre-encode all paragraphs with BERT, query is basically nearest neighbors Natural Ques6ons Kwiatkowski et al. (2019) ‣Many SQuAD ques6ons are not suited to the “open” se`ng because they’re underspecified: Where did the Super Bowl take place? ‣ Ques6ons arose from Google searches, unlike SQuAD ques6ons which were wriZen by people looking at a passage ‣ Short answer F1s < 60, long answer F1s <75 ‣ Natural Ques6ons: real ques6ons, answerable with Wikipedia seg-44 RNN Mo&va&on ‣ Feedforward NNs can’t handle variable length input: each posi&on in the feature vector has fixed seman&cs ‣ Instead, we need to: 1) Process each word in a uniform way the movie was great that was great ! 2) …while s&ll exploi&ng the context that that token occurs in ‣ These don’t look related (great is in two different orthogonal subspaces) RNN Abstrac&on ‣ Cell that takes some input x, has some hidden state h, and updates that hidden state and produces output y (all vector-valued) previous h next h (previous c) (next c) input x output y ‣ Op&onally: cell state c (used in LSTMs but not all architectures) RNN Uses ‣ Transducer: make some predic&on for each element in a sequence ‣ Acceptor/encoder: encode a sequence into a fixed-sized vector and use that for some purpose the movie was great predict sen&ment (matmul + soWmax) translate the movie was great DT NN VBD JJ paraphrase/compress output y = score for each tag, then soWmax Elman Networks input xt prev hidden state ht-1 ht output yt ‣ Computes output from hidden state ‣ Updates hidden state based on input and current hidden state ‣ Long history! (invented in the late 1980s) yt = tanh(Uht + by) ht = tanh(Wxt + V ht�1 + bh) Training Elman Networks the movie was great predict sen&ment ‣ “Backpropaga&on through &me”: build the network as one big computa&on graph, some parameters are shared ‣ RNN poten&ally needs to learn how to “remember” informa&on for a long &me! it was my favorite movie of 2016, though it wasn’t without problems -> +
‣ “Correct” parameter update is to do a beder job of remembering the
sen&ment of favorite
Vanishing Gradient
‣ Gradient diminishes going through tanh; if not in [-2, 2], gradient is almost 0
<- gradient<- smaller gradient<- &ny gradient hdp://colah.github.io/posts/2015-08-Understanding-LSTMs/ ‣ Repeated mul&plica&on by V causes problems ht = tanh(Wxt + V ht�1 + bh) seg-50 Seman&cs ‣ Our syntac&c representa&ons don’t say much about word meaning ‣ Grounding: associate words with real-world concepts (e.g., red means certain RGB values coming in from a sensor) ‣ Here: formal seman&cs. Connect language to logical representa&ons of what it means, then worry about grounding those later Model Theore&c Seman&cs ‣ Key idea: can interpret natural language expressions as set-theore&c expressions called models of those sentences ‣ Natural language statement S => interpreta&on of S that models it
‣ We can now concretely define things like entailment: A entails B if in all
worlds where A is true, B is true
She likes going to that restaurant
‣ Interpreta&on: defines who she and that restaurant are, make it able
to be concretely evaluated with respect to a world
‣ Our modeling language is first-order logic
First-order Logic
‣ sings is a predicate (with one argument), func&on f: en&ty → true/false
‣ Powerful logic formalism including things like en&&es, rela&ons, and
quan&fica&on
‣ [[sings]] = denota7on, set of en&&es which sing (found by execu&ng this
predicate on the world — we’ll come back to this)
Lady Gaga sings
‣ sings(Lady Gaga) = true or false, have to execute this against some
database (world)
Quan&fica&on
‣ ∀x sings(x) ∨ dances(x) → performs(x)
‣ ∃y ∀x friend(x,y)
‣ Universal quan&fica&on: “forall” operator
‣ Existen&al quan&fica&on: “there exists” operator
‣ ∀x ∃y friend(x,y)
‣ Source of ambiguity! “Everyone is friends with someone”
“Everyone who sings or dances performs”
“Someone sings”‣ ∃x sings(x)
Logic in NLP
‣ Ques&on answering:
‣ Informa&on extrac&on: Lady Gaga and Eminem are both musicians
∀x musician(x) => performer(x)
musician(Lady Gaga) ∧ musician(Eminem)
Then: performer(Lady Gaga) ∧ performer(Eminem)
Who are all the American singers named Amy?
λx. na&onality(x,USA) ∧ sings(x) ∧ firstName(x,Amy)
‣ Func&on that maps from x to true/false, like filter. Execute this on the
world to answer the ques&on
‣ Can now do reasoning. Maybe know:
‣ Lambda calculus: powerful system for expressing these func&ons
._slides-notes
slides-notes/._seg-49.pdf
slides-notes/seg-49.pdf
slides-notes/._seg-75.pdf
slides-notes/seg-75.pdf
slides-notes/._seg-61.pdf
slides-notes/seg-61.pdf
slides-notes/._seg-60.pdf
slides-notes/seg-60.pdf
slides-notes/._seg-74.pdf
slides-notes/seg-74.pdf
slides-notes/._seg-48.pdf
slides-notes/seg-48.pdf
slides-notes/._seg-62.pdf
slides-notes/seg-62.pdf
slides-notes/._seg-76.pdf
slides-notes/seg-76.pdf
slides-notes/._seg-89.pdf
slides-notes/seg-89.pdf
slides-notes/._seg-88.pdf
slides-notes/seg-88.pdf
slides-notes/._seg-77.pdf
slides-notes/seg-77.pdf
slides-notes/._seg-63.pdf
slides-notes/seg-63.pdf
slides-notes/._seg-67.pdf
slides-notes/seg-67.pdf
slides-notes/._seg-73.pdf
slides-notes/seg-73.pdf
slides-notes/._seg-98.pdf
slides-notes/seg-98.pdf
slides-notes/._seg-99.pdf
slides-notes/seg-99.pdf
slides-notes/._seg-72.pdf
slides-notes/seg-72.pdf
slides-notes/._seg-66.pdf
slides-notes/seg-66.pdf
slides-notes/._seg-70.pdf
slides-notes/seg-70.pdf
slides-notes/._seg-64.pdf
slides-notes/seg-64.pdf
slides-notes/._seg-58.pdf
slides-notes/seg-58.pdf
slides-notes/._seg-59.pdf
slides-notes/seg-59.pdf
slides-notes/._seg-65.pdf
slides-notes/seg-65.pdf
slides-notes/._seg-71.pdf
slides-notes/seg-71.pdf
slides-notes/._seg-16.pdf
slides-notes/seg-16.pdf
slides-notes/._seg-11a.pdf
slides-notes/seg-11a.pdf
slides-notes/._seg-17.pdf
slides-notes/seg-17.pdf
slides-notes/._seg-8.pdf
slides-notes/seg-8.pdf
slides-notes/._seg-29.pdf
slides-notes/seg-29.pdf
slides-notes/._seg-15.pdf
slides-notes/seg-15.pdf
slides-notes/._seg-14.pdf
slides-notes/seg-14.pdf
slides-notes/._seg-28.pdf
slides-notes/seg-28.pdf
slides-notes/._seg-9.pdf
slides-notes/seg-9.pdf
slides-notes/._seg-10.pdf
slides-notes/seg-10.pdf
slides-notes/._seg-38.pdf
slides-notes/seg-38.pdf
slides-notes/._seg-39.pdf
slides-notes/seg-39.pdf
slides-notes/._seg-11.pdf
slides-notes/seg-11.pdf
slides-notes/._seg-13.pdf
slides-notes/seg-13.pdf
slides-notes/._seg-12.pdf
slides-notes/seg-12.pdf
slides-notes/._seg-2.pdf
slides-notes/seg-2.pdf
slides-notes/._seg-37.pdf
slides-notes/seg-37.pdf
slides-notes/._seg-23.pdf
slides-notes/seg-23.pdf
slides-notes/._seg-22.pdf
slides-notes/seg-22.pdf
slides-notes/._seg-36.pdf
slides-notes/seg-36.pdf
slides-notes/._seg-3.pdf
slides-notes/seg-3.pdf
slides-notes/._seg-1.pdf
slides-notes/seg-1.pdf
slides-notes/._seg-20.pdf
slides-notes/seg-20.pdf
slides-notes/._seg-34.pdf
slides-notes/seg-34.pdf
slides-notes/._seg-35.pdf
slides-notes/seg-35.pdf
slides-notes/._seg-21.pdf
slides-notes/seg-21.pdf
slides-notes/._seg-0.pdf
slides-notes/seg-0.pdf
slides-notes/._seg-4.pdf
slides-notes/seg-4.pdf
slides-notes/._seg-25.pdf
slides-notes/seg-25.pdf
slides-notes/._seg-31.pdf
slides-notes/seg-31.pdf
slides-notes/._seg-19.pdf
slides-notes/seg-19.pdf
slides-notes/._seg-18.pdf
slides-notes/seg-18.pdf
slides-notes/._seg-30.pdf
slides-notes/seg-30.pdf
slides-notes/._seg-24.pdf
slides-notes/seg-24.pdf
slides-notes/._seg-5.pdf
slides-notes/seg-5.pdf
slides-notes/._seg-7.pdf
slides-notes/seg-7.pdf
slides-notes/._seg-32.pdf
slides-notes/seg-32.pdf
slides-notes/._seg-26.pdf
slides-notes/seg-26.pdf
slides-notes/._seg-27.pdf
slides-notes/seg-27.pdf
slides-notes/._seg-33.pdf
slides-notes/seg-33.pdf
slides-notes/._seg-6.pdf
slides-notes/seg-6.pdf
slides-notes/._seg-68.pdf
slides-notes/seg-68.pdf
slides-notes/._seg-54.pdf
slides-notes/seg-54.pdf
slides-notes/._seg-40.pdf
slides-notes/seg-40.pdf
slides-notes/._seg-97.pdf
slides-notes/seg-97.pdf
slides-notes/._seg-83.pdf
slides-notes/seg-83.pdf
slides-notes/._seg-82.pdf
slides-notes/seg-82.pdf
slides-notes/._seg-96.pdf
slides-notes/seg-96.pdf
slides-notes/._seg-41.pdf
slides-notes/seg-41.pdf
slides-notes/._seg-55.pdf
slides-notes/seg-55.pdf
slides-notes/._seg-69.pdf
slides-notes/seg-69.pdf
slides-notes/._seg-43.pdf
slides-notes/seg-43.pdf
slides-notes/._seg-57.pdf
slides-notes/seg-57.pdf
slides-notes/._seg-80.pdf
slides-notes/seg-80.pdf
slides-notes/._seg-94.pdf
slides-notes/seg-94.pdf
slides-notes/._seg-95.pdf
slides-notes/seg-95.pdf
slides-notes/._seg-81.pdf
slides-notes/seg-81.pdf
slides-notes/._seg-56.pdf
slides-notes/seg-56.pdf
slides-notes/._seg-42.pdf
slides-notes/seg-42.pdf
slides-notes/._seg-46.pdf
slides-notes/seg-46.pdf
slides-notes/._seg-52.pdf
slides-notes/seg-52.pdf
slides-notes/._seg-85.pdf
slides-notes/seg-85.pdf
slides-notes/._seg-91.pdf
slides-notes/seg-91.pdf
slides-notes/._seg-90.pdf
slides-notes/seg-90.pdf
slides-notes/._seg-84.pdf
slides-notes/seg-84.pdf
slides-notes/._seg-53.pdf
slides-notes/seg-53.pdf
slides-notes/._seg-47.pdf
slides-notes/seg-47.pdf
slides-notes/._seg-51.pdf
slides-notes/seg-51.pdf
slides-notes/._seg-45.pdf
slides-notes/seg-45.pdf
slides-notes/._seg-79.pdf
slides-notes/seg-79.pdf
slides-notes/._seg-92.pdf
slides-notes/seg-92.pdf
slides-notes/._seg-86.pdf
slides-notes/seg-86.pdf
slides-notes/._seg-87.pdf
slides-notes/seg-87.pdf
slides-notes/._seg-93.pdf
slides-notes/seg-93.pdf
slides-notes/._seg-78.pdf
slides-notes/seg-78.pdf
slides-notes/._seg-44.pdf
slides-notes/seg-44.pdf
slides-notes/._seg-50.pdf
slides-notes/seg-50.pdf