CS计算机代考程序代写 SQL scheme prolog database chain compiler Java GPU flex ER cache Hidden Markov Mode AI algorithm ada b'slides-notes.tgz'

b’slides-notes.tgz’

seg-49

Context-Dependent Embeddings

‣ Train a neural language model to predict the next word given previous
words in the sentence, use its internal representa RNN

John visited Madagascar yesterday
Char CNN Char CNN Char CNN Char CNN

4096-dim LSTMs w/ 512-dim projec
i xj)

x0i =
nX

j=1

↵i,jxj

↵k,i,j = softmax(x
>
i Wkxj) x

0
k,i =

j=1

↵k,i,jVkxj

What can self-aJen
word- level 1989

– 2000

source target

Big chunks make translation
easier

↳ How to get chunks
?

↳ How to translate
?

MT Evaluation

Evaluate MT at the
word (phrase level

BLEU : geometric mean of l
–

gram,
2-
gram ,

3 – gram,
U -gram precision of the output

× a brevity penalty
– reference

← translation

How many 4- grans
are in the reference ?

BLEU scones are low
! n 30-40

BLEU correlates w/
human judgments

BERT for QA p ( start lpietsoftmaxlsi)
end

s ‘ Eye
.

.
–

FEET

T
( CLS ] What event . . . (SEP]

The assassination of . –

– –
① Or q③ On . . . p

Caveat : 512 word piece length
limit

Identify small
contest as input

to BERT

seg-89

Seq2seq Summariza-on

‣ Extrac-ve paradigm isn’t all that flexible, even with compression

‣ Can we just use seq2seq models to simplify things?

Its president quit suddenly

The

oat bran craze has

… …

‣ Train to produce summary based on document

‣ Need lots of data: most methods are going to be single-document

Chopra et al. (2016)

Seq2seq Headline Genera-on
‣ Headline genera-on task: generate headline from first sentence of ar-cle
(can get lots of data!)

no aWen-on
with aWen-on

reference

sentence

‣ Works preWy well, though these models can generate incorrect summaries
(who has the knee injury?)

‣ What happens if we try this on a longer ar-cle?

Chopra et al. (2016)

Seq2seq Summariza-on

See et al. (2017)

‣What’s wrong with
this summary?

Pointer-Generator Model
‣ Copying approach like
in Jia+Liang / MT

See et al. (2017)

Pointer-Generator Model

‣ Solu-ons: copy mechanism, coverage, just like in MT…

Pointer-Generator Model

See et al. (2017)

‣ How abstrac-ve is this, anyway? Mostly doing copying!

‣ Next: pre-trained models which are beWer at introducing new content

seg-88

Compressive Summariza/on
Indian Express — A massive earthquake of magnitude 7.3 struck Iraq on Sunday, 103 kms
(64 miles) southeast of the city of As-Sulaymaniyah, the US Geological Survey said, reports
Reuters. US Geological Survey iniKally said the quake was of a magnitude 7.2, before
revising it to 7.3.

‣ Sentence extrac/on isn’t aggressive enough at removing irrelevant
content

‣Want to extract sentences and also delete content from them

Syntac/c Cuts

Berg-Kirkpatrick et al. (2011)

‣ Delete adjuncts

A massive earthquake of magnitude 7.3 struck Iraq on Sunday, 103 kms (64 miles)…

NP

NP-LOCVBD NP NP-TMP

VP

S

‣ Use syntac/c rules to make certain dele/ons

Syntac/c Cuts

Berg-Kirkpatrick et al. (2011)

‣ Use syntac/c rules to make certain dele/ons

‣ Delete second parts of coordina/on structures

At least six people were killed and many others injured

S SCC

S

Compressive Integer Linear Program

‣ Recall the Gillick+Favre ILP:

‣ Now sj variables are nodes or sets of nodes in the parse tree

At least six people were killed and many others injured

S SCC

S

s2s1

‣ New constraint: s2 ≤ s1
“s1 is a prerequisite for s2”

Berg-Kirkpatrick et al. (2011)

Compressive Integer Linear Program

Durrec et al. (2016)

This hasn’t been Kellogg’s year.

Its president quit suddenly.

The oat-bran craze has cost it market share.

And now Kellogg is canceling its new cereal plant, which would have cost $1 billion.SBARNP

NP

The oat-bran craze has cost Kellogg market share.

x1

x2
x3

x4
x5

max
x

�
w>f(x)

�
s.t. summary(x) obeys length limit

summary(x) is grammatical

summary(x) is coherent

ILP:

Compressive Integer Linear Program

Durrec et al. (2016)

max
x

�
w>f(x)

�

Gramma/cality constraints: allow cuts within sentences

Coreference constraints: do not allow pronouns that would refer to nothing

‣ Otherwise, force its antecedent to be included in the summary
‣ If we’re confident about coreference, rewrite the pronoun (it Kellogg)

s.t. summary(x) obeys length limit

summary(x) is grammatical

summary(x) is coherent

Results

Durrec et al. (2016)

30

35

40

45

50

5 6 7 8 9 10

Linguis/c Quality
(Human study on Mechanical Turk)

Content fidelity
(ROUGE-1: word
overlap with
reference) First n words

Extract sentences

Extract clauses Full system

Yoshida et al. (2014)

Results

Xu and Durrec (2019)

‣Model is now a neural
model that scores
sentences and
compression op/ons

‣ Decoding is done by
beam search (not ILP),
length constraint is not
enforced as strongly
anymore

‣ Stronger results on NYT and on CNN/Daily Mail

seg-77

Problems in QA

Devlin et al. (2019)

‣ SQuAD ques Hindi (Devanagari). Transfers well despite different
alphabets!

‣ Japanese => English: different script and very different syntax

seg-99

Ethics in NLP

Bias amplifica+on: systems
exacerbate real-world bias
rather than correct for it

Unethical use: powerful systems can be
used for bad ends

Exclusion: underprivileged users are le>
behind by systems

Dangers of automa+on:
automa?ng things in ways we don’t
understand is dangerous

Types of riskSystem

Applica?on-specific

‣ IE / QA / summariza?on?
‣Machine transla?on?

Machine learning, generally

Deep learning, generally

‣ Dialog?

Bias Amplifica?on

‣ Bias in data: 67% of training images involving
cooking are women, model predicts 80%
women cooking at test ?me — amplifies bias

‣ Can we constrain models to avoid this while
achieving the same predic?ve accuracy?

‣ Place constraints on propor?on of predic?ons
that are men vs. women?

Zhao et al. (2017)

Bias Amplifica?on

Zhao et al. (2017)

Bias Amplifica?on

Alvarez-Melis and Jaakkola (2011)

‣ English -> French machine transla?on
requires inferring gender even when
unspecified

‣ “dancer” is assumed to be female in
the context of the word “charming”…
but maybe that reflects how language
is used?

Exclusion

Alvarez-Melis and Jaakkola (2011)

‣Most of our annotated data is English data, especially newswire

Codeswitching?

Dialects?

Other languages? (Non-European/CJK)

‣What about:

‣ Efforts to broaden along all these axes, including Universal
Dependencies, Masakhane NLP, …

Dangers of Automa?on

Dangers of Automa?on

Slide credit: allout.org

‣ Offensive terms

http://allout.org

Dangers of Automa?on

Slide credit: https://www.reuters.com/article/us-amazon-com-
jobs-automation-insight/amazon-scraps-secret-ai-recruiting-
tool-that-showed-bias-against-women-idUSKCN1MK08G

‣ “Amazon scraps secret AI recrui?ng tool that showed bias
against women”

‣ “Women’s X” organiza?on was a nega?ve-weight feature in resumes
‣Women’s colleges too

‣Was this a bad model? Maybe it correctly reflected the biases in the
what the humans did in the actual recrui?ng process

Dangers of Automa?on

https://toxicdegeneration.allenai.org/

‣ “Toxic degenera?on”: systems that generate toxic stuff

‣ System trained on a big chunk of the Internet: condi?oning on “SJW”,
“black” gives the system a chance of recalling bad stuff from its
training data

Unethical Use

‣ Surveillance applica?ons?
‣ Genera?ng convincing fake news / fake comments?

‣What if these were
undetectable?

Unethical Use
‣ Wang and Kosinski: gay vs. straight
classifica?on based on faces

‣ Blog post by Agüera y Arcas, Todorov,
Mitchell: the system detects mostly social
phenomena (glasses, makeup, angle of
camera, facial hair)

‣ Authors argued they were tes?ng a
hypothesis: sexual orienta?on has a
gene?c component reflected in
appearance

Slide credit: hkps://medium.com/@blaisea/do-
algorithms-reveal-sexual-orienta?on-or-just-expose-
our-stereotypes-d998fafdf477

‣ Poten?ally dangerous tool, and not
even good science

https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477
https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477
https://medium.com/@blaisea/do-algorithms-reveal-sexual-orientation-or-just-expose-our-stereotypes-d998fafdf477

Unethical Use

hkp://www.facecep?on.com

http://www.faceception.com

How to move forward

‣ Hal Daume III: Proposed code of ethics
hkps://nlpers.blogspot.com/2016/12/should-nlp-and-ml-communi?es-have-code.html

‣ Value-sensi?ve design: vsdesign.org
‣ Account for human values in the design process: understand whose
values maker here, analyze how technology impacts those values

‣ Contribute to society and human well-being, and minimize nega?ve consequences of compu?ng systems
‣ Make reasonable effort to prevent misinterpreta?on of results
‣ Make decisions consistent with safety, health, and welfare of public
‣ Improve understanding of technology, its applica?ons, and its poten?al consequences (pos and neg)

‣Many other points, but these are relevant:

Value Sensitive Design Lab

Reading Comprehension
knowledge base question answering

:

question
→ formal repr →④→ ans

when was Brie Larson
born ? → Xd . birthday ( BL ,

d)
–

What temp should I cook chicken to ?

Why did WW2 start
?

What event led to the start of WW1
?
avg

↳
.
.
.

The assassination of Franz Ferdinand
took place on . . .

. –

This event was a direct caus! of WW1 .

Reading comp : ④Et÷e7I7Y§÷!
✓ which answers

the question

seg-66

Syntac’c Machine Transla’on
‣ Rather than use phrases, use a synchronous context-free grammar

NP → [DT1 JJ2 NN3; DT1 NN3 JJ2]

DT → [the, la]

NN → [car, voiture]

JJ → [yellow, jaune]

the yellow car

‣ Assumes parallel tree structures, but there can be reordering

DT → [the, le]

la voiture jaune

NP NP

DT1 NN3 JJ2DT1 NN3JJ2

‣ Transla’on = parse the input with “half” of the grammar, read off the
other half

Syntac’c Machine Transla’on

Slide credit: Dan Klein

‣ To get flexibility in
transla’on, use lexicalized
rules, look like “syntac’c
phrases”

‣ Leads to HUGE grammars,
parsing is slow

seg-70

GPT/GPT-2

Radford et al. (2018, 2019)

‣ GPT2: trained on 40GB of text
collected from upvoted links
from reddit

‣ 1.5B parameters — by far the
largest of these models trained
as of March 2019

‣ GPT: “ELMo with transformers” (works beRer than ELMo)
‣ Train a single unidirecSonal transformer LM on long contexts

‣ Because it’s a language model, we can generate from it

GPT-2

credit:
OpenAI

QuesSons

3) How do we harness these priors for condiSonal generaSon tasks
(summarizaSon, generate a report of a basketball game, etc.)

4) Is this technology dangerous? (OpenAI pursued a “staged release”
model

1) How novel is the stuff being generated? (Is it just doing nearest
neighbors on a large corpus?)

2) How do we understand and disSll what is learned in this model?

Grover

Zellers et al. (2019)

‣ Sample from a large language model condiSoned on a domain, date,
authors, and headline

‣ Humans rank Grover-generated propaganda as more realisSc than
real “fake news”

‣ NOTE: Not a GAN, discriminator
trained separately from the generator

‣ Fine-tuned Grover can detect
Grover propaganda easily —
authors argue for releasing it
for this reason

Pre-training Cost

hRps://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/

‣ XLNet (BERT variant): $30,000 — $60,000 (unclear)

‣ Grover-MEGA: $25,000

‣ BERT: Base $500, Large $7000

‣ This is for a single pre-training run…developing new pre-training
techniques may require many runs

‣ Fine-tuning these models can typically be done with a single GPU (but
may take 1-3 days for medium-sized datasets)

The Staggering Cost of Training SOTA AI Models

GPT-3 (June 2020)

Brown et al. (2020)

‣ 175B parameter model: 96 layers, 96 heads, 12k-dim vectors

‣ Trained on Microsol Azure, esSmated to cost roughly $10M. Almost
1000x BERT-large

GPT-3: Few-shot Learning

Brown et al. (2020)

‣Model “learns” by condiSoning on a few examples of
the task.

‣ Not as good as systems that train on more data,
but impressive results in the few-shot semng

IBM Model I

a- = (ai , . . .,
an ) I = ( ti , . . . ,tn ) 5=6 . , . . .,sm, NULL)

→
each word in I placeholder for
aligns to one

word
unaligned words

in 5

PIE ,aTst= Play Ktla , 5) = II Plait Pftilsa 😉
Model pamms

: translation prob . matrix

includes → I Vslxlvtl Pftargetl – aura )
NULL

Phil Sai ) : look up
the prob of ti given

source
word Sai ai is a

”

pointer
”

f) ( ai ) = uniform over [ I , . . . . , m ,
NULL) m÷,

Inference in Model 1

plays ,E) = Plait
Ist
=
II Fitilsai )
–

–

pets) pt# ignore

Plats ,E) A #Pltilsai ) Plait tpftilsa!

÷÷÷÷÷:÷÷÷÷÷÷u
.

Pla
,
15
,
t )

213 J
‘

yoo
‘ aime –

so aine

0.4 NULL ‘ 13 NULL

HMM Model ( Vogel , 1996 )

Phil – II Plait ai – it
”

Categorical ( ai – ai – i )
a ”

moving
the alignment pointer by

+1 is most likely
”

Learning No labeled alignment
data
–

optimize : log Pftls )
= !? log ⇐ PIE,a- 15 )
–
model :

PCE ,
Ils) marginal log

likelihood

Expectation maximization (FM )

seg-58

Unknown Words

‣Want to be able to copy named en44es like Pont-de-Buis

1

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

from a;en4on
from RNN
hidden state

‣ Problems: target word has to be in the vocabulary, a;en4on + RNN need
to generate good embedding to pick it

Luong et al. (2015), Jean et al. (2015)

Pointer Networks

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

‣ Pointer network: predict from source words instead of target vocab

th
e

m
ov
ie

wa
sa of … … …

the movie was great th
e

m
ov
ie

wa
sa of … … …

‣ Standard decoder (Pvocab): soOmax
over vocabulary, all words get >0 prob

{ h>j V h̄i if yi = wjAAAGcHicjVRbb9MwFM7GWka5bewFCRCGqSJhYWq2SSCkoQleEBJSuewiLVvkOm7rLTfZ7tIq9Ss/kDd+BC/8Ak6cZGNdtxGp6cn5vnOzP7uTBEzIVuvXzOyNuVr95vytxu07d+/dX1h8sCPiASd0m8RBzPc6WNCARXRbMhnQvYRTHHYCuts5/pDjuyeUCxZH3+UooQch7kWsywiW4PIW5340XRwkfew5prDQJnLpMDHdpM88agrbsd0Qy36nmw2VZTUqrjSFJzVbDEIvE14mXzlKoRLWX2bptc6n9KQtJ5Oe4rIKsvP8FgKsQ+XUeiu6nkb1h1k6J8tppw2vK4tKuwrXRQPalWZRq4dcFiFX0qHkYRb1OA4FVE49AIgfS0Rcznp9macM4h5qm6NNZ5zaxEIryO1yTDJHZceqbJ1tOuoQ8lXU1jj1GJAbzbZZNZiqfNS2mXqOpf/Wxqfmem7a4II5Aygv8kBIod2FB9CM6YUvkuQYOKJ8HSYYOnhcDXfCBJPUR99wpIOL7iuUxINIKnMa2Uappf6HaKnzFSNYvxRHEskY9WJ4T7Z0LeFT3I9QVeIz9nEPC4J50X8A58AHZV89yWUprpzqkiALpLFS7ow2ig42pncwLX+uqXKZzvZg7GbJWzRBTWxgvGu5agxZcnGl6Dpeoyk8p9EcFprWcutkX9WhjwoRY87jFA1fFHjWsh1XHWb+Cwgc5YGgWJDT2UGyR2eqG1WqcxMeJ7Bfjb53dOjKOEE7cFAxz/rKY6fNIdZFCkE6GDL1jryF5dZqSz/oouGUxrJRPm1v4afrx2QQ0kiSAAux77QSeZBhLhkJqGq4A0ETTI5xj+6DGeGQioNMX5gKNcHjo27M4Qfi0t5/IzI45GIUdoCZzyomsdw5DdsfyO6bg4xFyUDSiBSFuoMgl29++yKfcUpkMAIDE86gV0T6GLZYwh3dgEVwJke+aOysrTrrq2tfNpa33pfLMW88Mp4bpuEYr40t46PRNrYNMve7tlR7XHtS+1N/WH9af1ZQZ2fKmCXj3FN/+RcVWSa90 otherwise

w1 w2 w3 w4

Ppointer(yi|x, y1, . . . , yi�1) /
AAAG+3icjVXdbtQ4FA6wM8sO+9PCJTfWViMSNVSTggRCKkLLDVpppdllW5Ca1nIcz4xpEke2p5ko41fZm70AIW73RfaOt+HESVo6HQqWkpyc851ff06iPOFKj0Yfr12/8V2v//3NHwa3fvzp5182Nm8fKDGXlO1TkQj5OiKKJTxj+5rrhL3OJSNplLBX0cnz2v7qlEnFRfa3LnN2lJJpxiecEg0qvNnbHIYkyWcEB67y0B4K2SJ3w3zGMXOVH/hhSvQsmlQL43mDDqtdhbVFq3mKK4UrfT8wBrVm++a2Wu9iSKx9vRr0zK47J7+O7yGwRUyvzbdt81mrfXFb5Wo6q/ThdmVS7XfuNmnCJtptck1RyDMUarbQMq2yqSSpgswFBgONhUY0lHw603XIREzR2C33gmXhUw9to3AiCa0CU52YtnS+F5hjiNdBR8sCcwAPhmO3K7Awdatjt8CBZx+7yzPxQS36oII+E0ivakcIYdWNBqwVt4NvgtQ2UGT1HFYQ1nnZNXfKFdcsRi9JZp2b6jsrFfNMG3cd2EeFZ74F6JmLGTOYX0EyjbRAUwH31ZK+CvhdzDLUpfiDxGRKFCWyqT+BcxADs6/u5EshruzqC04eUGO73RkrNBU8XF/Buvg1p9oxne/BMqzyJ2gFmvuAeDoKzRKi1OQq0Ndwg6HCwWC4aDht6RZVf5njGDUkJlKKAi3uNfZq5AehOa7ie+BYguMYn5WbC55pJo1xSyDY+dHyy3Melh0Pw1yKHHZwMJzhN8ehFjk6gLNLZDUzmJ/Vi/gEGQTxoO8Cv6ln8A3Ba5Kfd5yXxltXJmyGG6D7l7Cfg08FJZExeGNrtDOyC10WglbYcto1xhv/h7Gg85RlmiZEqcNglOujikjNacLMIJwrlhN6QqbsEMSMpEwdVfbbbdAQNDGaCAkX8NxqP/eo4HujyjQCZD0HtWqrletsh3M9eXxU8Syfa5bRJtFkntQnqf4RoJhLRnVSgkCo5FArojMCbIN5qQEMIVht+bJwsLsTPNjZ/fPh1rPf2nHcdO46vzquEziPnGfOC2fs7Du0t+j903vbe9c3/X/77/sfGuj1a63PHefC6v/3CTVpXbg=

Pointer Generator Mixture Models

‣Marginalize over copy variable during training and inference

th
e

m
ov
ie

wa
s

gr
ea
ta of … … …

‣ Define the decoder model as a mixture model of the and
models (previous slide)

Pvocab
AAAGhHicjVRtb9M6FM6AFsi9FwZ85IvFVC3RwpRsIBDSEIIvCAmpF9hAWjbLcdzWW95ku0ur1H+En8U3/g0nTrKxUgaWkpyc85xXP3ZUJFwq3/++du36jV7/5q3b9j///nfn7vq9+wcynwrK9mme5OJLRCRLeMb2FVcJ+1IIRtIoYZ+j0ze1/fMZE5Ln2Sc1L9hRSsYZH3FKFKjwvRtfByFJignBgSNdtIdCNiucsJhwzBzpBV6YEjWJRtVMu67dYZUjsTJoOU1xJXGlHgdao9Zs/pxW614OiZWnloOe21Xn5NXxXQS2iKmV+bZMPmM1P06rXE5nlB68rkyqvM7dJE3YSDlNrjEKeYZCxWZKpFU2FiSVkLnEYKBxrhANBR9PVB0yycdo6Mz3gkXpURdtoXAkCK0CXZ3qtnS+F+hjiNdB/UWJOYDtwdDpCix13erQKXHgms/O4lzcrUUPVNBnAull7QghjLrRgLXiZvBNkNoGiqyewxLCOC+65s645IrF6CPJjHNTfWel+TRT2lkF9lDp6r8BuvpyxgzmV5JMIZWjcQ7v5ZL+CHiXTzLUpXhPYjImkhLR1J/AOYiB2Vd38rsQV3b1GycXqLHV7owRmgqerK5gVfyaU+2YLvZgEVbFC7QELTxAvPRDvYAoNblK9CecPZA4sAezhtOGblH1QR/HqCExESIv0WyzsVe+F4T6uIo3wXFeOwJjgU4XB8mbX7Bu3rEuLERewH7Zgwk+OQ5VXqADOKlEVBON+Xl1iI+QRhAPuizxiT3E57M4yymJtMbrG/62bxb6VQhaYcNq1xCvfwvjnE5TlimaECkPA79QRxURitOEaTucSlYQekrG7BDEjKRMHlXmEtVoAJoYjXIBDxDOaH/2qODgy3kaAbLuXy7bauUq2+FUjZ4fVTwrpopltEk0miY1pesbGcVcMKqSOQiECg61IjohsO0K7m0bhhAst/yrcLCzHexu7/z/ZOPV63Yct6yH1iPLsQLrmfXKemsNrX2L9tZ6mz2/F/T7fa+/23/aQK+ttT4PrEur//IHUUYtYA==

Ppointer
AAAGhnicjVRbb9MwFM7YWka4DXjkxWKqSLRQNdsQCGlogheEhFQuG0jLZrmO23rLTba7tEr9T/hVvPFvOHGSjZUysNT05Hzfufj4cwZZxKXq9X6u3Fhda7Vvrt+yb9+5e+/+xoOHhzKdCMoOaBql4tuASBbxhB0oriL2LROMxIOIfR2cvS3xr+dMSJ4mX9QsY8cxGSV8yClR4MIP1r53AhJlY4J9R7poDwVsmjlBNuaYOdLzvSAmajwYFlPtunbDVY7EyrDlJMaFxIV65muNati8ObXXvZoSK08tJr3AVRPklfldBNiAqaX1tkw9g5oXp3YuljNODx7XFlVeE26KRmyonKrWCAU8QYFiUyXiIhkJEkuonGMAaJgqRAPBR2NVpozSEeo7sz1/nnvURVsoGApCC18XZ7pune/5+gTyNdTePMccyHan7zQN5rrcat/Jse+av+35hblTmh64YJ8RlJdlIKQw7soDaMHN4KskJQaOpJzDAsMEz5vNnXPJFQvRZ5KY4Kr7BqXpJFHaWUb2UO7q/yG6+mrFBOaXk0QhlaJRCs/Flv5JeJ+OE9SU+EBCMiKSElH1H8E9CEHZ1+/kbymu3dVfglyQxlZ9MsaoOthd3sGy/KWm6jFdnsE8KLJXaIGaecB43Qv0HLKU4srRv3h2R2Lf7kwrTRu5DYpP+iRElYiJEGmOpk8rvOh5fqBPivApBM7KQFAsyOnyInmzS9XNGtUFmUgzOC+7M8anJ4FKM3QIN5WIYqwxv+gO8SHSCPLBLnN8avfxxSyylCeKCa3xxmav2zML/Wn4tbFp1auPN34EYUonMUsUjYiUR34vU8cFEYrTiGk7mEiWEXpGRuwIzITETB4X5jOqUQc8IRqmAn4gOeP9PaKAqy9n8QCY5QTkIlY6l2FHEzV8eVzwJJsoltCq0HASlaIuv8ko5IJRFc3AIFRw6BXRMYGDhyFIG4bgL275T+Nwu+vvdLc/7m7uv6nHsW49tp5YjuVbL6x9653Vtw4s2lptua3t1k57vd1tP2+/qKg3VuqYR9aV1d7/BWIjLmo=

P (yi|x, y1, . . . , yi�1) = P (copy)Ppointer + (1� P (copy))Pvocab
AAAG6HicjVTdbts2FFa72kvV/STr5W6IBUYlRA2stMCGAhmK7qYYMMBbl7RAlBAURdtMJFEg6ciGzCfoTS9WDLvdI+1uL1P0iJKSxnHTEpB0dM53fvmRcZFypYfD/2/d/uJOr//lxl333ldff/Pt5tZ3h0rMJGUHVKRCvoqJYinP2YHmOmWvCslIFqfsZXz2S21/ec6k4iL/Uy8KdpyRSc7HnBINKrx1590gImkxJTj0lI/2UcTmhRcVU46Zp4IwiDKip/G4mhvfdzus9hTWFq1mGa4UrvTD0BjUmu2f12r9qyGxDvRq0Au77pyCOr6PwBYzvTbfjs1nrfbHa5Wr6awygNeNSXXQudukKRtrr8k1QRHPUaTZXMusyieSZAoylxgMNBEa0UjyyVTXIVMxQSNvsR8uy4D6aAdFY0loFZrqzLSl8/3QnEC8DjpclpgD2B2MvK7A0tStjrwSh7797C0vxEe1GIAK+kwhvaodIYRVNxqwVtwOvglS20CR13NYQVjnZdfcOVdcswS9ILl1bqrvrFTMcm28deAAlb75HKBvrmbMYX4lyTXSAk0EvFdL+iTgVzHNUZfiN5KQCVGUyKb+FM5BAsy+uZOPhbixq484+UCNnXZnrNBU8Hh9Bevi15xqx3S5B8uoKp6gFWgRAOLnYWSWEKUmV4k+hXMHCofuYN5w2tItrv4wJwlqSEykFCWaP2js1TAII3NSJQ/AcVE7AmOBTpcHKVhcsm7RsS4qpChgv9zBFJ+eRFoU6BBOKpHV1GB+UR3iY2QQxIMuS3zqflbsmtGX7RUL44/wxQgLwXPNpN0CL0QPr2E/BJ8LSmJj8Ob2cHdoF7ouhK2w7bRrhDf/ixJBZxnLNU2JUkfhsNDHFZGa05QZN5opVhB6RibsCMScZEwdV/aiNmgAmgSNhYQHSG21H3pUcLmoRRYDsp6DWrXVynW2o5ke/3Rc8byYaZbTJtF4ltbHpr71UcIlozpdgECo5FArolMC1IJ5KReGEK62fF043NsNH+3u/f54++mzdhwbzvfOD47nhM6PzlPnuTNyDhzaY73Xvb96b/un/Tf9v/v/NNDbt1qf+86V1f/3PaDIVak=

‣ Predict P(copy) based on decoder state, input, etc.

‣Model will be able to both
generate and copy, flexibly adapt
between the two

Copying

{ {theazebraPont-de-Buis
ecotax

…

‣ Solu4on: expand the vocabulary dynamically. New words can
only be predicted by copying (always 0 probability under Pvocab)

‣ Some words we may want to copy may not be in
the fixed output vocab (Pont-de-Buis)

Results on Seman4c Parsing

‣ Copying typically helps a bit, but a;en4on captures most of the
benefit. However, vocabulary expansion is cri4cal for some tasks
(machine transla4on)

‣ For seman4c parsing, copying tokens from the input (texas) can be
very useful

Jia and Liang (2016)

seg-59

Handling Rare Words

Sennrich et al. (2016)

‣Words are a difficult unit to work with: copying can be cumbersome,
word vocabularies get very large

‣ Character-level models don’t work well

Input: _the _eco tax _port i co _in _Po nt – de – Bu is …

Output: _le _port ique _éco taxe _de _Pont – de – Bui s

‣ Compromise soluOon: use subword tokens, which may be full words but
may also be parts of words

‣ Can achieve transliteraOon with this, subword structure makes some
translaOons easier to achieve

Byte Pair Encoding (BPE)

‣ Vocabulary stats are weighted over a large corpus

‣ Start with every individual byte (character) as its own symbol

‣ Count bigram character
cooccurrences in dicOonary

‣Merge the most frequent pair
of adjacent characters

‣ Doing 30k merges => vocabulary of 30000 word pieces. Includes many
whole words:

Sennrich et al. (2016)

and there were no re_ fueling sta2ons anywhere

one of the city ’s more un_ princi_ pled real estate agents

Word Pieces

‣ SentencePiece library from Google: unigram LM

Build a language model over your corpus

Merge pieces that lead to highest improvement in language model
perplexity

‣ Issues: what LM to use? How to make this tractable?

while voc size < target voc size: ‣ AlternaOve to BPE Schuster and Nakajima (2012), Wu et al. (2016) Comparison Bostrom and Durrea (2020) ‣ BPE produces less linguisOcally plausible units than word pieces (unigram LM) ‣ Some evidence that unigram LM works beaer in pre-trained transformer models ‣ Other work explores ensembling across mulOple tokenizaOons seg-65 Phrase-Based Machine Transla1on Unlabeled English data cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8 dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9 language ||| langue ||| 0.9 … Language model P(e) Phrase table P(f|e) P (e|f) / P (f |e)P (e) Noisy channel model: combine scores from translation model + language model to translate foreign to English “Translate faithfully but make fluent English” } Phrase-Based Machine Transla1on ‣ Phrase table: set of phrase pairs (e, f) with probabili1es P(f|e) ‣ Inputs needed ‣What we want to find: e produced by a series of phrase-by-phrase transla1ons from an input f ‣ Language model that scores P (ei|e1, . . . , ei�1) ⇡ P (ei|ei�n�1, . . . , ei�1) ‣ Noisy channel model: P(e|f) ∝ P(f|e) P(e) (ignore P(f) term) Transla1on model (TM) Language model (LM) Phrase LaFce ‣ Given an input sentence, look at our phrase table to find all possible transla1ons of all possible spans ‣Monotonic transla1on: need to translate each word in order, explore paths in the laFce that don’t skip any words ‣ Looks like Viterbi, but the scoring is more complicated Koehn (2004) Monotonic Transla1on ‣ If we translate with beam search, what state do we need to keep in the beam? ‣Where are we in the sentence ‣What words have we produced so far (actually only need to remember the last 2 words when using a 3-gram LM) ‣ Score Monotonic Transla1on Mary idx = 1 -1.1 ‣ Beam state: where we’re at, what the current transla1on so far is, and score of that transla1on …did not idx = 2 Mary not Mary no -0.1 -1.2 -2.9 idx = 2 idx = 2 ‣ Advancing state consists of trying each possible transla1on that could get us to this 1mestep Monotonic Transla1on …did not idx = 2 Mary not Mary no -0.1 -1.2 -2.9 idx = 2 idx = 2 score = log [P(Mary) P(not | Mary) P(Maria | Mary) P(no | not)]{ { LM TM In reality: score = α log P(LM) + β log P(TM) …and TM is broken down into several features Monotonic Transla1on ‣ Two ways to get here: Maria + no dio or Maria no + dio ‣ Beam contains op1ons from mul1ple segmenta.ons of input — as many hypotheses as paths through the laFce (up to beam size) …did not idx = 2 Mary not Mary no -0.1 -1.2 -2.9 idx = 2 idx = 2 Non-Monotonic Transla1on ‣More flexible model: can visit source sentence “out of order” ‣ State needs to describe which words have been translated and which haven’t translated: Maria, dio, una, bofetada ‣ Big enough phrases already capture lots of reorderings, so this isn’t as important as you think “Training” Decoders ‣MERT (Och 2003): decode to get 1000- best transla1ons for each sentence in a small training set (<1000 sentences), do line search on parameters to directly op1mize for BLEU score = α log P(t) + β log P(s|t) …and P(s|t) is in fact more complex ‣ Usually 5-20 feature weights to set, want to op1mize for BLEU score which is not differen1able Moses ‣ Pharaoh (Koehn, 2004) is the decoder from Koehn’s thesis ‣ Toolkit for machine transla1on due to Philipp Koehn + Hieu Hoang ‣Moses implements word alignment, language models, and this decoder, plus a ton more stuff ‣ Highly op1mized and heavily engineered, could more or less build SOTA transla1on systems with this from 2007-2013 seg-71 BART Lewis et al. (2019) ‣ Sequence-to-sequence BERT variant: permute/make/delete tokens, then predict full sequence autoregressively ‣ For downstream tasks: feed document into both encoder + decoder, use decoder hidden state as output BART Lewis et al. (2019) ‣ Good results on dialogue, summarizaOon tasks. Will discuss more when we get to these problems ‣ Results on GLUE not beRer than RoBERTa T5 Raffel et al. (2019) ‣ Pre-training: similar denoising scheme to BART ‣ Frame many problems as sequence-to-sequence ones: T5 ‣We sOll haven't hit the limit of bigger data being useful for pre- training: here we see stronger MT results from the biggest data ‣ Colossal Cleaned Common Crawl: 750 GB of text Raffel et al. (2019) The Future ‣ Explosion of techniques, remains to be seen what comes out on top seg-16 Batching ‣ Batching data gives speedups due to more efficient matrix opera5ons ‣ Need to make the computa5on graph process a batch at the same 5me probs = ffnn.forward(input) # [batch_size, num_classes] loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label)) ... ‣ Batch sizes from 1-100 o>en work well

def make_update(input, gold_label)

# input is [batch_size, num_feats]  
# gold_label is [batch_size, num_classes]

…

Training Basics

‣ Basic formula: compute gradients on batch, use first-order op5miza5on
method (SGD, Adagrad, etc.)

‣ How to ini5alize? How to regularize? What op5mizer to use?

‣ This segment: some prac5cal tricks. Take deep learning or op5miza5on
courses to understand this further

How does ini5aliza5on affect learning?

V

n features

d hidden units

d x n matrix m x d matrix

so>maxWf(
x
)

z

nonlinearity
(tanh, relu, …)

g P
(y

|x
)

P (y|x) = softmax(Wg(V f(x)))

‣ How do we ini5alize V and W? What consequences does this have?

‣ Nonconvex problem, so ini5aliza5on maTers!

How does ini5aliza5on affect learning?

‣ Nonlinear model…how does this affect things?

‣ If cell ac5va5ons are too large in absolute value, gradients are small

‣ ReLU: larger dynamic range (all posi5ve numbers), but can produce
big values, can break down if everything is too nega5ve

Ini5aliza5on

1) Can’t use zeroes for parameters to produce hidden layers: all values in
that hidden layer are always 0 and have gradients of 0, never change

‣ Can do random uniform / normal ini5aliza5on with appropriate scale

U

”
�
r

6

fan-in + fan-out
,+

r
6

fan-in + fan-out

#
‣ Glorot ini5alizer:

‣Want variance of inputs and gradients for each layer to be the same

‣ Batch normaliza5on (Ioffe and Szegedy, 2015): periodically shi>+rescale
each layer to have mean 0 and variance 1 over a batch (useful if net is deep)

2) Ini5alize too large and cells are saturated

Dropout

‣ Probabilis5cally zero out parts of the network during training to prevent
overfiang, use whole network at test 5me

Srivastava et al. (2014)

‣ Similar to benefits of
ensembling: network
needs to be robust to
missing signals, so it
has redundancy

‣ Form of stochas5c
regulariza5on

‣ One line in Pytorch/Tensorflow

Op5mizer

‣ Adam (Kingma and Ba, ICLR 2015):
very widely used. Adap5ve step size
+ momentum

‣Wilson et al. NeurIPS 2017:
adap5ve methods can actually
perform badly at test 5me
(Adam is in pink, SGD in black)

‣ One more trick: gradient clipping
(set a max value for your gradients)

seg-11a

Fairness in Classifica-on

‣ Classifiers can be used to make real-world decisions:

‣ Humans making these decisions are typically subject to an–discrimina-on laws;
how do we ensure classifiers are fair in the same way?

‣ Is someone a criminal based on their face?

‣ Who gets an interview?

‣ Who should we lend money to?

‣ Is this online ac-vity suspicious?

‣ Many other factors to consider when deploying classifiers in the real world (e.g.,
impact of a false posi-ve vs. a false nega-ve) but we’ll focus on fairness here

Don’t do this!

Evalua-ng Fairness

Idea 1: Classifiers need to be evaluated beyond just accuracy

‣ T. Anne Cleary (1966-1968): a test is
biased if predic-on on a subgroup
makes consistent nonzero predic-on
errors compared to the aggregate

‣ Individuals of X group could s-ll score
lower on average. But the errors
should not be consistently impac-ng X Test result

Ground truth

‣ Member of π1 has a test result higher than a
member of π2 for the same ground truth ability. Test
penalizes π2

Hutchinson and Mitchell (2018)

Evalua-ng Fairness

Petersen and Novik (1976)

Hutchinson and Mitchell (2018)

‣ Thorndike (1971), Petersen and Novik (1976): fairness in classifica-on: ra-o of
predicted posi-ves to ground truth posi-ves must be approximately the same for
each group

‣ Allows for different criteria across groups: imposing different classifica-on
thresholds actually can give a fairer result

‣ Group 1: 50% posi-ve movie reviews. Group 2: 60% posi-ve movie reviews

‣ A classifier classifying 50% posi-ve in both groups is unfair, regardless of accuracy

‣ Can’t we just make our classifiers not depend on sensi-ve features like gender?

Idea 1: Classifiers need to be evaluated beyond just accuracy

Discrimina-on

Idea 2: It is easy to build classifiers that discriminate even without meaning to

‣ Bag-of-words features can iden-fy par-cular dialects of English like AAVE or
code-switching (using two languages). Impacts classifica-on on social media, etc.

Credit: https://www.reuters.com/article/us-amazon-com-
jobs-automation-insight/amazon-scraps-secret-ai-recruiting-
tool-that-showed-bias-against-women-idUSKCN1MK08G

‣ Reuters: “Amazon scraps secret AI recrui-ng tool that showed bias against women”

‣ “Women’s X” organiza-on, women’s colleges were nega-ve-weight features
‣ Accuracy will not catch these problems, very complex to evaluate depending
on what humans did in the actual recrui-ng process

‣ ZIP code as a feature is correlated with race

‣ A feature might correlate with minority group X and penalize that group:

Takeaways

‣ What minority groups in the popula-on should I be mindful of? (Review sen-ment:
movies with female directors, foreign films, …)

‣ Do aspects of my system or features it uses introduce poten-al correla-ons with
protected classes or minority groups?

‣ Can I check one of these fairness criteria?

Word Embeddings Vaio , ooo words

movie was good
EY [ Ow ! ,

O O O

g
!
.
do

00
–
– –

mo
!,j )

via -wTo – I – o] + [ of
.
!
– I + Co –

g
! ;D

movie

film is great = Cfm – – is
”

– – –

great
)

orthogonal to
”

movie was good
”

word embeddings : low
– dimensional representations

of words

movie
(50 – 300) capturing their similarity

film

¥
”

is

How to learn embeddings

JR firm 1957 Distributional hypothesis
”
You shall know a

word by the company
it keeps

”

I watched the movie I developed
the film

I
watched the film in the

darkroom

The film inspired
me

The movie inspired
me

Brown Clusters , .
.

.

Nikolov et al . 2013
”

word Zvec
”

predict each word ‘s

Each word w e
Tw word vector context given

→ In context
vector that word

Optimization Basics

find it to minimize loss ;
search over space

of parang

loss L ( CI
” ‘

, y
‘ “)

;
!
,

F) = §
,

LIF ” ‘
,
y
”

,
F )

–

training
data function of

T

stochastic gradient descent
: repeatedly pick example

i

I ← T
– XIE LEE)←

loss on ith example
9

step size
4-

Steps suppose L ( iyw
) = w
‘

one feature
5=4]

W=
– l if a -1

: W =
– I

⇐ e — w ⇒ I
4- ” •¥

Keep at
a
.
oscillates !

if 2=112 : W → O oa¥

Choosing Step size

How to choose step size
?

– Try them out
: le

°

le
– t

le
– Z

.

. .

– Large →
small

, e.g
.

‘
It for epoch t ( Yrt . . .)

(fixed schedule )

or decrease step site when performance stagnates
on held – out data

Newton ‘s method
: Te – (3¥ L )

– ‘

IFL
DL very

expensive
to compute

Adagrad , Ada
della
,
Adam : “adaptive

”

methodsD
* run)approximations

to the inverse Hessian

( linear in

Regularization : don’t really use

seg-29

Named En)ty Recogni)on

Barack Obama will travel to Hangzhou today for the G20 mee=ng .

PERSON LOC ORG

B-PER I-PER O O O B-LOC B-ORGO O O O O

‣ BIO tagset: begin, inside, outside

‣Why might an HMM not do so well here?
‣ Lots of O’s

‣ Sequence of tags — should we use an HMM?

‣ Insufficient features/capacity with mul)nomials (especially for unks)

‣Mo)va)ng example for CRFs

Condi)onal Random Fields

‣ Flexible discrimina)ve model for tagging tasks that can use arbitrary
features of the input. Similar to logis)c regression, but structured

Barack Obama will travel to Hangzhou today for the G20 mee=ng .

B-PER I-PER

Curr_word=Barack & Label=B-PER
Next_word=Obama & Label=B-PER
Curr_word_starts_with_capital=True & Label=B-PER
Posn_in_sentence=1st & Label=B-PER
Label=B-PER & Next-Label = I-PER
…

Condi)onal Random Fields

any real-valued scoring func)on of its arguments

‣ CRFs: discrimina)ve model with the following form:

P (y|x) =
1

Z

Y

k

exp(�k(x,y))

normalizer

‣ Special case: linear feature-based poten)als �k(x,y) = w>fk(x,y)

P (y|x) =
1

Z
exp

nX

k=1

w>fk(x,y)

!
‣ Looks like our single
weight vector mul)class
logis)c regression model

‣ There is a set of poten=als φk that each assign a score to x and y

Log-linear CRFs in General

‣ Each square represents a poten)al (φk for features
fk) and touches the y’s and x’s it depends on

y1 y2

x1
x2

x3‣ Condi)onal model: x is observed, y isn’t

P (y|x) =
1

Z
exp

nX

k=1

w>fk(x,y)

!

Z =
X

y0

exp

nX

k=1

w>fk(x,y
0)

!
ybest = argmaxy0 exp

nX

k=1

w>fk(x,y
0)

!

‣ Sum or max over exponen)ally many y, even just to compute Z

φ1

φ2

φ3

‣ Problem: intractable inference!

Sequen)al CRFs

‣ Omit x from this factor graph. Every feature func)on can use x because
the model condi)ons on it

y1 y2 yn…

�t
‣ Two types of factors: transi=ons
(look at adjacent y’s, but not x) and
emissions (look at y and all of x)

P (y|x) =
1

Z

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

�e

�t

�e

Sequen)al CRFs (expressed with poten)als φ)

Sequen)al CRFs

P (y|x) =
1

Z

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

Sequen)al CRFs (expressed with poten)als φ)

P (y|x) =
1

Z
expw>

”
nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

AAADNHicbVLLbhMxFPUMrxIeTWHJxiKKmIgQZVIk2FSqYINYBYk0hUw68jiexKrnIfsOzcj1R7HhQ9ggJBYgxJZvwMmkKE1zJUvH95z7tKNccAXd7nfHvXb9xs1bO7drd+7eu79b33twpLJCUjagmcjkcUQUEzxlA+Ag2HEuGUkiwYbR6esFP/zEpOJZ+h7KnI0TMk15zCkB6wr3nLfNvleeBwmBWRTruWnhAxzEklAdsHnuXRBnJixPAshyHHtr4pbRgSqSUJdPcMBTvKQoEfqDMXgzgRWZrTlqzTXV/haJFQCbg0y0ymJIyNx4w8s5an1vKbHX0pxfwPVxfKM/Vj3hs6pEIFgMI1wNwA965iTFcQheaW/PfNMuQ97CT//zfsUzy/M2b69VDySfzmAc1hvdTndp+CrwV6CBVtYP61+DSUaLhKVABVFq5HdzGGsigVPBTC0oFMsJPSVTNrIwJQlTY718dIOb1jPBcSbtSQEvvesRmiRKlUlklYtO1Sa3cG7jRgXEL8eap3kBLKVVobgQGDK8+EF4wiWjIEoLCJXc9orpjNgVg/1nNbsEf3Pkq+Co1/H3O713zxuHr1br2EGP0GPkIR+9QIfoDeqjAaLOZ+eb89P55X5xf7i/3T+V1HVWMQ/RJXP//gMktgo4

Special case: log-linear with transi=on and emission features:

‣ Spoiler: this structure is going to allow us to use dynamic programming
(Viterbi) to sum or max over all sequences!

HMMs vs. CRFs

y1 y2 yn

x1 x2 xn

…

‣ Both models are expressible in different factor graph nota)on

P (y,x) = P (y1)P (x1|y1)P (y2|y1)P (x2|y2) . . .

y1 y2 yn…

�t

�e

P (y|x) =
1

Z

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

‣ If we squint, an exponen)ated poten)al plays a similar role as a probability

P (y,x) = P (y1)
nY

i=2

P (yi|yi�1)
nY

i=1

P (xi|yi)
AAADkXicbVJbb9MwFHYTLiPA6LZHXiyqilSUqukmAQ9FFbwg8VIkug7qLnJcp7WWm2xna5T69/B/eOPf4DZZlXY9kqXP5/t8bj5eEjAhu91/NcN89PjJ06Nn1vMXL49f1U9OL0WcckJHJA5ifuVhQQMW0ZFkMqBXCac49AI69m6+rvnxLeWCxdFPmSV0GuJ5xHxGsNQu96T2pzm0sxUKsVx4fr5ULdiHyOeY5IguE/ueuFNudo1knEDfrohbKkciDd08ewsRi+CGIjjIfykF9wNokToYw2pWVOcHJFog6VLyMBexL0O8VPZ4N4al29ho9D1Tq3tY7cdR+e+iKHhX5EAB9eUEFh2wfk9dR9B3pZ3p23tHtTOXteC7Le8UPNU8a7N2JT3ibL6QU2u4rSlT7d2Z6im7TguihMezbbK1k8EVLBNWaKeglyXNWm690e10NwYfAqcEDVDa0K3/RbOYpCGNJAmwEBOnm8hpjrlkJKDKQqmgCSY3eE4nGkY4pGKabzZKwab2zKAfc30iCTfe6osch0JkoaeV6y7FPrd2HuImqfQ/TnMWJamkESkS+WkAZQzX6wlnjFMig0wDTDjTtUKywPr7pF5iSw/B2W/5IbjsdZzzTu/HRWPwpRzHEXgN3gAbOOADGIBvYAhGgBjHxoXRNz6bZ+Ync2CWWqNWvjkDO2Z+/w+GlSQt

CRF

HMM

HMMs vs. CRFs

‣ HMMs: in the standard HMM, emissions consider one word at a )me

‣ CRFs support features over many words simultaneously, non-independent
features (e.g., suffixes and prefixes), doesn’t have to be a genera)ve model

P (y|x) =
1

Z
expw>

”
nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

AAADNHicbVLLbhMxFPUMrxIeTWHJxiKKmIgQZVIk2FSqYINYBYk0hUw68jiexKrnIfsOzcj1R7HhQ9ggJBYgxJZvwMmkKE1zJUvH95z7tKNccAXd7nfHvXb9xs1bO7drd+7eu79b33twpLJCUjagmcjkcUQUEzxlA+Ag2HEuGUkiwYbR6esFP/zEpOJZ+h7KnI0TMk15zCkB6wr3nLfNvleeBwmBWRTruWnhAxzEklAdsHnuXRBnJixPAshyHHtr4pbRgSqSUJdPcMBTvKQoEfqDMXgzgRWZrTlqzTXV/haJFQCbg0y0ymJIyNx4w8s5an1vKbHX0pxfwPVxfKM/Vj3hs6pEIFgMI1wNwA965iTFcQheaW/PfNMuQ97CT//zfsUzy/M2b69VDySfzmAc1hvdTndp+CrwV6CBVtYP61+DSUaLhKVABVFq5HdzGGsigVPBTC0oFMsJPSVTNrIwJQlTY718dIOb1jPBcSbtSQEvvesRmiRKlUlklYtO1Sa3cG7jRgXEL8eap3kBLKVVobgQGDK8+EF4wiWjIEoLCJXc9orpjNgVg/1nNbsEf3Pkq+Co1/H3O713zxuHr1br2EGP0GPkIR+9QIfoDeqjAaLOZ+eb89P55X5xf7i/3T+V1HVWMQ/RJXP//gMktgo4

‣ Naive Bayes : logis)c regression :: HMMs : CRFs
local vs. global normaliza)on <-> genera)ve vs. discrimina)ve

(locally normalized discrimina)ve models do exist (MEMMs))

seg-15

Computa(on Graphs

‣ Compu(ng gradients is hard!

‣ Automa(c differen(a(on: instrument code to keep track of deriva(ves

y = x * x (y,dy) = (x * x, 2 * x * dx)
codegen

‣ Computa(on is now something we need to reason about symbolically;
use a library like PyTorch (or Tensorflow)

PyTorch

‣ Framework for defining computa(ons that provides easy access to
deriva(ves

torch.nn.Module‣Module: defines a neural
network (can use wrap
other modules which
implement predefined
layers)

forward(x):
# Takes an example x and computes result

backward(): # produced automa(cally
# Computes gradient aKer forward() is called

…

…

‣ If forward() uses crazy
math, you have to write
backward yourself

Neural Networks for Classifica(on

V

n features

d hidden units

d x n matrix num_classes x d
matrix

soKmaxWf(
x
)

z

nonlinearity
(tanh, relu, …)

g P
(y

|x
)

P (y|x) = softmax(Wg(V f(x)))
num_classes

probs

Computa(on Graphs in PyTorch

P (y|x) = softmax(Wg(V f(x)))

class FFNN(nn.Module):
def __init__(self, input_size, hidden_size, out_size):
super(FFNN, self).__init__()
self.V = nn.Linear(input_size, hidden_size)
self.g = nn.Tanh() # or nn.ReLU(), sigmoid()…
self.W = nn.Linear(hidden_size, out_size)
self.softmax = nn.Softmax(dim=0)

‣ Define forward pass for

def forward(self, x):
return self.softmax(self.W(self.g(self.V(x))))

(syntac(c sugar for forward)

Input to Network

‣Whatever you define with torch.nn needs its input as some sort of
tensor, whether it’s integer word indices or real-valued vectors

def form_input(x) -> torch.Tensor:
# Index words/embed words/etc.
return torch.from_numpy(x).float()

‣ torch.Tensor is a different datastructure from a numpy array, but you can
translate back and forth fairly easily

‣ Note that transla’ng out of PyTorch will break backpropaga’on; don’t
do this inside your Module

Training and Op(miza(on

P (y|x) = softmax(Wg(V f(x)))

ffnn = FFNN(inp, hid, out)

loss.backward()

probs = ffnn.forward(input)
loss = torch.neg(torch.log(probs)).dot(gold_label)

optimizer.step()

for (input, gold_label) in training_data:

ffnn.zero_grad() # clear gradient variables

one-hot vector
of the label
(e.g., [0, 1, 0])

optimizer = optim.Adam(ffnn.parameters(), lr=lr)

nega(ve log-likelihood of correct answer

for epoch in range(0, num_epochs):

Compu(ng Gradients with Backprop

class FFNN(nn.Module):
def __init__(self, inp, hid, out):
super(FFNN, self).__init__()
self.V = nn.Linear(inp, hid)
self.g = nn.Tanh()
self.W = nn.Linear(hid, out)
self.softmax = nn.Softmax(dim=0)  

‣ Ini(alizing to a nonzero value is cri(cal!

nn.init.uniform(self.V.weight)

Training a Model

Define modules, etc.

For each epoch:

Compute loss on batch

For each batch of data:

Run on dev/test set

Autograd to compute gradients and take step on op(mizer

Zero out gradient

Ini(alize weights and op(mizer

[Op(onal: check performance on dev set to iden(fy overfi^ng]

seg-14

Vectoriza*on and So/max

‣ Single scalar probability

‣ Three classes,
“different weights” =

-1.1

2.1

-0.4

0.036

0.89

0.07

so
/
m
ax

class
probs

‣We write this as:

‣ So/max opera*on = “exponen*ate and normalize”

w>1 f(x)
AAACiXicjVFbS8MwGE3rbc7b1EdfgmM4QUbrdQiC6IuPE5wX1lnSLJ1haVqSVFdi/4u/yTf/jemcMHUPfhA4nHO+S74vSBiVynE+LHtmdm5+obRYXlpeWV2rrG/cyjgVmLRxzGJxHyBJGOWkrahi5D4RBEUBI3fB4LLQ756JkDTmNypLSDdCfU5DipEylF95q7Xq2asXIfUUhHqY78Iz6IUCYe2RYVL/Fl5yP3v0VJzAsD5h3s21J9PI19kO9CiHIwkjph/yHP6vQHnC4U5x+JWq03BGAf8CdwyqYBwtv/Lu9WKcRoQrzJCUHddJVFcjoShmxPRLJUkQHqA+6RjIUURkV482mcOaYXowjIV5XMERO5mhUSRlFgXGWYwof2sFOU3rpCpsdjXlSaoIx1+NwpRBFcPiLLBHBcGKZQYgLKiZFeInZO6gzPHKZgnu7y//Bbf7DfegsX99WD2/GK+jBLbANqgDF5yAc3AFWqANsDVv7VlH1rG9ZLt20z79strWOGcT/Aj78hMrPMZz

w>2 f(x)
AAACiXicjVFbS8MwGE3rfd6mPvoSHOIGMtp5G4Ig+uLjBDcn6yxplmowTUuS6krsf/E3+ea/MbsIc/PBDwKHc853yfcFCaNSOc6XZc/NLywuLa8UVtfWNzaLW9stGacCkyaOWSzaAZKEUU6aiipG2okgKAoYuQ9ergf6/SsRksb8TmUJ6UboidOQYqQM5Rc/9hvl7N2LkHoOQt3PK/ACeqFAWHukn5R/hLfczx49FScwLE+YK7n2ZBr5OjuAHuVwKGHE9EOew/8VKEw4an84/GLJqTrDgLPAHYMSGEfDL356vRinEeEKMyRlx3US1dVIKIoZMf1SSRKEX9AT6RjIUURkVw83mcN9w/RgGAvzuIJDdjJDo0jKLAqMczCinNYG5F9aJ1VhvaspT1JFOB41ClMGVQwHZ4E9KghWLDMAYUHNrBA/I3MHZY5XMEtwp788C1q1qntUrd0ely6vxutYBrtgD5SBC87AJbgBDdAE2Fq0Dq0T69RetV27bp+PrLY1ztkBv8K+/gYs08Z0

w>3 f(x)
AAACiXicjVFbS8MwGE2rTt28TH30JTiGG8hovSMIoi8+TnAXWWdJs3QG07QkqVpi/4u/yTf/jdmcMJ0PfhA4nHO+S74vSBiVynE+LHtufqGwuLRcLK2srq2XNzbbMk4FJi0cs1h0AyQJo5y0FFWMdBNBUBQw0gker0Z654kISWN+q7KE9CM05DSkGClD+eW3arOWvXoRUg9BqF/yOjyHXigQ1h55SWrfwnPuZ/eeihMY1qbM9Vx7Mo18ne1Cj3I4ljBi+i7P4f8KFKccB384/HLFaTjjgLPAnYAKmETTL797gxinEeEKMyRlz3US1ddIKIoZMf1SSRKEH9GQ9AzkKCKyr8ebzGHVMAMYxsI8ruCYnc7QKJIyiwLjHI0of2sj8i+tl6rwtK8pT1JFOP5qFKYMqhiOzgIHVBCsWGYAwoKaWSF+QOYOyhyvaJbg/v7yLGjvN9yDxv7NYeXicrKOJbANdkANuOAEXIBr0AQtgK2CtWcdWcd2yXbtU/vsy2pbk5wt8CPsq08uasZ1

softmax(Wf(x))
AAACq3icjVHLbhMxFPUMrzI8msKSjUUUMV0QzbRIsKlUwYYVCoI0gTiMPI6nterHyL4DGZn5OT6BHX+DkwYpfSy4kqWjc859+N6ylsJBlv2J4lu379y9t3M/efDw0ePd3t6TE2cay/iYGWnstKSOS6H5GARIPq0tp6qUfFKev1vpk+/cOmH0Z2hrPlf0VItKMAqBKnq/BqO0/UkUhbOy8stuHx9hUlnKPOHLOv0n/OiK9hsBU+Mq3TLvd564RhW+fYGJ0HgtMSr9l67D/1cgGWxZDm+wJAT4EqzyzlSg6LJLJ5dLFL1+NszWga+DfAP6aBOjovebLAxrFNfAJHVulmc1zD21IJjkXUIax2vKzukpnwWoqeJu7te77vAgMAtcGRueBrxmtzM8Vc61qgzO1YzuqrYib9JmDVRv5l7ougGu2UWjqpEYDF4dDi+E5QxkGwBlVoRZMTuj4VIQzpuEJeRXv3wdnBwM88PhwcdX/eO3m3XsoGfoOUpRjl6jY/QejdAYsSiNPkSTaBq/jD/FX2NyYY2jTc5TdCli/hd45tSP

P (y|x) =
exp(w>y f(x))P

y02Y exp(w
>
y0f(x))

AAACrnicbVHLbhMxFPUMrxIeDbBkYxFVTIUUzbRIsEGqYMMylUhSyAwjj+NprPoxsu9ARsafxw+w429w0iClaa5k6eiccx++t2oEt5Cmf6P4zt179x8cPOw9evzk6WH/2fOJ1a2hbEy10OaiIpYJrtgYOAh20RhGZCXYtLr6tNKnP5ixXKsv0DWskORS8ZpTAoEq+79HSfcrlwQWVe2W/hh/wHltCHU5WzbJf+GnL7vvOegG18mW+di73LaydN1rnHOF1xIlwn31Hu8WCCa/t0bvaMt1uscSDMCWYKSzugZJlj6Z3qxR9gfpMF0Hvg2yDRigTYzK/p98rmkrmQIqiLWzLG2gcMQAp4L5Xt5a1hB6RS7ZLEBFJLOFW6/b46PAzHGtTXgK8JrdznBEWtvJKjhXM9pdbUXu02Yt1O8Lx1XTAlP0ulHdCgwar26H59wwCqILgFDDw6yYLki4FoQL98ISst0v3waTk2F2Ojw5fzs4+7hZxwF6iV6hBGXoHTpDn9EIjRGN3kTn0bdoFqfxJC7i8toaR5ucF+hGxIt/omzVzA==

Logis*c Regression with NNs

‣ Single scalar probability

P (y|x) = softmax(Wf(x)) ‣Weight vector per class;
W is [num classes x num feats]

P (y|x) = softmax(Wg(V f(x))) ‣ Now one hidden layer

P (y|x) =
exp(w>y f(x))P

y02Y exp(w
>
y f(x))

AAACaHicjVHLSgMxFM2M7/oaX4i4CRaxbsqMCroRim5cVrBa6dQhk2Y0mMkMSUY7xMF/dOcHuPErTB9CtS68EDicc+7NzUmYMiqV675b9sTk1PTM7FxpfmFxadlZWb2WSSYwaeCEJaIZIkkY5aShqGKkmQqC4pCRm/DxvKffPBEhacKvVJ6SdozuOY0oRspQgfNar+QvfozUQxjpbrEPT6EfCYS1T7pp5Vt4LoL8zldJCqPKiHm/0L7M4kDne9CnHPYljJi+LQr4vwGBU3arbr/gOPCGoAyGVQ+cN7+T4CwmXGGGpGx5bqraGglFMSNFyc8kSRF+RPekZSBHMZFt3Q+qgLuG6cAoEeZwBfvsaIdGsZR5HBpnb0n5W+uRf2mtTEUnbU15minC8eCiKGNQJbCXOuxQQbBiuQEIC2p2hfgBmZyV+ZuSCcH7/eRxcH1Q9Q6rB5dH5drZMI5ZsA12QAV44BjUwAWogwbA4MOat9atDevTduxNe2tgta1hzxr4UfbOF15uuwk=

Neural Networks for Classifica*on

V

n features

d hidden units

d x n matrix num_classes x d
matrix

so/maxWf(
x
)

z

nonlinearity
(tanh, relu, …)

g P
(y

|x
)

P (y|x) = softmax(Wg(V f(x)))
num_classes

probs

Training Neural Networks

‣Maximize log likelihood of training data

‣ i*: index of the gold label

‣ ei: 1 in the ith row, zero elsewhere. Dot by this = select ith index

z = g(V f(x))P (y|x) = softmax(Wz)

L(x, i⇤) = Wz · ei⇤ � log
X

j

exp(Wz) · ej

L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)

Backpropaga*on Picture

V so/maxWf(
x
)

z
g P

(y
|x
)

P (y|x) = softmax(Wg(V f(x)))

@L
@W

z

n features

d hidden units
num_classes

probs

‣ Gradient w.r.t. W: looks like logis*c
regression, can be computed trea*ng z
as the features

Backpropaga*on Picture

V so/maxWf(
x
)

z
g P

(y
|x
)

P (y|x) = softmax(Wg(V f(x)))

@L
@W

err(z)

‣ Can forget everything a/er z, treat
it as the output and keep backpropping

z

Compu*ng Gradients with Backprop

z = g(V f(x))

Ac*va*ons at
hidden layer

‣ Gradient with respect to V: apply the chain rule

a = V f(x)

‣ First term: gradient of nonlinear ac*va*on func*on at a (depends on
current value)

‣ Second term: gradient of linear func*on

‣ First term: err(z); represents gradient w.r.t. z

@L(x, i⇤)
@Vij

=
@L(x, i⇤)

@z

@z

@Vij
AAADUnicnVJNj9MwEHUaPpawQBeOXCyqlVpAVVKQ4IK0ggsHDotEuys13chxndas40S2Aw2WfyMS4sIP4cIBcJqUTbd7YiTL4zfvzYxHE+eMSuX7P5yOe+36jZt7t7zb+3fu3use3J/IrBCYjHHGMnEaI0kY5WSsqGLkNBcEpTEjJ/H5myp+8okISTP+QZU5maVowWlCMVIWig6c5WGYIrWME/3FwFdw0Z/ApL+BVmYAn8DNKzYDr8WOgppv7y3FNmnUkEawrbScMtKhIislUm07npuq+gZAYpGilYnKf6LP9nEWqixvp+FemAiEdZgjoShidQwjpt+ZVkdP6dnjgblgTSJNP9b1/kd+0YC5MkEd2ikXdXv+0F8b3HWCxumBxo6j7rdwnuEiJVxhhqScBn6uZrrKihkxXlhIkiN8jhZkal2OUiJner0SBh5aZA6TTNjDFVyjbYVGqZRlGltm1bO8HKvAq2LTQiUvZ5ryvFCE47pQUjCoMljtF5xTQbBipXUQFtT2CvES2Skpu4WeHUJw+cu7zmQ0DJ4NR++f945eN+PYAw/BI9AHAXgBjsBbcAzGADtfnZ/Ob+dP53vnl+u4bk3tOI3mAdgyd/8v+YIUNw==

@z

@Vij
=

@g(a)

@a

@a

@Vij
AAAD33icnVPLjtMwFHUTHkN5dWDJxqIaqQVUJQUJNiONYMOCxSDRzkhNJ7pxnYxnnIdsBxqibNiwACG2/BY7PoQ9TpPOpA8WYCnK9bnn3Ht8ZXsJZ1JZ1q+WYV65eu36zo32zVu379zt7N4byzgVhI5IzGNx7IGknEV0pJji9DgRFEKP0yPv/FWZP3pPhWRx9E5lCZ2GEETMZwSUhtzd1u89JwR16vn5xwLv46A3xn5vCc2LPn6Mlzuv6LcbbNeu+Pq/olglDWvSEDeVmpO5uaPoXIkw145nRdl9CYAIQpgXbnYh+qA3J46Kk2aZSHfyBZDcSUAoBrxKEuD5m6Jh6Qk7edQvLlljN2dnVcP/kV86KLYWqFIb7dr/wMX7a+Tgwg8U27zA373Alvpup2sNrMXCm4FdB11Ur0O389OZxSQNaaQIBykntpWoaV5WJZzqs6WSJkDOIaATHUYQUjnNF/ezwHsamWE/FvqLFF6gTUUOoZRZ6Glm6Vmu50pwW26SKv/FNGdRkioakaqRn3KsYlxedjxjghLFMx0AEUx7xeQU9JSUfhJtPQR7/cibwXg4sJ8Ohm+fdQ9e1uPYQQ/QQ9RDNnqODtBrdIhGiBiO8cn4Ynw1wfxsfjO/V1SjVWvuo5Vl/vgDOqRLBg==

L(x, i⇤) = Wz · ei⇤ � log
X

j

exp(Wz) · ej

Backpropaga*on Picture

V so/maxWf(
x
)

z
g P

(y
|x
)

P (y|x) = softmax(Wg(V f(x)))

@L
@W

@z

@V

zf(x)
n features

d hidden units
num_classes

probs

‣ Combine backward gradients with forward-pass products

err(z)

seg-28

Trigram Taggers

‣ Trigram model: y1 = (~~, NNP), y2 = (NNP, VBZ), …~~

‣ Probabili?es now look like P((NNP, VBZ) | (, NNP)) — more context!
We know the verb is occurring two words aMer

Fed raises interest rates 0.5 percent
NNP VBZ NN NNS CD NN

‣ Tradeoff between model capacity and data size — trigrams are a
“sweet spot” for POS tagging

‣ Normal HMM “bigram” model: y1 = NNP, y2 = VBZ, …

HMM POS Tagging

‣ Baseline: assign each word its most frequent tag: ~90% accuracy

‣ Trigram HMM: ~95% accuracy / 55% on words not seen in train

‣ TnT tagger (Brants 1998, tuned HMM): 96.2% acc / 86.0% on unks

‣ State-of-the-art (BiLSTM-CRFs, BERT): 97.5% / 89%+

‣MaxEnt tagger (Toutanova + Manning 2000): 96.9% / 87.0%

‣ Penn Treebank English POS tagging: 44 tags

POS Errors

official knowledge made up the story recently sold shares
JJ/NN NN VBD RP/IN DT NN RB VBD/VBN NNS

Toutanova + Manning (2000)

(NN NN: tax cut, art gallery, …)

Remaining Errors

‣ Underspecified / unclear, gold standard inconsistent / wrong: 58%

‣ Lexicon gap (word not seen with that tag in training): 4.5% of errors

‣ Unknown word: 4.5%

‣ Could get right: 16% (many of these involve parsing!)

‣ Difficult linguis?cs: 20%

They set up absurd situa/ons, detached from reality
VBD / VBP? (past or present?)

a $ 10 million fourth-quarter charge against discon/nued opera/ons
adjec?ve or verbal par?ciple? JJ / VBN?

Manning 2011 “Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguis?cs?”

POS with Feedforward Networks

…

Fed raises interest rates in order to …
f(x)

??
em

b
(raises)

‣Word embeddings for each word form input

‣ f(x) doesn’t look like a bag-of-words, instead
captures posi?on-sensi?ve informa?on

em
b
(interest)

em
b
(rates)

previous word

curr word

next word

other words, feats, etc.

POS with Feedforward Networks

‣ Hidden layer mixes these
different signals and learns
feature conjunc?ons

‣ Use bag-of-character
bigram + trigram
embeddings for each word

‣ Botha et al. (2017): small
FFNNs for NLP tasks

Botha et al. (2017)

POS with Feedforward Networks

‣Works well on a range of
languages

‣ Beser than a RNN-based
approach (Gillick et al.,
2016)

Botha et al. (2017)

Multiclass Classification

-i y
–

– uns )

“e -us
. . ” ,

#
n binary classifiers E¥

Twoteches
One weight vector per

class ( different weights , Dw)

or different
features LDF) per

class

DW : arg;yF
HI ) of : argymeaigwtflgy)
W hypothesized y
NN

Topic Classification

I = too many drug
trials
,
too few patients

y = {
health
, sports,

science)

f- (E) = bag- of
–

uuig rains
[ drug, patients,

baseball)

f- (E) = ( C , 1,0]

DW : The.im
= [ 2,5-6,

-3) T
sports

= [ 1.2 , -3.1 , 5- 7)

⇒ 7.6 ⇒
– I -9

DF : f LI ,y )
= f II ) replicated for each class

f- (I
, y

= Health ) = [ I , 1,0 , O O
o o O o ]

WU W Indicator ( sent

f- (I
, y
‘ Sports ) = ( ooo I , I , O O

O O) contains word i

✓ A y= Sports )
Jr = [ 2 , 5.6, -3, 1.2, -3.1 , 5.7 , . . . ]

Multiclass Perception t logistic Regression Y
–

–
“””

Multiclass Perception
Ex : [ I , 1,0] y=I

health

too many drug
finals
,
too few

patients
for t in epochs

for i in data diffreitdhts [ 1,0, 4=2 sports

yprea
= argtynaxwtyfcx )

baseball players taking drugs

if ypred F y
” ) Assume

”

default
”

pred is y⇒

Typed
← Typed

-aft) ( 000 {
000 9 ooo y

¥

Iya , ← Tycil
txt#

Update on [ 1,1 , o ) , I

ft ypred =3 y I 9=1

Df : w- ←
It Lf

# it
” )

[ 1,1 , o { ooo {
– I -to]

Ci)

Ypred
= I 4=2

[ 0,1 , -1$ I , 0,1 {
– I – I o)

Muhticlass LR
in binary

U :

Ply- III ) =
enttffx ,

DF

– TTFCE)
[ evttflx, y

‘

) y=tl : e
g
,
,

f y’Ey y= – l : l=e
must sum

to

lover y

¥ loss (Itil,ycil,T. )=
– f- (I ,yc”It §, Ply

‘ II )fG,y ‘)

Ey

pfycillx )=l
:

– FLI ,y
” ‘) + FLI ,y

“)=O

P( ybadtxlnl : – f- (I ,y”It FLI, ybad )

SGD : – x : Fe

seg-38

Lexicalized Parsing

S(ran)

NP(dog)

VP(ran)

PP(to)

NP(house)

DT(the) NN(house)TO(to)VBD(ran)DT(the) NN(dog)
the housetoranthe dog

Dependency Parsing

DT NNTOVBDDT NN
the housetoranthe dog

‣ Dependency syntax: syntac@c structure is defined by these arcs
‣ Head (parent, governor) connected to dependent (child, modifier)
‣ Each word has exactly one parent except for the ROOT symbol,
dependencies must form a directed acyclic graph

ROOT

‣ POS tags same as before, oKen run a tagger first as preprocessing

Dependency Parsing

DT

NN

TO

VBD

DT

NN

the

house

to

ran

the

dog

‣ S@ll a no@on of hierarchy! Subtrees oKen align with cons@tuents

Labeled Dependency Parsing

DT NNTOVBDDT NN
the housetoranthe dog

‣ Can label dependencies according to syntac@c func@on

det

‣Major source of ambiguity is in the structure, so we focus on that more
(labeling separately with a classifier works preQy well)

nsubj

pobj

detprep

Dependency vs. Cons@tuency

‣ Cons@tuency: several rule produc@ons need to change

the children ate the cake with a spoon

‣ Dependency: one parent differs

Dependency vs. Cons@tuency

‣ Cons@tuency: ternary rule NP -> NP CC NP

dogs in houses and cats dogs in houses and cats

[dogs in houses] and cats dogs in [houses and cats]

‣ Dependency: first item is the head. Doesn’t really make sense

Stanford Dependencies

‣ Designed to be prac@cally useful for rela@on extrac@on

Standard Collapsed

Bills on ports and immigra@on were submiQed by Senator Brownback, Republican of Kansas

Universal Dependencies

‣ Annotate dependencies with the same representa@on in many languages

English

Bulgarian

Czech

Swiss

‣ Dependencies are more portable cross-lingually: languages with free
word order are not well handled by cons@tuency

Projec@vity

‣ Any subtree is a con@guous span of the sentence <-> tree is projec’ve

credit: Language Log

‣ Crossing arcs / nonprojec@ve:

‣ Equivalent to drawing the structure and having no crossing arcs

Projec@vity

‣ Swiss German example: X said [that we helped Hans paint the house]

Pitler et al. (2013)

‣Many trees in other languages are nonprojec@ve

seg-39

Transi’on-based Parsing

‣We can build a dependency parser using a chart-based algorithm
like CKY

‣ Time: O(n3), but the algorithm is very tricky!

‣ Transi’on-based, or shi.-reduce, is another style of parser;
similar to determinis’c parsing for compilers

‣ A tree is built from a sequence of incremental decisions moving
leI to right through the sentence

‣ Stack contains par’ally-built tree, buffer contains rest of sentence

Transi’on System

I ate some spagheM bolognese

ROOT

‣ ShiI 1: Stack: [ROOT I] Buffer: [ate some spagheM bolognese]
‣ ShiI: top of buffer -> top of stack

‣ Ini’al state: Stack: [ROOT] Buffer: [I ate some spagheM bolognese]

‣ ShiI 2: Stack: [ROOT I ate] Buffer: [some spagheM bolognese]

Transi’on System

I ate some spagheM bolognese

ROOT

‣ State: Stack: [ROOT I ate] Buffer: [some spagheM bolognese]

‣ LeI-arc (reduce): Let denote the stack, = stack ending in w-1�
‣ “Pop two elements, add an arc, put them back on the stack”

‣ State: Stack: [ROOT ate] Buffer: [some spagheM bolognese]

I

�|w�2, w�1 ! �|w�1 w�1w�2 is now a child of

�|w�2, w�1 ! �|w�1

Arc-Standard Parsing

‣ Start: stack: [ROOT], buffer: [I ate some spagheM bolognese]

‣ ShiI: top of buffer -> top of stack
‣ LeI-Arc: �|w�2, w�1 ! �|w�1 w�1w�2
‣ Right-Arc �|w�2, w�1 ! �|w�2

is now a child of,

w�1 w�2,

I ate some spagheM bolognese

‣ End: stack contains [ROOT], buffer is empty []

‣ How many transi’ons do we need if we have n words in a sentence?

is now a child of

ROOT

‣ Arc-standard system: three opera’ons

‣ There are other transi’on systems, but we won’t discuss these

Arc-Standard Parsing

[I ate some spagheM bolognese][ROOT]

[ROOT I]

[ROOT I ate]

[ROOT ate]

I

S

S

L

‣ Could do the leI arc later! But no reason to wait
‣ Can’t a^ach ROOT <- ate yet even though this is a correct dependency! S top of buffer -> top of stack
LA
RA

[ate some spagheM bolognese]

[some spagheM bolognese]

[some spagheM bolognese]

I ate some spagheM bolognese

ROOT
pop two, leI arc between them
pop two, right arc between them

Arc-Standard Parsing

S top of buffer -> top of stack
LA
RAI ate some spagheM bolognese

ROOT
pop two, leI arc between them
pop two, right arc between them

[ROOT ate]

I

[some spagheM bolognese]

[ROOT ate some spagheM]

I

[bolognese]

[ROOT ate spagheM]

I some

[bolognese]

S

L

S

S

Arc-Standard Parsing

S top of buffer -> top of stack
LA
RAI ate some spagheM bolognese

ROOT
pop two, leI arc between them
pop two, right arc between them

[ROOT ate spagheM bolognese]

I some

[ROOT ate spagheM]

I some bolognese
[ROOT ate]

I
some bolognese

spagheM

Stack consists of all words that are
s’ll wai’ng for right children, end
with a bunch of right-arc ops

[ROOT]

I
some

bolognese
spagheM

ate

[]

[]

[]
[]

Final state:

R

R

Building Transi’on-Based Parsers

[ROOT ate some spagheM]

I

[bolognese]

‣Mul’-way classifica’on problem: shiI, leI-arc, or right-arc?

[ROOT] [I ate some spagheM bolognese]

‣ How do we make the right decision in this case?

‣ How do we make the right decision in this case? (all three ac’ons legal)

‣ Only one legal move (shiI)

argmaxa2{S,LA,RA}w
>f(stack, bu↵er, a)

Features for ShiI-Reduce Parsing

[ROOT ate some spagheM]

I

[bolognese]

‣ Features to know this should leI-arc?

‣ One of the harder feature design tasks!

‣ In this case: the stack tag sequence VBD – DT – NN is pre^y informa’ve
— looks like a verb taking a direct object which has a determiner in it

‣ Things to look at: top words/POS of buffer, top words/POS of stack,
leImost and rightmost children of top items on the stack

Training a Greedy Model

[ROOT ate some spagheM]

I

[bolognese]

‣ Train a classifier to predict the right decision using these as training data
‣ Can turn a tree into a decision sequence a by building an oracle

‣ Training data assumes you made correct decisions up to this point
and teaches you to make the correct decision, but what if you
screwed up…

argmaxa2{S,LA,RA}w
>f(stack, bu↵er, a)

Training a Greedy Model

‣ Greedy: 2n local training examples

State space

Gold end stateStart state

‣ Non-gold states unobserved during training: consider
making bad decisions but don’t condi’on on bad decisions

seg-11

Text Classifica-on

~20 classes

Sports

Health

‣ 20 Newsgroups, Reuters, Yahoo! Answers, …

Textual Entailment

Bowman et al. (2015)

A black race car starts up in front of a crowd of people.

A man is driving down a lonely road

A soccer game with mul-ple males playing.

Some men are playing a sport.

A smiling costumed woman is holding an umbrella.

A happy woman in a fairy costume holds an umbrella.

CONTRADICTS

ENTAILS

NEUTRAL

‣ Three-class task
over sentence
pairs

‣ Not clear how to
do this with
simple bag-of-
words features

En-ty Disambigua-on/En-ty Linking

‣ 4,500,000 classes (all ar-cles in Wikipedia)

Although he originally won the
event, the United States An–
Doping Agency announced in
August 2012 that they had
disqualified Armstrong from
his seven consecu-ve Tour de
France wins from 1999–2005.

Lance Edward Armstrong is
an American former
professional road cyclist

Armstrong County
is a county in
Pennsylvania…

?
?

‣ Needs a very different structure for f(x,y) to accommodate so many classes

Authorship A`ribu-on

‣ Early work: Shakespeare’s plays, Federalist papers (Hamilton v. Madison)

‣ Sta-s-cal methods date back to 1930s and 1940s
‣ Based on handcraeed heuris-cs like stopword frequencies

‣ Twi`er: given a bunch of tweets, can we figure out who wrote them?
‣ Schwartz et al. EMNLP 2013: 500M tweets, take 1000 users with at
least 1000 tweets each

‣ Task: given a held-out tweet by one of the 1000 authors, who wrote it?

Authorship A`ribu-on

Schwartz et al. (2013)

‣ SVM with character 4-
grams, words 2-grams
through 5-grams

50-author task

Authorship A`ribu-on

Schwartz et al. (2013)

‣ k-signature: n-gram that appears in k% of the authors tweets but not
appearing for anyone else — suggests why these are so effec-ve

Authorship A`ribu-on

Schwartz et al. (2013)

‣ k-signature: n-gram that appears in k% of the authors tweets but not
appearing for anyone else — suggests why these are so effec-ve

seg-13

Neural Networks

Taken from h2p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Warp
space Shift

Nonlinear
transformation

z = g(V f(x) + b)
AAACFXicbVDLSsNAFJ3UV62vqEs3g0VoUUpSBd0IRTcuK9gHtKFMppN26GQSZiZiDfkJN/6KGxeKuBXc+TdO2gjaemDgzDn3cu89bsioVJb1ZeQWFpeWV/KrhbX1jc0tc3unKYNIYNLAAQtE20WSMMpJQ1HFSDsUBPkuIy13dJn6rVsiJA34jRqHxPHRgFOPYqS01DOPuj5SQ9eL7xN4DgelJvRKP9JdUoaH8OfnJuWeWbQq1gRwntgZKYIM9Z752e0HOPIJV5ghKTu2FSonRkJRzEhS6EaShAiP0IB0NOXIJ9KJJ1cl8EArfegFQj+u4ET93REjX8qx7+rKdEU566Xif14nUt6ZE1MeRopwPB3kRQyqAKYRwT4VBCs21gRhQfWuEA+RQFjpIAs6BHv25HnSrFbs40r1+qRYu8jiyIM9sA9KwAanoAauQB00AAYP4Am8gFfj0Xg23oz3aWnOyHp2wR8YH9+frZ3a

ypred = argmaxyw
>
y z

AAACWXicbVFdS8MwFE3r16xfUx99CQ5hIoxWBX0RRF98VHBTWGdJs3QGk7Ykt7pa+id9EMS/4oPp3Py+EDj33BzOzUmYCq7BdV8se2p6ZnauNu8sLC4tr9RX1zo6yRRlbZqIRF2HRDPBY9YGDoJdp4oRGQp2Fd6dVvOre6Y0T+JLyFPWk2QQ84hTAoYK6umWLwnchlHxWOIjPGh2cNScUMNyG+/gSReW204eFD6wIShZGJt+WWkmBFEDSYZlkH8qHkxz40OS4i+ToN5wW+6o8F/gjUEDjes8qD/5/YRmksVABdG667kp9IwbcCpY6fiZZimhd2TAugbGRDLdK0bJlHjLMH0cJcqcGPCI/a4oiNQ6l6G5WW2of88q8r9ZN4PosFfwOM2AxfTDKMoEhgRXMeM+V4yCyA0gVHGzK6a3RBEK5jMcE4L3+8l/QWe35e21di/2G8cn4zhqaANtoiby0AE6RmfoHLURRc/ozZqxZq1X27JrtvNx1bbGmnX0o+z1d3GBtZk=

‣ Ignore shiE / +b term for the
rest of the course

Neural Networks

Taken from h2p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Linear classifier Neural network
Linear classificaIon
in the transformed
space!

Deep Neural Networks

Taken from h2p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

z1 = g(V1f(x))
AAACfnicbVHJbtswEKWUtcrmtMdemBgO7ARxpKRAfClgtJccEyB2AliOQNGUQ4SUBHLUWhX0Gf2x3votvZTykn0Agm/evFk4DFPBNbjuX8teWl5ZXVv/4Gxsbm3v1HY/9nWSKcp6NBGJug2JZoLHrAccBLtNFSMyFOwmfPhexW9+MKV5El9DnrKhJOOYR5wSMFRQ+93wJYH7MCp+lfgrHjf7OGouqEnZwkd44YVly3kSB95Mbu4XCS2nkQeFD2wCShZmllFZFV4QRI0lmZRB/lj2p3HufEhS/Kx4re623anht8Cbgzqa22VQ++OPEppJFgMVROuB56YwNN2AU8FKx880Swl9IGM2MDAmkulhMV1fiRuGGeEoUebEgKfs84yCSK1zGRplNaF+HavI92KDDKLOsOBxmgGL6axRlAkMCa7+Ao+4YhREbgChiptZMb0nilAwP+aYJXivn/wW9E/b3ln79OpLvfttvo519Bntoyby0Dnqogt0iXqIon/WnnVoHdnIPrCP7ZOZ1LbmOZ/QC7M7/wFkxcE8

z2 = g(V2z1)
AAAConicbVFda9swFJXdfXTeR7P2sQ8TywIJg2Bng+2lUFoKYw9by5YPiDMjK3IqKtlGum7jGv+w/o297d9MTpzhprsgdHTuObpXV2EquAbX/WPZO48eP3m6+8×5/uLlq73W6/2RTjJF2ZAmIlGTkGgmeMyGwEGwSaoYkaFg4/DqtMqPr5nSPIl/Qp6ymSSLmEecEjBU0Lrr+JLAZRgVtyU+wovuCEfdDbUse/g93pzCsuc01IG31pv9nqPnNDSDWjPATaO5Jw8KH9gSlCxMv/OyKr4hiFpIsiyD/J/pxhx++ZCkzWtabbfvrgI/BF4N2qiO86D1258nNJMsBiqI1lPPTWFmqgGngpWOn2mWEnpFFmxqYEwk07NiNeISdwwzx1GizIoBr9imoyBS61yGRll1qLdzFfm/3DSD6POs4HGaAYvpulCUCQwJrv4Lz7liFERuAKGKm14xvSSKUDC/6pgheNtPfghGg773oT+4+Ng+PqnHsYsO0VvURR76hI7RF3SOhohab6wz65v13X5nf7Uv7B9rqW3VngN0L2z/L7o9zbE=

…

ypred = argmaxyw
>
y zn

AAACpHicbVFdb9MwFHUC+8rG6OBxL4ZqUyukKimTthekCh7gYQ+toN2kposc12mt2Ulk34yGKL+Mf8Eb/wanTad240qWj889x/f6OkwF1+C6fy37xcud3b39A+fw6NXx68bJm5FOMkXZkCYiUbch0UzwmA2Bg2C3qWJEhoLdhPdfqvzNA1OaJ/EPyFM2kWQW84hTAoYKGr/PfElgHkbFrxJ/wrPWCEetNbUo2/gDXp/Csu1sqANvpTf7lmNb1K1FXbzpbDt5UPjAFqBkYfqdllXxNUHUTJJFGeSPnp/mcOdDkm7eYrpvuh13Gfg58GrQRHX0g8Yff5rQTLIYqCBajz03hYmpB5wKVjp+pllK6D2ZsbGBMZFMT4rlkEt8ZpgpjhJlVgx4yW46CiK1zmVolFWP+mmuIv+XG2cQXU0KHqcZsJiuCkWZwJDg6sfwlCtGQeQGEKq46RXTOVGEgvlXxwzBe/rk52DU7XgfO93BRbP3uR7HPjpF71ELeegS9dA31EdDRK131lerbw3sc/va/m4PV1Lbqj1v0VbYd/8AvPzOkg==

Neural Nets

g
b n

good I O O
– I tbad 0 I 0

not good I 0 I
not bad 0 11

TF
NNs : transform the data

into a latent feature space

TTf replace
FLI ) with a nonlinear function of

the original FLI )

Define E = g ( V FLI
))
.

Classify with ITEE)
✓ dxY#imensivnal feat vector

nonlinearity
matrix

g=
tanh #

How can Vtg give g= Rew #
us useful latent features ?

#

NN example

suppose V= [ 49, ] g=
tanh

tanh (01=0 ,
tanh ( 1) =/
tanh (2) 21

X , I = [ tanh ( x , ) , tanhlxz ) , tanhlxitxzl)
73

E=gH¥§J
+ Iq”‘t’

”

1¥
—

Zg
Zz

Sentiment Analysis and Basic feature Extraction

the movie was great
! would watch again

!
– –

the film was axed ;
I’ll never watehagac.nl

① text I ⇒ FLI ) feature extraction

② {ftx ” ‘) , y
lil )!

,

dataset of D labeled exs
,

⇒ train classifier

Feature extraction

the movie was great

Bag -of – words :
Assume 10,000 words in vocabulary

f [the Oa % It
.
. .

imovie
.
.

w’
as . . . grint .?

4 Is

9996 Os
t
counts ( how many

”
the ” are present)

presence ( absence ( 0/1 )
n -gram

:

sequence of n consecutive
words

Bag – of – engrams 2-grams : thematic , moviewas
,
wasgreat
total # does

tf – idf
tf : Count of the term

↳ tfxidf
Idf : inverse document frequency log q¥eog

# does with W

Preprocessing

① Tokenization was greatly
C . – – great . –

– – great ! . . .]

was greet[
waswgra.tw !

wasn’t → was- n ‘t

② [ Sometimes ] Stopword removal (
the
,
of
,
a
,
.
.
. )

③ [ Sometimes ] Casing ( lower casing , true casing )

④ Handling unknown words Durrett ⇒ UNK

⑤ Indexing : map each { word,
n -gram

) into IN

use a map

seg-37

PCFG Independence Assump0ons

11%
9%

6%

NP PP DT NN PRP

9% 9%

21%

NP PP DT NN PRP

7%
4%

23%

NP PP DT NN PRP

All NPs NPs under S NPs under VP

‣ Language is not context-free: NPs in different contexts rewrite differently

‣ [They]NP received [the package of books]NP

Ver0cal Markoviza0on

S^ROOT

NP^S VP^S

She saw it

VBD^VP PRP^VPPRP^NP

S

NP VP

She saw it

VBD PRPPRP

Basic tree (v = 1) v = 2 Markoviza0on

‣Why is this a good idea?

Binariza0on Revisited

VP

sold books to her

NP PPVBZ PP

for $50

VP

sold

books

to her

NP

PP

VBZ

PP

for $50

VP [VBZ]

VP [VBZ NP]

‣ Equivalent to the original because of the
chain rule of probability

P(VP -> VBZ NP PP PP) = P(VBZ | VP) P(NP | VP [VBZ])
P (PP | VP [VBZ NP]) P (PP | VP [VBZ NP PP])

VP [VBZ NP PP]

‣ Another way of doing lossless binariza0on:

(abusing nota0on slightly)

Horizontal Markoviza0on

VP

sold books to her

NP PPVBZ PP

for $50

VP

sold

books

to her

NP

PP

VBZ VP [VBZ]

VP [VBZ NP]

h = 2: VP [… VBZ NP]

h = 1: VP [… NP]

h = 0: VP

h = 2: VP [… ~~VBZ]~~

h = 1: VP [… VBZ]

h = 0: VP

‣ Changes amount of context remembered
in binariza0on (h=∞: remember all)

‣ In prac0ce: always remember the head tag

…

Annota0ng Trees

‣ First apply ver0cal Markoviza0on, then binarize + apply horizontal

VP^S

NP^VP PP^VPVBZ^VP

S^ROOT

NP^S VP^S

NP^VP PP^VP

VBZ^VP

S^ROOT

NP^S

VP^S [… VBZ^VP]

Annota0ng Trees

Klein and Manning (2003)

Lexicaliza0on

§  What’s different between basic PCFG scores here?
§  What (lexical) correla;ons need to be scored?

‣ Even with parent annota0on, these trees have the same rules. Need to
use the words

Lexicaliza0on
§  Add “head words” to

each phrasal node
§  Syntac4c vs. seman4c

heads
§  Headship not in (most)

treebanks
§  Usually use head rules,

e.g.:
§  NP:

§  Take leFmost NP
§  Take rightmost N*
§  Take rightmost JJ
§  Take right child

§  VP:
§  Take leFmost VB*
§  Take leFmost VP
§  Take leF child

‣ Annotate each grammar symbol with
its “head word”: most important
word of that cons0tuent

‣ Rules for iden0fying headwords (e.g.,
the last word of an NP before a
preposi0on is typically the head)

‣ Collins and Charniak (late 90s):
~89 F1 with these

Sequence Labeling (Tagging) , Tagging
with Classifiers

Input I = ( Xi , Xz ,
–

–
‘

, Xn )

Output y- = ( y , , ya , .
. .

, Yn ) one prediction per
word

structured classification
fed raises interest rates

0.5 percent
– –

Predict each y , independently w/ logistic regression

fly ; = y II , i )
index we’re predicting

at

i =3
✓

f- (E) = [ I O O
l O O O .

– ]
Fed raises

.
.

BOW : FLI
, y
–

– NN ) = [¥0 I 1€00 I 00-00 . . .]
NN

independent of i simple way
: single feature

on Corr

word

Positional : f (I
,
y
– NN

,
i =3) ( O- Ho go 00010- ]

Positional features with context

f CI
, y
‘ NN
,
i =3 )

our
word prev next
–

O O l 00 .
–
. I 010 . . – fo . . . I

[¥ it

Conjuncts
.

#
m of properties

Indicator [Curr word = interest A tag – NN
”

)
“word ” in a bag – of-words space

Problem w/ classification for tagging

indicators → classifier

what goes wrong
?

VBZ 1-VBP
NNS-1 NN

Fed raises interest rates
.
. .

predictions
of a classifier may

be incoherent

sequence modeling :

Hidden Markov Models

Conditional Random fields

Part – of – Speech Tagging
record → arms →DIE
a a s

noun {erb noun verb

POS Tags : closed class

Open class new
words can join
–

–
these categories

Determiners : the , some

Nouns : proper (IBM , Italy) conjunctions
: and

,
or

common ( cat, snow ) Pronouns

Verbs : see,
tweet

Auxiliaries : had [ VERB]

Modals : could weep . –

Adjectives prepositions : up ,
in
,
to

Adverbs : swiftly particles : Ip made up

Example

Fed raises interest rates
0.5 percent

r:p: Nnp fins
NN – NNS – CD – NN

VBN
VBZ

VBP ✓Bz

VB

✓BD
ve d

VBD I
interest you

I want
to
interest you

I fed
–

– – VB

I
had

fed –
–

✓
BN

( KY Algorithm

argrf
” PCT II ) = a ITA P (

T
,
I )

Viterbi¥+

“M
s – I score of S : – I

co,i ,
I\ runtime : 0436)

o
the

,
child

z
raises

z
i n

Chart t [i
, j ,
X) score of the best derivation of

X over span ( i,;) logco -51=-1
1.0

t Ci
,
iet,X]=logP(will) i.optime S- NPVP

+ [ igj , X]= MIX
Maxl. > raises °-5NP→NN NNS

log P(XsX , a) + tfi, k, × ,]
t.oVBZ-raisesw.ve →Voter

+the,j, Xz]
to PRP → it

Basic Machine learning,
Gradient Descent

Parameters it to optimize
labeled data ( Ici)

,
yci

) )
,
?
,

D points

input
label

suppose 10,000 feats
.
TER

“””

.
Searching for an optimal

T

Optimization problem
: formulate training objective

which
–
–

,

is (for us ) linear over examples, optimize

objective
:

.

loss ( Ici )
, y
‘ ”
,T )

t of the loss
stochastic gradient descent ,

W- rate W

g.
gradin

for t up
to nom epochs :

for i up
to D : points

towards it
that give

sample j
– ( I

,
D ] higher

loss

✓ ← T – * If loss #
‘il
,
y
” ‘

,

T )
step size

Linear Binary Classification

Point IE Rd ft) f.
feature

extractor

I : string
→ f- (F) EIRD

Label yet -11+1 ) ‘ ”
.

.

. .frClassifier : weight vegtorgw
‘

– decision

boundary

fest ( 35 ‘ ‘ ‘]gd%t÷%Jtf (E) + b 20( bias X not use I
f- CE) ( 3 , -1,2 ,

I ]

seg-20

Bias in Word Embeddings

‣ Iden1fy she – he axis in
word vector space,
project words onto this
axis

‣ Nearest neighbor of (b –
a + c)

‣Word embeddings are trained from the whole web, and this data
encodes biases about the real world

Bolukbasi et al. (2016)

Manzini et al. (2019)

Debiasing

Bolukbasi et al. (2016)

‣ Iden1fy gender subspace with gendered
words

she

he

homemaker

woman

man

‣ Project words onto this subspace

‣ Subtract those projec1ons from
the original word

homemaker’

Hardness of Debiasing

‣ Not that effec1ve…and the male
and female words are s1ll
clustered together

‣ Bias pervades the word embedding
space and isn’t just a local property
of a few words

Gonen and Goldberg (2019)

seg-34

Syntax

‣ Study of word order and how words form sentences

‣ Why do we care about syntax?

‣ Recognize verb-argument structures (who is doing what to whom?)

‣ Mul@ple interpreta@ons of words (noun or verb? Fed raises… example)

‣ Higher level of abstrac@on beyond words: some languages are SVO,
some are VSO, some are SOV, parsing can canonicalize

Cons@tuency Parsing

‣ Tree-structured syntac@c analyses of sentences

‣ Cons,tuents: (S)entence, (N)oun (P)hrases,
(V)erb (P)hrases, (P)reposi@onal (P)hrases, and
more

‣ BoMom layer is POS tags

‣ Examples will be in English. Cons@tuency
makes sense for a lot of languages but
not all

Cons@tuency Parsing

senten@al complement

whole embedded sentence

adverbial phrase

unary rule

ternary rule

Cons@tuency Parsing

a refund that the court es@mated *-1

Challenges

§  If we do no annota+on, these trees differ only in one rule:
§  VP → VP PP
§  NP → NP PP

§  Parse will go one way or the other, regardless of words
§  Lexicaliza+on allows us to be sensi+ve to specific words

§  If we do no annota+on, these trees differ only in one rule:
§  VP → VP PP
§  NP → NP PP

§  Parse will go one way or the other, regardless of words
§  Lexicaliza+on allows us to be sensi+ve to specific words

PP aMachment

same parse as “the cake with some icing”

Challenges

The students complained to the professor that they didn’t understand

The man picked up his hammer and saw

Eisenstein textbook

plas,c cup holder plas,c cup holder
JJ NN NN JJ NN NN

NP
NP

NP

compare: The man picked up his hammer and swung

‣ Modifier scope:

‣ Complement structure:

‣ Coordina@on scope:

Cons@tuency

‣ How do we know what the cons@tuents are?

‣ Cons@tuency tests:

‣ Subs@tu@on by proform (pronoun, do so)

‣ CleWing (It was with a spoon that…)

‣ Answer ellipsis (What did they eat? the cake)
(How? with a spoon)

‣ Some@mes cons@tuency is not clear, e.g., coordina@on: she went to
and bought food at the store

Probabilistic Context – free Grammars

CFI ( N ,T , S , R )
non terminals

,
terminals

,
start
,
rules

the
,
children

, s binary unary
S
,
NP
,

VP
,
PP ate

,
cake

.

–
–

314

DT , NN ,
VBD
,
IN
,
NNS Y’Foh l S – NP VP

Veno

# 114 VP
→ VBD NP DTs

the

Pos : preterm
inals vz Np → pt NN

NNS → children

112 NP
→ DT NNS

NNS cake

NN→ spoon

PCFG : rules have probs , probs sum to one
V →ate

P ( rule (parent (
rule ) ) per parent

P ( tree T ) — II
,

Nrl parent Cr ) )

Steps to Parsing

Treeline of sent labeled with
trees

① Binarizetion ②
Estimation of rule probs ③ Inference

( count -‘ normalizing) argnfaxpftlx )

( maximum likelihood est . )

Bihari ration lossless
– Lossy
VP VP

::¥÷⇐:⇒⇒:÷i÷÷÷÷ vii.aprob ‘t ”

seg-21

Applying Embeddings

‣ Approach 1: learn these embeddings as parameters from your data

‣ Approach 2: ini:alize word embeddings using GloVe, keep fixed

‣ Approach 3: ini:alize word embeddings GloVe, fine-tune

‣ Faster because no need to update these parameters

‣Works best for some tasks

‣ OHen works preIy well

‣ First layer of your network: map from word indices to embeddings

‣ Can also evaluate embeddings intrinsically on tasks like word similarity

Deep Averaging Networks

‣ Deep Averaging Networks: feedforward neural network on average of word
embeddings from input

Iyyer et al. (2015)

Deep Averaging Networks

‣ Contradicts a widely-held view
that we need to model syntac:c
structure to represent language

‣ Simple averaging can work as
well as syntac:c composi:on for
some problems!

Iyyer et al. (2015)

Deep Averaging Networks

{

{
Bag-of-words

Tree-structured
neural networks

Wang and Manning
(2012)

Kim (2014)

Iyyer et al. (2015)

No pretrained
embeddings

Iyyer et al. (2015)

Deep Averaging Networks

‣ Will return to composi:onality with syntax and LSTMs
Iyyer et al. (2015)

seg-0

What is the goal of NLP?

‣ Be able to solve problems that require deep understanding of text

Siri, what’s your
favorite kind of

movie?

I like superhero
movies!

What’s come
out recently?

‣ Systems that talk to us: dialogue systems, machine translaFon,
summarizaFon

The Avengers

What is the goal of NLP?

中共中央政治局7⽉30⽇召开会议，会议分析研究当前经
济形势，部署下半年经济⼯作。

The PoliFcal Bureau of the CPC Central CommiKee held a meeFng on July 30 to
analyze and study the current economic situaFon and plan economic work in the
second half of the year.

People’s Daily, August 10, 2020

‣ Be able to solve problems that require deep understanding of text

‣ Systems that talk to us: dialogue systems, machine translaFon,
summarizaFon

The PoliFcal Bureau
of the CPC Central
CommiKee July 30 hold a meeFng

Translate

What is the goal of NLP?

When was Abraham Lincoln born?

February 12, 1809

Name Birthday
Lincoln, Abraham 2/12/1809

Washington, George 2/22/1732
Adams, John 10/30/1735

‣ Build systems that extract informaFon from text and answer quesFons

The park has a total of five
visitor centers

five

How many visitors centers are there in Rocky Mountain NaFonal Park?

map to Birthday field

“Standard” NLP Pipeline

SyntacFc parses

Coreference resoluFon

EnFty disambiguaFon

Discourse analysis

Summarize

Extract informaFon

Answer quesFons

IdenFfy senFment

Translate

Text Analysis Applica=onsText Annota=ons

‣ All of these components are modeled with staFsFcal
approaches using machine learning

How do we represent language?

Labels

Sequences/tags

Trees

Text

the movie was good +
Beyoncé had one of the best videos of all 6me subjec=ve

Tom Cruise stars in the new Mission Impossible film
PERSON WORK_OF_ART

I eat cake with icing

PP
NP

S

NP
VP

VBZ NN
flights to Miami

λx. flight(x) ∧ dest(x)=Miami

How do we use these representaFons?

Labels

Sequences

Trees

Text AnalysisText

‣Why is this predicFon hard? Because language is complex and ambiguous!

‣What ambiguiFes do we need to resolve?

…

Applica=ons

Tree transducers (for machine
translaFon)

Extract syntacFc features

Tree-structured neural networks

end-to-end models …

What makes language hard?

Teacher Strikes Idle Kids

Ban on Nude Dancing on Governor’s Desk

Iraqi Head Seeks Arms

What makes language hard?

‣ There aren’t just one or two possibiliFes, but many!

It is really nice out

il fait vraiment beau
It’s really nice
The weather is beauFful
It is really beauFful outside
He makes truly beauFful
It fact actually handsome

“AI winter”
rule-based,
expert systems

A brief history of NLP techniques

Unsup: topic
models,
grammar inducFon

Collins vs.
Charniak
parsers

1980 1990 2000 2010 2020

earliest stat MT
work at IBM

Penn
treebank

NP VP

S

Ratnaparkhi
tagger

NNP VBZ

Sup: SVMs,
CRFs, NER,
SenFment

Semi-sup,
structured
predicFon

Neural/
pretrained

‣ This course focuses on supervised learning, semi-supervised methods, and
neural models (including pre-training)

‣ All of these models can handle ambiguity by learning how to map text into
linguisFc representaFons using data

Outline of the course

‣ ClassificaFon: linear and neural, word representaFons
‣ Text analysis: tagging (HMMs, CRFs) and parsing (PCFGs)
‣ Language modeling and pre-training
‣ QuesFon answering and semanFcs
‣ Machine translaFon
‣ ApplicaFons

I
2=1

Perception Ex : Y HII oooo
Binary classifier Tvtflx )

50 movie good -it 7948 , Yoo
‘Area -_ – I

↳ YEE – Igel ) ep# movie bad – I 1010 0,40 Ypres
– it

not good
– I 0101 Ooty – I Yprei

”

for t up to epochs
:

g

for i up
to D : ep2[movie good

-11 1100
11¥ -I ypr.ua =

-I

Ypred
← TTf(Ici

)) y
?
-0

I ← it if ypred
= yci )

else I ← T+xf(Ici
‘) if y

“! -11

else Te I – xf ( IC
” ) if ycils – I

n
y g

b n ng
nb not separable

Ex 2 good tl I 00 o o
–
–

–

bad –
‘

o lo o o
–

t

not good
– l l o

not bad a
oil

”

o ,#
b

↳ Add big rains
9

HMMS : Parameter Estimation

Labeled data (I
” ‘

, y-
lil) !

,

Maximize § log P(y-
lil
,
I
” ‘) generative (joint)

likelihood

= ? log Ry .
“” t ?§logK

‘

ly It ? flog Kyi”l%
train setae

MLE with frequency counts
: biased coin w/prob p of It

any

HHHT 314=9×31ogptlogli
–

p)

HMM param
estimation : Count + normalizing

Example
T = { N

,
V
,
STOP) V = { they, can, fish)

N v stop

tneycm s=I⑦t=j
N V

STOP
g

L t b

tiny fish Nv ④ ¥19 I
7

Smoothy add counts
to avoid Os NOT Phil can )

1- →i
97¥77 vs Manly
1/5 45 315

V -V

⇐ In
,
i. ii. ⇒ PEEK ; Y:

”

seg-31

CRF Learning and Inference

‣Model: P (y|x) =
1

Z

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

‣ Inference

‣ Learning

P (y|x) / expw>
”

nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

Inference: Compu7ng (arg)maxes

max
y1,…,yn

e�t(yn�1,yn)e�e(yn,n,x) · · · e�e(y2,2,x)e�t(y1,y2)e�e(y1,1,x)

‣ This is maxing over a product of terms with exactly the same
structure as HMM inference

‣ Replace log probabili7es in Viterbi code with poten7als here

logP (yi|yi�1) ! �t(yi�1, yi)
AAADxnicbVJbb9MwFHYbLiPcOnjkxaKqSEWpmg4JXiZN7GWPRaLroO4ix3Vaa0kc2Q5tlFriN/KGxI/BTbKRdj1SpM/n+841x09CJtVg8KfRtB48fPT46In99NnzFy9bx68uJU8FoWPCQy6ufCxpyGI6VkyF9CoRFEd+SCf+zfmWn/ykQjIef1NZQmcRXsQsYAQr4/KOG387IyfboAirpR/ka92FpxAFApMc0XXi3BIr7WXXSPEEBk5N3NU5kmnk5dk7iFgMC4rgMP+uNdxPYET6YA67U1OdHJAYgaJrJaJc8kBFeK2dyW4O24xRaMw705tbWJ/H1fmPsim4KmugkAZqCssJ2OlQX8cw8JSTmdcHV/cyj3Xh+zveLXlqeNZjvVp5JNhiqWZFE5U3073dpZo1e24XokTw+V21rZPBDawq1mi3pNcVzbo2CvmiDNj8lxeFsRB8ZUKXbK95r9Ue9AeFwfvArUAbVDbyWr/RnJM0orEiIZZy6g4SNcuxUIyEVNsolTTB5AYv6NTAGEdUzvLiDDXsGM8cBlyYL1aw8NYjchxJmUW+UW43I/e5rfMQN01V8HmWszhJFY1JWShIQ6g43N40nDNBiQozAzARzPQKyRKbf67M5dtmCe7+yPfB5bDvnvSHXz+2z75U6zgCb8Bb4AAXfAJn4AKMwBiQ5nmTNUVTWhdWbKXWqpQ2G1XMa7Bj1q9/cig4yw==

logP (xi|yi) ! �e(yi, i,x)
AAAD/HicbVPLjtMwFM0kPIbwmM6wZGNRVaSiVE0HCTYjjWDDskh0OtB0Isd1Wmuch2yHJkrNr7BhAUJs+RB2/A3OY4a00ytZur7nnPuS7cWUcDEY/N3TjVu379zdv2fef/Dw0UHr8OiMRwlDeIwiGrFzD3JMSYjHggiKz2OGYeBRPPEu3xb45DNmnEThB5HFeBbARUh8gqBQIfdQP+qMrGztBFAsPT9PZRecAMdnEOUOTmPrClhJN7twRBQD32qQuzJ3eBK4efYMOCQEJYQgzT9KCbYTKJLcmcPsNFjHOyiKIHAqWJDzyBcBTKU12cxhqjFKjrpncn3lNuexZf6pagqsqhoOxb6YgmoCcjKUFyHwXWFl6vbClr3MJV3w/Bq3KxwrnPRIr1HeYWSxFLOyiTqayd7mUtWaXbsLnJhF8+tqRZCANagrNmC7gtMaJsUOaLSoFOv//LIyZCxaKe2SbHVv1pq01JAd/B3TuK32oD8oDdx07Nppa7WN3NYfZx6hJMChQBRyPrUHsZjlkAmCKJamk3AcQ3QJF3iq3BAGmM/y8vFK0FGROfAjpk4oQBltKnIYcJ4FnmIWLfJtrAjuwqaJ8F/PchLGicAhqgr5CQUiAsVPAHPCMBI0Uw5EjKheAVpC9VKE+i+mWoK9PfJN52zYt4/7w/cv26dv6nXsa0+0p5ql2dor7VR7p420sYb0TP+qf9d/GF+Mb8ZP41dF1fdqzWNtw4zf/wDuME1K

‣ These poten7als depend on more of x, but x is constant, and this
doesn’t change inference over y

‣ Can do efficient inference in any tree-structured CRF (max-product or
sum-product algorithms)

CRF Learning and Inference

‣Model: P (y|x) =
1

Z

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

‣ Inference: argmax P(y|x) from Viterbi

‣ Learning

P (y|x) / expw>
”

nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

Learning in Log-linear CRFs

‣ How did the gradient look in mul7class logis7c regression?

P (y|x) / expw>
”

nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

P (y|x) / expw>f(x, y)‣ Logis7c regression:

‣Maximize L(y⇤,x) = logP (y⇤|x)

@

@wi
L(xj , y⇤j ) = fi(xj , y

⇤
j )� Ey[fi(xj , y)]

gold feature value

model’s expecta7on of
feature value

‣ Subtract off “expected features” summing over all possible other labels

@

@wi
L(xj , y⇤j ) = fi(xj , y

⇤
j )�

X

y

fi(xj , y)Pw(y|xj)

Learning in Log-linear CRFs

‣ Gradient is analogous to logis7c regression:

P (y|x) / expw>
”

nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

P (y|x) / expw>f(x, y)‣ Logis7c regression:

‣Maximize L(y⇤,x) = logP (y⇤|x)

intractable! requires
summing over all
possible tag seqs

@

@w
L(y⇤,x) =

nX

i=2

ft(y
⇤
i�1, y

⇤
i ) +

nX

i=1

fe(y
⇤
i , i,x)

�Ey

”
nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

Learning in Log-linear CRFs

‣ Let’s focus on emission feature expecta7on

@

@w
L(y⇤,x) =

nX

i=2

ft(y
⇤
i�1, y

⇤
i ) +

nX

i=1

fe(y
⇤
i , i,x)

�Ey

”
nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

Ey

”
nX

i=1

fe(yi, i,x)

#
=

X

y2Y
P (y|x)

”
nX

i=1

fe(yi, i,x)

#
=

nX

i=1

X

y2Y
P (y|x)fe(yi, i,x)

=
nX

i=1

X

s

P (yi = s|x)fe(s, i,x)

‣ How do we compute these marginals ? Forward-backwardP (yi = s|x)

P (yi = s|x) =
X

y1,…,yi�1,yi+1,…,yn

P (y|x)

Training CRFs

‣ Transi7on features: need to compute

‣ …but you can build a preVy good system without learned transi7on
features (use heuris7c weights, or just enforce constraints like
B-PER -> I-ORG is illegal)

P (yi = s1, yi+1 = s2|x)
using forward-backward as well

‣ For emission features:

gold features — expected features under model

@

@w
L(y⇤,x) =

nX

i=1

fe(y
⇤
i , i,x)�

nX

i=1

X

s

P (yi = s|x)fe(s, i,x)

CRF Learning and Inference

‣Model: P (y|x) =
1

Z

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

‣ Inference: argmax P(y|x) from Viterbi

P (y|x) / expw>
”

nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

‣ Learning: run forward-backward to compute posterior probabili7es; then

@

@w
L(y⇤,x) =

nX

i=1

fe(y
⇤
i , i,x)�

nX

i=1

X

s

P (yi = s|x)fe(s, i,x)

CRF Learning Implementa7on

for each epoch

for each example

extract features on each emission and transi7on (look up in cache)

compute marginal probabili7es with forward-backward

compute poten7als phi based on features + weights

accumulate gradient over all emissions and transi7ons

‣ Caching is your friend! Cache feature vectors especially, and reduce any
redundant computa7on

‣ Exploit sparsity in feature vectors where possible, especially in feature
vectors and gradients

Debugging CRFs

‣ Hard to know whether inference, learning, or the model is broken!
‣ Compute the objec7ve — is op7miza7on working?

‣ Learning: is the objec7ve going down? Try to fit 1 example / 10
examples. Are you applying the gradient correctly?

‣ Inference: check gradient computa7on (most likely place for bugs),
check marginals from forward-backward

‣ If objec7ve is going down but model performance is bad:
‣ Inference: check performance if you decode the training set

seg-19

Skip-Gram

Mikolov et al. (2013)

the dog bit the man‣ Predict each word of context from word in turn,
up to distance k

bit

soCmax
MulDply
by W

gold = dog

‣ Parameters: d x |V| word vectors, |V| x d context vectors (stacked into
a matrix W)

‣ Another training example: bit -> the

P (w0|w) = softmax(We(w))

‣Why skip-gram? With window size >1, we predict a context word
skipping over intermediate words

ConDnuous Bag-of-words
‣ Predict word from mulDple words of context

the dog bit the man

‣ Parameters: d x |V| (one d-length context vector per voc word),
|V| x d word vectors (in matrix W)

dog

the

+

size d

soCmax
MulDply
by W

gold label = bit,
no manual labeling
required!

d-dimensional
word embeddings

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

size |V| x d

Mikolov et al. (2013)

Hierarchical SoCmax

‣ Matmul + soCmax over |V| is very slow to compute for both techniques

‣ Hierarchical soCmax:

P (w|w�1, w+1) = softmax (W (c(w�1) + c(w+1)))

‣ Standard soCmax:
|V| dot products of
size d

log(|V|) dot products of size d,

…

…

the
a

‣ Huffman encode vocabulary,
use binary
classifiers to decide which
branch to take

|V| x d parameters

P (w0|w) = softmax(We(w))

‣ log(|V|) binary decisions

Skip-gram:

CBOW:

Mnih and Hinton (2008)

Skip-gram with NegaDve Sampling

‣ d x |V| vectors, d x |V| context vectors (same # of params as before)

(bit, the) => +1
(bit, cat) => -1

(bit, a) => -1
(bit, fish) => -1

‣ Take (word, context) pairs and classify them as “real” or not. Create
random negaDve examples by sampling from unigram distribuDon

words in similar
contexts select for
similar c vectors

P (y = 1|w, c) =
ew·c

ew·c + 1

‣ ObjecDve =
sampled

logP (y = 1|w, c) +
1

k

nX

i=1

logP (y = 0|wi, c)
AAADb3icfVJNb9NAEN04fJTwlZYDhyK0oqpkqyayCxJcIlVw4Rgk0laKw2q9WTurrD+0OyaJXB/5g9z4D1z4B6wdB9G0YiRbs++9mbc7mjCXQoPn/exY3Tt3793fe9B7+Ojxk6f9/YNznRWK8THLZKYuQ6q5FCkfgwDJL3PFaRJKfhEuPtb8xTeutMjSL7DO+TShcSoiwSgYiOx3vh8HVOZzSnxbO3iIA77K7SCfC8Jt7fpukFCYh1G5qhynt9WCrQk0al0kpNSkhNd+VeGWbk52izrXWxJwYbfpXx62RW7d38GGCznc6nfS+DVsc7BbcNeuAV3z+68puNvyxlTyCOyNV4wDkeIA+ApUUqaxook2zktiCDbLALNAiXgOTi+QWYxH9nroXy1d5uATHESKstKvykXV3lwM/eqrabeVeldLIoyY9I+8gdcEvpn4bXKE2hiR/o9glrEi4SkwSbWe+F4O05IqEEzyqhcUmueULWjMJyZNacL1tGz2pcLHBpnhKFPmS80LavTfitK8Ua+T0CjrmeldrgZv4yYFRO+npUjzAnjKNkZRITFkuF4+PBOKM5Brk1CmhLkrZnNqhgRmRXtmCP7uk28m56cD/83g9PPbo7MP7Tj20CF6hWzko3foDH1CIzRGrPPLOrAOrRfW7+7z7ssu3kitTlvzDF2LrvMHMNAQkQ==

‣ Are there alternaDve ways to learn vectors while avoiding O(|V|) term?

Mikolov et al. (2013)

ConnecDons with Matrix FactorizaDon

‣ Skip-gram model looks at word-word co-occurrences and produces two
types of vectors

word pair
counts

|V|

|V| |V|

d

d

|V|

context vecs
word
vecs

‣ Looks almost like a matrix factorizaDon…can we interpret it this way?

Levy et al. (2014)

Skip-gram as Matrix FactorizaDon

|V|

|V|
Mij = PMI(wi, cj)� log k

PMI(wi, cj) =
P (wi, cj)

P (wi)P (cj)
=

count(wi,cj)
D

count(wi)
D

count(cj)
D

‣ If we sample negaDve examples from the unigram distribuDon over words

num negaDve samples

‣ …and it’s a weighted factorizaDon problem (weighted by word freq)

Skip-gram objecDve exactly corresponds to factoring this matrix:

Levy et al. (2014)

GloVe (Global Vectors)

X

i,j

f(count(wi, cj))
�
w>i cj + ai + bj � log count(wi, cj))

�2
‣ ObjecDve =

‣ Also operates on counts matrix, weighted
regression on the log co-occurrence matrix

‣ Constant in the dataset size (just need counts), quadraDc in voc size

‣ By far the most common word vectors used today (5000+ citaDons)

word pair
counts

|V|

|V|

Pennington et al. (2014)

fastText: Sub-word Embeddings

Bojanowski et al. (2017)

‣ Same as SGNS, but break words down into n-grams with n = 3 to 6
where:
3-grams:
4-grams: ,
5-grams: ,
6-grams:

‣ Replace in skip-gram computaDon with w · cAAADBXicfVLLbtQwFHXCqwyPTmHZjcWoUiLCKClIZYNUwYZlkZi20mQUOZ6bjlUnsewb2lGUDRt+hQ0LEGLLP7Djb3DSDKLTiivZOj7n3nv8SpUUBsPwt+PeuHnr9p2Nu4N79x883BxuPTo0ZaU5THgpS32cMgNSFDBBgRKOlQaWpxKO0tM3rX70AbQRZfEelwpmOTspRCY4Q0slW872TsykWrAk8oxPX9EYzpUXq4VIwDNBFMQ5w0Wa1eeN7w9WueiZBLtsU+VJbZIan0VNQ3u5W3k9619umWCA603/6rgqCtr+PrVaCnit39POr1O7hdeT63YdGdjpv6YYrMqt6RmN+bxEypPhKByHXdCrIOrBiPRxkAx/xfOSVzkUyCUzZhqFCmc10yi4hGYQVwYU46fsBKYWFiwHM6u7V2zojmXmNCu1HYV1b9l/K2qWG7PMU5vZnsSsay15nTatMHs5q0WhKoSCXxhllaRY0vZL0LnQwFEuLWBcC7tXyhdMM4724wzsJUTrR74KDnfH0fPx7rsXo/3X/XVskG3yhHgkIntkn7wlB2RCuPPR+ex8db65n9wv7nf3x0Wq6/Q1j8mlcH/+AZLG7d0=

X

g2ngrams
wg · c

!

AAADMXicfVJNb9QwEHXCV1k+uoUjF4tVpUSEVVKQ4IJUwQGORWLbSutV5HidrFXHiewJ7CrKX+LCP0FcegAhrvwJnGwW0W3FSI6e35uZN3aclFIYCMNzx712/cbNWzu3B3fu3ru/O9x7cGyKSjM+YYUs9GlCDZdC8QkIkPy01JzmieQnydmbVj/5yLURhfoAq5LPcpopkQpGwVLxnvN2n1BZLmgcecbHrzDhy9Ij5ULE3DNBFJCcwiJJ62Xj+4NNLngmhi7bVHlcm7iGp1HT4F7udl7P+hdbxhDAdtO/OmyKgra/j62WcLjS70nn16ndxuvJbbuODOznv6YQbMqtKZE8BW9tlWEiFCbAl6DzWmWa5sYaf4qtwOYFYEa0yBbgx8NROA67wJdB1IMR6uMoHn4l84JVOVfAJDVmGoUlzGqqQTDJmwGpDC8pO6MZn1qoaM7NrO7+eIP3LTPHaaHtUnaIlv23orZjmlWe2Mz21GZba8mrtGkF6ctZLVRZAVdsbZRWEkOB2+eD50JzBnJlAWVa2FkxW1BNGdhHNrCXEG0f+TI4PhhHz8YH75+PDl/317GDHqHHyEMReoEO0Tt0hCaIOZ+db85354f7xT13f7q/1qmu09c8RBfC/f0HGUoAvg==

Pre-trained Models: ELMo, GPT, BERT

‣ These encode “subwords” rather than words. Underscore indicates that
the following token conDnues the exisDng word

‣ Learn embeddings through language modeling (discussed in the second
half of the course)

and there were no re_ fueling sta=ons anywhere

‣ Embeddings are computed using RNNs and Transformers. We can’t just
look up an embedding for each word, but actually need to run a model

one of the city ’s more un_ princi_ pled real estate agents

‣ Any word is either in the subword vocabulary or can be expressed as a
sequence of subwords in the vocabulary

Skip – gram

Input : large corpus of sentences

Output
: Tw

,
Ew for each word type

W

Hyperparams
: word vector dim

d (-50-300)
window size

K ( assume K=l )
Lord c Take all
film → inspired neighbors of each

The film inspired film → The word token up
✓I

go ke positions

Skip
–

gram
: probabilistic model of context (word

away

f) ( context -_ yl word =× ) =
exp (Tx

– Ty ) T ,T model
–

params

If Ignis ;
”
Iveta “sina.%4f.ve/PKx-Eilz.m..aheinctItexT

parans
in model

EI Corpus = I saw

T±= [ 1,0
] Tsar = [0,1)

Tsar EI word context

# I Taw→Isa- saw I
If c-

saw
= [ 1,0 ) and E±= [0,1) ,

what is

P( context I word -_ saw )
?

exp (Tsan
– I) exp (Tsm – Esau )

,

,

I vocab
=3

Pf context –I Iwan -_ saw ) = Za sunlit

Training
Maximize { log P( context-_yl word

-_x )
(x,y)

pairs in
data

”

Impossible
”

problem : cannot
drive P – I

Initialize params randomly

seg-30

Feature Func*ons

y1 y2 yn…

�e

�t

‣ Phis are flexible (can be NN with 1B+ parameters). Here: sparse linear fcns

�t(yi�1, yi) = w
>ft(yi�1, yi)

P (y|x) / expw>
”

nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

�e(yi, i,x) = w
>fe(yi, i,x)

P (y|x) =
1

Z

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

Basic Features for NER

Barack Obama will travel to Hangzhou today for the G20 mee=ng .

O B-LOC

Transi*ons:

Emissions: Ind[B-LOC & Current word = Hangzhou]
Ind[B-LOC & Prev word = to]

ft(yi�1, yi) = Ind[yi�1 & yi]

fe(y6, 6,x) =

P (y|x) / expw>
”

nX

i=2

ft(yi�1, yi) +
nX

i=1

fe(yi, i,x)

#

= Ind[O — B-LOC]Ind[yi-1 — yi]

Emission Features for NER

Leicestershire is a nice place to visit…

I took a vaca=on to Boston

Apple released a new version…

According to the New York Times…

ORG

ORG

LOC

LOC

Texas governor Greg AbboI said

Leonardo DiCaprio won an award…

PER

PER

LOC

�e(yi, i,x)

Emission Features for NER

‣ Context features (can’t use in HMM!)
‣Words before/aWer
‣ POS before/aWer (if we run a POS tagger first)

‣Word features (can use in HMM)
‣ Capitaliza*on
‣Word shape
‣ Prefixes/suffixes
‣ Lexical indicators

‣ Gaze]eers

‣Word clusters

Leicestershire

Boston

Apple released a new version…

According to the New York Times…

Hidden Markov Models
Two-step

Generative sequence
model ① Param

estimation

Tags Yi ET words × ;
EV ② Inference
vocabulary

tags

Pty ,# = Ply . ) PG .ly/Plyaly,1PCxalyal–.PlstoHyn
)

” OIG . . . Issue Parameters.in?!Ya?..!!iE:D
tr b b

x , O O Oxn ynext
S

Xa

y
‘s forma m process 7¥! You km :#¥

‘

Yi
is conditionally

independent

of y . .
–

‘ Yi –
– 9
“” “”

¥; !
i! ‘”

“”

seg-5

Perceptron as minimizing loss

Perceptron as minimizing loss

x

y

Perceptron Loss Demonstra3on

−
−

(-1, -1)

(-1, 0)

(-1, 1)
+

+ (1, 2)

x

y

x

y

Perceptron Loss Demonstra3on

−
−

(-1, -1)

(-1, 0)

(-1, 1)
+

x

y

x

y

seg-7

Sen$ment Analysis

the movie was gross and overwrought, but I liked it

this movie was great! would watch again

‣ Bag-of-words doesn’t seem sufficient (discourse structure, nega$on)

this movie was not really very enjoyable

‣ There are some ways around this: extract bigram feature for “not X” for
all X following the not

+
+

—

Sen$ment with Bag-of-words Models

‣ Simple feature sets can do preIy well!

Bo Pang, Lillian Lee, Shivakumar Vaithyanathan (2002)

Sen$ment with Bag-of-words Models

Wang and Manning (2012)

Before neural nets had taken
off — results weren’t that
great

81.5 89.5Kim (2014) CNNs

‣ 10 years later — revisited basic BoW classifiers vs. other methods

Kim (2014)

Sen$ment with Bag-of-words Models

https://github.com/sebastianruder/NLP-progress/blob/master/english/sentiment_analysis.md

…

‣ Best systems now:
large pretrained
networks

‣ Stanford Sen$ment
Treebank (SST)
binary classifica$on

‣ 90 -> 97 over the
last 2 years

seg-32

Forward-Backward Algorithm

‣ How do we compute marginals ?P (yi = s|x)

P (yi = s|x) =
X

y1,…,yi�1,yi+1,…,yn

P (y|x)

‣What did Viterbi compute? P (ymax|x) = max
y1,…,yn

P (y|x)

‣ Can compute marginals with dynamic programming as well using
forward-backward

Forward-Backward Algorithm

P (y3 = 2|x) =
sum of all paths through state 2 at time 3

sum of all paths

slide credit: Dan Klein

Forward-Backward Algorithm

P (y3 = 2|x) =
sum of all paths through state 2 at time 3

sum of all paths

=

‣ Easiest and most flexible to do one
pass to compute and one to
compute

slide credit: Dan Klein

Forward-Backward Recurrence

‣ IniIal:

‣ Recurrence:

‣ Same as Viterbi but summing
instead of maxing!

‣ These quanIIes get very small!
Store everything as log probabiliIes

↵1(s) = exp(�e(s, 1,x))
AAACE3icbVBNS8NAEN3Ur1q/oh69LBahlVKSKuhFEL14VLC20JSw2U7apZtk2d1IS+h/8OJf8eJBEa9evPlv3NYetPXBwOO9GWbmBYIzpR3ny8otLC4tr+RXC2vrG5tb9vbOnUpSSaFOE57IZkAUcBZDXTPNoSkkkCjg0Aj6l2O/cQ9SsSS+1UMB7Yh0YxYySrSRfPvQI1z0iO+WVBmfYQ8GouSJHvOhpCpuxYuI7gVhNhiVy75ddKrOBHieuFNSRFNc+/an10loGkGsKSdKtVxH6HZGpGaUw6jgpQoEoX3ShZahMYlAtbPJTyN8YJQODhNpKtZ4ov6eyEik1DAKTOf4RjXrjcX/vFaqw9N2xmKRaojpz6Iw5VgneBwQ7jAJVPOhIYRKZm7FtEckodrEWDAhuLMvz5O7WtU9qtZujovnF9M48mgP7aMSctEJOkdX6BrVEUUP6Am9oFfr0Xq23qz3n9acNZ3ZRX9gfXwDGCWcbA==

↵t(st) =
X

st�1

↵t�1(st�1) exp(�e(st, t,x))
AAAChHicbVFNT9wwEHVCS+n2g2175GKxWjWR0lUMreilCNFLjyB1AWmzshyvw1o4iWVPEKsov4R/1Rv/pk4IogsdydLze2/G45lUK2khju88f+PFy81XW68Hb96+e789/PDxzJaV4WLKS1Wai5RZoWQhpiBBiQttBMtTJc7Tq5+tfn4tjJVl8RtWWsxzdlnITHIGjqLD23HClF4ySgIb4h84ETc6SPRSUhHYiERJzmCZZvVNE4aD3gqBpdCZbZXT2tIavpCmwb3c3YKeDdcrUohgveb4UYeHpKitH2I6HMWTuAv8HJAejFAfJ3T4J1mUvMpFAVwxa2ck1jCvmQHJlWgGSWWFZvyKXYqZgwXLhZ3X3RAbPHbMAmelcacA3LH/ZtQst3aVp87Z9m+fai35P21WQfZ9XstCVyAKfv9QVikMJW43ghfSCA5q5QDjRrpeMV8ywzi4vQ3cEMjTLz8HZ3sTsj/ZO/06Ojrux7GFdtAuChBBB+gI/UInaIq453mfvdgj/qYf+fv+t3ur7/U5n9Ba+Id/AUKdvsc= exp(�t(st�1, st))

AAAChHicbVFNT9wwEHUCpXT7tdBjLxYr1ERKVzG0gksr1F44gtQFpM3KcrwOa+Eklj1BrKL8kv4rbvybOiGoLDCSpef33ozHM6lW0kIc33n+2vqrjdebbwZv373/8HG4tX1my8pwMeGlKs1FyqxQshATkKDEhTaC5akS5+nV71Y/vxbGyrL4A0stZjm7LGQmOQNH0eHf3YQpvWCUBDbEP3AibnSQ6IWkIrARiZKcwSLN6psmDAcPXggshc5tq5zWltbwlTQN7uXuFvRsuFqSQgSrRf/L8JATteVDTIejeBx3gZ8D0oMR6uOEDm+TecmrXBTAFbN2SmINs5oZkFyJZpBUVmjGr9ilmDpYsFzYWd0NscG7jpnjrDTuFIA79nFGzXJrl3nqnG379qnWki9p0wqyw1ktC12BKPj9Q1mlMJS43QieSyM4qKUDjBvpesV8wQzj4PY2cEMgT7/8HJztjcn+eO/02+joVz+OTfQZ7aAAEXSAjtAxOkETxD3P++LFHvE3/Mjf97/fW32vz/mEVsL/+Q8xFr7H

slide credit: Dan Klein

Forward-Backward Recurrence

‣ IniIal:
�n(s) = 1

‣ Recurrence:

‣ Difference from forward: count
emission for the next Imestep (not
current one)

�t(st) =
X

st+1

�t+1(st+1) exp(�e(st+1, t+ 1,x))

AAAC3XicbVLLbtQwFHXCq4RHp7BkYzGqlIgwSgpS2SBVsGFZJKatmIwsx+N0rDqJZd+gjqJIbFiAEFv+ix3/wQfguClipr2SreNz7r3Hr1xJYSBJfnv+jZu3bt/Zuhvcu//g4fZo59GRqRvN+JTVstYnOTVciopPQYDkJ0pzWuaSH+dnb3v9+BPXRtTVB1gpPi/paSUKwShYioz+7GZUqiUlaWgi/Bpn/FyFmVoKwkMTp3FWUljmRXveRVFwmQuhIeCyTVOS1pAWnqddhwfZrcKBjdZbEohhs+k/HS6L4r5/hIMs53Ct3TNn51S3CAdy082RsZ3WPMlonEwSF/gqSAcwRkMcktGvbFGzpuQVMEmNmaWJgnlLNQgmeRdkjeGKsjN6ymcWVrTkZt661+nwrmUWuKi1HRVgx/5f0dLSmFWZ28x+j2ZT68nrtFkDxat5KyrVAK/YhVHRSAw17p8aL4TmDOTKAsq0sHvFbEk1ZWA/RGAvId088lVwtDdJX0z23r8cH7wZrmMLPUFPUYhStI8O0Dt0iKaIeR+9z95X75tP/C/+d//HRarvDTWP0Vr4P/8Cs27g5w== exp(�t(st, st+1))
AAAC+nicbVLLbtQwFHVSHu3wmpYlG4tRpUSEUVKQYINUwYZlkZi20mRkOR6nY9VJLPsGOgr5FDYsQIgtX9Idf4PjZhAz7ZVsHZ9z7z1+ZUoKA3H8x/O3bt2+c3d7Z3Dv/oOHj4a7e8emqjXjE1bJSp9m1HApSj4BAZKfKs1pkUl+kp2/6/STT1wbUZUfYan4rKBnpcgFo2ApsusN91Mq1YKSJDAhfoNTfqGCVC0E4YGJkigtKCyyvLlow3CwyoXAEHDZpi5IY0gDz5O2xb3sVkHPhustCUSw2fSfDquiqOsfYqtlHG70e+b8nOoWQU9u2jkystO66bonRKvqEJPhKB7HLvB1kPRghPo4IsPLdF6xuuAlMEmNmSaxgllDNQgmeTtIa8MVZef0jE8tLGnBzaxxT9fifcvMcV5pO0rAjv2/oqGFMcsis5nd9s2m1pE3adMa8tezRpSqBl6yK6O8lhgq3P0DPBeaM5BLCyjTwu4VswXVlIH9LQN7Ccnmka+D44Nx8mJ88OHl6PBtfx3b6Al6igKUoFfoEL1HR2iCmPfZ++p99374X/xv/k//11Wq7/U1j9Fa+L//AuUc6iA=

slide credit: Dan Klein

CompuIng Marginals

�n(s) = 1

P (s3 = 2|x) =
↵3(2)�3(2)P
i ↵3(i)�3(i)

‣What is the denominator equal to?

‣ Does this explain why beta doesn’t
account for the current Imestep?

P (x)

↵1(s) = exp(�e(s, 1,x))
AAACE3icbVBNS8NAEN3Ur1q/oh69LBahlVKSKuhFEL14VLC20JSw2U7apZtk2d1IS+h/8OJf8eJBEa9evPlv3NYetPXBwOO9GWbmBYIzpR3ny8otLC4tr+RXC2vrG5tb9vbOnUpSSaFOE57IZkAUcBZDXTPNoSkkkCjg0Aj6l2O/cQ9SsSS+1UMB7Yh0YxYySrSRfPvQI1z0iO+WVBmfYQ8GouSJHvOhpCpuxYuI7gVhNhiVy75ddKrOBHieuFNSRFNc+/an10loGkGsKSdKtVxH6HZGpGaUw6jgpQoEoX3ShZahMYlAtbPJTyN8YJQODhNpKtZ4ov6eyEik1DAKTOf4RjXrjcX/vFaqw9N2xmKRaojpz6Iw5VgneBwQ7jAJVPOhIYRKZm7FtEckodrEWDAhuLMvz5O7WtU9qtZujovnF9M48mgP7aMSctEJOkdX6BrVEUUP6Am9oFfr0Xq23qz3n9acNZ3ZRX9gfXwDGCWcbA==

↵t(st) =
X

st�1

↵t�1(st�1) exp(�e(st, t,x))
AAAChHicbVFNT9wwEHVCS+n2g2175GKxWjWR0lUMreilCNFLjyB1AWmzshyvw1o4iWVPEKsov4R/1Rv/pk4IogsdydLze2/G45lUK2khju88f+PFy81XW68Hb96+e789/PDxzJaV4WLKS1Wai5RZoWQhpiBBiQttBMtTJc7Tq5+tfn4tjJVl8RtWWsxzdlnITHIGjqLD23HClF4ySgIb4h84ETc6SPRSUhHYiERJzmCZZvVNE4aD3gqBpdCZbZXT2tIavpCmwb3c3YKeDdcrUohgveb4UYeHpKitH2I6HMWTuAv8HJAejFAfJ3T4J1mUvMpFAVwxa2ck1jCvmQHJlWgGSWWFZvyKXYqZgwXLhZ3X3RAbPHbMAmelcacA3LH/ZtQst3aVp87Z9m+fai35P21WQfZ9XstCVyAKfv9QVikMJW43ghfSCA5q5QDjRrpeMV8ywzi4vQ3cEMjTLz8HZ3sTsj/ZO/06Ojrux7GFdtAuChBBB+gI/UInaIq453mfvdgj/qYf+fv+t3ur7/U5n9Ba+Id/AUKdvsc= exp(�t(st�1, st))

AAAChHicbVFNT9wwEHUCpXT7tdBjLxYr1ERKVzG0gksr1F44gtQFpM3KcrwOa+Eklj1BrKL8kv4rbvybOiGoLDCSpef33ozHM6lW0kIc33n+2vqrjdebbwZv373/8HG4tX1my8pwMeGlKs1FyqxQshATkKDEhTaC5akS5+nV71Y/vxbGyrL4A0stZjm7LGQmOQNH0eHf3YQpvWCUBDbEP3AibnSQ6IWkIrARiZKcwSLN6psmDAcPXggshc5tq5zWltbwlTQN7uXuFvRsuFqSQgSrRf/L8JATteVDTIejeBx3gZ8D0oMR6uOEDm+TecmrXBTAFbN2SmINs5oZkFyJZpBUVmjGr9ilmDpYsFzYWd0NscG7jpnjrDTuFIA79nFGzXJrl3nqnG379qnWki9p0wqyw1ktC12BKPj9Q1mlMJS43QieSyM4qKUDjBvpesV8wQzj4PY2cEMgT7/8HJztjcn+eO/02+joVz+OTfQZ7aAAEXSAjtAxOkETxD3P++LFHvE3/Mjf97/fW32vz/mEVsL/+Q8xFr7H

�t(st) =
X

st+1

�t+1(st+1) exp(�e(st+1, t+ 1,x))

AAAC3XicbVLLbtQwFHXCq4RHp7BkYzGqlIgwSgpS2SBVsGFZJKatmIwsx+N0rDqJZd+gjqJIbFiAEFv+ix3/wQfguClipr2SreNz7r3Hr1xJYSBJfnv+jZu3bt/Zuhvcu//g4fZo59GRqRvN+JTVstYnOTVciopPQYDkJ0pzWuaSH+dnb3v9+BPXRtTVB1gpPi/paSUKwShYioz+7GZUqiUlaWgi/Bpn/FyFmVoKwkMTp3FWUljmRXveRVFwmQuhIeCyTVOS1pAWnqddhwfZrcKBjdZbEohhs+k/HS6L4r5/hIMs53Ct3TNn51S3CAdy082RsZ3WPMlonEwSF/gqSAcwRkMcktGvbFGzpuQVMEmNmaWJgnlLNQgmeRdkjeGKsjN6ymcWVrTkZt661+nwrmUWuKi1HRVgx/5f0dLSmFWZ28x+j2ZT68nrtFkDxat5KyrVAK/YhVHRSAw17p8aL4TmDOTKAsq0sHvFbEk1ZWA/RGAvId088lVwtDdJX0z23r8cH7wZrmMLPUFPUYhStI8O0Dt0iKaIeR+9z95X75tP/C/+d//HRarvDTWP0Vr4P/8Cs27g5w== exp(�t(st, st+1))
AAAC+nicbVLLbtQwFHVSHu3wmpYlG4tRpUSEUVKQYINUwYZlkZi20mRkOR6nY9VJLPsGOgr5FDYsQIgtX9Idf4PjZhAz7ZVsHZ9z7z1+ZUoKA3H8x/O3bt2+c3d7Z3Dv/oOHj4a7e8emqjXjE1bJSp9m1HApSj4BAZKfKs1pkUl+kp2/6/STT1wbUZUfYan4rKBnpcgFo2ApsusN91Mq1YKSJDAhfoNTfqGCVC0E4YGJkigtKCyyvLlow3CwyoXAEHDZpi5IY0gDz5O2xb3sVkHPhustCUSw2fSfDquiqOsfYqtlHG70e+b8nOoWQU9u2jkystO66bonRKvqEJPhKB7HLvB1kPRghPo4IsPLdF6xuuAlMEmNmSaxgllDNQgmeTtIa8MVZef0jE8tLGnBzaxxT9fifcvMcV5pO0rAjv2/oqGFMcsis5nd9s2m1pE3adMa8tezRpSqBl6yK6O8lhgq3P0DPBeaM5BLCyjTwu4VswXVlIH9LQN7Ccnmka+D44Nx8mJ88OHl6PBtfx3b6Al6igKUoFfoEL1HR2iCmPfZ++p99374X/xv/k//11Wq7/U1j9Fa+L//AuUc6iA=

slide credit: Dan Klein

CompuIng Marginals

y1 y2 yn…

�e

�t
P (y|x) =

1

Z

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

Z =
X

y

nY

i=2

exp(�t(yi�1, yi))
nY

i=1

exp(�e(yi, i,x))

‣ For both HMMs and CRFs:

‣ Normalizing constant

P (yi = s|x) =
forwardi(s)backwardi(s)P
s0 forwardi(s

0)backwardi(s0)

Z for CRFs, P(x)
for HMMs

‣ Analogous to P(x) for HMMs

‣ This denominator is the same for all i! Nice way to debug your forward-
backward implementaIon and catch bugs

Posteriors vs. ProbabiliIes

P (yi = s|x) =
forwardi(s)backwardi(s)P
s0 forwardi(s

0)backwardi(s0)

‣ Posterior is derived from the parameters and the data (condiIoned on x!)

HMM

CRF

Model parameter (usually
mulInomial distribuIon)

Inferred quanIty from
forward-backward

Inferred quanIty from
forward-backward

Undefined (model is by
definiIon condiIoned on x)

P (xi|yi), P (yi|yi�1) P (yi|x), P (yi�1, yi|x)

HMM Inference : The Viterbi Algorithm

HMMS : model of Pty ,E) = Ply , ) plx.ly , ) Kya ly , ) . . .

Inference : anginal Pty
II ) = arty

” P(y constant
PCI )£ v.ntay

= asf
”

Phir ) = 99mF
‘

log HIEI
=

,?? log Plyiltlogplx .ly/tlogPlyiIyTlt–

Viterbi Dynamic Program
Define y. (g) = n x ITI

u sent ten

ITI number
of tags

score of the best path ending in yn
at time i

Base : v. (5) = log Pk , II )
t log Pty )

Recurrence : Vi ly I = log Plxily ) t Ep!! logply45pre.lt
vi.

,
tyred

Viterbi for it . – in
for y in T

:

compute vily)
compute vn+ , ( stop ) , this

= Mff log Pfxjy)
Track

”

backpointers
”

‘

i’ ⇐

Vil’T )
they can can fish

STOP

N

v .

Beam Search

Viterbi runtime : O (n IT 12 )
IT I 240 for English

Penn Tree bank

Beam search : keep only the top
k Viterbi States

at each time step Kc IT I

””

÷
:*.in#:i:i:iti:::o:
into a new priority queue

÷¥÷÷÷÷. : no, ”

seg-33

NER: Evalua*on

Barack Obama will travel to Hangzhou today for the G20 mee=ng .

PERSON LOC ORG

B-PER I-PER O O O B-LOC B-ORGO O O O O

‣ Predic*on of all Os gets 66% accuracy on this example!
‣What we really want to know: how many named en*ty chunk
predic*ons did we get right?

‣ Precision: of the ones we predicted, how many are right?
‣ Recall: of the gold named en**es, how many did we find?
‣ F-measure: harmonic mean of these two

‣ Example:

NER: CoNLL 2003

‣ CRF with lexical features can get around 85 F1 on CoNLL 2003: 4
classes (PER, ORG, LOC, MISC), newswire data

‣What else do we need to capture?

‣World knowledge:

The delega*on met the president at the airport, Tanjug said.

Nonlocal Features

ORG?
PER?

The delega*on met the president at the airport, Tanjug said.

The news agency Tanjug reported on the outcome of the mee*ng.

‣More complex factor graph structures can let you capture this, or just
decode sentences in order and use features on previous sentences

Finkel et al. (2005), Ra*nov and Roth (2009)

Results: CoNLL 2003

Ra*nov and Roth (2009)

Lample et al. (2016)

BiLSTM-CRF + ELMo
Peters et al. (2018)

92.2

BERT
Devlin et al. (2019)

92.8

‣ Knowledge from pre-training is a huge factor

Scaling Up

Choi et al. (2018)

‣Modern work uses larger and larger typesets: 17 -> 112 -> 10,000+

‣ Lots of challenges scaling to this many tags and making these
systems useful!

Logistic Regression
Discriminative probabilistic model Ply II )

( generative
: PCI

, y) ) logistic

ear. .in#=.:II:’a ÷. ‘t
Decision boundary

?

P ( y= – I II ) = II return +1 it fly -HII ) >0.5
JTFLI ) >O

Training

for a dataset ( I
” ‘

, ylil) ? , , want to maximize

II fly ” II
“‘) maximum likelihood

M
i
.§ log Nyc ” II

“) log
likelihood #

⇒ E.is?!i:i.:::?.D::;:i:eoi:;iii:iin.
(NLL)

Need to compute Fa loss txt
”

,
Y
”
F )

LR Gradients Assumey”
Zoe loss = If [- TTHI) + log ( tee

“” ) )
= – FLI ) t – E

“”
– ft)

= f E) f- It E.FI?7EI)=fElfPly=.nix-H)
Update : I

← I +*FCI ) ( l – Ply –HII ) )
P( yet II ) Il : no update Let z=

T THE)

fly# III )=o : perception update NLL
: log ( Ite

‘t

) -z
loss”

”

# z= THE)

seg-68

BERT

‣ Four major changes compared to ELMo:

‣ Transformers instead of LSTMs
‣ Bidirec

‣ Decoder: separate module, single cell. Two inputs: hidden state (vector
h or tuple (h, c)) and previous token. Outputs token + new state

Encoder

…

film

le

Decoder Decoder

Seq2seq Training

‣ Objec.ve: maximize

the movie was great ~~le film était bon~~

le

‣ One loss term for each target-sentence word, feed the correct word
regardless of model’s predic.on (called “teacher forcing”)

[STOP]était

X

(x,y)

nX

i=1

logP (y⇤i |x, y
⇤
1 , . . . , y

⇤
i�1)

Scheduled Sampling

‣ Star.ng with p = 1 (teacher forcing) and decaying it works best

‣ Scheduled sampling: with probability p, take the gold as input, else
take the model’s predic.on

the movie was great

la film étais bon [STOP]

le film était

‣Model needs to do the right thing even with its own predic.ons

Bengio et al. (2015)

sample

‣ BeZer approach: train with reinforcement learning

Implementa.on Details

‣ Sentence lengths vary for both encoder and decoder, so pad to
make sequences the right length, use masking

‣ Encoder: standard RNN module

‣ Decoder: execute one step of computa.on at a .me, so
computa.on graph is formulated as taking one input + hidden state

‣ Test .me: do this un.l you generate the stop token

‣ Training: do this un.l you reach the gold stopping point

Batching and Beam Search

‣ Batching is preZy tricky: decoder is across .me steps, so you probably
want your label vectors to look like [num .mesteps x batch size x num
labels], iterate upwards by .me steps

‣ Beam search: can help with lookahead. Finds the (approximate) highest
scoring sequence:

argmaxy

nY

i=1

P (yi|x, y1, . . . , yi�1)

Beam Search
‣Maintain decoder state, token history in beam

la: 0.4

la

le

les

le: 0.3
les: 0.1

lo
g(0.4)

lo
g(0.3)

lo
g(0.1)

film: 0.4

la

…

film: 0.8

le

…
le
film

la
film

lo
g(0.3)+lo

g(0.8)

…

lo
g(0.4)+lo

g(0.4)

‣ Keep both film states! Hidden state vectors are different

the movie was great

Other Architectures

‣What’s the basic abstrac.on here?

‣ Encoder: sentence -> vector

‣ Decoder: hidden state, output prefix -> new hidden state, new output

‣Wide variety of models can apply here: CNN encoders, decoders can be
any autoregressive model including certain types of CNNs

‣ Transformer: another model discussed later

‣ OR: sentence, output prefix -> new output (more general)

seg-40

CRF Parsing

Taskar et al. (2004)
Hall, Durre;, and Klein (2014)

score

LeB child last word = report ∧ NP PP
NP

w>f
NP PP

NP

2 5 7
=

f
NP PP

NP

2 5 7
He wrote a long report on Mars .

PPNP

NP

=
2 5 7

wrote a long report on Mars .

wrote a long report on Mars .

‣ Can learn that we report [PP], which is common due to repor’ng on things

‣ Can “neuralize” this by compuSng these scores with a neural network

CRF Parsing

+Discrete ConSnuous

He wrote a long report on Mars

NP
PP

NP

‣ Chart remains discrete!

‣ Feedforward pass on nets

‣ Run CKY dynamic program
‣ Discrete feature computaSon

+Discrete ConSnuous
…

Parsing a sentence:

Durre; and Klein (2015)

Using ELMo

Kitaev and Klein (2018)

‣ Improves the neural CRF by using a
transformer layer (self-a;enSve),
character-level modeling, and ELMo

‣ 95.21 on Penn Treebank dev set — much
be;er than past parsers! (~92-93)

‣ This consStuency parser with BERT is
one of the strongest today, or use a
transiSon-based version due to Kitaev
and Klein (2020)

Graph-based Dependency Parsers

‣ ConjuncSons of parent and child words + POS, POS of words in between,
POS of surrounding words

DT NNTOVBDDT NN
the housetoranthe dog

ROOT

‣ HEAD=TO & MOD=NN
‣ HEAD=TO & MOD-1=the

‣ HEAD=TO & MOD=house
‣ ARC_CROSSES=DT

f(i, parent(i),x)

McDonald et al. (2005)

‣ Score of a parse is a sum of weights * features over all arcs it contains.
Use dynamic programming to find the best parse

Higher-order Parsing

Koo and Collins (2010)

‣ Track addiSonal state during parsing so
we can look at “grandparents” (and
siblings). O(n4) dynamic program or
use approximate search

DT NNTOVBDDT NN
the housetoranthe dog

ROOT

f(i, parent(i), parent(parent(i)),x)

Biaffine Neural Parsing

‣ Neural CRFs for dependency parsing: let c = LSTM embedding of i, p =
LSTM embedding of parent(i). score(i, parent(i), x) = pTUc

Dozat and Manning (2017)

(num words x hidden size) (num words x
num words)

LSTM looks at words and POS

‣ This is the best graph-based parser today when combined with BERT

TransiSon-based Parsers

Chen and Manning (2014)

Andor et al. (2016)

‣ 94.61 UAS on the Penn Treebank using a global transiSon-based system with early
updaSng (compared to 95.8 for Dozat, 93.7 for Koo in 2009)

‣ Feature set pioneered by Chen and Manning (2014), Google fine-tuned it as part of
the SyntaxNet parser

TransiSon-based Parsers

Chen and Manning (2014)

UnopSmized S-R{
{
{
{

OpSmized S-R

Graph-based

Neural S-R

‣ TransiSon-based parsers are very fast, and now give good performance

seg-97

Cross-lingual Tagging
‣ Labeling POS datasets is expensive
‣ Can we transfer annota;on from high-resource languages (English, etc.)
to low-resource languages?

English

Raw text

POS data

Spanish:

+
Raw text

en-es bitext

POS data

Malagasy

bitext

Raw text+ Malagasy tagger

Spanish
tagger

Cross-lingual Tagging

Das and Petrov (2011)

‣ Can we leverage word alignment here?

N PR V ??

‣ Tag with English tagger, project across bitext, train French tagger?
Works preTy well

I like it a lot

Je l’ aime beaucoup

align
I like it a lot

Je l’ aime beaucoup

N V PR DT ADJ

tag
I like it a lot

Je l’ aime beaucoup

Projected tags

Cross-lingual Parsing

‣ Now that we can POS tag other languages, can we parse them too?

‣ Direct transfer: train a parser over POS sequences in one language, then
apply it to another language

I like tomatoes

PRON VERB NOUN

Je les aime
PRON PRON VERB

I like them

PRON VERB PRON

Parser trained
to accept tag
input

VERB is the
head of PRON
and NOUN

parse
new
data

train

McDonald et al. (2011)

Cross-lingual Parsing

‣Mul;-dir: transfer a parser trained on several source treebanks to the
target language

‣Mul;-proj: more complex annota;on projec;on approach

McDonald et al. (2011)

seg-83

Probing

Tenney et al. (2019)

‣ Given a simple, fixed class of model (e.g., one-layer FFNN), how
well can we predict various things from word representaGons?

‣We want to know what
informaGon is captured in
a neural network. Try to
predict that informaGon
from the network’s
representaGons

Probing: Results

Tenney et al. (2019)

‣ Lex: baseline built on
context-independent vectors

‣ Large gains from
contextualizaGon, and BERT
beats ELMo

Probing: Results

Tenney et al. (2019b)

‣ Purple: BERT-large
performance on each task
(as delta from mean) using
representaGons from that
layer of BERT

‣ Earlier layers of the
network: beSer at POS and
low-level tasks. Later layers
are beSer at higher-level
tasks like coreference

seg-82

Text-based Explana/ons

‣ Can we generate a natural language explana/on of a model’s
behavior?

‣ Possible advantages:

‣ Easy for untrained users to understand

‣ Easy for annotators to provide ground truth human explana/ons
(which may also help our models)

‣ Possible disadvantages:

‣ Hard to generate gramma/cal/seman/cally meaningful text

‣ Can text truly explain a model’s behavior?

Explana/ons of Bird Classifica/on

Hendricks et al. (2016)

‣What makes a visual explana/on? Should be
relevant to the class and the image

‣ Are these features really what the model used?

Explana/ons of Bird Classifica/on

Hendricks et al. (2016)

‣ Are these features really what the model used? The decoder looks at
the image, but what it reports may not truly reflect the model’s
decision-making

‣More likely to produce plausible (look good to humans) but unfaithful
explana/ons!

e-SNLI

Camburu et al. (2019)

‣ e-SNLI: natural language inference with explana/ons

e-SNLI

Camburu et al. (2019)

‣ Similar to birds: explana/on is generated condi/oned on the
label and the network state f

‣ Informa/on from f is fed into the explana/on LSTM, but no
constraint that this must be used. Explana/on might be purely
generated from the label

f = func/on of premise and hypothesis vectors

Latent Textual Explana/ons

‣Model generates text “hypothesis”, which is completely latent

Latcinnik and Berant (2020)

‣ Hypothesis isn’t constrained to be natural language, ends up
being keywords

seg-96

Morphological Analysis
‣ In English, lexical features on words and word vectors are pre;y effec=ve

‣When we’re building systems, we probably want to know base form +
morphological features explicitly

‣ In other languages, lots more unseen words due to rich morphology!
Affects parsing, transla=on, …

‣ How to do this kind of morphological analysis?

Morphological Analysis: Hungarian

Ám a kormány egyetlen adó csökkentését sem javasolja .

n=
sin
gu
lar
|c
as
e=
no
m
in
a=
ve
|p
ro
pe
r=
no

de
g=
po
si=
ve
|n
=s
in
gu
lar
|c
as
e=
no
m
in
a=
ve

n=
sin
gu
lar
|c
as
e=
no
m
in
a=
ve
|p
ro
pe
r=
no

n=
sin
gu
lar
|c
as
e=
ac
cu
sa
=v
e|
pr
op
er
=n
o|
pp
er
so
n=
3r
d|
pn
um
be
r=
sin
gu
lar

m
oo
d=
in
di
ca
=v
e|
t=
pr
es
en
t|
p=
3r
d|
n=
sin
gu
lar
|d
ef
=y
es

But the government does not recommend reducing taxes.

CoNLL dependency treebank

Morphological Analysis

‣ Given a word in context, need to predict what its morphological features
are

‣ Lots of work on Arabic inflec=on (high amounts of ambiguity)

‣ Basic approach: combines two modules:

‣ Lexicon: tells you what possibili=es are for the word

‣ Analyzer: sta=s=cal model that disambiguates

‣Models are largely CRF-like: score morphological features in context

Morphological Inflec=on
‣ Inverse task of analysis: given base form + features, inflect the word
‣ Hard for unknown words — need models that generalize

w i n d e n

Durre; and DeNero (2013)

Morphological Inflec=on for MT

‣Machine transla=on where phrase table is defined in terms of lemmas
‣ “Translate-and-inflect”: translate into uninflected words and predict
inflec=on based on source side

Chahuneau et al. (2013)

Chinese Word Segmenta=on

‣ LSTMs over character
embeddings / character
bigram embeddings to
predict word boundaries

‣Word segmenta=on:
some languages
including Chinese are
totally untokenized

‣ Having the right
segmenta=on can help
machine transla=on

Xinchi Chen et al. (2015)

N – gram Language Modeling

PCT ) = Plw , , . . . , win )= Pcw , ) Plwalw , ) Pfwslwiwa)
–
– –

n -gram LM
:

Pk ) – IIP (wit wine , . . . , wi . . )
” =3 – I

–

previous n – l
words

Z -gram
Ln : Plw , less )P( wzlw , ) wslwz) – – –

W
,
cord . indep . of w , I wz

n -gram LM #
n – I – order Markov model

3-gran
LM : Plural as > as > ) Plural as > w , )P(wslwiwal . . .

Z -gram
: multinomial distributions

. Ivfxlvl params

P (w/ the)=
0.001 house very

flat distribution !

O
– 0005 dog
0.0005 cat

÷

Parameter estimation MLE from a large corpus
–

P ( dog I the)=
couhtCthe,d#
count ( the)

why ① Generation
: machine translation

② Grammatical error correction

③ Way to build
”

wordZvectt
”

seg-55

Seman&c Parsing as Transla&on

‣Write down a linearized form of the seman&c parse, train seq2seq
models to directly translate into this representa&on

‣ Concerns: may produce invalid logical forms…and does it work?

“what states border Texas”

lambda x ( state( x ) and border( x , e89 ) ) )

‣ Benefits: no grammar engineering, latent deriva&ons, …

Jia and Liang (2016)

Encoding Invariances

‣ Parsing-based approaches handle these the same way (with small
possibly divergences due to feature weights)

‣ Don’t change the model, change the data. Data augmenta5on lets
us encode invariances by automa&cally genera&ng new training
examples

what states border Texas what states border Ohio

‣ Can we get seq2seq seman&c parsers to handle these the same way?

Jia and Liang (2016)

Data Augmenta&on

‣ Abstract out en&&es: now we can “remix” examples and encode
invariance to en&ty ID. More complicated remixes too

‣ Lets us synthesize a “what states border ohio ?” example

Jia and Liang (2016)

Seman&c Parsing as Transla&on

Prolog

Lambda calculus

Other DSLs

‣ Handle all of these with uniform machinery!

Jia and Liang (2016)

Seman&c Parsing as Transla&on

‣ Three forms of data
augmenta&on all help

‣ Results on these tasks are s&ll not
as strong as hand-tuned systems
from 10 years ago, but the same
simple model can do well at all
problems

Jia and Liang (2016)

Regex Genera&on from Language

‣ Predict regex from text

‣ Problem: requires a lot of data: 10,000 examples needed to get ~60%
accuracy on pre`y simple regexes

‣ Does not scale when regex specifica&ons are more abstract (I want to
recognize a decimal number less than 20)

Locasico et al. (2016)

SQL Genera&on

‣ Convert natural language
descrip&on into a SQL
query against some DB

‣ How to ensure that well-
formed SQL is generated?

‣ Three seq2seq models

‣ How to capture column
names + constants?

‣ Pointer mechanisms, to
be discussed later

Zhong et al. (2017)
‣ Problem to address: how to generate following
the input closely? A@en5on mechanism

seg-69

BERT: Model and Applica2ons
‣ BERT Base: 12 layers, 768-dim per
wordpiece token, 12 heads. Total
params = 110M

‣ BERT Large: 24 layers, 1024-dim
per wordpiece token, 16 heads.
Total params = 340M

‣ Posi2onal embeddings and
segment embeddings, 30k word
pieces

‣ This is the model that gets pre-
trained on a large corpus

Devlin et al. (2019)

What can BERT do?

Devlin et al. (2019)

‣ Ar2ficial [CLS] token is used as the vector to do classifica2on from

‣ BERT can also do tagging by predic2ng tags at each word piece

‣ Sentence pair tasks (entailment): feed both sentences into BERT

What can BERT do?

Devlin et al. (2019)

‣ How does BERT model sentence pair tasks?

‣ Transformers can capture interac2ons between the two sentences (even
though the NSP objec2ve doesn’t really cause this to happen)

Transformer

Transformer

…

[CLS] A boy plays in the snow [SEP] A boy is outside

Entails (first sentence implies second is true)

What can BERT NOT do?

Devlin et al. (2019)

‣ BERT cannot generate text (at least not in an obvious way)

‣ Can fill in MASK tokens, but can’t generate lea-to-right (you can put MASK
at the end repeatedly, but this is slow)

‣ Masked language models are intended to be used primarily for “analysis”
tasks

Fine-tuning BERT
‣ Fine-tune for 1-3 epochs, small
learning rate

‣ Large changes to weights up here
(par2cularly in last layer to route the right
informa2on to [CLS])

‣ Smaller changes to weights lower down in
the transformer

‣ Small LR and short fine-tuning schedule
mean weights don’t change much

‣ More complex “triangular learning
rate” schemes exist

Fine-tuning BERT

‣ BERT is typically beeer if the whole network is fine-tuned, unlike ELMo

Peters et al. (2019)

Evalua2on

Wang et al. (2019)

Evalua2on

‣ Huge improvements over prior work (even compared to ELMo)

‣ Effec2ve at “sentence pair” tasks: textual entailment (does sentence A
imply sentence B), paraphrase detec2on

Devlin et al. (2019)

Analysis

Clark et al. (2019)

‣ Heads on transformers learn interes2ng and diverse things: content
heads (aeend based on content), posi2onal heads (based on posi2on),
etc.

Analysis

‣ S2ll way worse than what supervised parsing systems can do, but
interes2ng that this is learned organically

RoBERTa

Liu et al. (2019)

‣ “Robustly op2mized BERT”

‣ 160GB of data instead
of 16 GB

‣ Dynamic masking: standard
BERT uses the same MASK
scheme for every epoch,
RoBERTa recomputes them

‣ New training + more data = beeer performance

‣ For this and more: check out Huggingface Transformers or fairseq

Neural language Models

PCT ) = Plur , )P( wzlw , ) – – – P ( wnlw , – –
– Wu – i )
–

model with a neural net

Vi – Ci – I

Very basic neural Ln : pfwilwi – e) =e-
{ e
”

– Ci
– I

skip
–

gram w
‘

EV

More generally : P ( wi Iw , . . . Wi . . )
= softmax (Uwi – f- (w. . – wid)

f : neural net to embed
the context

f- is a DAN ? Ignores offering mwithistftmax
f- is a FFNN ? Ti

→ P(wily , . . . , wi – i)
I saw the day Mniht Hinton ( 20031

seg-57

A”en%on

‣ At each decoder state, compute a distribu%on over source inputs
based on current decoder state, use that in output layer

the movie was great le

th
e
m
ov
ie

wa
s
gr
ea
t

th
e
m
ov
ie

wa
s
gr
ea
t

… …

A”en%on

the movie was great

h1 h2 h3 h4

h̄1

‣ For each decoder state,
compute weighted sum of
input states

eij = f(h̄i, hj)

ci =
X

j

↵ijhj

c1

‣ Some func%on
f (TBD)

‣Weighted sum
of input hidden
states (vector)

le

↵ij =
exp(eij)P
j0 exp(eij0)

P (yi|x, y1, . . . , yi�1) = softmax(W [ci; h̄i])

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄i)‣ Normal seq2seq:

th
e

m
ov
ie

wa
s

gr
ea
t

A”en%on Variants

h̄1
eij = f(h̄i, hj)

ci =
X

j

↵ijhj

c1

‣ Note that this all uses outputs of hidden layers

f(h̄i, hj) = tanh(W [h̄i, hj ])

f(h̄i, hj) = h̄i · hj

f(h̄i, hj) = h̄
>
i Whj

‣ Bahdanau+ (2014): addi%ve

‣ Luong+ (2015): dot product

‣ Luong+ (2015): bilinear

le

↵ij =
exp(eij)P
j0 exp(eij0)

Bahdanau et al. (2014), Luong et al. (2015)

A”en%on Structures

‣When do we compute a”en%on? Can compute before or aTer RNN cell

h̄1

c1

c1

ATer RNN cell

le le
Before RNN cell

Bahdanau et al. (2014)Luong et al. (2015)

What can a”en%on do?

‣ Can learn to copy a sequence and subsample tokens

0 3 2 1

3 1

‣ LSTMs can learn to count — gives posi%on-based addressing

‣ LSTMs can learn to reject certain inputs — allows us to ignore
undesirable inputs from a”en%on

What can a”en%on do?

‣ Decoder hidden states are now
mostly responsible for selec%ng
what to a”end to

‣ Doesn’t take a complex hidden
state to walk monotonically
through a sentence and spit
out word-by-word transla%ons

‣ Encoder hidden states capture
contextual source word iden%ty

Bahdanau et al. (2014)

Batching A”en%on

the movie was great
the movie was great

token outputs: batch size x sentence length x hidden size

sentence outputs:
batch size x hidden size

hidden state: batch size
x hidden size

eij = f(h̄i, hj)

↵ij =
exp(eij)P
j0 exp(eij0)

a”en%on scores = batch size x sentence length

c = batch size x hidden size ci =
X

j

↵ijhj

‣Make sure tensors are the right size!

seg-80

Explana’ons in NLP

‣ Neural models in NLP have complex behavior. How can we
understand them?

Lipton (2016)

‣ QA: why did the model prefer Stewart over Devin Funchess?

Explana’ons in NLP

‣ Neural models in NLP have complex behavior. How can we
understand them?

Iyyer et al. (2015)

‣ Sen’ment:

‣ LeK side: predic’ons model makes on individual words

DAN Ground Truth

‣ Tells us how these words combine

Why explana’ons?

Lipton (2016)

‣ Trust: if we see that models are behaving in human-like ways and
making human-like mistakes, we might be more likely to trust them and
deploy them

‣ Causality: if our classifier predicts class y because of input feature x,
does that tell us that x causes y? Not necessarily, but it might be helpful
to know

‣ Informa2veness: more informa’on may be useful (e.g., predic’ng a
disease diagnosis isn’t that useful without knowing more about the
pa’ent’s situa’on)

‣ Fairness: ensure that predic’ons are non-discriminatory

What are explana’ons?

Lipton (2016); Belinkov and Glass (2018)

‣ Some models are naturally transparent: we can understand why they
do what they do (e.g., a decision tree with <10 nodes) ‣ Explana'ons of more complex models ‣ Local explana2ons: highlight what led to this classifica'on decision. (Counterfactual: if they were different, the model would’ve predicted a different class) ‣ Text explana2ons: describe the model’s behavior in language ‣ Model probing: auxiliary tasks, challenge sets, adversarial examples to understand more about how our model works seg-94 Dialogue and QA ‣ Dialogue is a very natural way to find informa7on from a search engine or a QA system ‣ Challenges: hard to annotate good dialogue datasets in a purely sta7c way Iyyer et al. (2017) Dialogue and QA Choi et al. (2018) ‣ UW QuAC dataset: Ques7on Answering in Context Conversa7onal Machine Reading ‣ Conversa7onal machine reading: answer repeated ques7ons based on a passage ‣ Interes7ng and poten7ally useful idea, but annota7ng data is very hard! Saeidi et al. (2018) seg-95 Morphology ‣ Study of how words form ‣ Deriva7onal morphology: create a new lexeme from a base estrange (v) => estrangement (n)

become (v) => unbecoming (adj)

I become / she becomes

‣ Inflec7onal morphology: word is inflected based on its context

‣May not be totally regular: enflame => inflammable

‣Mostly applies to verbs and nouns

Morphological Inflec7on
‣ In English: I arrive you arrive he/she/it arrives

we arrive you arrive they arrive
[X] arrived

‣ In French:

Wik7onary

Morphological Inflec7on

Wik7onary

‣ In Spanish:

Morphological Inflec7on

‣ Nomina7ve: I/he/she, accusa7ve: me/him/her, geni7ve: mine/his/hers

‣ Not just verbs either; gender, number, case complicate things

I give the children a book <=> Ich gebe ein Buch

I taught the children <=> Ich unterrichte die Kinder

‣ Da7ve: merged with accusa7ve in English, shows recipient of something

den Kindern

Irregular Inflec7on

‣ Common words are oSen irregular
‣ I am / you are / she is

‣ Less common words typically fall into some regular paradigm —
these are somewhat predictable

‣ Je suis / tu es / elle est

‣ Soy / está / es

Agglu7na7ng Languages
‣ Finnish/Hungarian
(Finno-Ugric), also
Turkish: what a
preposi7on would
do in English is
instead part of the
verb

‣Many possible forms — and in newswire data, only a few are observed
illa7ve: “into” adessive: “on”

halata: “hug”

Morphologically-Rich Languages
‣Many languages spoken all over the world have much richer morphology
than English

‣ CoNLL 2006 / 2007: dependency parsing + morphological analyses for
~15 mostly Indo-European languages

‣Word piece / byte-pair encoding models for MT are prehy good at
handling these if there’s enough data

‣ SPMRL shared tasks (2013-2014): Syntac7c Parsing of Morphologically-
Rich Languages

seg-81

Local Explana+ons

‣ An explana+on should help us answer counterfactual ques+ons:
if the input were x’ instead of x, what would the output be?

that movie was not ____ , in fact it was terrible !

that movie was not great , in fact it was _____ !

that movie was not great , in fact it was terrible !

Model

—

—

+

‣ Perturb input many +mes and assess the impact on the model’s
predic+on

LIME

Ribeiro et al. (2016)

‣ LIME: Locally-Interpretable Model-Agnos+c Explana+ons
‣ Local because we’ll focus on this one example
‣Model-agnos:c: treat model as black box

hOps://www.oreilly.com/learning/introduc+on-to-local-interpretable-model-agnos+c-explana+ons-lime

‣ Check predic+ons on subsets of components
‣ Train a model to explain which components yield the model’s preds

LIME

Ribeiro et al. (2016)
hOps://www.oreilly.com/learning/introduc+on-to-local-interpretable-model-agnos+c-explana+ons-lime

‣ Break down input into many small pieces for interpretability
x 2 Rd ! x0 2 {0, 1}d

0

AAAGDnicjVTLbtNAFHUhgRJeLSzZjChRbdVUdlsJhFRUwQYhIYVHH1LdWuPxxBnVL3nGdSJnvoANv8KGBQixZc2Ov+F67BSapi0jxbk+59zXzPV4aci4sKzfc1euttrXrs/f6Ny8dfvO3YXFezs8yTNCt0kSJtmehzkNWUy3BRMh3UsziiMvpLve0cuK3z2mGWdJ/EGMUnoQ4SBmfUawAMhdbD3qOjhMB9i1dW6gTeTQYao76YC5VOembToRFgOvXw6lYXQmWqFzVyg1zyO35G4pHttSooZWb3qDGqdDusIU00FPeDFxMqv4BgLOo2JmvhWVT7HqRW/A6XQKNOFxYVJhTtxV0pD2hV7nCpDDYuQIOhRZVMZBhiMOmQsXCOInAhEnY8FAVCHDJEA9fbRpjwuTGGgFOf0Mk9KW5ZFsSmebtjyEeBOpNS5cBuJOt6dPCixk1WpPL1zbUH9r4xNzvTJNgKDPENLzyhFCKLhGgC2Z2vg6SMUBEFf7MKVQzuNJc8eMM0F99B7HyrmufsKSJI+F1GeJTVQY8n+EhjydMYb9K3AskEhQkMBzuqRLBa+TQYwmKd5gHweYE5zV9YfwHfgw2Rd3cl6IC7s6x8mA0VhpTkYZdQUbsyuYFb+aqWab/p7B2CnTZ2hKmpqgeG45cgxRquEq0GW6Tpe7dmdYj7SaNq98Jw99VM8wzrKkQMPlmi8t03bkYekvS3dhyVq11EJnDbsxlrRm9dyFX46fkDyisSAh5nzftlJxUOJMMBJS2XFyTlNMjnBA98GMcUT5QamuM4m6gPion2Twg6NX6L8eJXyCfBR5oKx64NNcBc7i9nPRf3pQsjjNBY1Jnaifh9VwVXcj8llGiQhHYGCSMagVkQGGAxBwg3ZgE+zpls8aO2ur9vrq2tuNpa0XzXbMaw+0h5qu2doTbUt7pfW0bY20PrY+t762vrU/tb+0v7d/1NIrc43Pfe3Uav/8A851B/g=

‣ Now learn a model to predict f(x’’) based on x’. This model’s
weights will serve as the explana+on for the decision

‣ If the pieces are very coarse, can
interpret but can’t learn a good
model of the boundary. If pieces
are too fine-grained, can
interpret but not predict

‣ Draw samples by using x’ as a mask to form a new example x’’.
Compute f(x’’)

why
it’s +

LIME

‣ Evalua+on: the authors train a sparse model (only looks at 10
features of each example), then try to use LIME to recover the
features. Greedy: remove features to make predicted class
prob drop by as much as possible

Ribeiro et al. (2016)
hOps://www.oreilly.com/learning/introduc+on-to-local-interpretable-model-agnos+c-explana+ons-lime

Gradient-based Methods

‣ Problem: fully removing pieces of the input may cause it to be very
unnatural

data manifold (points we
observe in prac+ce)

LIME zeroes out
certain features

‣ Alterna+ve approach: look at what this perturba+on does locally
right around the data point using gradients

Gradient-based Methods

Simonyan et al. (2013)

‣ Originally used for images

‣ Approximate score with a
first-order Taylor series
approxima+on around the
current data point

‣ Higher gradient magnitude = small
change in pixels leads to large
change in predic+on

Sc = score of class c

I0 = current image

Gradient-based Methods

Simonyan et al. (2013)

Integrated Gradients

Sundararajan et al. (2017)

‣ Suppose you have predic+on = A OR B for features A and B. Changing
either feature doesn’t change the predic+on, but changing both
would. Gradient-based method says neither is important

‣ Integrated gradients: compute
gradients along a path from
the origin to the current data
point, aggregate these to
learn feature importance

‣ Now at intermediate points,
increasing “par+al A” or
“par+al B” reveals the
importance of A and B

seg-56

Problems with Seq2seq Models

‣ Need some no5on of input coverage or what input words we’ve
translated

‣ Encoder-decoder models like to repeat themselves:

A boy plays in the snow boy plays boy playsUn garçon joue dans la neige

‣Why does this happen?
‣Models trained poorly
‣ LSTM state is not behaving as expected so it gets stuck in a “loop”
of genera5ng the same output tokens again and again

Problems with Seq2seq Models

‣ Bad at long sentences: 1) a fixed-size hidden representa5on doesn’t
scale; 2) LSTMs s5ll have a hard 5me remembering for really long
periods of 5me

RNNenc: the model we’ve
discussed so far
RNNsearch: uses aUen5on

Bahdanau et al. (2014)

Problems with Seq2seq Models

‣ Unknown words:

‣ In fact, we don’t want to encode them, we want a way of directly
looking back at the input and copying them (Pont-de-Buis)

‣ Encoding these rare words into a vector space is really hard

Luong et al. (2015)

Aligned Inputs

~~le film était bon~~

the movie was great

the movie was great

le film était bon

‣Much less burden on the hidden
state

‣ Suppose we knew the source and
target would be word-by-word
translated

‣ Can look at the corresponding
input word when transla5ng —
this could scale!

le film était bon [STOP]

‣ How can we achieve this without hardcoding it?

AUen5on

‣ At each decoder state, compute a distribu5on over source inputs
based on current decoder state, use that in output layer

the movie was great le

th
e
m
ov
ie

wa
s
gr
ea
t

th
e
m
ov
ie

wa
s
gr
ea
t

… …

Smoothing in N -gram LMS

5 – gram
models work well ! gauvin

cu Plwlko ) o) Plwlgo to ) CHP (w/ want
to
go

to)

Cst
w/ hate to go

to )
dass

f) ( Austin I to ) so seen in
data

P ( Austin I want to go to )=O if corpus
isn’t huge

Absolute Discounting
Reserve mass from seen 5-grams

to allocate to

OCKLI
unseen 5-grams u -gram

– K –

# Austin want to go tot-CIIYI.GE#-txP.aslAltgtl
# of word types seen

× get to
make this

normalize in this context times
K- 0.2

↳ K

want to go
to a

Maui 2 → I -8
x=

→ Class I → 0.8

county to campus I
– O -8

yneser-NIP.no( A ltgt ) = . – – – e- X ‘ Pao ( Algt) smoothing
– –
–

PLA ) > 0

seg-46

What do RNNs produce?

‣ Encoding of each word (each hi) — can pass this to another layer to make
a predic>on (can also pool these to get a different sentence encoding)

=

‣ Encoding of the sentence (final hi/ci) — can pass this a decoder or make a
classifica>on decision about the sentence

the movie was great

‣ RNN can be viewed as a transforma>on of a sequence of vectors into a
sequence of context-dependent vectors

RNN Uses
‣ Transducer: make some predic>on for each element in a sequence

‣ Acceptor/encoder: encode a sequence into a fixed-sized vector and use
that for some purpose

the movie was great

predict sen>ment (matmul + soNmax)

translate

the movie was great

DT NN VBD JJ

paraphrase/compress

output y = score for each tag, then soNmax

Mul>layer Bidirec>onal RNN

‣ Sentence classifica>on
based on concatena>on of
both final outputs

‣ Token classifica>on based on
concatena>on of both direc>ons’ token
representa>ons

the movie was great the movie was great

What do LSTMs return in Pytorch?

the movie was great

‣ Hidden/cell states are a 2-tuple, tensors of size
[num_layers * num_direc>ons, batch size,
dimensionality]

‣ 2x1xdim here

‣ Outputs are a single tensor of size
[seq_len, batch size, num_direc>ons *
hidden_size]

‣ 4x1xdim here

Training RNNs

the movie was great

‣ Loss = nega>ve log likelihood of probability of gold label

P (y|x)

‣ Backpropagate through en>re network, RNN parameters get a gradient update
from each >mestep

‣ Example: sen>ment analysis

Training RNNs

the movie was great

‣ Loss = nega>ve log likelihood of probability of gold predic>ons, summed over
the tags

‣ Loss terms filter back through network

P (ti|x)

‣ Example: language modeling, POS tagging

seg-52

Combinatory Categorial Grammar
‣ Steedman+Szabolcsi (1980s): formalism bridging syntax and seman?cs

‣ Syntac?c categories (for this lecture): S, NP, “slash”
categories

‣ S\NP: “if I combine with an NP on my leJ
side, I form a sentence” — verb

NP S\NP

Eminem sings
e728 λy. sings(y)

S
sings(e728)

‣ Parallel deriva?ons of syntac?c parse and lambda calculus expression

‣ When you apply this, there has to be a
parallel instance of func?on applica?on on
the seman?cs side

Combinatory Categorial Grammar
‣ Steedman+Szabolcsi (1980s): formalism bridging syntax and seman?cs

‣ Syntac?c categories (for this class): S, NP, “slash” categories

‣ S\NP: “if I combine with an NP on my leJ side, I form a sentence” — verb
‣ (S\NP)/NP: “I need an NP on my right and then on my leJ” — verb with a
direct object

NP S\NP

Eminem sings
e728 λy. sings(y)

S
sings(e728)

NP (S\NP)/NP

Oklahoma borders
e101

Texas
e89
NP

λx.λy borders(y,x)

S\NP
λy borders(y,e89)

S
borders(e101,e89)

CCG Parsing

‣ “What” is a very complex type: needs a noun and needs a S\NP to form a
sentence. S\NP is basically a verb phrase (border Texas)

CCG Parsing

‣ “What” is a very complex type: needs a noun and needs a S\NP to form a
sentence. S\NP is basically a verb phrase (border Texas)

‣ Lexicon is highly ambiguous — all the challenge of CCG parsing is in picking the
right lexicon entries

CCG Parsing

‣ Many ways to build these parsers

‣ One approach: run a “supertagger” (tags the sentence with complex labels), then
run the parser

‣ Parsing is easy once you have the tags, so we’ve reduced it to a (hard) tagging
problem

Building CCG Parsers

‣ Training data looks like pairs of sentences and logical forms

What states border Texas λx. state(x) ∧ borders(x, e89)

‣ Texas corresponds to NP | e89 in the logical form (easy to figure out)

(S/(S\NP))/N | λf.λg.λx. f(x) ∧ g(x)‣ What corresponds to

‣ How do we infer that without being told it?

‣ Problem: we don’t know the deriva?on

‣ GENLEX: takes sentence S and logical form L. Break up logical form into chunks
C(L), assume any substring of S might map to any chunk

Ze`lemoyer and Collins (2005)

‣ Latent variable model learns how to map input to deriva?ons through the
learning process, supervised only by the correct answer

Seq2seq Seman?c Parsing

‣ Parsers have been pre`y hard to build…

‣ Cons?tuency/graph-based: complex dynamic programs

‣ Transi?on-based: complex transi?on systems

‣ CCG/seman?c parsers: complex syntax/seman?cs interface, challenging
inference, challenging learning

‣ For seman?c parsing in par?cular: bridging the syntax-seman?cs divide results
in structural weirdnesses in parsers, hard to learn the right seman?c grammar

What states border Texas λ x state( x ) ∧ borders( x , e89 )

‣ seq2seq models will provide a way to learn this mapping directly, without
having to model complex deriva?on structures

seg-85

Summariza(on

‣What makes a good summary?

Summariza(on
BAGHDAD/ERBIL, Iraq (Reuters) – A strong earthquake hit large parts of northern Iraq and the
capital Baghdad on Sunday, and also caused damage in villages across the border in Iran where
state TV said at least six people had been killed.

There were no immediate reports of casualLes in Iraq aMer the quake, whose epicenter was in
Penjwin, in Sulaimaniyah province which is in the semi-autonomous Kurdistan region very close
to the Iranian border, according to an Iraqi meteorology official.

But eight villages were damaged in Iran and at least six people were killed and many others
injured in the border town of Qasr-e Shirin in Iran, Iranian state TV said.

The US Geological Survey said the quake measured a magnitude of 7.3, while an Iraqi
meteorology official put its magnitude at 6.5 according to preliminary informaLon.

Many residents in the Iraqi capital Baghdad rushed out of houses and tall buildings in panic.
…

Summariza(on
Indian Express — A massive earthquake of magnitude 7.3 struck Iraq on Sunday, 103 kms
(64 miles) southeast of the city of As-Sulaymaniyah, the US Geological Survey said, reports
Reuters. US Geological Survey iniLally said the quake was of a magnitude 7.2, before
revising it to 7.3.

The quake has been felt in several Iranian ciLes and eight villages have been damaged.
Electricity has also been disrupted at many places, suggest few TV reports.

A massive earthquake of magnitude 7.3 struck Iraq on Sunday. The epicenter was
close to the Iranian border. Eight villages were damaged and six people were killed in
Iran.

Summary

What makes a good summary?

A strong earthquake of magnitude 7.3 struck Iraq and Iran on Sunday. The epicenter
was close to the Iranian border. Eight villages were damaged and six people were
killed in Iran.

Summary

‣ Content selec(on: pick the right content
‣ Right content was repeated within and across documents
‣ Domain-specific (magnitude + epicenter of earthquakes are important)

‣ Genera(on: write the summary
‣ Extrac(on: pick whole sentences from the summary
‣ Compression: compress those sentences but basically just do dele(on
‣ Abstrac(on: rewrite + reexpress content freely

seg-91

Chatbots: The Turing Test

Turing (1950s)

‣ Imita9on game: A and B are locked in rooms and answer C’s ques9ons
via typewriter. Both are trying to act like B

A B

C

B B

trained judge
C trained judge

Original Interpreta9on: Standard Interpreta9on:

‣ The test is not “does this computer seem human-like to random people
with a web browser?”

ELIZA

‣ Created 1964-1966 at
MIT, heavily scripted

‣ DOCTOR script was
most successful:
repeats user’s input,
asks inane ques9ons

Weizenbaum (1966)

ELIZA

(.*) you (.*) me

Why do you think I $2 you?

‣ Iden9fy keyword, iden9fy context, apply transforma9on rule

‣ Very li_le need to generate new content, but can only have one type of
conversa9on

My (.) (.*)

What else comes to mind when you
think of your $1?

Weizenbaum (1966)

Cleverbot

Carpenter (1986)

‣ “Nearest neighbors”: human says statement A, find a human response
in human-human or human-computer chats to statement A, repeat that

User: Hi there cleverbot.
Cleverbot: How are you?
User: Doing okay. what plans do you have today?
Cleverbot: I do revision.
User: What are you revising for?
Cleverbot: Maths history.
User: How old are you?
Cleverbot: That’s none of your business.
User: What is maths history?
Cleverbot: I don’t know, farming maybe

‣ Can ofen give sensible
answers, but the bot doesn’t
really impose high-level
discourse structure

Chatbots as Transla9on

Ri_er et al. (2011)

‣ Can treat as a machine transla9on problem: “translate” from current
u_erance to next one

‣ Filter the data, use sta9s9cal measures to prune extracted phrases to get
be_er performance

Chatbots as Transla9on

Ri_er et al. (2011)

seg-90

BART

Lewis et al. (2020)

‣ Sequence-to-sequence BERT
variant: permute/make/delete
tokens, then predict full
sequence autoregressively

‣ For downstream tasks: feed
document into both encoder +
decoder, use decoder hidden
state as output

‣ Treat summarizaLon as a sequence-to-sequence task with this
model

PEGASUS

Zhang et al. (2020)

‣Mask out “gap sentences” and generate them from the context
‣ Strong results, a bit beUer than BART

BART Summaries

Lewis et al. (2020)

‣ These look great! But they’re not always factual

Factuality via Entailment

Goyal and DurreU (2020)

Pre-trained Encoder Model

nummod
nsubj:pass

amod

obl:tmod
nmod:involving

Seven games …. arrested last November [SEP] Seven games involving Nimes were arrested last November

head
word

child
word

dep
label

Classifier W

Output Dependency Parse: d(h)

Encoder Output: E(x;h)

Arc Entailment Decision

Input x Hypothesis h

y = 1 y = 1 y = 0 y = 1 y = 1

raArc Embedding:

‣ AUempt to classify individual dependency arcs in generated output as entailed/
not entailed by the input

“Seven games involving Nimes
were inves&gated aZer Conrad
was arrested last November”

Factuality via QA

Wang et al. (2020)

seg-84

Annota&on Ar&facts

Gururangan et al. (2018); Poliak et al. (2018)

‣ Some datasets might be easy because of how they’re constructed

‣ Understanding our datasets is just as important as understanding
our models in order to know what they’re learning

Task: Natural Language Inference

‣ To create neutral sentences: annotators add informa*on

‣ To create contradic&ons: annotators add nega*on

Gururangan et al. (2018); Poliak et al. (2018)

‣Models can do very well
without looking at the premise

Performance of models that only look at the hypothesis:
~70% on 3-class SNLI dataset

What do we do about this?

‣Why is this a problem? Because our models learn these simple cues
and not actually the hard task we want them to learn

‣ Solu&ons: build harder tasks, tweak data or training objec&ve to
inoculate models against this (many proposals)

‣ They don’t generalize to challenging new examples without these
paYerns — understanding this behavior is crucial to explaining what
our models are doing!

Other Tasks: Ques&on Answering

Mudrakarta et al. (2018)

‣ Ques&on type is very powerful indicator. Only a couple of loca&ons
in this context!

Other Tasks: Mul&-hop QA

The Oberoi family is an Indian family that is famous for its involvement
in hotels, namely through The Oberoi Group

Ques%on: The Oberoi family is part of a hotel company that has a head office
in what city?

The Oberoi Group is a hotel company with its head office in Delhi.

…

…

D
o
c
1

D
o
c
2

High lexical overlap

Chen and DurreY (2019)

Other Tasks: SWAG

Zellers et al. (2018)

‣ Answers were filtered using ELMo to only leave answers that were
difficult for a model to reject…but then BERT solved this dataset
easily

seg-53

Seq2seq Models

‣ Seman.c parsing:

What states border Texas λ x state( x ) ∧ borders( x , e89 )

‣ Syntac.c parsing

The dog ran (S (NP (DT the) (NN dog) ) (VP (VBD ran) ) )

(but what if we produce an invalid tree or one with different words?)

‣Machine transla.on, summariza.on, dialogue can all be viewed in this
framework as well

Seq2seq Models

‣ Encode a sequence into a fixed-sized vector

the movie was great

‣ Now use that vector to produce a series of tokens as output from a
separate LSTM decoder

le film était bon [STOP]

Seq2seq Models

‣ Generate next word condi.oned on previous word as well as hidden state

the movie was great

h̄

‣W size is |vocab| x |hidden state|, soPmax over en.re vocabulary

Decoder has separate
parameters from encoder, so
this can learn to be a language
model (produce a plausible next
word given current one)

P (y|x) =
nY

i=1

P (yi|x, y1, . . . , yi�1)

P (yi|x, y1, . . . , yi�1) = softmax(Wh̄)y1
AAAGE3icjVTLbtQwFE3LDJTwamHJxqIaNVFDlbSVQEhFFWwQEtLw6ENq2shxPBmrecl2mhll/A9s+BU2LECILRt2/A2OkxQ6nbZYmszNOee+7Bv7WUQYt+3fc/PXOt3rNxZu6rdu37l7b3Hp/i5Lc4rwDkqjlO77kOGIJHiHEx7h/YxiGPsR3vOPX1b83gmmjKTJBz7O8GEMw4QMCIJcQt5Sx+y5MMqG0HMMZoIt4OJRZrjZkHjYYJZjuTHkQ39QjoRp6q2WG8zjSs3y2CuZV/LHjhCgodWb0aDm2ZAet/h00FOet05WFd8EkvMxn5lvVeVTrHoxGnA6nQIt+bg0Kbdad5U0wgNu1LlC4JIEuByPOI3LJKQwZjJz4UkCBSkHyKUkHPIqZJSGoG+Mt5xJYSETrAJ3QCEqHVEei6Z0suWIIxmvldqTwiNSrPf6RltgIapW+0bhOab6W5+cmhuVaUlI9hnJ9KxylCEUXCOSLYna+DpIxUkgqfZhSqGcJ21zJ4QRjgPwHibKua6+ZVGaJ1wYs8QWKEzxP0JTnM2YyP0rYMIBT0GYyud0SVcKXqfDBLQp3sAAhpAhSOv6I/kdBHKyL+/kohCXdnWBkylHY7U5GWXUFWzOrmBW/Gqmmm36ewYTt8yegSlpZknFc9sVExmlGq4CXKXTe8xz9N6onmk1bn75ThwFoB5iSGlagNFKzZe25bjiqAxWhD72HG9x2V6z1QLnDacxlrVm9b3FX26QojzGCUcRZOzAsTN+WELKCYqw0N2c4QyiYxjiA2kmMMbssFR3mgA9iQRgkFL5k+ev0H89SvkdsnHsS2XVB5vmKnAWd5DzwdPDkiRZznGC6kSDPKomrLogQUAoRjwaSwMiSmStAA2hPAUur1FdboIz3fJ5Y3d9zdlYW3+7ubz9otmOBe2h9kgzNEd7om1rr7S+tqOhzsfO587Xzrfup+6X7vfuj1o6P9f4PNDOrO7PP83BCWI=

Inference
‣ Generate next word condi.oned on previous word and hidden state

the movie was great

‣ During inference: need to compute the argmax over the word
predic.ons and then feed that to the next RNN state

le

‣ Need to actually evaluate computa.on graph up to this point to form
input for the next state

‣ Decoder is advanced one state at a .me un.l [STOP] is reached

film était bon [STOP]

seg-47

RNN Language Modeling

I saw the dog

hi

P (w|context) =
exp(w · hi)P
w0 exp(w

0 · hi)

P (w|context) = softmax(Whi)

‣W is a (vocab size) x (hidden size) matrix; linear layer in PyTorch (rows
are word embeddings)

equivalent to

word probs

=

Training RNNLMs

~~I saw the dog~~

‣ Input is a sequence of words, output is those words shiHed by one

I saw the dog running

‣ Allows us to train on predicJons across several Jmesteps
simultaneously (similar to batching but we don’t call this batching)

Training RNNLMs

I saw the dog

‣ Total loss = sum of negaJve log likelihoods at each posiJon

‣ In PyTorch: simply add the losses together and call .backward()

P(w|context)

loss = — log P(w*|context)

Batched LM Training

I saw the dog running in the park and it looked very excited to be there

~~I saw the dog~~

I saw the dog running

~~in the park and~~

in the park and it
batch dim

‣MulJple sequences and mulJple
Jmesteps per sequence

looked very excited to be

Batched LM Training

‣ torch.nn.LSTM / torch.nn.GRU: expect input in [seq len, batch, word dim]
format

executed in parallel

‣ Cannot parallelize across Jmesteps
of RNN since output depends on
previous Jmesteps, so using larger
batches gives beZer parallelism

‣ Input: [4, 2, dim]

I saw the dog

in the park and

Other ImplementaJon Details

‣ torch.nn.Embedding: maps sequence of word indices to vectors

‣ [126, 285] -> [[0.1, -0.07, 1.2],
[-2.3, 0.2, 1.4]]

‣Moves from [sequence length] vector of indices -> [seq len, dim]
tensor or [batch, sequence length] matrix -> [batch, seq len, dim
tensor]

LM EvaluaJon: Perplexity

‣ Accuracy doesn’t make sense — predicJng the next word is generally
impossible so accuracy values would be very low

‣ Evaluate LMs on the likelihood of held-out data (averaged to
normalize for length)

1

n

nX

i=1

logP (wi|w1, . . . , wi�1)

‣ Perplexity: exp(average negaJve log likelihood). Lower is beZer
‣ Suppose we have probs 1/4, 1/3, 1/4, 1/3 for 4 predicJons
‣ Avg NLL (base e) = 1.242 Perplexity = 3.464 <== geometric mean of denominators seg-51 Montague Seman,cs Id Name Alias Birthdate Sings? e470 Stefani Germano3a Lady Gaga 3/28/1986 T e728 Marshall Mathers Eminem 10/17/1972 T ‣ Database containing en,,es, predicates, etc. ‣ Sentence expresses something about the world which is either true or false NP VP NNP NNP S VBP Lady Gaga sings ‣ Denota,on: evalua,on of some expression against this database [[Lady Gaga]] = e470 denota,on of this string is an en,ty [[sings(e470)]] = True denota,on of this expression is T/F Montague Seman,cs NP VP NNP NNP S VBP Lady Gaga sings e470 λy. sings(y) takes one argument (y, the en,ty) and returns a logical form sings(y) λy. sings(y) sings(e470) ‣We can use the syntac,c parse as a bridge to the lambda-calculus representa,on, build up a logical form (our model) composi/onally func,on applica,on: apply this to e470 ID Parses to Logical Forms NP VP NNP NNP S VBP Lady Gaga sings e470 λy. sings(y) sings(e470) ∧ dances(e470) VP CC VP VBP dances λy. dances(y) and VP: λy. a(y) ∧ b(y) -> VP: λy. a(y) CC VP: λy. b(y)

λy. sings(y) ∧ dances(y)

‣ General rules:
S: f(x) -> NP: x VP: f

Parses to Logical Forms

NP

NNP NNP

S

VBD
Lady Gaga was

e470

λx.λy. born(y,x)

born(e470,3/28/1986)

VP

NP

March 28, 1986born

λy. born(y, 3/28/1986)

VBN

VP

λy. born(y, 3/28/1986)

‣ How to handle tense: should we indicate that this happened in the past?

‣ Func,on takes two arguments: first x (date), then y (en,ty)

3/28/1986

Tricky Things
‣ Adverbs/temporality: Lady Gaga sang well yesterday

∃e. type(e,sing) ∧ agent(e,e470) ∧ manner(e,well) ∧ …
‣ “Neo-Davidsonian” view of events: things with many proper,es:

‣ Quan,fica,on: Everyone is friends with someone

‣ Generic: Cats eat mice (all cats eat mice? most cats? some cats?)

∀x ∃y friend(x,y)∃y ∀x friend(x,y)
(different friends)(one friend)

‣ Same syntac,c parse for both! Syntax doesn’t resolve all ambigui,es

sings(Lady Gaga, time=yesterday, manner=well)

‣ Indefinite: Amy ate a waffle ∃w. waffle(w) ∧ ate(Amy,w)

Seman,c Parsing

‣ For ques,on answering, syntac,c parsing doesn’t tell you
everything you want to know, but indicates the right structure

‣ Solu,on: seman/c parsing: many forms of this task depending on
seman,c formalisms

‣ Applica,ons: database querying/ques,on answer: produce lambda-
calculus expressions that can be executed in these contexts

seg-45

Gated Connec+ons

‣ Designed to fix “vanishing gradient” problem using gates

‣ Vector-valued “forget gate” f computed based
on input and previous hidden state

‣ Sigmoid: elements of f are in (0, 1)

f = �(W xfxt +W
hfht�1)

ht = ht�1 � f + func(xt)

=

ht-1 f ht

ht = tanh(Wxt + V ht�1 + bh)

gated Elman

‣ If f ≈ 1, we simply sum up a func+on of all inputs — gradient doesn’t
vanish! More stable without matrix mul+ply (V) as well

LSTMs

‣ “Cell” c in addi+on to hidden state h

‣ Vector-valued forget gate f depends on the h hidden state

‣ Basic communica+on flow: x -> c -> h -> output, each step of this process is gated in
addi+on to gates from previous +mesteps

ct = ct�1 � f + func(xt,ht�1)

f = �(W xfxt +W
hfht�1)

‣ “Long short-term memory” network: hidden state is a “short-term” memory

hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/

Goldberg Primer

LSTMs

xj

f
g

i
o

hjhj-1

cj-1 cj

‣ f, i, o are gates that control informa+on flow

‣ g reflects the main computa+on of the cell

hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/

Goldberg Primer

LSTMs

‣ Is an LSTM able to “throw out” the previous value of c at this +mestep?

‣ Can we ignore a par+cular input x?
‣ Can we sum up the inputs x?

‣ Can we output something without changing c?

xj

f
g

i
o

hjhj-1

cj-1 cj

hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/

Goldberg Primer

LSTMs

xj

f
g

i
o

hjhj-1

cj-1 cj

‣ Ignoring recurrent state en+rely:

‣ Lets us discard stopwords

‣ Summing inputs:

‣ Lets us get feedforward layer over token

‣ Ignoring input:

‣ Lets us compute a bag-of-words
representa+on

hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/

LSTMs

‣ Gradient s+ll diminishes, but in a controlled way and generally by less — can
ini+alize forget gate = 1 to remember everything to start

<- gradientsimilar gradient <- hVp://colah.github.io/posts/2015-08-Understanding-LSTMs/ Other Units ‣ Gated recurrent unit (GRU): simpler architecture without c “update” “reset” ‣ Other variants of LSTMs: mul+plica+ve LSTMs, rota+onal unit of memory (RUM), … ‣ Quan+zed (binary/ternary weights) and other compressed forms seg-79 Mul$-hop QA ‣ Very few SQuAD ques$ons require actually combining mul$ple pieces of informa$on — this is an important capability QA systems should have ‣ Several datasets test mul$-hop reasoning: ability to answer ques$ons that draw on several sentences or several documents to answer Welbl et al. (2018); Yang et al. (2018) ‣WikiHop: annotators given mul$ple documents HotpotQA Yang et al. (2018) Meet Corliss Archer is an American television sitcom that aired on CBS … Ques%on: What government posi$on was held by the woman who portrayed Corliss Archer in the film Kiss and Tell ? Shirley Temple Black was an American actress, businesswoman, and singer … Kiss and Tell is a comedy film in which 17-year-old Shirley Temple acts as Corliss Archer . … … As an adult, she served as Chief of Protocol of the United States Do c 1 D o c 2 Same en$ty D o c 3 Same en$ty Mul$-hop Reasoning Yang et al. (2018) The Oberoi family is an Indian family that is famous for its involvement in hotels, namely through The Oberoi Group Ques%on: The Oberoi family is part of a hotel company that has a head office in what city? The Oberoi Group is a hotel company with its head office in Delhi. … … D o c 1 D o c 2 Same en$ty Same en$ty This is an idealized version of mul$-hop reasoning. Do models need to do this to do well on this task? Mul$-hop Reasoning Yang et al. (2018) The Oberoi family is an Indian family that is famous for its involvement in hotels, namely through The Oberoi Group Ques%on: The Oberoi family is part of a hotel company that has a head office in what city? The Oberoi Group is a hotel company with its head office in Delhi. … … D o c 1 D o c 2 Model can ignore the bridging en$ty and directly predict the answer High lexical overlap Mul$-hop Reasoning Meet Corliss Archer is an American television sitcom that aired on CBS … Ques%on: What government posi$on was held by the woman who portrayed Corliss Archer in the film Kiss and Tell ? Shirley Temple Black was an American actress, businesswoman, and singer … Kiss and Tell is a comedy film in which 17-year-old Shirley Temple acts as Corliss Archer . … … As an adult, she served as Chief of Protocol of the United States Do c 1 D o c 2 Same en$ty D o c 3 Same en$ty No simple lexical overlap. …but only one government posi$on appears in the context! Weaknesses of Mul$-hop Reasoning Chen and DurreX (2019); Min et al. (2019) A cc u ra cy 30 50 70 59.3 67.466.4 64.8 42.9 38.8 NoContextEn$ty- GCN CFC BAGMajority- candidate BiDAF state-of-the-artweak baselines NoContextEn$ty- GCN CFC BAGMajority- candidate BiDAF ‣ Can get >60%
accuracy on a
mul$ple choice task
without using the
context

Weaknesses of Mul$-hop Reasoning

Chen and DurreX (2019); Min et al. (2019)

‣ Can get >50%
accuracy on a
tougher task looking
at sentences in
isola$on

BiDAF++ QFE GRN DFGN Sentence
Factored

F1
0

35

70

50.8

69.769.068.1

58.7

BiDAF++ QFE GRN DFGN Sentence
Factored

State-of-the-art Models

Asai et al. (2020)

‣ Best systems: use hyperlink structure of Wikipedia and a strong
mul$-step retrieval mode built on BERT

seg-92

Neural Chatbots

Sordoni et al. (2015)

What are you doing

I

am going home [STOP]

‣ Just like convenIonal MT, can train seq2seq models for this task

‣ Hard to evaluate:

OpenSubItles Training Data

do you want to meet your sponsor for the last 10 years ?
of course ! but he doesn’t want to see me !

yeah , we were just going to hit up taco bell .
well , it’ s my pleasure .

and where had you been just before ?
i’ d been to the palace of the legion of honor , the art gallery .

‣Why might this model perform poorly? What might it be bad at?

Lack of Diversity

Li et al. (2016)

‣ Training to maximize likelihood gives a system that prefers common
responses:

Lack of Diversity

Li et al. (2016)

‣ SoluIon: mutual informaIon criterion; response R should be
predicIve of user u^erance U as well

‣Mutual informaIon:

‣ Standard condiIonal likelihood: logP (R|U)

log
P (R,U)

P (R)P (U)
= logP (R|U)� logP (R)

‣ log P(R) = probabiliIes under a language model

Lack of Diversity

Li et al. (2016)

Specificity

Ko, Durre^, Li (2019)

What are you doing

I

don’t know [STOP]

SPECIFICITY=1 (nonspecific)

‣When training the decoder, condiIon on the automa&cally predicted
specificity of the response

I don’t know => SPECIFICITY=1

‣ Train a specificity classifier on labeled data

Going to the store => SPECIFICITY=3

Specificity

Ko, Durre^, Li (2019)

Ko, Durre^, Li (2019)

What are you doing

I

don’t know [STOP]

SPECIFICITY=1 (nonspecific)

What are you doing

Going to the store [STOP]

SPECIFICITY=4 (specific)

‣ At test Ime, set
the specificity
level higher to
get less generic
responses

PersonaChat

Zhang et al. (2018)

Meena

Adiwardana et al. (2020)

‣ 2.6B-parameter seq2seq model (larger than GPT-2)

‣ Trained on 341GB of online conversaIons scraped from public social
media

‣ Sample responses:

Blender

Roller et al. (2020)

‣ 2.7B-param model (like the previous one), also 9.4B-parameter
seq2seq model

‣ “Poly-encoder” Transformer architecture, some training tricks

‣ Three models: retrieve (from training data), generate, retrieve-and-refine

‣ Fine-tuning on three prior datasets: PersonaChat, EmpatheIc Dialogues
(discuss personal situaIon, listener is empatheIc), Wizard of Wikipedia
(discuss something from Wikipedia)

Blender

Roller et al. (2020)

Blender

Roller et al. (2020)

‣ Inconsistent responses: this
model doesn’t really have
anything to say about itself

‣ Holding a conversaIon != AI

‣ Can’t acquire new informaIon
‣ Did it learn “fun guy”? No, it
doesn’t understand
phonology. It probably had
this in the data somewhere

seg-86

Extrac’ve Summariza’on: MMR

Carbonell and Goldstein (1998)

‣ Given some ar’cles and a length budget of k words, pick some
sentences of total length <= k and make a summary ‣ Pick important yet diverse content: maximum marginal relevance (MMR) While summary is < k words Calculate Add highest MMR sentence that doesn’t overflow length “make this sentence similar to a query” “make this sentence maximally different from all others added so far” “max over all sentences not yet in the summary” MMR = argmaxDi2R\S  �(Sim1(Di, Q)� (1� �) max Dj2S Sim2(Di, Dj)) � AAAChnicbVFbb9MwFHYyLqVc1sEjLxYVUiuxKulA2wtSBTzwMmkXuk1qoshxndbMdiL7BLWK8lP4U7zxbzjJgjQ2jmT58znfd25OCyUdBMFvz9958PDR496T/tNnz1/sDvZeXri8tFzMea5ye5UyJ5Q0Yg4SlLgqrGA6VeIyvf7cxC9/COtkbr7BthCxZisjM8kZoCsZ/Iw0g7XV1fHxWU0/0gjEBvDJ7EqzTZ1UXxJJI2noGaVRyvi1U8yt6XlNIyUyWOCFxZZs9DfPudR1Eo5Q9u50TPfpKNzvKGOKnE2T8XubsclxWzRtRRgdI9PK1RriZDAMJkFr9D4IOzAknZ0kg1/RMuelFgY4NuoWYVBAjNOA5ErU/ah0osAp2EosEBqmhYurdo01fYueJc1yi8cAbb23FRXTzm11isymb3c31jj/F1uUkB3FlTRFCcLwm0JZqSjktPkTupRWcFBbBIxbib1SvmaWccCf6+MSwrsj3wcX00l4MJmevh/OPnXr6JHX5A0ZkZAckhn5Sk7InHBvxxt7U+/A7/kT/4N/eEP1vU7zivxj/uwPer7ApA==

Centroid
‣ Represent the documents and each sentences as bag-of-words with TF-IDF
weigh’ng

While summary is < k words Calculate score(sentence) = cosine(sent-vec, doc-vec) Discard all sentences whose similarity with some sentence already in the summary is too high Add the best remaining sentence that won’t overflow the summary Erkan and Radev (2004) Summariza'on ‣ Count number of documents each bigram occurs in to measure importance score(massive earthquake) = 3 score(Iraqi capital) = 1score(six killed) = 2 score(magnitude 7.3) = 2 ‣ ILP formula'on: c and s are indicator variables indexed over bigrams (“concepts”) and sentences, respec'vely “set ci to 1 iff some sentence that contains it is included” sum of included sentences’ lengths can’t exceed L ‣ Find summary that maximizes the score of bigrams it covers Gillick and Favre (2009) Evalua'on: ROUGE ‣ ROUGE-n: n-gram precision/recall/F1 of summary w.r.t. gold standard ‣ ROUGE-2 correlates somewhat well with human judgments for mul'- document summariza'on tasks An earthquake was detected in Iraq on Sunday A massive earthquake of magnitude 7.3 struck Iraq on Sunday ‣ Many hyperparameters: stemming, remove stopwords, etc. reference predic'on ROUGE 2 recall = 1 correct bigram (Iraq, Sunday) / 4 reference bigrams ‣ Historically: ROUGE recall @ k {words, characters}. Now: ROUGE F1 ROUGE 2 precision = 1 correct bigram (Iraq, Sunday) / 6 predicted bigrams Lin (2004) Results Ghalandari (2017) Bejer centroid: 38.58 9.73 1.53 Gillick and Favre / bigram recall ‣ Caveat: these techniques all work bejer for mul'-document than single- document! Mul'-document vs. Single-document ‣ “a massive earthquake hit Iraq” “a massive earthquake struck Iraq” — lots of redundancy to help select content in mul'-document case ‣When you have a lot of documents, there are more possible sentences to extract: But eight villages were damaged in Iran and at least six people were killed and many others injured in the border town of Qasr-e Shirin in Iran, Iranian state TV said. The quake has been felt in several Iranian ciGes and eight villages have been damaged. ‣Mul'-document summariza'on is easier? seg-87 Neural Extrac,ve Methods Nallapa, et al. (2017) p o o lin g ‣ Most neural methods: encode sentences, classify each sentence as belonging to summary or not ‣ Some sort of step (sequence model, ranking) to restrict to at most k sentences (not k words) BERTSum Liu et al. (2017) BERTSum Liu et al. (2017) ‣ Final “summariza,on” layer doesn’t maQer that much ‣ LEAD baseline: take the first few document sentences. Very strong for single-document summariza,on! seg-93 Task-oriented Dialogue Siri, find me a good sushi restaurant in Chelsea Sushi Seki Chelsea is a sushi restaurant in Chelsea with 4.4 stars on Google ‣ Personal assistants / API front-ends: How expensive is it? Entrees are around $30 each Find me something cheaper Task-oriented Dialogue Hey Alexa, why isn’t my Amazon order here? Let me retrieve your order. Your order was scheduled to arrive at 4pm today. ‣ Customer service: It never came Okay, I can put you through to customer service. ATIS DARPA (early 1990s), Figure from Tur et al. (2010) ‣ Given an uYerance, predict a domain-specific semanZc interpretaZon ‣ Can formulate as semanZc parsing, but simple slot-filling soluZons (classifiers) work well too Intents ‣ 29 different intents what days of the week do flights from san jose to nashville fly on does tacoma airport offer transportaZon from the airport to the downtown area which flights go from cleveland to indianapolis on april fi_h what meals are served on american flight 811 from tampa to milwaukee Intent: flight Intent: ground_service Intent: day_name Intent: meal Full Dialogue Task Find me a good sushi restaurant in Chelsea restaurant_type <- sushi location <- Chelsea Sushi Seki Chelsea is a sushi restaurant in Chelsea with 4.4 stars on Google curr_result <- execute_search() How expensive is it? get_value(cost, curr_result) Entrees are around $30 each ‣ Involves slot filling, API calls, and generaZon Dataflow Graphs for Dialogue SemanZc Machines; Andreas et al. (2020) Training Dialogue Models Find me a good sushi restaurant in Chelsea restaurant_type <- sushi location <- Chelsea curr_result <- execute_search() {wizard enters these Sushi Seki Chelsea is a sushi restaurant in Chelsea with 4.4 stars on Google{ wizard types this out or invokes templates ‣ Learning from demonstraZons: “wizard” pulls the levers and makes the dialogue system update its state and take acZons Goal-oriented Dialogue ‣ Big Companies: Apple Siri, Google Assistant, Amazon Alexa, Microso_ Cortana/SemanZc Machines, Facebook, Samsung Bixby, Tencent WeChat, ASAPP ‣ Tons of industry interest! ‣ Dozens of startups + medium-sized companies in this space seg-78 Open-Domain QA ‣ SQuAD-style QA is very ar6ficial, not really a real applica6on ‣ Real QA systems should be able to handle more than just a paragraph of context — theore6cally should work over the whole web? Q: What was Marie Curie the recipient of? Marie Curie was awarded the Nobel Prize in Chemistry and the Nobel Prize in Physics… Mother Teresa received the Nobel Peace Prize in… Curie received his doctorate in March 1895… Skłodowska received accolades for her early work… Open-Domain QA ‣ SQuAD-style QA is very ar6ficial, not really a real applica6on ‣ Real QA systems should be able to handle more than just a paragraph of context — theore6cally should work over the whole web? ‣ QA pipeline: given a ques6on: ‣ Retrieve some documents with an IR system ‣ Zero in on the answer in those documents with a QA model ‣ This also introduces more complex distractors (bad answers) and should require stronger QA systems DrQA Chen et al. (2017) ‣ How oTen does the retrieved context contain the answer? (uses Lucene) ‣ Full retrieval results using a QA model trained on SQuAD: task is much harder Retrieval with BERT Lee et al. (2019) ‣ Can we do beZer than a simple IR system? ‣ Encode the query with BERT, pre-encode all paragraphs with BERT, query is basically nearest neighbors Natural Ques6ons Kwiatkowski et al. (2019) ‣Many SQuAD ques6ons are not suited to the “open” se`ng because they’re underspecified: Where did the Super Bowl take place? ‣ Ques6ons arose from Google searches, unlike SQuAD ques6ons which were wriZen by people looking at a passage ‣ Short answer F1s < 60, long answer F1s <75 ‣ Natural Ques6ons: real ques6ons, answerable with Wikipedia seg-44 RNN Mo&va&on ‣ Feedforward NNs can’t handle variable length input: each posi&on in the feature vector has fixed seman&cs ‣ Instead, we need to: 1) Process each word in a uniform way the movie was great that was great ! 2) …while s&ll exploi&ng the context that that token occurs in ‣ These don’t look related (great is in two different orthogonal subspaces) RNN Abstrac&on ‣ Cell that takes some input x, has some hidden state h, and updates that hidden state and produces output y (all vector-valued) previous h next h (previous c) (next c) input x output y ‣ Op&onally: cell state c (used in LSTMs but not all architectures) RNN Uses ‣ Transducer: make some predic&on for each element in a sequence ‣ Acceptor/encoder: encode a sequence into a fixed-sized vector and use that for some purpose the movie was great predict sen&ment (matmul + soWmax) translate the movie was great DT NN VBD JJ paraphrase/compress output y = score for each tag, then soWmax Elman Networks input xt prev hidden state ht-1 ht output yt ‣ Computes output from hidden state ‣ Updates hidden state based on input and current hidden state ‣ Long history! (invented in the late 1980s) yt = tanh(Uht + by) ht = tanh(Wxt + V ht�1 + bh) Training Elman Networks the movie was great predict sen&ment ‣ “Backpropaga&on through &me”: build the network as one big computa&on graph, some parameters are shared ‣ RNN poten&ally needs to learn how to “remember” informa&on for a long &me! it was my favorite movie of 2016, though it wasn’t without problems -> +

‣ “Correct” parameter update is to do a beder job of remembering the
sen&ment of favorite

Vanishing Gradient

‣ Gradient diminishes going through tanh; if not in [-2, 2], gradient is almost 0

<- gradient<- smaller gradient<- &ny gradient hdp://colah.github.io/posts/2015-08-Understanding-LSTMs/ ‣ Repeated mul&plica&on by V causes problems ht = tanh(Wxt + V ht�1 + bh) seg-50 Seman&cs ‣ Our syntac&c representa&ons don’t say much about word meaning ‣ Grounding: associate words with real-world concepts (e.g., red means certain RGB values coming in from a sensor) ‣ Here: formal seman&cs. Connect language to logical representa&ons of what it means, then worry about grounding those later Model Theore&c Seman&cs ‣ Key idea: can interpret natural language expressions as set-theore&c expressions called models of those sentences ‣ Natural language statement S => interpreta&on of S that models it

‣ We can now concretely define things like entailment: A entails B if in all
worlds where A is true, B is true

She likes going to that restaurant

‣ Interpreta&on: defines who she and that restaurant are, make it able
to be concretely evaluated with respect to a world

‣ Our modeling language is first-order logic

First-order Logic

‣ sings is a predicate (with one argument), func&on f: en&ty → true/false

‣ Powerful logic formalism including things like en&&es, rela&ons, and
quan&fica&on

‣ [[sings]] = denota7on, set of en&&es which sing (found by execu&ng this
predicate on the world — we’ll come back to this)

Lady Gaga sings

‣ sings(Lady Gaga) = true or false, have to execute this against some
database (world)

Quan&fica&on

‣ ∀x sings(x) ∨ dances(x) → performs(x)

‣ ∃y ∀x friend(x,y)

‣ Universal quan&fica&on: “forall” operator

‣ Existen&al quan&fica&on: “there exists” operator

‣ ∀x ∃y friend(x,y)
‣ Source of ambiguity! “Everyone is friends with someone”

“Everyone who sings or dances performs”

“Someone sings”‣ ∃x sings(x)

Logic in NLP
‣ Ques&on answering:

‣ Informa&on extrac&on: Lady Gaga and Eminem are both musicians

∀x musician(x) => performer(x)

musician(Lady Gaga) ∧ musician(Eminem)

Then: performer(Lady Gaga) ∧ performer(Eminem)

Who are all the American singers named Amy?

λx. na&onality(x,USA) ∧ sings(x) ∧ firstName(x,Amy)

‣ Func&on that maps from x to true/false, like filter. Execute this on the
world to answer the ques&on

‣ Can now do reasoning. Maybe know:

‣ Lambda calculus: powerful system for expressing these func&ons

._slides-notes

slides-notes/._seg-49.pdf

slides-notes/seg-49.pdf

slides-notes/._seg-75.pdf

slides-notes/seg-75.pdf

slides-notes/._seg-61.pdf

slides-notes/seg-61.pdf

slides-notes/._seg-60.pdf

slides-notes/seg-60.pdf

slides-notes/._seg-74.pdf

slides-notes/seg-74.pdf

slides-notes/._seg-48.pdf

slides-notes/seg-48.pdf

slides-notes/._seg-62.pdf

slides-notes/seg-62.pdf

slides-notes/._seg-76.pdf

slides-notes/seg-76.pdf

slides-notes/._seg-89.pdf

slides-notes/seg-89.pdf

slides-notes/._seg-88.pdf

slides-notes/seg-88.pdf

slides-notes/._seg-77.pdf

slides-notes/seg-77.pdf

slides-notes/._seg-63.pdf

slides-notes/seg-63.pdf

slides-notes/._seg-67.pdf

slides-notes/seg-67.pdf

slides-notes/._seg-73.pdf

slides-notes/seg-73.pdf

slides-notes/._seg-98.pdf

slides-notes/seg-98.pdf

slides-notes/._seg-99.pdf

slides-notes/seg-99.pdf

slides-notes/._seg-72.pdf

slides-notes/seg-72.pdf

slides-notes/._seg-66.pdf

slides-notes/seg-66.pdf

slides-notes/._seg-70.pdf

slides-notes/seg-70.pdf

slides-notes/._seg-64.pdf

slides-notes/seg-64.pdf

slides-notes/._seg-58.pdf

slides-notes/seg-58.pdf

slides-notes/._seg-59.pdf

slides-notes/seg-59.pdf

slides-notes/._seg-65.pdf

slides-notes/seg-65.pdf

slides-notes/._seg-71.pdf

slides-notes/seg-71.pdf

slides-notes/._seg-16.pdf

slides-notes/seg-16.pdf

slides-notes/._seg-11a.pdf

slides-notes/seg-11a.pdf

slides-notes/._seg-17.pdf

slides-notes/seg-17.pdf

slides-notes/._seg-8.pdf

slides-notes/seg-8.pdf

slides-notes/._seg-29.pdf

slides-notes/seg-29.pdf

slides-notes/._seg-15.pdf

slides-notes/seg-15.pdf

slides-notes/._seg-14.pdf

slides-notes/seg-14.pdf

slides-notes/._seg-28.pdf

slides-notes/seg-28.pdf

slides-notes/._seg-9.pdf

slides-notes/seg-9.pdf

slides-notes/._seg-10.pdf

slides-notes/seg-10.pdf

slides-notes/._seg-38.pdf

slides-notes/seg-38.pdf

slides-notes/._seg-39.pdf

slides-notes/seg-39.pdf

slides-notes/._seg-11.pdf

slides-notes/seg-11.pdf

slides-notes/._seg-13.pdf

slides-notes/seg-13.pdf

slides-notes/._seg-12.pdf

slides-notes/seg-12.pdf

slides-notes/._seg-2.pdf

slides-notes/seg-2.pdf

slides-notes/._seg-37.pdf

slides-notes/seg-37.pdf

slides-notes/._seg-23.pdf

slides-notes/seg-23.pdf

slides-notes/._seg-22.pdf

slides-notes/seg-22.pdf

slides-notes/._seg-36.pdf

slides-notes/seg-36.pdf

slides-notes/._seg-3.pdf

slides-notes/seg-3.pdf

slides-notes/._seg-1.pdf

slides-notes/seg-1.pdf

slides-notes/._seg-20.pdf

slides-notes/seg-20.pdf

slides-notes/._seg-34.pdf

slides-notes/seg-34.pdf

slides-notes/._seg-35.pdf

slides-notes/seg-35.pdf

slides-notes/._seg-21.pdf

slides-notes/seg-21.pdf

slides-notes/._seg-0.pdf

slides-notes/seg-0.pdf

slides-notes/._seg-4.pdf

slides-notes/seg-4.pdf

slides-notes/._seg-25.pdf

slides-notes/seg-25.pdf

slides-notes/._seg-31.pdf

slides-notes/seg-31.pdf

slides-notes/._seg-19.pdf

slides-notes/seg-19.pdf

slides-notes/._seg-18.pdf

slides-notes/seg-18.pdf

slides-notes/._seg-30.pdf

slides-notes/seg-30.pdf

slides-notes/._seg-24.pdf

slides-notes/seg-24.pdf

slides-notes/._seg-5.pdf

slides-notes/seg-5.pdf

slides-notes/._seg-7.pdf

slides-notes/seg-7.pdf

slides-notes/._seg-32.pdf

slides-notes/seg-32.pdf

slides-notes/._seg-26.pdf

slides-notes/seg-26.pdf

slides-notes/._seg-27.pdf

slides-notes/seg-27.pdf

slides-notes/._seg-33.pdf

slides-notes/seg-33.pdf

slides-notes/._seg-6.pdf

slides-notes/seg-6.pdf

slides-notes/._seg-68.pdf

slides-notes/seg-68.pdf

slides-notes/._seg-54.pdf

slides-notes/seg-54.pdf

slides-notes/._seg-40.pdf

slides-notes/seg-40.pdf

slides-notes/._seg-97.pdf

slides-notes/seg-97.pdf

slides-notes/._seg-83.pdf

slides-notes/seg-83.pdf

slides-notes/._seg-82.pdf

slides-notes/seg-82.pdf

slides-notes/._seg-96.pdf

slides-notes/seg-96.pdf

slides-notes/._seg-41.pdf

slides-notes/seg-41.pdf

slides-notes/._seg-55.pdf

slides-notes/seg-55.pdf

slides-notes/._seg-69.pdf

slides-notes/seg-69.pdf

slides-notes/._seg-43.pdf

slides-notes/seg-43.pdf

slides-notes/._seg-57.pdf

slides-notes/seg-57.pdf

slides-notes/._seg-80.pdf

slides-notes/seg-80.pdf

slides-notes/._seg-94.pdf

slides-notes/seg-94.pdf

slides-notes/._seg-95.pdf

slides-notes/seg-95.pdf

slides-notes/._seg-81.pdf

slides-notes/seg-81.pdf

slides-notes/._seg-56.pdf

slides-notes/seg-56.pdf

slides-notes/._seg-42.pdf

slides-notes/seg-42.pdf

slides-notes/._seg-46.pdf

slides-notes/seg-46.pdf

slides-notes/._seg-52.pdf

slides-notes/seg-52.pdf

slides-notes/._seg-85.pdf

slides-notes/seg-85.pdf

slides-notes/._seg-91.pdf

slides-notes/seg-91.pdf

slides-notes/._seg-90.pdf

slides-notes/seg-90.pdf

slides-notes/._seg-84.pdf

slides-notes/seg-84.pdf

slides-notes/._seg-53.pdf

slides-notes/seg-53.pdf

slides-notes/._seg-47.pdf

slides-notes/seg-47.pdf

slides-notes/._seg-51.pdf

slides-notes/seg-51.pdf

slides-notes/._seg-45.pdf

slides-notes/seg-45.pdf

slides-notes/._seg-79.pdf

slides-notes/seg-79.pdf

slides-notes/._seg-92.pdf

slides-notes/seg-92.pdf

slides-notes/._seg-86.pdf

slides-notes/seg-86.pdf

slides-notes/._seg-87.pdf

slides-notes/seg-87.pdf

slides-notes/._seg-93.pdf

slides-notes/seg-93.pdf

slides-notes/._seg-78.pdf

slides-notes/seg-78.pdf

slides-notes/._seg-44.pdf

slides-notes/seg-44.pdf

slides-notes/._seg-50.pdf

slides-notes/seg-50.pdf

Related Posts