Natural Language Processing
CMPSC 442
Week 13, Meeting 38, Three Segments
Outline
● Early Decades
● Shift to Machine Learning Paradigm
● NLP Deep Learning: Excerpts from Mirella Lapata 2017 Keynote
2Outline, Wk 13, Mtg 37
Natural Language Processing
CMPSC 442
Week 13, Meeting 38, Segment 1: Early Decades
Early Vision
● The Ultimate Goal – For computers to use natural language as
effectively as humans do . . .
4Early Decades
1989 White paper for DARPA on NLP
Application Areas
● Reading and writing text
○ Summarization
○ Extraction into Databases (Information Extraction)
○ Question Answering (as distinct from Information retrieval/search)
● Interactive spoken dialogue for human-machine interaction
○ Informal Speech Input and Output
○ Dialogue ≠ Alexa/Google Assistant/Siri
● Translation: Input and Output Across Different Human Languages
● Natural Language Generation (Translation from symbols to language)
5Early Decades
Speech: Continuous Acoustic Energy
● Acoustic models translate
continuous acoustic energy
to units of sound: many
possible analyses for one
utterance
● Language models translate
combinations of sound to
words and phrases: many
possible analyses
6Early Decades
● SHRDLU (Winograd, 1968) had a text-based
interface
● Could answer questions and simulate actions
on the blocks
An Early Interactive NLP System
7Early Decades
Another Early Interactive NLP System
● Lunar (Woods, 1971): NLP database access to
1971 lunar samples
● Handled 78% of sentences typed by
geologists at 1971 Lunar Rocks conference
○ What is the average concentration of aluminum in
high alkali rocks?
○ How many breccias contain olivine?
○ Give me the modal analyses of those samples for all
phases.
8Early Decades
Apollo 11 astronauts Buzz Aldrin, Michael
Collins & Neal Armstrong showing a
moonrock to the director of the Smithsonian
Information Extraction
Goal: extract facts
9Early Decades
1 – Label Named Entities (NEs)
2 – Assign Relations between NEs
3 – Create database entries
Zipf’s Law: A Few Words are Everywhere
● Word rank by word frequency on log-log
scale for first 10M words in 30 Wikipedias
● Accounts for 80/20 rule (Pareto principle):
“80% of the causes are due to 20% of the
effects”
○ 80% of the text comes from 20% of the
words in the vocabulary
○ Very long-tailed distribution of many, many
relatively rare words
● Most of language follows Zipf’s law, making it
relatively easy for ML to handle 80-90% of cases
and quite difficult to handle the rest
10Early Decades
Ambiguity and Language Efficiency (Piantadosi)
● Language processing evidence suggests
that use of contextual information for
inference is very rapid
● Capitalizing on re-use of the same word
forms for different meanings makes
language very efficient
● Ambiguity of words is highest for
monosyllabic words, i.e., lowest
production effort
● Methods do not yet exist for machines to
handle contextual information well
11Early Decades
Natural Language Processing
CMPSC 442
Week 13, Meeting 38, Segment 2: Scaling Up through
Machine Learning
Motivation for Syntax and Compositional Semantics
● European languages have more or less fixed word order inside phrases
● Some substrings can stand alone; others cannot
13Shift to Machine Learning
While white is the coolest summer shade, there are lots of pastel hues along with tintable fabrics
that will blend with any wardrobe color.
While white is the coolest summer shade, along with tintable fabrics there are lots of pastel hues
that will blend with any wardrobe color.
There are lots of pastel hues along with tintable fabrics that will blend with any wardrobe color,
while white is the coolest summer shade.
There are lots of pastel hues along with tintable fabrics
white is the coolest summer shade that will blend with any wardrobe color
Phrase Structure Parse Trees
● There are many different grammar formalisms
● Context Free Grammar (CFG)
14Shift to Machine Learning
S → NP VP Pro → {I, you, . . .} NP → Det Nom
NP → Pro VP → V NP Nom → Noun
Penn Treebank: 1988-1994
First large, annotated NLP corpus
● Mitchell P. Marcus, Beatrice Santorini, Mary Ann
Marcinkiewicz, Ann Taylor
● 1M Brown corpus + 1M Switchboard corpus + 1.3M
Wall Street Journal
● All tagged with part-of-speech and syntax;
consensus on syntactic structure
● Finished before it had practical use
15Shift to Machine Learning
Parsers Trained on Penn TreeBank
● 106 word TreeBank + ML = Robust parsing
16Shift to Machine Learning
Dramatic Gains in Parsing Coverage and Accuracy
17
Gaps between CFG Syntax and Logical Form
● Active versus passive constructions
○ The doctor saw the patient (Subject of the verb is the agent of “to see”)
○ The patient was seen by the doctor (Subject of the verb is not the agent)
● Syntactically elided arguments
○ The doctor decided to see the patient. (Elided subject of “to see”: doctor)
○ The doctor persuaded the patient to exercise. (Elided subject of “exercise”: patient)
● Expletive subjects (Logical form has no “agent”)
○ It rained.
○ There was a problem.
18Shift to Machine Learning
A Dual Formalism: Combinatory Categorial Grammar
● Categories: atomic elements or functions (alternative to POS)
○ Atomic elements: fewer than Penn Treebank POS tagset
○ Some words are syntactic functions: e.g., prove: (S\NP)/NP
○ Every syntactic rule is paired with a semantic rule
proved := (S\NP3s)/NP: λx.λx.prove’xy
19Shift to Machine Learning
Other Syntactic TreeBanks
● Prague Dependency Treebank for Czech, 1.5M words annotated for
morphology and dependency syntax
● Negra Treebank for German (355M words)
● CCGBank
○ Created by automatically translating phrase-structure trees from the Penn Treebank
via a rule-based approach
○ Produced successful translations of over 99% of the trees in the Penn Treebank
resulting in 48,934 sentences with CCG derivations
○ Provides a lexicon of 44,000 words with over 1200 categories
● Wikipedia page on Treebanks lists > 100 for three dozen languages
20
Abstract Meaning Representation (AMR)
● A semantic representation language with an annotated corpus
● Currently 60K sentences
● Rooted, directed, edge-labeled, leaf-labeled graphs
21Shift to Machine Learning
● Words (counts, weighted counts, proportions, conditional probabilities)
● Word classes
○ Syntactic: POS
○ Semantic and functional: Discourse cue words, sentiment words,
pronouns
● Syntactic features
○ CFG: Subtree/node types
○ Dependency grammar: Dependency relations
● Very time-consuming and sometimes lacking in generality
● Requires methods to select features, best as part of the training
22Shift to Machine Learning
Feature Engineering and Feature Selection for NLP
Natural Language Processing
CMPSC 442
Week 13, Meeting 38, Segment 3: Excerpts from Mirella
Lapata 2017 Keynote
Pros and Cons of Neural Architectures
Pros
● For certain tasks (language modeling, machine translation, semantic
parsing, natural language generation) can handle longer contexts
much more easily than statistical models
● Can reduce or eliminate costly feature engineering
Cons
● The usual: data hungry, large computational resource needs
● None of the above applications requires symbol grounding
○ Linking words to objects and actions in the world, across contexts
24
Translating from Multiple Modalities to Text and Back
● Excerpts from Mirella Lapata keynote address at 2017 annual
meeting of the Association for Computational Linguistics
25DNN for Symbol Mapping
Motivation
Case Studies
Research Goal
Methodology
NLP Comes to the Rescue!
riding a horse
define function with argument n
if n is not an integer value,
throw a TypeError exception
Suggs rushed for 82 yards and
scored a touchdown.
The Port Authority gave per-
mission to exterminate Snowy
Owls at NY City airports.
Which animals eat owls?
5 / 70
Motivation
Case Studies
Research Goal
Methodology
A Brief History of Neural Networks
Source: http://qingkaikong.blogspot.com/
7 / 70
Motivation
Case Studies
Research Goal
Methodology
Encoder-Decoder Modeling Framework
Kalchbrenner and Blunsom (2013); Cho et al. (2014); Sutskever et al. (2014);
Karpathy and Fei-Fei (2015); Vinyals et al. (2015).
Source: https://medium.com/@felixhill/
8 / 70
Motivation
Case Studies
Research Goal
Methodology
Encoder-Decoder Modeling Framework
1 End-to-end training All parameters are simultaneously
optimized to minimize a loss function on the network’s
output.
2 Distributed representations share strength Better
exploitation of word and phrase similarities.
3 Better exploitation of context We can use a much bigger
context – both source and partial target text – to translate
more accurately.
Essentially a Conditional Recurrent Language Model!
9 / 70
Motivation
Case Studies
The Simplication Task
Language to Code
Movie Summarization
In the Remainder of the Talk
We will look at the encoder-decoder framework across tasks and
along these dimensions:
Translation
different modalities
same modality
Data
comparable
parallel
Training Size
S
M
L
Model
encoder
decoder
training objective
19 / 70
Motivation
Case Studies
The Simplication Task
Language to Code
Movie Summarization
Deep Reinforcement Learning
X = x1 x2 x3 x4 x5
Ŷ = ŷ1 ŷ2 ŷ3
Vanilla encoder-decoder model only learns to copy.
We enforce task-specific constraints via reinforcement learning
(Ranzato et al., 2016; Li et al., 2016; Narashimhan et al., 2016;
Zhang and Lapata, 2017; Williams et al. 2017).
28 / 70
Motivation
Case Studies
The Simplication Task
Language to Code
Movie Summarization
Deep Reinforcement Learning
X = x1 x2 x3 x4 x5
Ŷ = ŷ1 ŷ2 ŷ3
Vanilla encoder-decoder model only learns to copy.
We enforce task-specific constraints via reinforcement learning
(Ranzato et al., 2016; Li et al., 2016; Narashimhan et al., 2016;
Zhang and Lapata, 2017; Williams et al. 2017).
28 / 70
Motivation
Case Studies
The Simplication Task
Language to Code
Movie Summarization
Deep Reinforcement Learning
X = x1 x2 x3 x4 x5
Ŷ = ŷ1 ŷ2 ŷ3
Get Action Seq. Ŷ
Update Agent
Simplicity
Model
Relevance
Model
Fluency
Model
REINFORCE algorithm
View model as an agent which reads source X .
Agent takes action ŷt 2 V according to policy PRL(ŷt |ŷ1:t�1,X ).
Agent outputs Ŷ = (ŷ1, ŷ2, . . . , ŷ|Ŷ |) and receives reward r .
29 / 70
Motivation
Case Studies
The Simplication Task
Language to Code
Movie Summarization
Deep Reinforcement Learning
X = x1 x2 x3 x4 x5
Ŷ = ŷ1 ŷ2 ŷ3
Get Action Seq. Ŷ
Update Agent
Simplicity
Model
Relevance
Model
Fluency
Model
REINFORCE algorithm
View model as an agent which reads source X .
Agent takes action ŷt 2 V according to policy PRL(ŷt |ŷ1:t�1,X ).
Agent outputs Ŷ = (ŷ1, ŷ2, . . . , ŷ|Ŷ |) and receives reward r .
29 / 70
Motivation
Case Studies
The Simplication Task
Language to Code
Movie Summarization
Deep Reinforcement Learning
X = x1 x2 x3 x4 x5
Ŷ = ŷ1 ŷ2 ŷ3
Get Action Seq. Ŷ
Update Agent
Simplicity
Model
Relevance
Model
Fluency
Model
REINFORCE algorithm
View model as an agent which reads source X .
Agent takes action ŷt 2 V according to policy PRL(ŷt |ŷ1:t�1,X ).
Agent outputs Ŷ = (ŷ1, ŷ2, . . . , ŷ|Ŷ |) and receives reward r .
29 / 70
Motivation
Case Studies
The Simplication Task
Language to Code
Movie Summarization
Deep Reinforcement Learning
X = x1 x2 x3 x4 x5
Ŷ = ŷ1 ŷ2 ŷ3
Get Action Seq. Ŷ
Update Agent
Simplicity
Model
Relevance
Model
Fluency
Model
REINFORCE algorithm
Simplicity SARI (Xu et al., 2016), arithmetic average of n-gram
precision and recall of addition, copying, and deletion.
Relevance cosine similarity between vectors representing
source X and predicted target Ŷ .
Fluency normalized sentence probability assigned by an LSTM
language model trained on simple sentences.
30 / 70
Motivation
Case Studies
The Simplication Task
Language to Code
Movie Summarization
Take-Home Message
Sequence-to-sequence model with task-specific objective.
RL framework could be used for other rewriting tasks.
Training data is not perfect, will never be huge.
Simplifications are decent, system performs well-out of
domain.
34 / 70
Summary
● Natural language processing (NLP) applies machine learning to
language data for a range of applications
● NLP depends heavily on collections of labeled data (TreeBanks,
Corpora)
● Early NLP handled grounded language (grounded in databases or
simulations of the world)
● Encoder-decoder models, which translate between different symbol
systems (language, symbolic) are the most prevalent neural
architectures for NLP
26Summary Wk 13, Mtg 38