CS447: Natural Language Processing
http://courses.engr.illinois.edu/cs447
Julia Hockenmaier
juliahmr@illinois.edu
3324 Siebel Center
Lecture 8:
Formal Grammars
of English
CS447: Natural Language Processing (J. Hockenmaier)
Recap: Wednesday’s
lecture
�2
CS447 Natural Language Processing
Graphical models for
sequence labeling
�3 CS447: Natural Language Processing
Directed graphical models
Graphical models are a notation for probability models.
In a directed graphical model, each node represents a
distribution over a random variable:
– P(X) =
Arrows represent dependencies (they define what other
random variables the current node is conditioned on)
– P(Y) P(X | Y ) =
– P(Y) P(Z) P(X | Y, Z) =
Shaded nodes represent observed variables.
White nodes represent hidden variables
– P(Y) P(X | Y) with Y hidden and X observed =
�4
X
XY
X
Y
Z
XY
CS447: Natural Language Processing
HMMs as graphical models
HMMs are generative models of the observed input
string w
They ‘generate’ w with P(w,t) = ∏iP(t(i)| t(i−1))P(w(i)| t(i) )
When we use an HMM to tag, we observe w, and
need to find t
t(1) t(2) t(3) t(4)
w(1) w(2) w(3) w(4)
CS447: Natural Language Processing
Discriminative probability models
A discriminative or conditional model of the labels t
given the observed input string w models
P(t | w) = ∏iP(t(i) |w(i), t(i−1)) directly.
t(1) t(2) t(3) t(4)
w(1) w(2) w(3) w(4)
CS447: Natural Language Processing
Discriminative models
There are two main types of discriminative
probability models:
–Maximum Entropy Markov Models (MEMMs)
–Conditional Random Fields (CRFs)
MEMMs and CRFs:
–are both based on logistic regression
–have the same graphical model
– require the Viterbi algorithm for tagging
–differ in that MEMMs consist of independently
learned distributions, while CRFs are trained to
maximize the probability of the entire sequence
CS447: Natural Language Processing
Probabilistic classification
Classification:
Predict a class (label) c for an input x
There are only a (small) finite number of possible class labels
Probabilistic classification:
– Model the probability P( c | x)
P(c|x) is a probability if 0 ≤ P (ci | x) ≤ 1, and ∑iP( ci | x) = 1
–Return the class c* = argmaxi P (ci | x)
that has the highest probability
One standard way to model P( c | x) is logistic
regression (used by MEMMs and CRFs)
�8
CS447: Natural Language Processing
MEMMs use a MaxEnt classifier for each P(t(i) |w(i), t(i−1)):
Since we use w to refer to words, let’s use λjk as the weight
for the feature function fj(t(i−1), w(i)) when predicting tag tk:
Maximum Entropy Markov Models
t(i−1) t(i)
w(i)
P(t(i) = tk | t(i�1),w(i)) =
exp(Â j l jk f j(t(i�1),w(i))
Âl exp(Â j l jl f j(t(i�1),w(i))
CS447: Natural Language Processing
Using features
Think of feature functions as useful questions you can
ask about the input x:
– Binary feature functions:
ffirst-letter-capitalized(Urbana) = 1
ffirst-letter-capitalized(computer) = 0
– Integer (or real-valued) features:
fnumber-of-vowels(Urbana) = 3
Which specific feature functions are useful
will depend on your task (and your training data).
�10
CS447: Natural Language Processing
From features to probabilities
We associate a real-valued weight wic with each
feature function fi(x) and output class c
Note that the feature function fi(x) does not have to depend
on c as long as the weight does (note the double index wic)
This gives us a real-valued score for predicting class c
for input x: score(x,c) = ∑iwic fi(x)
This score could be negative, so we exponentiate it:
score(x,c) = exp( ∑iwic fi(x))
To get a probability distribution over all classes c,
we renormalize these scores:
P(c | x) = score(x,c)∕∑j score(x,cj)
= exp( ∑iwic fi(x))∕∑j exp( ∑iwij fi(x))
�11 CS447: Natural Language Processing
Learning = finding weights w
We use conditional maximum likelihood estimation
(and standard convex optimization algorithms)
to find/learn w
(for more details, attend CS446 and CS546)
The conditional MLE training objective:
Find the w that assigns highest probability to all observed
outputs ci given the inputs xi
Learning: finding w
ŵ = argmax
w �i
P(ci|xi,w)
= argmax
w ⇥i
log(P(ci|xi,w))
= argmax
w ⇥i
log
�
e⇥ j w j f j(xi,c)
⇥c� e⇥ j w j f j(xi,c
�)
⇥ �12
CS447: Natural Language Processing (J. Hockenmaier)
Terminology
Models that are of the form
P(c | x) = score(x,c)∕∑j score(x,cj)
= exp( ∑iwic fi(x))∕∑j exp( ∑iwij fi(x))
are also called loglinear models, Maximum Entropy
(MaxEnt) models, or multinomial logistic regression
models.
CS446 and CS546 should give you more details about these.
The normalizing term ∑j exp( ∑iwij fi(x)) is also called
the partition function and is often abbreviated as Z
�13 CS447: Natural Language Processing (J. Hockenmaier)
Features for Sequence Labeling
What features are useful to model P(t(i) |w(i), t(i−1)) ?
The identity of the previous label
Properties of the current word:
-w(i) starts with/contains a capital letter/number,…
-w(i) contains the character “A” (“B”, …”Z”, …1, 2, …0,….)
-w(i) ends in “ing”, “ed”, ….
-…
Feature engineering is essential for any practical
We typically define feature templates (e.g. let any of
the first, or last, n (=1,2,3,…) characters be used as a
separate feature. This results in a very large number
of actual features (and weights to be learned)
Methods for feature selection become essential
�14
CS447: Natural Language Processing (J. Hockenmaier)
Feature Engineering
Feature engineering (finding useful features) is
essential to get good performance out of any classifier
This requires domain expertise
We typically define feature templates
(e.g. let any of the first, or last, n (=1,2,3,…) characters be
used as a separate feature.
This results in a very large number of actual features
(and weights to be learned)
Methods for feature selection become essential.
�15 CS447: Natural Language Processing (J. Hockenmaier)
On to new material..
�16
CS447: Natural Language Processing (J. Hockenmaier)
Previous key concepts
NLP tasks dealing with words…
-POS-tagging, morphological analysis
… require finite-state representations,
-Finite-State Automata and Finite-State Transducers
… the corresponding probabilistic models,
-Probabilistic FSAs and Hidden Markov Models
-Estimation: relative frequency estimation, EM algorithm
… and appropriate search algorithms
-Dynamic programming: Forward, Viterbi, Forward-Backward
�17 CS447: Natural Language Processing (J. Hockenmaier)
The next key concepts
NLP tasks dealing with sentences…
-Syntactic parsing and semantic analysis
… require (at least) context-free representations,
-Context-free grammars, unification grammars
… the corresponding probabilistic models,
-Probabilistic Context-Free Grammars, Loglinear models
-Estimation: Relative Frequency estimation, EM algorithm, etc.
… and appropriate search algorithms
-Dynamic programming: chart parsing, inside-outside
algorithm
�18
CS447: Natural Language Processing (J. Hockenmaier)
Search
Algorithm
(e.g Viterbi)
Dealing with ambiguity
Structural
Representation
(e.g FSA)
Scoring
Function
(Probability model,
e.g HMM)
�19 CS447: Natural Language Processing (J. Hockenmaier)
Today’s lecture
Introduction to natural language syntax (‘grammar’):
Constituency and dependencies
Context-free Grammars
Dependency Grammars
A simple CFG for English
�20
CS447: Natural Language Processing (J. Hockenmaier)
What is grammar?
�21
No, not
really, not in
this class
CS447: Natural Language Processing (J. Hockenmaier)
What is grammar?
Grammar formalisms
(= linguists’ programming languages)
A precise way to define and describe
the structure of sentences.
(N.B.: There are many different formalisms out there, which each define their
own data structures and operations)
Specific grammars
(= linguists’ programs)
Implementations (in a particular formalism) for a particular
language (English, Chinese,….)
�22
CS447: Natural Language Processing (J. Hockenmaier)
Can we define a program that
generates all English sentences?
The number of sentences is infinite.
But we need our program to be finite.
�23 CS447: Natural Language Processing (J. Hockenmaier)
Overgeneration
Undergeneration
John saw Mary.
I ate sushi with tuna.
I ate the cake that John had
made for me yesterday
I want you to go there.
John made some cake.
English
Did you go there?
…..
John Mary saw.
with tuna sushi ate I.
Did you went there?
….
�24
CS447: Natural Language Processing (J. Hockenmaier)
Noun
(Subject) Verb
(Head)
Noun
(Object)
I eat sushi.
Basic sentence structure
�25 CS447: Natural Language Processing (J. Hockenmaier)
This is a dependency graph:
I eat sushi.
sbj obj
eat
sushiI
sbj obj
�26
CS447: Natural Language Processing (J. Hockenmaier)
A finite-state-automaton (FSA)
Noun
(Subject)
Noun
(Object)Verb (Head)
�27 CS447: Natural Language Processing (J. Hockenmaier)
A Hidden Markov Model (HMM)
Noun
(Subject)
Noun
(Object)Verb (Head)
I, you, …. eat, drink sushi, …
�28
CS447: Natural Language Processing (J. Hockenmaier)
Words take arguments
I eat sushi. ✔
I eat sushi you. ???
I sleep sushi ???
I give sushi ???
I drink sushi ?
Subcategorization
(purely syntactic: what set of arguments do words take?)
Intransitive verbs (sleep) take only a subject.
Transitive verbs (eat) take also one (direct) object.
Ditransitive verbs (give) take also one (indirect) object.
Selectional preferences
(semantic: what types of arguments do words tend to take)
The object of eat should be edible.
�29 CS447: Natural Language Processing (J. Hockenmaier)
A better FSA
Noun
(Subject)
Noun
(Object)
Transitive
Verb (Head)
Intransitive
Verb (Head)
�30
CS447: Natural Language Processing (J. Hockenmaier)
Language is recursive
the ball
the big ball
the big, red ball
the big, red, heavy ball
….
Adjectives can modify nouns.
The number of modifiers (aka adjuncts)
a word can have is (in theory) unlimited.
�31 CS447: Natural Language Processing (J. Hockenmaier)
Another FSA
Determiner Noun
Adjective
�32
CS447: Natural Language Processing (J. Hockenmaier)
Recursion can be
more complex
the ball
the ball in the garden
the ball in the garden behind the house
the ball in the garden behind the house next to the school
….
�33 CS447: Natural Language Processing (J. Hockenmaier)
Yet another FSA
Det Noun
Adj
Preposition
So, why do we need anything
beyond regular (finite-state) grammars?
�34
CS447: Natural Language Processing (J. Hockenmaier)
What does this mean?
the ball in the garden behind the house
�35
There is an
attachment
ambiguity
CS447: Natural Language Processing (J. Hockenmaier)
FSAs do not generate
hierarchical structure
�36
Det Noun
Adj
Preposition
CS447: Natural Language Processing (J. Hockenmaier)
[ ] [ ] [ ] I eat sushi with tuna
What is the structure
of a sentence?
Sentence structure is hierarchical:
A sentence consists of words (I, eat, sushi, with, tuna)
..which form phrases or constituents: “sushi with tuna”
Sentence structure defines dependencies
between words or phrases:
�37
[ ]
CS447: Natural Language Processing (J. Hockenmaier)
Strong vs. weak
generative capacity
Formal language theory:
-defines language as string sets
– is only concerned with generating these strings
(weak generative capacity)
Formal/Theoretical syntax (in linguistics):
-defines language as sets of strings with (hidden) structure
– is also concerned with generating the right structures
(strong generative capacity)
�38
CS447: Natural Language Processing (J. Hockenmaier)
Context-free grammars (CFGs)
capture recursion
Language has complex constituents
(“the garden behind the house”)
Syntactically, these constituents behave
just like simple ones.
(“behind the house” can always be omitted)
CFGs define nonterminal categories
to capture equivalent constituents.
�39 CS447: Natural Language Processing (J. Hockenmaier)
Context-free grammars
A CFG is a 4-tuple 〈N, Σ, R, S〉 consisting of:
A set of nonterminals N
(e.g. N = {S, NP, VP, PP, Noun, Verb, ….})
A set of terminals Σ
(e.g. Σ = {I, you, he, eat, drink, sushi, ball, })
A set of rules R
R ⊆ {A → β with left-hand-side (LHS) A ∈ N
and right-hand-side (RHS) β ∈ (N ∪ Σ)* }
A start symbol S ∈ N
�40
CS447: Natural Language Processing (J. Hockenmaier)
An example
DT → {the, a}
N → {ball, garden, house, sushi }
P → {in, behind, with}
NP → DT N
NP → NP PP
PP → P NP
N: noun
P: preposition
NP: “noun phrase”
PP: “prepositional phrase”
�41 CS447: Natural Language Processing (J. Hockenmaier)
CFGs define parse trees
Correct analysis
Incorrect analysis
eat with tunasushi
NPNP
VP
PP
NP
V P
sushi eat with chopsticks
NPNP
VP
PPVP
V P
eat sushi with tuna
eat sushi with chopsticks
eat sushi with chopsticks
NPNP
NP
VP
PP
V P
eat with tunasushi
NPNP
VP
PPVP
V P
eat sushi with tuna
eat sushi with chopsticks
N → {sushi, tuna}
P → {with}
V → {eat}
NP → N
NP → NP PP
PP → P NP
VP → V NP
�42
CS447: Natural Language Processing (J. Hockenmaier)
CFGs and center embedding
The mouse ate the corn.
The mouse that the snake ate ate the corn.
The mouse that the snake that the hawk ate ate ate the corn.
….
�43 CS447: Natural Language Processing (J. Hockenmaier)
CFGs and center embedding
Formally, these sentences are all grammatical,
because they can be generated by the CFG
that is required for the first sentence:
S → NP VP
NP → NP RelClause
RelClause → that NP ate
Problem: CFGs are not able to capture bounded recursion.
(‘only embed one or two relative clauses’).
To deal with this discrepancy between what the model predicts
to be grammatical, and what humans consider grammatical,
linguists distinguish between a speaker’s competence
(grammatical knowledge) and performance (processing and
memory limitations)
�44
CS447: Natural Language Processing (J. Hockenmaier)
CFGs are equivalent to Pushdown
automata (PDAs)
PDAs are FSAs with an additional stack:
Emit a symbol and push/pop a symbol from the stack
This is equivalent to the following CFG:
S → a X b S → a b
X → a X b X → a b
Push ‘x’
on stack.
Emit ‘a’
�45
Pop ‘x’
from stack.
Emit ‘b’
Accept if
stack empty.
CS447: Natural Language Processing (J. Hockenmaier)
Action Stack String
1. Push x on stack. Emit a. x a
2. Push x on stack. Emit a. xx aa
3. Push x on stack. Emit a. xxx aaa
4. Push x on stack. Emit a. xxxx aaaa
5. Pop x off stack. Emit b. xxx aaaab
6. Pop x off stack. Emit b. xx aaaabb
7. Pop x off stack. Emit b. x aaaabbb
8. Pop x off stack. Emit b aaaabbbb
Generating anbn
�46
CS447: Natural Language Processing (J. Hockenmaier)
Defining grammars
for natural language
�47 CS447: Natural Language Processing (J. Hockenmaier)
Two ways to represent structure
Correct analysis
Incorrect analysis
eat with tunasushi
NPNP
VP
PP
NP
V P
sushi eat with chopsticks
NPNP
VP
PPVP
V P
eat sushi with tuna
eat sushi with chopsticks
eat sushi with chopsticks
NPNP
NP
VP
PP
V P
eat with tunasushi
NPNP
VP
PPVP
V P
eat sushi with tuna
eat sushi with chopsticks
Phrase structure trees Dependency trees
�48
CS447: Natural Language Processing (J. Hockenmaier)
Structure (syntax) corresponds
to meaning (semantics)
Correct analysis
Incorrect analysis
eat with tunasushi
NPNP
VP
PP
NP
V P
sushi eat with chopsticks
NPNP
VP
PPVP
V P
eat sushi with tuna
eat sushi with chopsticks
eat sushi with chopsticks
NPNP
NP
VP
PP
V P
eat with tunasushi
NPNP
VP
PPVP
V P
eat sushi with tuna
eat sushi with chopsticks
�49 CS447: Natural Language Processing (J. Hockenmaier)
Dependency grammar
DGs describe the structure of sentences as a
directed acyclic graph.
The nodes of the graph are the words
The edges of the graph are the dependencies.
Typically, the graph is assumed to be a tree.
Note: the relationship between DG and CFGs:
If a CFG phrase structure tree is translated into DG,
the resulting dependency graph has no crossing edges.
�50
CS447: Natural Language Processing (J. Hockenmaier)
Constituents:
Heads and dependents
There are different kinds of constituents:
Noun phrases: the man, a girl with glasses, Illinois
Prepositional phrases: with glasses, in the garden
Verb phrases: eat sushi, sleep, sleep soundly
Every phrase has a head:
Noun phrases: the man, a girl with glasses, Illinois
Prepositional phrases: with glasses, in the garden
Verb phrases: eat sushi, sleep, sleep soundly
The other parts are its dependents.
Dependents are either arguments or adjuncts
�51 CS447: Natural Language Processing (J. Hockenmaier)
Is string α a constituent?
Substitution test:
Can α be replaced by a single word?
He talks [there].
Movement test:
Can α be moved around in the sentence?
[In class], he talks.
Answer test:
Can α be the answer to a question?
Where does he talk? – [In class].
He talks [in class].
�52
CS447: Natural Language Processing (J. Hockenmaier)
Arguments are obligatory
Words subcategorize for specific sets of arguments:
Transitive verbs (sbj + obj): [John] likes [Mary]
All arguments have to be present:
*[John] likes. *likes [Mary].
No argument can be occupied multiple times:
*[John] [Peter] likes [Ann] [Mary].
Words can have multiple subcat frames:
Transitive eat (sbj + obj): [John] eats [sushi].
Intransitive eat (sbj): [John] eats.
�53 CS447: Natural Language Processing (J. Hockenmaier)
Adjuncts are optional
Adverbs, PPs and adjectives can be adjuncts:
Adverbs: John runs [fast].
a [very] heavy book.
PPs: John runs [in the gym].
the book [on the table]
Adjectives: a [heavy] book
There can be an arbitrary number of adjuncts:
John saw Mary.
John saw Mary [yesterday].
John saw Mary [yesterday] [in town]
John saw Mary [yesterday] [in town] [during lunch]
[Perhaps] John saw Mary [yesterday] [in town] [during lunch]
�54
CS447: Natural Language Processing (J. Hockenmaier)
A context-free grammar
for a fragment of
English
�55 CS447: Natural Language Processing (J. Hockenmaier)
Noun phrases (NPs)
Simple NPs:
[He] sleeps. (pronoun)
[John] sleeps. (proper name)
[A student] sleeps. (determiner + noun)
Complex NPs:
[A tall student] sleeps. (det + adj + noun)
[The student in the back] sleeps. (NP + PP)
[The student who likes MTV] sleeps. (NP + Relative Clause)
�56
CS447: Natural Language Processing (J. Hockenmaier)
The NP fragment
NP → Pronoun
NP → ProperName
NP → Det Noun
Det → {a, the, every}
Pronoun → {he, she,…}
ProperName → {John, Mary,…}
Noun → AdjP Noun
Noun → N
NP → NP PP
NP → NP RelClause
�57 CS447: Natural Language Processing (J. Hockenmaier)
Adjective phrases (AdjP) and
prepositional phrases (PP)
AdjP → Adj
AdjP → Adv AdjP
Adj → {big, small, red,…}
Adv → {very, really,…}
PP → P NP
P → {with, in, above,…}
�58
CS447: Natural Language Processing (J. Hockenmaier)
The verb phrase (VP)
He [eats].
He [eats sushi].
He [gives John sushi].
He [eats sushi with chopsticks].
VP → V
VP → V NP
VP → V NP PP
VP → VP PP
V → {eats, sleeps gives,…}
�59 CS447: Natural Language Processing (J. Hockenmaier)
Capturing subcategorization
He [eats]. ✔
He [eats sushi]. ✔
He [gives John sushi]. ✔
He [eats sushi with chopsticks]. ✔
*He [eats John sushi]. ???
VP → Vintrans
VP → Vtrans NP
VP → Vditrans NP NP
VP → VP PP
Vintrans → {eats, sleeps}
Vtrans → {eats}
Vtrans → {gives}
�60
CS447: Natural Language Processing (J. Hockenmaier)
Sentences
[He eats sushi].
[Sometimes, he eats sushi].
[In Japan, he eats sushi].
S → NP VP
S → AdvP S
S → PP S
He says [he eats sushi].
VP → Vcomp S
Vcomp → {says, think, believes}
�61 CS447: Natural Language Processing (J. Hockenmaier)
Sentences redefined
[He eats sushi]. ✔
*[I eats sushi]. ???
*[They eats sushi]. ???
S → NP3sg VP3sg
S → NP1sg VP1sg
S → NP3pl VP3pl
We need features to capture agreement:
(number, person, case,…)
�62
CS447: Natural Language Processing (J. Hockenmaier)
Complex VPs
In English, simple tenses have separate forms:
present tense: the girl eats sushi
simple past tense: the girl ate sushi
Complex tenses, progressive aspect and passive
voice consist of auxiliaries and participles:
past perfect tense: the girl has eaten sushi
future perfect: the girl will have eaten sushi
passive voice: the sushi was eaten by the girl
progressive: the girl is/was/will be eating sushi
�63 CS447: Natural Language Processing (J. Hockenmaier)
VPs redefined
He [has [eaten sushi]].
The sushi [was [eaten by him]].
VP → Vhave VPpastPart
VP → Vbe VPpass
VPpastPart → VpastPart NP
VPpass → VpastPart PP
Vhave→ {has}
VpastPart→ {eaten, seen}
We need more nonterminals (e.g. VPpastpart).
N.B.: We call VPpastPart, VPpass, etc. `untensed’ VPs
�64
CS447: Natural Language Processing (J. Hockenmaier)
Coordination
[He eats sushi] and [she drinks tea]
[John] and [Mary] eat sushi.
He [eats sushi] and [drinks tea]
S → S conj S
NP → NP conj NP
VP → VP conj VP
He says [he eats sushi].
VP → Vcomp S
Vcomp → {says, think, believes}
�65 CS447: Natural Language Processing (J. Hockenmaier)
Relative clauses
Relative clauses modify a noun phrase:
the girl [that eats sushi]
Relative clauses lack a noun phrase, which is
understood to be filled by the NP they modify:
‘the girl that eats sushi’ implies ‘the girl eats sushi’
There are subject and object relative clauses:
subject: ‘the girl that eats sushi’
object: ‘the sushi that the girl eats’
�66
CS447: Natural Language Processing (J. Hockenmaier)
Yes/No questions
Yes/no questions consist of an auxiliary, a subject
and an (untensed) verb phrase:
does she eat sushi?
have you eaten sushi?
YesNoQ → Aux NP VPinf
YesNoQ → Aux NP VPpastPart
�67 CS447: Natural Language Processing (J. Hockenmaier)
Wh-questions
Subject wh-questions consist of an wh-word, an
auxiliary and an (untensed) verb phrase:
Who has eaten the sushi?
Object wh-questions consist of an wh-word, an
auxiliary, an NP and an (untensed) verb phrase:
What does Mary eat?
�68
CS447: Natural Language Processing (J. Hockenmaier)
Today’s key concepts
Natural language syntax
Constituents
Dependencies
Context-free grammar
Arguments and modifiers
Recursion in natural language
�69 CS447: Natural Language Processing (J. Hockenmaier)
Today’s reading
Textbook:
Jurafsky and Martin, Chapter 12, sections 1-7
�70