Syntax and Parsing 3: Parsing with CFG
This time:
Basic recognition/parsing strategies
top-down strategy bottom up strategy
Problems with simple strategies
left recursion
empty productions redundant reparsing
Earley’s Algorithm: Chart Parsing
edges and the chart the fundamental rule
Data Science Group (Informatics) NLE/ANLP Autumn 2015 1 / 28
Parsing with CFG
Consider again the following simple CFG:
1) S→NPVP 2) NP→John 3) NP→Mary 4) NP→ARTN 5) ART→a
6) N→song
How can we use this grammar to
7) VP→VNP
8) VP→VNPPP 9) VP→VPP
10) V→sang 11) PP→PNP 12) P→to
Recognise a string such as John sang a song – but not, say, a sang John song
Assign parse structure to John sang a song
Data Science Group (Informatics) NLE/ANLP Autumn 2015 2 / 28
Basic Recognition/Parsing Strategies
There are different strategies that may be adopted:
Top-Down (goal-driven strategy):
Assume you’re looking for S (i.e. sentence)
Use rules ‘forwards’ to ‘expand’ symbols until input is derived — else fail
Bottom-Up:
Assume you’re looking for S
Use rules ‘backwards’ to ‘combine’ symbols until you get S — else fail
Other dimensions:
left-to-right versus right-to-left
depth-first versus breadth-first search
Data Science Group (Informatics) NLE/ANLP Autumn 2015 3 / 28
Top-Down Strategy
1 S :
2 NP VP :
3 VP :
4 VNP :
5 NP :
6 ARTN :
7 N :
John sang a song John sang a song sang a song sang a song a song a song
8
:
(1)S→NPVP (2) NP → John (7)VP→VNP (10) V → sang (4)NP→ARTN
song (5)ART→a (6)N→song
adopted left-to-right strategy searched depth-first
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
4 / 28
Top-Down Strategy: Non-Determinism
N.B.: At step 4, could have chosen to expand VP according to rule 8:
1 S :
2 NP VP :
3 VP :
4 VNPPP :
5 NP PP :
6 ARTNPP :
7NPP: 8PP: ?
John sang a song
John sang a song (1)
sang a song (2) sang a song (8) a song (10) a song (4) song(5) (6)
We need a way of backtracking when wrong choices are made
Data Science Group (Informatics) NLE/ANLP Autumn 2015 5 / 28
Top-Down Strategy: Left Recursion
A problem
for top-down strategy
(13) VP → VP PP
1 2 3 4 5 .
S : NP VP : VP : VP PP : VPPPPP :
John sang a song
John sang a song (1)
sang a song (2) sang a song (13) sang a song (13)
Not clear how we know when to stop expanding with VP → VP PP
Data Science Group (Informatics) NLE/ANLP Autumn 2015 6 / 28
Bottom-Up Strategy
1 John sang a song
2 NP sang a song (2)
3 NPVasong (10)
4 NP V ART song (5)
5 NPVARTN (6)
6 NPVNP (4)
7 NP VP (7)
8S (1)
adopted left-to-right strategy searched depth-first
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
7 / 28
Bottom-Up Strategy: Empty Productions
An empty production has the form: A → ǫ
— i.e. non-terminal A may be ‘expanded’ as the empty string ǫ
This is not as odd as it sounds
Consider including (optional) adjectives in NPs:
(14) NP → ART AP N (15) AP → ǫ
(16) AP → ADJ AP (17) ADJ → happy (18) ADJ → tuneful
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
8 / 28
Bottom-Up Strategy: Empty Productions
These rules allow NPs such as:
a happy song
a happy tuneful song a song
The problem:
The rule AP → ǫ is applicable at every step
Application does not depend on seeing anything in the data
Data Science Group (Informatics) NLE/ANLP Autumn 2015 9 / 28
Parsing and Search
In general, CFG parsing is non-deterministic Top-Down Example:
VP saw Mary
V saw Mary V NP saw Mary
FAIL!
V PP
FAIL!
saw Mary
Branching in this tree expresses alternatives
Parsing algorithms must explore search space systematically
Data Science Group (Informatics) NLE/ANLP Autumn 2015 10 / 28
Parsing and Search: The Agenda
To recover from wrong decisions, must keep a record of states that have not yet been explored
Parser maintains a list of parse states called an agenda:
At each iteration of parsing procedure: Remove state from agenda Generate successor states
Add successors to agenda
Data Science Group (Informatics) NLE/ANLP Autumn 2015 11 / 28
Search Strategy
Agenda is a priority queue
Priority determines order items are considered
Depth-first search:
Last in first out (LIFO) priority order: agenda is a stack
Breadth-first search:
First in first out (FIFO) priority order: agenda is a queue
An AI-search view of parsing
prioritise agenda items according to some heuristic
Data Science Group (Informatics) NLE/ANLP Autumn 2015 12 / 28
Inefficient Re-Parsing
1 S :
2 NP VP :
3 VP :
4 VNPPP :
5 NP PP :
6 ARTNPP :
7 NPP : 8PP: 9PNP :
10VNP : 11 NP : 12 ARTN : 13 N : 14 :
John sang a song John sang a song (1) sang a song (2) sang a song (8)
a song
a song (4)
song
sang a song (7) a song
a song song
(10) (5)
(10)
(4)
(5) (6)
(6)
repetition of rule application
(11)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
13 / 28
Chart Parsing: The Earley Algorithm
So: simple parsing/recognition strategies suffer from problems:
1 left-recursion (top-down strategy)
2 empty productions (bottom-up strategy)
3 inefficient reparsing of subtrees
4 a further problem is dealing with structural ambiguity
— more on that next time
A general solution is to use dynamic programming – c.f. the Viterbi algorithm and HMMs seen previously
The Earley Algorithm provides an efficient approach to CFG parsing – A.K.A. Chart Parsing
Data Science Group (Informatics) NLE/ANLP Autumn 2015 14 / 28
Chart Parsing
Algorithm builds a data structure called the chart
chart contains a list of edges representing state of parse Each edge involves a dotted rule
A → X1 …Xj •Xj+1 …Xk Symbols on the left of the dot (•) have been ‘found’
— confirmed hypotheses
Symbols on the right of the dot are still to be found
Data Science Group (Informatics) NLE/ANLP Autumn 2015 15 / 28
Edges: Partial Parse Information
Edges may be active or inactive:
NP → ART • N S → • NP VP VP → V NP •
active edge active and empty inactive edge
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
16 / 28
Edges: Position in Chart
In addition, each edge records position of partial parse in input In general, the chart has edges of the form:
⟨i, j, A → α • β⟩
integer i: position in input where the partial parse starts
integer j: position in input where the partial parse finishes
Data Science Group (Informatics) NLE/ANLP Autumn 2015 17 / 28
The Chart
S→NP • VP
NP → John • John
NP → • ART N
V → sang •
sang a
N → song • song
01234
Data Science Group (Informatics) NLE/ANLP Autumn 2015 18 / 28
The Fundamental Rule of Chart Parsing
If the chart contains the edges
⟨i, j, A → α • B β⟩ and ⟨j, k, B → γ •⟩
then add the new edge
⟨i, k, A → α B • β⟩
α, β, γ are possibly empty sequences of symbols
Data Science Group (Informatics) NLE/ANLP Autumn 2015 19 / 28
Example
S → NP VP •
S → NP • VP
VP → V •
ijk
Data Science Group (Informatics) NLE/ANLP Autumn 2015 20 / 28
Initialisation
Fundamental rule applies to chart containing edges
How do we get started?
Initialisation:
Include inactive edges for words in the input string
Data Science Group (Informatics) NLE/ANLP Autumn 2015 21 / 28
Example
Consider the input John sang a song
NP → John • V → sang • ART → a • N → song •
01234
Data Science Group (Informatics) NLE/ANLP Autumn 2015 22 / 28
Rule Invocation: Bottom-Up
If adding an edge
⟨i , j , B → α •⟩ then for every rule of the form
add an edge
A→Bβ
⟨i, i, A → • B β⟩
Data Science Group (Informatics) NLE/ANLP Autumn 2015 23 / 28
Example
NP → • ART N
ART → a •
ij
Data Science Group (Informatics) NLE/ANLP Autumn 2015 24 / 28
Rule Invocation: Top-Down
If adding an edge
⟨i,j,B→α • Aβ⟩ then for every rule of the form
add an edge
A→γ
⟨j, j, A → • γ⟩
Data Science Group (Informatics) NLE/ANLP Autumn 2015 25 / 28
Example
A → NP • VP
ij
VP → • V
VP → • V NP
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
26 / 28
Top-Down Initialisation
Initialise the chart with
⟨0, 0, S → • α⟩
Data Science Group (Informatics) NLE/ANLP Autumn 2015 27 / 28
Next Topic: Probabilistic CFG
The challenge of ambiguity Lexical and structural ambiguity Probabilistic CFG (PCFG) Syntactic disambiguation Training a PCFG
Data Science Group (Informatics) NLE/ANLP Autumn 2015 28 / 28