Syntax and Parsing 3: Parsing with CFG
Parsing with CFG
This time:
Basic recognition/parsing strategies
top-down strategy bottom up strategy
Problems with simple strategies
left recursion
empty productions redundant reparsing
Earley’s Algorithm: Chart Parsing
edges and the chart the fundamental rule
Data Science Group (Informatics) NLE/ANLP
Basic Recognition/Parsing Strategies
There are different strategies that may be adopted:
Top-Down (goal-driven strategy):
Consider again the following simple CFG:
1) S→NPVP 2) NP→John 3) NP→Mary 4) NP→ARTN 5) ART→a
6) N→song
How can we use this grammar to
7) VP→VNP
8) VP→VNPPP 9) VP→VPP
10) V→sang 11) PP→PNP 12) P→to
Recognise a string such as John sang a song
– but not, say, a sang John song
Assign parse structure to John sang a song
Autumn 2015
1 / 28
Data Science Group (Informatics)
Top-Down Strategy
1 S :
2 NP VP :
3 VP :
4 V NP :
5 NP :
6 ART N :
7 N :
NLE/ANLP
John sang a song John sang a song sang a song sang a song a song a song
Autumn 2015
2 / 28
Assume you’re looking for S (i.e. sentence)
Use rules ‘forwards’ to ‘expand’ symbols until input is derived — else fail
Bottom-Up:
Assume you’re looking for S
Use rules ‘backwards’ to ‘combine’ symbols until you get S — else fail
Other dimensions:
left-to-right versus right-to-left
depth-first versus breadth-first search
(1)S→NPVP (2) NP → John (7)VP→VNP (10) V → sang (4)NP→ARTN
8
:
song (5)ART→a (6)N→song
adopted left-to-right strategy searched depth-first
Data Science Group (Informatics) NLE/ANLP Autumn 2015
3 / 28
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
4 / 28
Top-Down Strategy: Non-Determinism
Top-Down Strategy: Left Recursion
N.B.: At step 4, could have chosen to expand VP according to rule 8:
A problem
1 2 3 4 5 .
for top-down strategy
(13) VP → VP PP
1 S
2 NP VP
3 VP
4 VNPPP
5 NP PP
6 ARTNPP
7NPP: 8PP: ?
We need a way of backtracking when wrong choices are made
Not clear how we know when to stop expanding with VP → VP PP
Data Science Group (Informatics) NLE/ANLP Autumn 2015
Bottom-Up Strategy: Empty Productions
An empty production has the form: A → ε
— i.e. non-terminal A may be ‘expanded’ as the empty string ε
This is not as odd as it sounds
Consider including (optional) adjectives in NPs:
(14) NP → ART AP N (15) AP → ε
(16) AP → ADJ AP (17) ADJ → happy (18) ADJ → tuneful
: : : : : :
John sang a song
John sang a song (1)
sang a song (2) sang a song (8) a song (10) a song (4) song(5) (6)
S : NP VP : VP : VP PP : VPPPPP :
John sang a song
John sang a song (1)
sang a song (2) sang a song (13) sang a song (13)
Data Science Group (Informatics)
Bottom-Up Strategy
NLE/ANLP
Autumn 2015
5 / 28
6 / 28
John sang a song
NP sang a song (2) NP V a song (10) NP V ART song (5) NP V ART N (6) NP V NP (4) NP VP (7)
1
2
3
4
5
6
7
8S (1)
adopted left-to-right strategy searched depth-first
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
7 / 28
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
8 / 28
Bottom-Up Strategy: Empty Productions
Parsing and Search
These rules allow NPs such as:
a happy song
a happy tuneful song a song
The problem:
The rule AP → ε is applicable at every step
Application does not depend on seeing anything in the data
Data Science Group (Informatics) NLE/ANLP Autumn 2015
Parsing and Search: The Agenda
To recover from wrong decisions, must keep a record of states that have not yet been explored
Parser maintains a list of parse states called an agenda:
At each iteration of parsing procedure:
Remove state from agenda Generate successor states Add successors to agenda
In general, CFG parsing is non-deterministic Top-Down Example:
V saw Mary FAIL!
V NP
saw Mary
V PP FAIL!
saw Mary
VP saw Mary
9 / 28
Branching in this tree expresses alternatives
Parsing algorithms must explore search space systematically
Data Science Group (Informatics) NLE/ANLP Autumn 2015
Search Strategy
Agenda is a priority queue
Priority determines order items are considered
Depth-first search:
Last in first out (LIFO) priority order: agenda is a stack
Breadth-first search:
First in first out (FIFO) priority order: agenda is a queue
An AI-search view of parsing
prioritise agenda items according to some heuristic
10 / 28
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
11 / 28
Data Science Group (Informatics) NLE/ANLP Autumn 2015
12 / 28
Inefficient Re-Parsing
Chart Parsing: The Earley Algorithm
1 S :
2 NP VP :
3 VP :
4 VNPPP :
5 NP PP :
6 ARTNPP :
7 NPP :
8 PP :
9 PNP :
10 VNP :
11 NP :
12 ARTN :
13 N :
14 :
Data Science Group (Informatics)
Chart Parsing
John sang a song John sang a song (1) sang a song (2) sang a song (8)
So: simple parsing/recognition strategies suffer from problems:
1 left-recursion (top-down strategy)
2 empty productions (bottom-up strategy)
3 inefficient reparsing of subtrees
4 a further problem is dealing with structural ambiguity
— more on that next time
A general solution is to use dynamic programming – c.f. the Viterbi algorithm and HMMs seen previously
The Earley Algorithm provides an efficient approach to CFG parsing – A.K.A. Chart Parsing
a song a song song
(10) (4)
(5) (6)
(11)
sang a song (7)
repetition of rule application
Autumn 2015
a song a song song
(10) (4) (5) (6)
NLE/ANLP
13 / 28
Data Science Group (Informatics) NLE/ANLP
Edges: Partial Parse Information
Autumn 2015
14 / 28
Algorithm builds a data structure called the chart
chart contains a list of edges representing state of parse
Each edge involves a dotted rule
A → X1 …Xj •Xj+1 …Xk
Symbols on the left of the dot (•) have been ‘found’ — confirmed hypotheses
Symbols on the right of the dot are still to be found
Edges may be active or inactive: NP → ART • N
S → • NP VP VP → V NP •
active edge active and empty inactive edge
Data Science Group (Informatics) NLE/ANLP Autumn 2015
15 / 28
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
16 / 28
Edges: Position in Chart
The Chart
In addition, each edge records position of partial parse in input In general, the chart has edges of the form:
⟨i, j, A → α • β⟩
integer i: position in input where the partial parse starts integer j: position in input where the partial parse finishes
S→NP • VP
NP → John • John
NP → • ART N V → sang •
sang a
N → song • song
01234
Data Science Group (Informatics) NLE/ANLP
The Fundamental Rule of Chart Parsing
If the chart contains the edges
⟨i, j, A → α • B β⟩ and ⟨j, k, B → γ •⟩
then add the new edge
⟨i, k, A → α B • β⟩
α, β, γ are possibly empty sequences of symbols
Autumn 2015
17 / 28
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
18 / 28
Example
S → NP • VP
VP → V •
S → NP VP •
ijk
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
19 / 28
Data Science Group (Informatics) NLE/ANLP Autumn 2015
20 / 28
Initialisation
Example
Fundamental rule applies to chart containing edges
How do we get started?
Initialisation:
Include inactive edges for words in the input string
Consider the input John sang a song
NP → John • V → sang • ART → a • N → song •
01234
Data Science Group (Informatics) NLE/ANLP
Rule Invocation: Bottom-Up
If adding an edge
then for every rule of the form
Autumn 2015
21 / 28
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
22 / 28
Example
⟨i , j , B → α •⟩
A→Bβ
⟨i, i, A → • B β⟩
NP → • ART N
i
ART → a •
add an edge
j
Data Science Group (Informatics) NLE/ANLP
Autumn 2015
23 / 28
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
24 / 28
Rule Invocation: Top-Down
Example
If adding an edge
then for every rule of the form
A → NP • VP
ij
⟨i,j,B→α • Aβ⟩
add an edge
A→γ
⟨j, j, A → • γ⟩
NLE/ANLP
VP → • V
VP → • V NP
Data Science Group (Informatics)
Top-Down Initialisation
Initialise the chart with
Autumn 2015
25 / 28
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
26 / 28
⟨0, 0, S → • α⟩
Next Topic: Probabilistic CFG
The challenge of ambiguity Lexical and structural ambiguity Probabilistic CFG (PCFG) Syntactic disambiguation Training a PCFG
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
27 / 28
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015
28 / 28