Statistical Parsing Using Treebanks
ANLP: Week 6, Unit 1
Shay Cohen
Based on slides from ANLP 2019
Last class
Recursive Descent Parsing Shift-Reduce Parsing
CYK:
For j > i + 1:
j−1
Chart[A,i,j]= Chart[B,i,k]∧Chart[C,k,j]
k=i+1 A→B C Seed the chart, for i +1 = j:
Chart[A, i, i + 1] = True if there exists a rule A → wi+1 where wi +1 is the (i + 1)th word in the string
How big a problem is disambiguation?
Early work in computational linguistics tried to develop broad-coverage hand-written grammars.
That is, grammars that include all sentences humans would judge as grammatical in their language;
while excluding all other sentences.
As coverage grows, sentences can have hundreds or thousands of parses. Very difficult to write heuristic rules for disambiguation.
Plus, grammar is hard to keep track of! Trying to fix one problem can introduce others.
Enter the treebank grammar.
Towards probabilistic parsing
We’ve seen various parsing algorithms, including one that parses exhaustively in polynomial time (CKY).
But we haven’t discussed how to choose which of many possible parses is the right one.
The obvious solution: probabilities.
Today’s lecture
What is a treebank and how do we use it to get a probabilistic grammar?
How do we use probabilities in parsing?
What are some problems that probabilistic CFGs help solve?
What are some remaining weaknesses of simple PCFGs, and what are some ways to address them?
Treebank grammars
The big idea: instead of paying linguists to write a grammar, pay them to annotate real sentences with parse trees.
This way, we implicitly get a grammar (for CFG: read the rules off the trees).
And we get probabilities for those rules (using any of our favorite estimation techniques).
We can use these probabilities to improve disambiguation and even speed up parsing.
Example: The Penn Treebank
The first large-scale parse annotation project, begun in 1989.
Original corpus of syntactic parses: Wall Street Journal text
About 40,000 annotated sentences (1m words) Standard phrasal categories like S, NP, VP, PP.
Now many other data sets (e.g. transcribed speech), and different kinds of annotation; also inspired treebanks in many other languages.
Treebank grammars
For example, if we have this tree in our corpus: S
Then we add rules
S → NP VP NP → Pro Pro → i
NP
Pro Vt
VP
i saw Det N
the man
Vt → saw NP→Det N Art → the N → man
NP VP → Vt NP
With more trees, we can start to count rules and estimate their probabilities.
Other language treebanks
Many annotated with dependency grammar rather than CFG (see next lecture).
Some require paid licenses, others are free. Just a few examples:
Danish Dependency Treebank Alpino Treebank (Dutch)
Bosque Treebank (Portuguese) Talbanken (Swedish)
Prague Dependency Treebank (Czech)
TIGER corpus, Tuebingen Treebank, NEGRA corpus (German) Penn Chinese Treebank
Penn Arabic Treebank
Tuebingen Treebank of Spoken Japanese, Kyoto Text Corpus
Creating a treebank PCFG
A probabilistic context-free grammar (PCFG) is a CFG where each rule A → α (where α is a symbol sequence) is assigned a probability P(α|A).
The sum over all expansions of A must equal 1:
α′ P(α′|A) = 1.
Easiest way to create a PCFG from a treebank: MLE
Count all occurrences of A → α in treebank.
Divide by the count of all rules whose LHS is A to get P(α|A)
But as usual many rules have very low frequencies, so MLE isn’t good enough and we need to smooth.
The probability of a parse
Under this model, the probability of a parse t is simply the product of all rules in the parse:
P(t)= p(A→α|A) A→α∈t
Example:
The generative model
Like n-gram models and HMMs, PCFGs are a generative model. Assumes sentences are generated as follows:
Start with root category S.
Choose an expansion α for S with probability P(α|S).
Then recurse on each symbol in α.
Continue until all symbols are terminals (nothing left to expand).
Statistical disambiguation example
How can parse probabilities help disambiguate PP attachment?
Let’s use the following PCFG, inspired by Manning & Schuetze (1999):
S → NP VP 1.0 PP → P NP 1.0 VP → V NP 0.7 VP → VP PP 0.3 P → with 1.0 V → saw 1.0
NP→NPPP 0.4 NP → kids 0.1 NP → birds 0.18 NP→saw 0.04 NP→fish 0.18 NP → binoculars 0.1
We want to parse kids saw birds with fish.
Probability of parse 1
NP0.1 kids
S1.0
V1.0 saw
VP0.7
NP0.18 birds
NP0.4
P1.0 NP0.18 with fish
PP1.0
P(t1) = 1.0·0.1·0.7·1.0·0.4·0.18·1.0·1.0·0.18 = 0.0009072
The probability of a sentence
Since t implicitly includes the words w⃗ , we have P ( t ) = P ( t , w⃗ ) .
So, we also have a language model. Sentence probability is obtained by summing over T (w⃗ ), the set of valid parses of w⃗ :
P(w⃗)= P(t,w⃗)= P(t) t ∈ T ( w⃗ ) t ∈ T ( w⃗ )
In our example,
P(kids saw birds with fish) = 0.0006804 + 0.0009072.
Probability of parse 2
S1.0
VP0.7
V1.0 NP0.18
saw birds
NP0.1 kids
VP0.3
PP1.0
P1.0 NP0.18
with fish
P(t2) = 1.0·0.1·0.3·0.7·1.0·0.18·1.0·1.0·0.18 = 0.0006804 which is less than P(t1) = 0.0009072, so t1 is preferred. Yay!