Parsing with Probabilities
ANLP: Week 6, Unit 2
Shay Cohen
Based on slides from ANLP 2019
How to find the best parse?
First, remember standard CKY algorithm.
Fills in cells in well-formed substring table (chart) by combining previously computed child cells.
ohe1 1saw2 2her3 3duck4
Probabilistic CKY
We also have analogues to the other HMM algorithms.
The inside algorithm computes the probability of the sentence (analogous to forward algorithm)
Same as above, but instead of storing the best parse for A, store the sum of all parses.
The inside-outside algorithm algorithm is a form of EM that learns grammar rule probs from unannotated sentences (analogous to forward-backward).
1
2
3
4
0
Pro, NP
S
1
Vt,Vp,N
VP
2
Pro, PosPro, D
NP
3
N,Vi
Probabilistic CKY
It is straightforward to extend CKY parsing to the probabilistic case.
Goal: return the highest probability parse of the sentence.
When we find an A spanning (i,j), store its probability along
with its label and backpointers in cell (i,j)
If we later find an A with the same span but higher probability, replace the probability for A in cell (i,j), and update the backpointers to the new children.
Analogous to Viterbi: we iterate over all possible child pairs (rather than previous states) and store the probability and backpointers for the best one.
Best-first probabilistic parsing
So far, we’ve been assuming exhaustive parsing: return all possible parses.
But treebank grammars are huge!!
Exhaustive parsing of WSJ sentences up to 40 words long adds on average over 1m items to chart per sentence.1
Can be hundreds of possible parses, but most have extremely low probability: do we really care about finding these?
Best-first parsing can help.
1Charniak, Goldwater, and Johnson, WVLC 1998.
Best-first probabilistic parsing
Use probabilities of subtrees to decide which ones to build up further.
Each time we find a new constituent, we give it a score (“figure of merit”) and add it to an agenda2, which is ordered by score.
Then we pop the next item off the agenda, add it to the chart, and see which new constituents we can make using it.
We add those to the agenda, and iterate.
Notice we are no longer filling the chart in any fixed order.
Many variations on this idea, often limiting the size of the agenda by pruning out low-scoring edges (beam search).
2aka a priority queue
How do we score constituents?
Perhaps according to the probability of the subtree they span? So, P(left example)=(0.7)(0.18) and P(right example)=0.18.
VP0.7
V1.0 NP0.18 saw birds
PP1.0
P1.0 NP0.18 with fish
Best-first intuition
Suppose red constituents are in chart already; blue are on agenda. SS
NP VP NP VP
Pro Vt NP Pro
Vp NP VP saw Pro Vi
her duck
he saw
PosPro N he her duck
If the VP in right-hand tree scores high enough, we’ll pop that next, add it to chart, then find the S. So, we could complete the whole parse before even finding the alternative VP.
How do we score constituents?
But what about comparing different sized constituents?
VP0.7
V1.0 NP0.18 saw birds
VP0.3
VP0.7
V1.0 NP0.18 saw birds
PP1.0
P1.0 NP0.18 with fish
A better figure of merit
If we use raw probabilities for the score, smaller constituents will almost always have higher scores.
Meaning we pop all the small constituents off the agenda before the larger ones.
Which would be very much like exhaustive bottom-up parsing!
Instead, we can divide by the number of words in the constituent.
Very much like we did when comparing language models (recall per-word cross-entropy)!
This works much better, though still not guaranteed to find the best parse first. Other improvements are possible.
Vanilla PCFGs: no lexical dependencies
NP0.1 kids
S1.0
V1.0 saw
VP0.7
NP0.18 birds
NP0.1 kids
PP1.0
NP0.1 with binoculars
S1.0
VP0.7
V1.0 NP0.18 saw birds
VP0.3
NP0.4
PP1.0
P1.0 NP0.1 with binoculars
P1.0
Exactly the same probabilities as the “fish” trees, except divide out P(fish|NP) and multiply in P(binoculars|NP) in each case.
So, the same (left) tree is preferred, but now incorrectly!
But wait a minute…
Best-first parsing shows how simple (“vanilla”) treebank PCFGs can improve efficiency. But do they really solve the problem of disambiguation?
Our example grammar gave the right parse for this sentence: kids saw birds with fish
What happens if we parse this sentence? kids saw birds with binoculars
Vanilla PCFGs: no lexical dependencies
Replacing one word with another with the same POS will never result in a different parsing decision, even though it should!
More examples:
She stood by the door covered in tears vs.
She stood by the door covered in ivy
She called on the student vs.
She called on the phone.
(assuming “on” has the same POS…)
Vanilla PCFGs: no global structural preferences
Ex. in Switchboard corpus, the probability of NP → Pronoun in subject position is 0.91 he saw the dog
in object position is 0.34 the dog bit him
Lots of other rules also have different probabilities depending
on where they occur in the sentence.
But PCFGs are context-free, so an NP is an NP is an NP, and will have the same expansion probs regardless of where it appears.
Example of parent annotation
SˆROOT
NPˆS VPˆS kids
VPˆVP
PPˆVP
VˆVP NPˆVP PˆPP NPˆPP saw birds with fish
Ways to fix PCFGs (1): parent annotation
Automatically create new categories that include the old category and its parent.
So, an NP in subject position becomes NP^S, with other NPs becoming NP^VP, NP^PP, etc.
Ex. rules:
S^ROOT → NP^S VP^S NP^S → Pro^NP
NP^S → NP^NP PP^NP
Ways to fix PCFGs (2): lexicalization
Again, create new categories, this time by adding the lexical head of the phrase:
S-saw
NP-kids VP-saw kids
VP-saw
PP-fish
V-saw NP-birds P-with NP-fish saw birds with fish
Now consider:
VP-saw → VP-saw PP-fish vs. VP-saw → VP-saw
PP-binoculars
Practical issues, again
All this category-splitting makes the grammar much more specific (good!)
But leads to huge grammar blowup and very sparse data (bad!)
Lots of effort over the years on how to balance these two issues.
Complex smoothing schemes (similar to N-gram interpolation/backoff).
More recently, emphasis on automatically learned subcategories and now neural parsers with implicit subcategories.
But how do we know which method works best?
Evaluating parse accuracy
Compare gold standard tree (left) to parser output (right): SS
NP VP NP VP
Pro he
Vt NP Pro
PosPro N he her duck
Vp NP VP saw Pro Vi
saw
Precision: (# correct constituents)/(# in parser output) =
3/5
Recall: (# correct constituents)/(# in gold standard) = 3/4 F-score: balances precision/recall: 2pr/(p+r)
her duck
Evaluating parse accuracy
Compare gold standard tree (left) to parser output (right): SS
NP VP NP VP
Pro Vt NP Pro
Vp NP VP saw Pro Vi
he saw
Output constituent is counted correct if there is a gold
constituent that spans the same sentence positions.
Harsher measure: also require the constituent labels to match.
Pre-terminals don’t count as constituents.
PosPro N he her duck
her duck
Parsing: where are we now?
Parsing is not just WSJ. Lots of situations are much harder!
Other languages, esp with free word order (will discuss next
time) or little annotated data.
Other domains, esp with jargon (e.g., biomedical) or
non-standard language (e.g., social media text).
And parsing to syntactic constituents isn’t the only option, as we’ll see next time…
Parsing: where are we now?
This lecture discussed the basics of probabilistic parsing and should give you a good idea of the issues involved.
State-of-the-art parsers address these issues in other ways. For comparison, parsing F-scores on WSJ corpus are:
vanilla PCFG: < 80%3
lexicalizing + cat-splitting: 89.5% (Charniak, 2000) Best current parsers get about 94%
We’ll say a little bit about recent methods later, but most details in sem 2.
3Charniak (1996) reports 81% but using gold POS tags as input.