Computational
Linguistics
CSC 485 Summer 2020
6
6. Statistical resolution of PP attachment ambiguities
Gerald Penn
Department of Computer Science, University of Toronto
Copyright © 2017 Suzanne Stevenson, Graeme Hirst and Gerald Penn. All rights reserved.
Statistical PP attachment methods
• A classification problem.
• Input: verb, noun1, preposition, noun2
Output: V-attach or N-attach • Example:
Possibly omitted
examined
the raw materials the optical microscope
v n1 p n2 • Does not cover all PP problems.
with
2
Hindle & Rooth 1993: Input 1
• Corpus: Partially parsed news text.
–Automatic.
–Many attachment decisions punted. –A collection of parse fragments for
each sentence.
Hindle, Donald and Rooth, Mats. Structural ambiguity and lexical relations. Computational Linguistics, 19(1), 1993, 103– 120.
3
The radical changes in export and customs regulations evidently are aimed at remedying an extreme shortage of consumer goods in the Soviet Union and assuaging citizens angry over the scarcity of such basic items as soap and windshield wipers.
From Hindle & Rooth 1993
4
Hindle & Rooth 1993: Input 2
• Data: [v,n,p] triples; v or p may be null; v may be –.
The radical changes in export and customs regulations evidently are aimed at remedying an extreme shortage of consumer goods in the Soviet Union and assuaging citizens angry over the scarcity of such basic items as soap and windshield wipers.
vn
p
–
aim
remedy
NULL
assuage
NULL
change
PRO shortage good citizen scarcity
in
at
of
in
NULL
of
From Hindle & Rooth 1993
5
Hindle & Rooth 1993: Algorithm 1
• Idea: Compute lexical associations (LAs) between p and each of v, n.
— Is the p more associated with the v or with the n?
• Learn a way to compute LA for each [v,n,p]
triple.
• Use to map from [v,n,p] to {V-attach, N- attach}.
6
Hindle & Rooth 1993: Algorithm 2
Method: Bootstrapping.
1. Label unambiguous cases as N- or V-attach:
When v or p is NULL, n is pronoun, or p is of. 2. Iterate (until nothing changes):
a. Compute lexical association score for each triple from data labelled so far.
b. Label the attachment of any new triples whose score is over threshold.
3. Deal with “leftovers” (random assignment). Test cases: Compute the LA score (or fail).
7
Hindle & Rooth 1993: Algorithm 3
• Lexical association score: log-likelihood ratio of verb- and noun-attachment.
LA(v,n,p) =
log2 P(V-attach p|v,n)/P(N-attach p|v,n)
• Can’t get these probabilities directly — data are too sparse.
• So estimate them from the data that we can get.
8
Hindle & Rooth 1993: Algorithm 4
• Lexical association score: log-likelihood ratio of verb- and noun-attachment.
LA(v,n,p) =
log2 P(V-attach p|v,n)/P(N-attach p|v,n)
≈ P(V-attach p|v) P(NULL|n) ❶❷
Based on frequency counts c in the labelled data. What are these probabilities “saying”?
• Why ratio of probabilities? Why log of ratio?
9
≈ P(N-attach p|n)
Hindle & Rooth 1993: Example 1
Moscow sent more than 100,000 soldiers into Afghanistan …
Choose between:
V-attach: [VP send [NP … soldier NULL] [PP into…]] N-attach: [VP send [NP … soldier [PP into…]]…]
11
Hindle & Rooth 1993: Example 2
❶P(V-attach into|send, soldier)
≈ P(V-attach into|send) ● P(NULL|soldier)
c(send, into) c(send)
.049
❷P(N-attach into|send, soldier) ≈ P(N-attach into|soldier)
.800
.0007
LA(send, soldier, into)
= log2(.049 × .800/.0007) ≈ 5.81
c(soldier, NULL) c(soldier)
c(soldier, into) c(soldier)
12
Hindle & Rooth 1993: Results
• Training: 223K triples
Testing: 1K triples
Results: 80% accuracy
(Baselines: 66% by noun attachment; 88% by humans.)
13
Hindle & Rooth 1993: Discussion
• Advantages: Unsupervised; gives degree of preference.
• Disadvantages: Needs lots of partially parsed data. Other words don’t get a vote.
• Importance to CL:
• Use of large amounts of unlabelled data, with clever application of linguistic knowledge, to learn useful statistics.
14
Brill & Resnik 1994: Method
• Corpus-based, non-statistical method.
• Transformation-based learning: Learns
sequence of rules to apply to each input item. • Form of transformation rules:
• Flip attachment decision (from V to N1 or vice versa) if {v,n1,p,n2} is w1 [and {v,n1,p,n2} is w2].
• All rules apply, in order in which they are learned.
A quad: Uses head noun of PP too
Optional conjunct
15
Brill & Resnik 1994: Method
Unlabelled text [attachments not assigned]
Initial state labeller
Labelled text [attachments assigned, but maybe not correctly]
Learner
Learner uses diffs between truth and labelled text to select new rule, then applies it.
Truth
[attachments all correctly labelled]
Transformations
[ordered list of rules to apply to new data]
16
Brill & Resnik 1994: Example
Some rules learned:
Start by assuming N1 attachment, and then change attachment …
1. from N1 to V ifpisat. 2. from N1 to V ifpisas.
⋮
6. from N1 to V if n2 is year.
8. from N1 to V ifpisinandn1isamount.
⋮
15. from N1 toV ifvishaveandpisin.
17. from V to N1
ifpisof.
17
Brill & Resnik 1994: Results
• Training: 12K annotated quads
Testing: 500 quads
Results: 80% accuracy (Baseline: 64% by noun attachment)
18
Brill & Resnik 1994: Discussion
• Advantages: Readable rules (but may be hard); can build in bias in initial annotation; small number of rules.
• Disadvantages: Supervised; no strength of preference. Very memory-intensive.
• Importance to CL:
• Successful general method for non-statistical
learning from annotated corpus.
• Based on popular (and relatively easily modified) tagger.
19
Ratnaparkhi 1998: Introduction
• Using large amounts of cheap, noisy data in an unsupervised setting.
• Corpus processing:
• PoS tagged.
• “Chunked” using simple regular expressions.
• “Unambiguous” attachment data:
• Based on errorful heuristics (cf. Hindle & Rooth).
• Quantity versus quality of data.
20
Ratnaparkhi 1998: Outline
The professional conduct of lawyers in other jurisdictions …
Raw text PoS-tagged text
Tagged text with NPs replaced by head nouns
The/DT professional/JJ conduct/NN of/IN lawyers/NNS in/IN other/JJ jurisdictions/NNS …
Tagger
conduct/NN of/IN lawyers/NNS in/IN jurisdictions/NNS …
Chunker
(n = lawyers, p = in, n2 = jurisdictions) … “Unambiguous” triples
(n,p,n2) and (v,p,n2)
Extractor
(n = lawyer, p = in, n2 = jurisdiction) …
Final triples with words replaced by base forms
Morph processor
21
Ratnaparkhi 1998: Unambiguous triples
• Extract (n,p,n2) as “unambiguous” if p ≠ of and:
– n is first noun within k words left of p; and
– no verb occurs within k words left of p; and
– n2 is first noun within k words right of p; and
– no verb occurs between p and n2.
• Extract (v,p,n2) as “unambiguous” if p ≠ of and:
– v (≠ be) is first verb within k words left of p; and
– no noun intervenes between v and p; and
– n2 is first noun within k words right of p; and
– no verb occurs between p and n2.
• Why are “unambiguous” data only 69% correct?
23
Ratnaparkhi 1998: Probabilities
• What we have:
• Sets of (v,p) and (n,p). [doesn’t use n2]
• What we need:
• argmaxa P(v,n,p,a), where a is either N- or V-
attach.
• Notice the probability has all three of v, n, and p, but the extracted data never have both v and n.
24
Ratnaparkhi 1998: Probabilities
• By the chain rule for probabilities*,
P(v,n,p,a) = ● P(a|v,n) ● P(p|a,v,n)
P(v) ● P(n|v)
No influence on argmaxa
• For a = N-attach: [analogously for V-attach]
❶ P(a = N|n,v) ≈
Why?
P(a = N|n)
❶❷
= c(extracted n)/c(all n)
How often does this n have
an attachment (to any p)? • But why this specific conditional chain? (We’ll see…)
*P(x1, …, xn) = P(x1) ⋅ P(x2|x1) ⋅ P(x3|x2, x1) ⋅ … ⋅ P(xn|xn–1, xn–2, …, x1)
Ratnaparkhi 1998: Probabilities
• For a = N-attach: [analogously for V-attach] ❷ P(p|a,v,n) ≈
How often when this n has an attachment is it to this p?
= c(extracted n with p)/c(extracted n) for this n
OR = [c(extracted n with p) + proportion of all n’s with p]/[c(extracted n) + 1] for this n
“add-one” smoothing
P(p|n,a)
27
Ratnaparkhi 1998: Backoffs
• When a count is zero, back off to uniform probabilities:
• P(a = N|n) = 0.5
• Proportion of all n’s with p
= 1 / number of prepositions [and analogously for v].
Why?
28
Ratnaparkhi 1998: Results 1
• Training: 900K automatically annotated tuples
Testing: 3K tuples
Results: 82% accuracy (Baseline: 70%)
• Remember: no parse trees
• And it has another feature…
29
Ratnaparkhi 1998: Results 2
• The rise num to num problem:
• “profits rose 46% to $52 million”
• num to num is more frequent than rise to num
• So why is V-attach correctly preferred?
• P(a = N|rise,num) is lower than for a = V.
• Because there are more occurrences of a p
attached to rise than to num.
• P(to|a = N,rise,num) is lower than for a = V.
• Because the proportion of all attachments to num that are with to is lower than the proportion of all attachments to rise that are with to.
30
Ratnaparkhi 1998: Discussion
• Advantages: unsupervised; portable (also Spanish).
• Disadvantages: very problem specific.
• Importance to CL:
• Using large amounts of unlabelled data and minimal linguistic tools/knowledge for attachment resolution.
• Clever formulation of probability to match available info.
31
Evaluating corpus-based methods 1
Questions to consider in evaluation:
• What are the required resources?
• How is the corpus annotated?
• What information is extracted and how?
• How much data is needed?
• What is the information learned?
• Statistics or rules?
• Binary preference or strength of preference?
32
Evaluating corpus-based methods 2
• What is the size of the test set?
• How good is the performance?
• Absolute performance?
• Reduction in error rate relative to a baseline?
33