Computational Linguistics
Computational
Linguistics
Copyright © 2017 Suzanne
Stevenson, Graeme Hirst
and Gerald Penn. All rights
reserved.
6
6. Statistical resolution of PP
attachment ambiguities
Gerald Penn
Department of Computer Science, University of Toronto
CSC 2501 / 485
Fall 2018
Statistical PP attachment methods
• A classification problem.
• Input: verb, noun1, preposition, noun2
Output: V-attach or N-attach
• Example:
examined the raw materials with the optical microscope
• Does not cover all PP problems.
2
Possibly omitted
v n1 p n2
3
• Corpus: Partially parsed news text.
Hindle & Rooth 1993: Input 1
–Automatic.
–Many attachment decisions punted.
–A collection of parse fragments for
each sentence.
Hindle, Donald and Rooth, Mats. Structural ambiguity and
lexical relations. Computational Linguistics, 19(1), 1993, 103–
120.
4
The radical changes in export and customs regulations evidently are aimed at
remedying an extreme shortage of consumer goods in the Soviet Union and
assuaging citizens angry over the scarcity of such basic items as soap and windshield
wipers.
From Hindle & Rooth 1993
5
• Data: [v,n,p] triples; v or p may be null; v
may be –.
Hindle & Rooth 1993: Input 2
From Hindle & Rooth 1993
v n p
– change in
aim PRO at
remedy shortage of
NULL good in
assuage citizen NULL
NULL scarcity of
The radical changes in export and customs regulations evidently are aimed at
remedying an extreme shortage of consumer goods in the Soviet Union and
assuaging citizens angry over the scarcity of such basic items as soap and
windshield wipers.
6
• Idea: Compute lexical associations (LAs)
between p and each of v, n.
— Is the p more associated with the v or with the n?
• Learn a way to compute LA for each [v,n,p]
triple.
• Use to map from [v,n,p] to {V-attach, N-
attach}.
Hindle & Rooth 1993: Algorithm 1
7
Method: Bootstrapping.
1. Label unambiguous cases as N- or V-attach:
When v or p is NULL, n is pronoun, or p is of.
2. Iterate (until nothing changes):
a. Compute lexical association score for each triple
from data labelled so far.
b. Label the attachment of any new triples whose
score is over threshold.
3. Deal with “leftovers” (random assignment).
Test cases: Compute the LA score (or fail).
Hindle & Rooth 1993: Algorithm 2
8
• Lexical association score: log-likelihood
ratio of verb- and noun-attachment.
LA(v,n,p) =
log2 P(V-attach p|v,n)/P(N-attach p|v,n)
• Can’t get these probabilities directly — data
are too sparse.
• So estimate them from the data that we can
get.
Hindle & Rooth 1993: Algorithm 3
9
• Lexical association score: log-likelihood
ratio of verb- and noun-attachment.
LA(v,n,p) =
log2 P(V-attach p|v,n)/P(N-attach p|v,n)
• Why ratio of probabilities? Why log of ratio?
Hindle & Rooth 1993: Algorithm 4
What are these probabilities “saying”?
Based on frequency counts c in the labelled data.
≈ P(V-attach p|v) P(NULL|n) ≈ P(N-attach p|n)
❷❶
Hindle & Rooth 1993: Example 1
Moscow sent more than 100,000 soldiers into
Afghanistan …
Choose between:
V-attach: [VP send [NP … soldier NULL] [PP into…]]
N-attach: [VP send [NP … soldier [PP into…]]…]
11
Hindle & Rooth 1993: Example 2
❶P(V-attach into|send, soldier)
≈ P(V-attach into|send) ● P(NULL|soldier)
❷P(N-attach into|send, soldier)
≈ P(N-attach into|soldier)
12
c(send, into)
c(send)
.049
LA(send, soldier, into)
= log2(.049 × .800/.0007) ≈ 5.81
c(soldier, NULL)
c(soldier)
.800
c(soldier, into)
c(soldier)
.0007
13
• Training: 223K triples
Testing: 1K triples
Results: 80% accuracy
(Baselines: 66% by noun attachment; 88% by humans.)
Hindle & Rooth 1993: Results
14
• Advantages: Unsupervised; gives degree of
preference.
• Disadvantages: Needs lots of partially
parsed data. Other words don’t get a vote.
• Importance to CL:
• Use of large amounts of unlabelled data, with
clever application of linguistic knowledge, to learn
useful statistics.
Hindle & Rooth 1993: Discussion
15
• Corpus-based, non-statistical method.
• Transformation-based learning: Learns
sequence of rules to apply to each input item.
• Form of transformation rules:
• Flip attachment decision (from V to N1 or vice
versa) if {v,n1,p,n2} is w1 [and {v,n1,p,n2} is w2].
• All rules apply, in order in which they are
learned.
Brill & Resnik 1994: Method
A quad: Uses head noun of PP too Optional conjunct
16
Unlabelled text [attachments not assigned]
Initial state labeller
Labelled text [attachments assigned,
but maybe not correctly]
Learner Truth
[attachments all
correctly labelled]
Transformations
[ordered list of rules
to apply to new data]
Learner uses diffs
between truth and
labelled text to
select new rule,
then applies it.
Brill & Resnik 1994: Method
Brill & Resnik 1994: Example
17
Some rules learned:
Start by assuming N1 attachment, and then change
attachment …
1. from N1 to V if p is at.
2. from N1 to V if p is as.
⋮
6. from N1 to V if n2 is year.
8. from N1 to V if p is in and n1 is amount.
⋮
15. from N1 to V if v is have and p is in.
17. from V to N1 if p is of.
18
• Training: 12K annotated quads
Testing: 500 quads
Results: 80% accuracy
(Baseline: 64% by noun attachment)
Brill & Resnik 1994: Results
19
• Advantages: Readable rules (but may be
hard); can build in bias in initial annotation;
small number of rules.
• Disadvantages: Supervised; no strength of
preference. Very memory-intensive.
• Importance to CL:
• Successful general method for non-statistical
learning from annotated corpus.
• Based on popular (and relatively easily modified)
tagger.
Brill & Resnik 1994: Discussion
20
• Using large amounts of cheap, noisy data in
an unsupervised setting.
• Corpus processing:
• PoS tagged.
• “Chunked” using simple regular expressions.
• “Unambiguous” attachment data:
• Based on errorful heuristics (cf. Hindle & Rooth).
• Quantity versus quality of data.
Ratnaparkhi 1998: Introduction
Raw text Tagger PoS-tagged text
ChunkerTagged text with NPs
replaced by head nouns
Ratnaparkhi 1998: Outline
21
Extractor
“Unambiguous” triples
(n,p,n2) and (v,p,n2)
Morph
processor
Final triples with words
replaced by base forms
The professional conduct of
lawyers in other jurisdictions …
The/DT professional/JJ conduct/NN of/IN
lawyers/NNS in/IN other/JJ jurisdictions/NNS …
conduct/NN of/IN lawyers/NNS in/IN
jurisdictions/NNS …
(n = lawyers, p = in, n2 = jurisdictions) …
(n = lawyer, p = in, n2 = jurisdiction) …
23
• Extract (n,p,n2) as “unambiguous” if p ≠ of and:
– n is first noun within k words left of p; and
– no verb occurs within k words left of p; and
– n2 is first noun within k words right of p; and
– no verb occurs between p and n2.
• Extract (v,p,n2) as “unambiguous” if p ≠ of and:
– v (≠ be) is first verb within k words left of p; and
– no noun intervenes between v and p; and
– n2 is first noun within k words right of p; and
– no verb occurs between p and n2.
• Why are “unambiguous” data only 69% correct?
Ratnaparkhi 1998: Unambiguous triples
24
• What we have:
• Sets of (v,p) and (n,p). [doesn’t use n2]
• What we need:
• argmaxa P(v,n,p,a), where a is either N- or V-
attach.
• Notice the probability has all three of v, n, and
p, but the extracted data never have both v
and n.
Ratnaparkhi 1998: Probabilities
• By the chain rule for probabilities*,
P(v,n,p,a) = P(v) ● P(n|v) ● P(a|v,n) ● P(p|a,v,n)
• For a = N-attach: [analogously for V-attach]
❶ P(a = N|n,v) ≈ P(a = N|n)
= c(extracted n)/c(all n)
• But why this specific conditional chain? (We’ll see…)
25
Ratnaparkhi 1998: Probabilities
Why?
How often does this n have
an attachment (to any p)?
❶ ❷
*P(x1, …, xn) = P(x1) ⋅ P(x2|x1) ⋅ P(x3|x2, x1) ⋅ … ⋅ P(xn|xn–1, xn–2, …, x1)
No influence on argmaxa
27
• For a = N-attach: [analogously for V-attach]
❷ P(p|a,v,n) ≈ P(p|n,a)
= c(extracted n with p)/c(extracted n)
for this n
OR = [c(extracted n with p) + proportion of
all n’s with p]/[c(extracted n) + 1] for this n
Ratnaparkhi 1998: Probabilities
How often when this n has
an attachment is it to this p?
“add-one” smoothing
28
• When a count is zero, back off to uniform
probabilities:
• P(a = N|n) = 0.5
• Proportion of all n’s with p
= 1 / number of prepositions
[and analogously for v].
Why?
Ratnaparkhi 1998: Backoffs
29
Ratnaparkhi 1998: Results 1
• Training: 900K automatically annotated tuples
Testing: 3K tuples
Results: 82% accuracy
(Baseline: 70%)
• Remember: no parse trees
• And it has another feature…
30
Ratnaparkhi 1998: Results 2
• The rise num to num problem:
• “profits rose 46% to $52 million”
• num to num is more frequent than rise to num
• So why is V-attach correctly preferred?
• P(a = N|rise,num) is lower than for a = V.
• Because there are more occurrences of a p
attached to rise than to num.
• P(to|a = N,rise,num) is lower than for a = V.
• Because the proportion of all attachments to num
that are with to is lower than the proportion of all
attachments to rise that are with to.
31
Ratnaparkhi 1998: Discussion
• Advantages: unsupervised; portable (also
Spanish).
• Disadvantages: very problem specific.
• Importance to CL:
• Using large amounts of unlabelled data and
minimal linguistic tools/knowledge for attachment
resolution.
• Clever formulation of probability to match available
info.
32
Questions to consider in evaluation:
• What are the required resources?
• How is the corpus annotated?
• What information is extracted and how?
• How much data is needed?
• What is the information learned?
• Statistics or rules?
• Binary preference or strength of preference?
Evaluating corpus-based methods 1
33
• What is the size of the test set?
• How good is the performance?
• Absolute performance?
• Reduction in error rate relative to a baseline?
Evaluating corpus-based methods 2