程序代写代做代考 algorithm chain Computational Linguistics

Computational Linguistics

Computational

Linguistics

Copyright © 2017 Suzanne

Stevenson, Graeme Hirst

and Gerald Penn. All rights

reserved.

6

6. Statistical resolution of PP

attachment ambiguities

Gerald Penn

Department of Computer Science, University of Toronto

CSC 2501 / 485

Fall 2018

Statistical PP attachment methods

• A classification problem.

• Input: verb, noun1, preposition, noun2
Output: V-attach or N-attach

• Example:

examined the raw materials with the optical microscope

• Does not cover all PP problems.

2

Possibly omitted

v n1 p n2

3

• Corpus: Partially parsed news text.

Hindle & Rooth 1993: Input 1

–Automatic.
–Many attachment decisions punted.
–A collection of parse fragments for

each sentence.

Hindle, Donald and Rooth, Mats. Structural ambiguity and

lexical relations. Computational Linguistics, 19(1), 1993, 103–

120.

4

The radical changes in export and customs regulations evidently are aimed at

remedying an extreme shortage of consumer goods in the Soviet Union and

assuaging citizens angry over the scarcity of such basic items as soap and windshield

wipers.

From Hindle & Rooth 1993

5

• Data: [v,n,p] triples; v or p may be null; v
may be –.

Hindle & Rooth 1993: Input 2

From Hindle & Rooth 1993

v n p

– change in

aim PRO at

remedy shortage of

NULL good in

assuage citizen NULL

NULL scarcity of

The radical changes in export and customs regulations evidently are aimed at

remedying an extreme shortage of consumer goods in the Soviet Union and

assuaging citizens angry over the scarcity of such basic items as soap and

windshield wipers.

6

• Idea: Compute lexical associations (LAs)
between p and each of v, n.

— Is the p more associated with the v or with the n?

• Learn a way to compute LA for each [v,n,p]
triple.

• Use to map from [v,n,p] to {V-attach, N-
attach}.

Hindle & Rooth 1993: Algorithm 1

7

Method: Bootstrapping.

1. Label unambiguous cases as N- or V-attach:
When v or p is NULL, n is pronoun, or p is of.

2. Iterate (until nothing changes):

a. Compute lexical association score for each triple
from data labelled so far.

b. Label the attachment of any new triples whose
score is over threshold.

3. Deal with “leftovers” (random assignment).

Test cases: Compute the LA score (or fail).

Hindle & Rooth 1993: Algorithm 2

8

• Lexical association score: log-likelihood
ratio of verb- and noun-attachment.

LA(v,n,p) =

log2 P(V-attach p|v,n)/P(N-attach p|v,n)

• Can’t get these probabilities directly — data
are too sparse.

• So estimate them from the data that we can
get.

Hindle & Rooth 1993: Algorithm 3

9

• Lexical association score: log-likelihood
ratio of verb- and noun-attachment.

LA(v,n,p) =

log2 P(V-attach p|v,n)/P(N-attach p|v,n)

• Why ratio of probabilities? Why log of ratio?

Hindle & Rooth 1993: Algorithm 4

What are these probabilities “saying”?

Based on frequency counts c in the labelled data.

≈ P(V-attach p|v) P(NULL|n) ≈ P(N-attach p|n)
❷❶

Hindle & Rooth 1993: Example 1

Moscow sent more than 100,000 soldiers into
Afghanistan …

Choose between:
V-attach: [VP send [NP … soldier NULL] [PP into…]]
N-attach: [VP send [NP … soldier [PP into…]]…]

11

Hindle & Rooth 1993: Example 2

❶P(V-attach into|send, soldier)

≈ P(V-attach into|send) ● P(NULL|soldier)

❷P(N-attach into|send, soldier)

≈ P(N-attach into|soldier)

12

c(send, into)

c(send)

.049

LA(send, soldier, into)

= log2(.049 × .800/.0007) ≈ 5.81

c(soldier, NULL)

c(soldier)

.800

c(soldier, into)

c(soldier)

.0007

13

• Training: 223K triples
Testing: 1K triples
Results: 80% accuracy
(Baselines: 66% by noun attachment; 88% by humans.)

Hindle & Rooth 1993: Results

14

• Advantages: Unsupervised; gives degree of
preference.

• Disadvantages: Needs lots of partially
parsed data. Other words don’t get a vote.

• Importance to CL:

• Use of large amounts of unlabelled data, with
clever application of linguistic knowledge, to learn
useful statistics.

Hindle & Rooth 1993: Discussion

15

• Corpus-based, non-statistical method.

• Transformation-based learning: Learns
sequence of rules to apply to each input item.

• Form of transformation rules:

• Flip attachment decision (from V to N1 or vice
versa) if {v,n1,p,n2} is w1 [and {v,n1,p,n2} is w2].

• All rules apply, in order in which they are
learned.

Brill & Resnik 1994: Method

A quad: Uses head noun of PP too Optional conjunct

16

Unlabelled text [attachments not assigned]

Initial state labeller

Labelled text [attachments assigned,

but maybe not correctly]

Learner Truth

[attachments all

correctly labelled]

Transformations

[ordered list of rules

to apply to new data]

Learner uses diffs

between truth and

labelled text to

select new rule,

then applies it.

Brill & Resnik 1994: Method

Brill & Resnik 1994: Example

17

Some rules learned:

Start by assuming N1 attachment, and then change

attachment …

1. from N1 to V if p is at.

2. from N1 to V if p is as.

6. from N1 to V if n2 is year.

8. from N1 to V if p is in and n1 is amount.

15. from N1 to V if v is have and p is in.

17. from V to N1 if p is of.

18

• Training: 12K annotated quads
Testing: 500 quads
Results: 80% accuracy
(Baseline: 64% by noun attachment)

Brill & Resnik 1994: Results

19

• Advantages: Readable rules (but may be
hard); can build in bias in initial annotation;
small number of rules.

• Disadvantages: Supervised; no strength of
preference. Very memory-intensive.

• Importance to CL:

• Successful general method for non-statistical
learning from annotated corpus.

• Based on popular (and relatively easily modified)
tagger.

Brill & Resnik 1994: Discussion

20

• Using large amounts of cheap, noisy data in
an unsupervised setting.

• Corpus processing:

• PoS tagged.

• “Chunked” using simple regular expressions.

• “Unambiguous” attachment data:

• Based on errorful heuristics (cf. Hindle & Rooth).

• Quantity versus quality of data.

Ratnaparkhi 1998: Introduction

Raw text Tagger PoS-tagged text

ChunkerTagged text with NPs

replaced by head nouns

Ratnaparkhi 1998: Outline

21

Extractor
“Unambiguous” triples

(n,p,n2) and (v,p,n2)

Morph

processor
Final triples with words

replaced by base forms

The professional conduct of

lawyers in other jurisdictions …

The/DT professional/JJ conduct/NN of/IN

lawyers/NNS in/IN other/JJ jurisdictions/NNS …

conduct/NN of/IN lawyers/NNS in/IN

jurisdictions/NNS …

(n = lawyers, p = in, n2 = jurisdictions) …

(n = lawyer, p = in, n2 = jurisdiction) …

23

• Extract (n,p,n2) as “unambiguous” if p ≠ of and:
– n is first noun within k words left of p; and

– no verb occurs within k words left of p; and

– n2 is first noun within k words right of p; and

– no verb occurs between p and n2.

• Extract (v,p,n2) as “unambiguous” if p ≠ of and:
– v (≠ be) is first verb within k words left of p; and

– no noun intervenes between v and p; and

– n2 is first noun within k words right of p; and

– no verb occurs between p and n2.

• Why are “unambiguous” data only 69% correct?

Ratnaparkhi 1998: Unambiguous triples

24

• What we have:

• Sets of (v,p) and (n,p). [doesn’t use n2]

• What we need:

• argmaxa P(v,n,p,a), where a is either N- or V-
attach.

• Notice the probability has all three of v, n, and
p, but the extracted data never have both v
and n.

Ratnaparkhi 1998: Probabilities

• By the chain rule for probabilities*,
P(v,n,p,a) = P(v) ● P(n|v) ● P(a|v,n) ● P(p|a,v,n)

• For a = N-attach: [analogously for V-attach]

❶ P(a = N|n,v) ≈ P(a = N|n)

= c(extracted n)/c(all n)

• But why this specific conditional chain? (We’ll see…)

25

Ratnaparkhi 1998: Probabilities

Why?

How often does this n have

an attachment (to any p)?

❶ ❷

*P(x1, …, xn) = P(x1) ⋅ P(x2|x1) ⋅ P(x3|x2, x1) ⋅ … ⋅ P(xn|xn–1, xn–2, …, x1)

No influence on argmaxa

27

• For a = N-attach: [analogously for V-attach]

❷ P(p|a,v,n) ≈ P(p|n,a)

= c(extracted n with p)/c(extracted n)
for this n

OR = [c(extracted n with p) + proportion of
all n’s with p]/[c(extracted n) + 1] for this n

Ratnaparkhi 1998: Probabilities

How often when this n has

an attachment is it to this p?

“add-one” smoothing

28

• When a count is zero, back off to uniform
probabilities:

• P(a = N|n) = 0.5

• Proportion of all n’s with p
= 1 / number of prepositions

[and analogously for v].

Why?

Ratnaparkhi 1998: Backoffs

29

Ratnaparkhi 1998: Results 1
• Training: 900K automatically annotated tuples

Testing: 3K tuples
Results: 82% accuracy
(Baseline: 70%)

• Remember: no parse trees

• And it has another feature…

30

Ratnaparkhi 1998: Results 2
• The rise num to num problem:

• “profits rose 46% to $52 million”

• num to num is more frequent than rise to num

• So why is V-attach correctly preferred?

• P(a = N|rise,num) is lower than for a = V.

• Because there are more occurrences of a p
attached to rise than to num.

• P(to|a = N,rise,num) is lower than for a = V.

• Because the proportion of all attachments to num
that are with to is lower than the proportion of all
attachments to rise that are with to.

31

Ratnaparkhi 1998: Discussion

• Advantages: unsupervised; portable (also
Spanish).

• Disadvantages: very problem specific.

• Importance to CL:

• Using large amounts of unlabelled data and
minimal linguistic tools/knowledge for attachment
resolution.

• Clever formulation of probability to match available
info.

32

Questions to consider in evaluation:

• What are the required resources?

• How is the corpus annotated?

• What information is extracted and how?

• How much data is needed?

• What is the information learned?

• Statistics or rules?

• Binary preference or strength of preference?

Evaluating corpus-based methods 1

33

• What is the size of the test set?

• How good is the performance?

• Absolute performance?

• Reduction in error rate relative to a baseline?

Evaluating corpus-based methods 2