程序代写代做代考 Hidden Markov Mode Bayesian algorithm Computational Linguistics

Computational Linguistics

Computational

Linguistics

Copyright © 2017 Suzanne

Stevenson, Graeme Hirst

and Gerald Penn. All rights

reserved.

9

9. Statistical parsing

Gerald Penn

Department of Computer Science, University of Toronto

CSC 2501 / 485

Fall 2018

Reading: Jurafsky & Martin: 5.2–5.5.2, 5.6, 12.4, 14.0–1,

14.3–4, 14.6–7. Bird et al: 8.6.

2

• General idea:

• Assign probabilities to rules in a context-free
grammar.

• Use a likelihood model.

• Combine probabilities of rules in a tree.

• Yields likelihood of a parse.

• The best parse is the most likely one.

Statistical parsing 1

3

• Motivations:

• Uniform process for attachment decisions.

• Use lexical preferences in all decisions.

Statistical parsing 2

4

1. Assign a probability to each rule of grammar, including
lexical productions.

–Parse string of input words with probabilistic rules.
The can will rust.

2. Assign probabilities only to non-lexical productions.

–Probabilistically tag input words with syntactic
categories using a part-of-speech tagger.

–Consider the pre-terminal syntactic categories to be
terminals, parse that string with probabilistic rules.

Det N Modal Verb.

3. “Supertagging” – parsing as tagging with tree
fragments.

Two general approaches

5

• Part-of-speech (PoS) tagging:
Given a sequence of words w1 … wn (from
well-formed text), determine the syntactic
category (PoS) Ci of each word.

• I.e, the best category sequence C1 … Cn to
assign to the word sequence w1 … wn.

Part-of-speech tagging 1

Most likely

The can will rust

det modal verb modal verb noun

noun noun verb

verb verb

Part-of-speech tagging 2
• Example:

6
Example from Charniak 1997

The can will rust

det modal verb modal verb noun

noun noun verb

verb verb

7

• We cannot get this probability directly.

• Have to estimate it (through counts).

• Perhaps after first approximating it (by
modifying the formula).

• Counts: Need representative corpus.

Part-of-speech tagging 3

8

• Look at individual words (unigrams):

• Maximum likelihood estimator (MLE):

PoS tagging: Unigram MLE 1

Count in corpus

• Problems of MLE:

• Sparse data.

• Extreme cases:
a. Undefined if w is not in the corpus.
b. 0 if w does not appear in a particular category.

9

PoS tagging: Unigram MLE 2

• Smoothing of formula, e.g.,:

𝑃 𝐶 𝑤 ≈
𝑐 𝑤 𝑖𝑠 𝐶 + 𝜖

𝑐 𝑤 + 𝜖𝑁

• Give small (non-zero) probability value to
unseen events, taken from seen events by
discounting them.

• Various methods to ensure we still have valid
probability distribution.

10

PoS tagging: Unigram MLE 3

11

• Just choosing the most frequent PoS for each
word yields 90% accuracy in PoS tagging.

• But:

• Not uniform across words.

• Accuracy is low (0.9n) when multiplied over n
words.

• No context: The fly vs. I will fly.

• Need better approximations for

PoS tagging: Unigram MLE 4

12

• Use Bayes’s rule to rewrite:

• For a given word string, we want to maximize
this, find most likely C1 … Cn:

• So just need to maximize the numerator.

PoS tagging: Bayesian method

❶ ❷

13

• Approximate ❶P(C1 … Cn) by predicting
each category from previous n – 1 categories:
an n-gram model.

• Bigram (2-gram) model:

• Posit pseudo-categories START at C0 , and
END as Cn. Example:

Approximating probabilities 1

Warning: Not

the same n!!

• Approximate ❷P(w1 … wn|C1 … Cn) by
assuming that the probability of a word
appearing in a category is independent of the
words surrounding it.

15

Lexical generation

probabilities

Approximating probabilities 2

• Why is P(w|C) better than P(C|w)?

• P(C|w) is clearly not independent of surrounding
categories.

• Lexical generation probability is somewhat more
independent.

• Complete formula for PoS includes bigrams, and
so it does capture some context.

16

Approximating probabilities 3

Putting it all together

Really should use smoothed MLE;

cf slide 10

MLE for categories not the same as for words;

cf slide 8 17

18

• Want to find the argmax (most probable)
C1 … Cn.

• Brute force method: Find all possible
sequences of categories and compute P.

• Unnecessary: Our approximation assumes
independence:

• Category bigrams: Ci depends only on Ci – 1.
Lexical generation: wi depends only on Ci.

• Hence we do not need to enumerate all
sequences independently.

Finding max 1

19

• Bigrams:
Markov model.

• States are categories and transitions are bigrams.

• Lexical generation probabilities:
Hidden Markov model.

• Words are outputs (with given probability) of states.

• A word could be the output of more than one state.

• Current state is unknown (“hidden”).

Finding max 2

Example
• Artificial corpus of PoS-tagged 300 sentences

using only Det, N, V, P.

• The flower flowers like a bird.
Some birds like a flower with fruit beetles.
Like flies like flies.

• Some lexical generation probabilities:

20

P(the|Det) = .54 P(like|N) = .012 P(flower|N) = .063 P(birds|N) = .076

P(a|Det) = .36 P(like|V) = .1 P(flower|V) = .050 P(flies|V) = .076

P(a|N) = .001 P(like|P) = .068 P(flowers|N) = .050 P(flies|N) = .025

⋮ ⋮ P(flowers|V) = .053 ⋮

Based on an example in section 7.3 of: Allen, James. Natural

Language Understanding (2nd ed), 1995, Benjamin Cummings.

21

Bigram

Ci–1, Ci
Count Ci-1 Count Ci–1,Ci P(Ci|Ci–1) Estimate

START,

Det
300 213 P(Det| START) 0.710

START, N 300 87 P(N| START) 0.290

Det, N 558 558 P(N|Det) 1.000

N, V 883 300 P(V|N) 0.340

N, N 883 51 P(N|N) 0.058

N, P 883 307 P(P|N) 0.348

N, END 883 225 P(END|N) 0.255

V, N 300 106 P(N|V) 0.353

V, Det 300 119 P(Det|N) 0.397

V, END 300 75 P(END|V) 0.250

P, Det 307 226 P(Det|P) 0.740

P, N 307 81 P(N|P) 0.260

Markov model: Bigram table

22

Markov model: Transition probabilities

START

Det

N

V

P

END

.71

.397

.25

.255

.34

8

.26

.353

.34

.74

1.0

.29

.058

23

P(the|Det) = .54 P(like|N) = .012 P(flower|N) = .063 P(birds|N) = .076

P(a|Det) = .36 P(like|V) = .1 P(flower|V) = .050 P(flies|V) = .076

P(a|N) = .001 P(like|P) = .068 P(flowers|N) = .050 P(flies|N) = .025

⋮ ⋮ P(flowers|V) = .053 ⋮

HMM: Lexical generation probabilities

like
.068

like
flower

flies
birds

a

.076
.025

.063.012

.001

the

a

.54

.36
like

flower

flies

.05
.076

.1
Det

N

V

P

END

.71

.397

.25

.225

.34

8

.26

.353

.34

.74

1.0

.29

START

.058

24

• Given the observed output, we want to find
the most likely path through the model.

Hidden Markov models 1

The can will rust

det modal verb modal verb noun

noun noun verb

verb verb

25

• At any state in an HMM, how you got there is
irrelevant to computing the next transition.

• So, just need to remember the best path and
probability up to that point.

• Define PCi–1 as the probability of the best
sequence up to state Ci–1.

• Then find Ci that maximizes
P(Ci–1) × P(Ci|Ci–1) × P(w|Ci) ❸ from slide 17

Hidden Markov models 2

Viterbi algorithm
• Given an HMM and an observation O of its

output, finds the most probable sequence S
of states that produced O.

• O = words of sentence, S = PoS tags of sentence

• Parameters of HMM based on large training
corpus.

26

27

• Consider tags as terminals (i.e., use a PoS
tagger to pre-process input texts).

Det N Modal Verb.

• For probability of each grammar rule, use
MLE.

• Probabilities derived from hand-parsed
corpora (treebanks).

• Count frequency of use c of each rule , for
each non-terminal C and each different RHS .

What are some problems with this approach?

Statistical chart parsing 1

28

• MLE probability of rules:

• For each rule :

• Takes no account of the context of use of a
rule: independence assumption.

• Source-normalized: assumes a top-down
generative process.

• NLTK’s pchart demo doesn’t POS-tag first
(words are generated top-down), and it
shows P(t) rather than P(t|s)’. Why?

Statistical chart parsing 2

29

30

:

:

31

:

:

32

33

• In this view of chart parsing, probability of chart
entries is relatively simple to calculate. For
completed constituents:

e0 is the entry for current constituent, of category C0;

e1 … en are chart entries for C1 … Cn in the RHS of

the rule.

NB: Unlike for PoS tagging above, the Ci are not necessarily lexical categories.

39

Statistical chart parsing 3

• Consider a complete parse tree, t, with root
label S.

• Recasting ❺, t has the probability:
𝑃 𝑡 = 𝑃 𝑆 ∗ Π𝑛𝑃 𝑟𝑢𝑙𝑒 𝑛 𝑐𝑎𝑡 𝑛

where n ranges over all nodes in the tree t;
rule(n) is the rule used for n;
cat(n) is the category of n.

• P(S) = 1!

• “Bottoms out” at lexical categories.

• Note that we’re parsing bottom-up, but the
generative model “thinks” top-down
regardless.

40

Statistical chart parsing 4

• Maximum likelihood estimates on an
annotated corpus can be improved to
increase the likelihood of a different,
unannotated corpus

• Step 1: parse the unannotated corpus using
the MLE parameters.

• Step 2: adjust the parameters according to the
expected relative frequencies of different rules
in the parse trees obtained in Step 1:

• ṗ(A→B C) = μ(A→B C) / Z

• ṗ(A→w) = μ(A→w) / Z

41

Inside-Outside Algorithm

• 𝜇 𝐴 → 𝐵𝐶 = σ 𝑖,𝑘,𝑗 𝜇(𝐴 → 𝐵𝐶, 𝑖, 𝑘, 𝑗)

• 𝜇 𝐴 → 𝑤 = σ𝑖 𝜇 𝐴, 𝑖 𝛿𝑖 𝑤

where we now count having seen an A from i to j, a B
from i to k, and a C from k to j,

…or an A at location i, where there appears the word w).

42

Inside-Outside Algorithm 2

• We can define these position-specific μ’s in
terms of:

• outside probability

• inside probability

43

Inside-Outside Algorithm 3

wp wmwp-1w1 wq Wq+1

N

𝜷(𝑵, 𝒑, 𝒒)

𝜶(𝑵, 𝒑, 𝒒)

• 𝜇 𝐴 → 𝐵𝐶, 𝑖, 𝑘, 𝑗 =
𝑝 𝐴 → 𝐵𝐶 𝛽 𝐴, 𝑖, 𝑗 𝛼 𝐵, 𝑖, 𝑘 𝛼(𝐶, 𝑘 + 1, 𝑗)

• 𝜇 𝐴, 𝑖 = 𝜇(𝐴, 𝑖, 𝑖)

• 𝜇 𝐴, 𝑖, 𝑗 = 𝛼 𝐴, 𝑖, 𝑗 𝛽(𝐴, 𝑖, 𝑗)

• 𝑍 = 𝛼 𝑆, 1, 𝑛

There are also very terse, recursive
formulations of α and β that are amenable to
dynamic programming.

44

Inside-Outside Algorithm 4

• But just like non-statistical chart parsers, this
one only answers ‘yes’ or ‘no’ (with a
probability) in polynomial time:

• It’s not supposed to matter how we got each
constituent. Just the non-terminal label and the
span are all that should matter.

• There might be exponentially many trees in
this formulation.

• And we’re not calculating the probability that
the input is a sentence – this is only the
probability of one interpretation (tree).

45

Statistical chart parsing 5

47

• Evaluation method:

• Train on part of a parsed corpus.
(I.e., gather rules and statistics.)

• Test on a different part of the corpus.

• In one sense, the best evaluation of a method
like this would be data likelihood, but since
we’re scoring trees instead of strings, it’s
difficult to defend any sort of intuition about
the numbers assigned to them.

Evaluation 1

Fraction of correct constituents in output.

48

• Evaluation: PARSEVAL measures compare
parser output to known correct parse:

• Labelled precision, labelled recall.

• F-measure = harmonic mean of precision and
recall = 2PR / (P + R)

Evaluation 2

Fraction of constituents in output that are correct.

49

• Evaluation: PARSEVAL measures compare
parser output to known correct parse:

• Penalize for cross-brackets per sentence:
Constituents in output that overlap two (or more)
correct ones; e.g., [[A B] C] for [A [B C]].

• [[Nadia] [[smelled] [the eggplant]]]

[[[Nadia] [smelled]] [the eggplant]]

• The labels on the subtrees aren’t necessary

for this one.

Evaluation 3

50

• PARSEVAL is a classifier accuracy score –
much more extensional. All that matters is
the right answer at the end.

• But that still means that we can look at parts
of the right answer.

• Can get ~75% labelled precision, recall, and
F with above methods.

Evaluation 4

Improving statistical parsing

• Problem: Probabilities are based only on
structures and categories:

• But actual words strongly condition which rule
is used (cf Ratnaparkhi).

• Improve results by conditioning on more
factors, including words. Think semantics –
the words themselves give us a little bit of
access to this.

51

• Head of a phrase: its central or key word.

• The noun of an NP, the preposition of a PP, etc.

• Lexicalized grammar: Refine the grammar
so that rules take heads of phrases into
account — the actual words.

• BEFORE: Rule for NP.
AFTER: Rules for NP-whose-head-is-aardvark,
NP-whose-head-is-abacus, …, NP-whose-head-is-
zymurgy.

• And similarly for VP, PP, etc.

52

Lexicalized grammars 1

• Notation: cat(head,tag) for constituent
category cat headed by head with part-of-
speech tag.

• e.g., NP(aardvark,NN), PP(without,IN)

53

Lexicalized grammars 2

NP-whose-head-is-the-NN-aardvark

PP-whose-head-is-the-IN-without

A lexicalized grammar

54

TOP → S(bought,VBD)

S(bought,VBD) → NP(week,NN) NP(IBM,NNP)

VP(bought,VBD)

NP(week,NN) → JJ(Last,JJ) NN(week,NN)

NP(IBM,NNP) → NNP(IBM,NNP)

VP(bought,VBD) → VBD(bought,VBD)

NP(Lotus,NNP)

NP(Lotus,NNP) → NNP(Lotus,NNP)

Lexical Rules:

JJ(Last,JJ) → Last

NN(week,NN) → week

NNP(IBM,NNP) → IBM

VBD(bought,VBD) → bought

NNP(Lotus,NNP) → Lotus

Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 2003, 589–

637.

• Number of rules and categories explodes, but
no theoretical change in parsing process
(whether statistical or not).

• But far too specific for practical use; each is
too rarely used to determine its probability.

• Need something more than regular (unlexicalized)
rules and less than complete lexicalization …

• … perhaps we should change the process
after all.

56

Lexicalized grammars 3

Starting from unlexicalized rules:

• 1. Lexicalization: Consider the head word of
each node, not just its category:

• 𝑃 𝑡 = 𝑃 𝑆 ∗ Π𝑛𝑃(𝑟𝑢𝑙𝑒(𝑛)|ℎ𝑒𝑎𝑑 𝑛 )

where head(n) is the PoS-tagged head word of node n.

• Needs finer-grained probabilities:

• e.g., probability that rule r is used, given we have an
NP whose head is the noun deficit.

57

Lexicalized parsing 1

Replaces ❻
from slide 40

58

• 2. Head and parent: Condition on the head
and the head of the parent node in the tree:

Lexicalized parsing 2

e.g., probability of rule r given that head is the noun deficit.

e.g., probability that head is the noun deficit,

given that parent phrase’s head is the verb report.

60

• Lexical information introduces context into
CFG.

• Grammar is larger.

• Potential problems of sparse data.

• Possible solutions: Smoothing; back-off
estimates.

Effects on parsing

If you don’t have data for a fine-grained

situation, use data from a coarser-grained

situation that it’s contained in.

62

Bikel’s 2004 intepretation
• Can condition on any information available in

generating the tree.

• Basic idea: Avoid sparseness of
lexicalization by decomposing rules.

• Make plausible independence assumptions.

• Break rules down into small steps (small number
of parameters).

• Each rule still parameterized with word/PoS pair:

S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD)

Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 2003, 589–

637.

63

Collins’s “model 1” 1
• Lexical Rules, with probability 1:

tag(word, tag) → word

• Internal Rules, with treebank-based
probabilities. Separate terminals to the left
and right of the head; generate one at a time:

X, Li, H, and Ri all have the form cat(head,tag).
Notation: Italic lowercase symbol for (head,tag):

64

Collins’s “model 1” 2
• Assume there are additional Ln+1 and Rm+1

representing phrase boundaries (“STOP”).

• Example:
S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD)
n = 2, m = 0 (two constituents on the left of the head, zero on the right).
X = S, H = VP, L1 = NP, L2 = NP, L3 = STOP, R1 = STOP.
h = (bought,VBD), l1 = (IBM,NNP), l2 = (week,NN).

• Distinguish probabilities of heads Ph, of left
constituents Pl, and of right constituents Pr.

65

Probabilities of internal rules 1

Generate head constituent

Generate left modifiers (stop at STOP) Generate right modifiers

(stop at STOP)
By independence assumption

Probabilities of internal rules 2

P( S(bought,VBD)
→ NP(week,NN) NP(IBM,NNP) VP(bought,VBD) )

≈ Ph(VP | S, bought,VBD)
× Pl(NP(IBM,NNP) | S, bought,VBD, VP)

× Pl(NP(week,NN) | S, bought,VBD, VP)
× Pl(STOP | S, bought,VBD, VP)

× Pr(STOP | S, bought,VBD, VP)

Example:

66

Generate head constituent

Generate left

modifiers

Generate right modifiers

67

• (Badly-named) “distance measure” to capture
properties of attachment relevant to current
modifier.

• becomes

and analogously on the right.

• The value of distancex is actually a pair of Boolean
random variables:
• Is string 1..(i – 1) of length 0?

i.e., is attachment of modifier i to the head?

• Does string 1..(i – 1) contain a verb?
i.e., is attachment of modifier i crossing a verb?

Adding other dependencies

Collins’s “model 1” 4
• Backs off …

• to tag probability when no data for specific word;

• to complete non-lexicalization when necessary.

68

69

Collins’s Models 2 and 3
• Model 2: Add verb subcategorization and

argument/adjunct distinction.

• Model 3: Integrate gaps and trace
identification into model.

• Especially important with addition of
subcategorization.

72

Results and conclusions 1

• Model 2 outperforms Model 1.

• Model 3: Similar performance, but identifies
traces too.

• Model 2 performs best overall:

• LP = 89.6, LR = 89.9 [sentences ≤ 100 words].

• LP = 90.1, LR = 90.4 [sentences ≤ 40 words].

• Rich information improves parsing
performance.

73

Results and conclusions 2

• Strengths:

• Incorporation of lexical and other linguistic
information.

• Competitive results.

• Weaknesses:

• Supervised training.

• Performance tightly linked to particular type of
corpus used.

74

Results and conclusions 3

• Importance to CL:

• High-performance parser showing benefits of
lexicalization and linguistic information.

• Publicly available, widely used in research.

• There was some initial hope that it would make
language models better, but that didn’t pan out.

• But it was fairly successful at giving us some
access to semantics, i.e. language modelling
makes parsing better.