程序代写代做代考 Hidden Markov Mode Bayesian algorithm Computational Linguistics

Computational Linguistics

Computational

Linguistics

Stevenson, Graeme Hirst

and Gerald Penn. All rights

reserved.

9. Statistical parsing

Gerald Penn

Department of Computer Science, University of Toronto

CSC 2501 / 485

Fall 2018

Reading: Jurafsky & Martin: 5.2–5.5.2, 5.6, 12.4, 14.0–1,

14.3–4, 14.6–7. Bird et al: 8.6.

• General idea:

• Assign probabilities to rules in a context-free
grammar.

• Use a likelihood model.

• Combine probabilities of rules in a tree.

• Yields likelihood of a parse.

• The best parse is the most likely one.

Statistical parsing 1

• Motivations:

• Uniform process for attachment decisions.

• Use lexical preferences in all decisions.

Statistical parsing 2

1. Assign a probability to each rule of grammar, including
lexical productions.

–Parse string of input words with probabilistic rules.
The can will rust.

2. Assign probabilities only to non-lexical productions.

–Probabilistically tag input words with syntactic
categories using a part-of-speech tagger.

–Consider the pre-terminal syntactic categories to be
terminals, parse that string with probabilistic rules.

Det N Modal Verb.

3. “Supertagging” – parsing as tagging with tree
fragments.

Two general approaches

• Part-of-speech (PoS) tagging:
Given a sequence of words w1 … wn (from
well-formed text), determine the syntactic
category (PoS) Ci of each word.

• I.e, the best category sequence C1 … Cn to
assign to the word sequence w1 … wn.

Part-of-speech tagging 1

Most likely

The can will rust

det modal verb modal verb noun

noun noun verb

verb verb

Part-of-speech tagging 2
• Example:

6
Example from Charniak 1997

The can will rust

det modal verb modal verb noun

noun noun verb

verb verb

• We cannot get this probability directly.

• Have to estimate it (through counts).

• Perhaps after first approximating it (by
modifying the formula).

• Counts: Need representative corpus.

Part-of-speech tagging 3

• Look at individual words (unigrams):

• Maximum likelihood estimator (MLE):

PoS tagging: Unigram MLE 1

Count in corpus

• Problems of MLE:

• Sparse data.

• Extreme cases:
a. Undefined if w is not in the corpus.
b. 0 if w does not appear in a particular category.

PoS tagging: Unigram MLE 2

• Smoothing of formula, e.g.,:

𝑃 𝐶 𝑤 ≈
𝑐 𝑤 𝑖𝑠 𝐶 + 𝜖

𝑐 𝑤 + 𝜖𝑁

• Give small (non-zero) probability value to
unseen events, taken from seen events by
discounting them.

• Various methods to ensure we still have valid
probability distribution.

PoS tagging: Unigram MLE 3

• Just choosing the most frequent PoS for each
word yields 90% accuracy in PoS tagging.

• But:

• Not uniform across words.

• Accuracy is low (0.9n) when multiplied over n
words.

• No context: The fly vs. I will fly.

• Need better approximations for

PoS tagging: Unigram MLE 4

• Use Bayes’s rule to rewrite:

• For a given word string, we want to maximize
this, find most likely C1 … Cn:

• So just need to maximize the numerator.

PoS tagging: Bayesian method

❶ ❷

• Approximate ❶P(C1 … Cn) by predicting
each category from previous n – 1 categories:
an n-gram model.

• Bigram (2-gram) model:

• Posit pseudo-categories START at C0 , and
END as Cn. Example:

Approximating probabilities 1

Warning: Not

the same n!!

• Approximate ❷P(w1 … wn|C1 … Cn) by
assuming that the probability of a word
appearing in a category is independent of the
words surrounding it.

Lexical generation

probabilities

Approximating probabilities 2

• Why is P(w|C) better than P(C|w)?

• P(C|w) is clearly not independent of surrounding
categories.

• Lexical generation probability is somewhat more
independent.

• Complete formula for PoS includes bigrams, and
so it does capture some context.

Approximating probabilities 3

Putting it all together

Really should use smoothed MLE;

cf slide 10

MLE for categories not the same as for words;

cf slide 8 17

❸

• Want to find the argmax (most probable)
C1 … Cn.

• Brute force method: Find all possible
sequences of categories and compute P.

• Unnecessary: Our approximation assumes
independence:

• Category bigrams: Ci depends only on Ci – 1.
Lexical generation: wi depends only on Ci.

• Hence we do not need to enumerate all
sequences independently.

Finding max 1

• Bigrams:
Markov model.

• States are categories and transitions are bigrams.

• Lexical generation probabilities:
Hidden Markov model.

• Words are outputs (with given probability) of states.

• A word could be the output of more than one state.

• Current state is unknown (“hidden”).

Finding max 2

Example
• Artificial corpus of PoS-tagged 300 sentences

using only Det, N, V, P.

• The flower flowers like a bird.
Some birds like a flower with fruit beetles.
Like flies like flies.

…

• Some lexical generation probabilities:

P(the|Det) = .54 P(like|N) = .012 P(flower|N) = .063 P(birds|N) = .076

P(a|Det) = .36 P(like|V) = .1 P(flower|V) = .050 P(flies|V) = .076

P(a|N) = .001 P(like|P) = .068 P(flowers|N) = .050 P(flies|N) = .025

⋮ ⋮ P(flowers|V) = .053 ⋮

⋮

Based on an example in section 7.3 of: Allen, James. Natural

Language Understanding (2nd ed), 1995, Benjamin Cummings.

Bigram

Ci–1, Ci
Count Ci-1 Count Ci–1,Ci P(Ci|Ci–1) Estimate

START,

Det
300 213 P(Det| START) 0.710

START, N 300 87 P(N| START) 0.290

Det, N 558 558 P(N|Det) 1.000

N, V 883 300 P(V|N) 0.340

N, N 883 51 P(N|N) 0.058

N, P 883 307 P(P|N) 0.348

N, END 883 225 P(END|N) 0.255

V, N 300 106 P(N|V) 0.353

V, Det 300 119 P(Det|N) 0.397

V, END 300 75 P(END|V) 0.250

P, Det 307 226 P(Det|P) 0.740

P, N 307 81 P(N|P) 0.260

Markov model: Bigram table

Markov model: Transition probabilities

START

Det

END

.71

.397

.25

.255

.34

.26

.353

.34

.74

1.0

.29

.058

P(the|Det) = .54 P(like|N) = .012 P(flower|N) = .063 P(birds|N) = .076

P(a|Det) = .36 P(like|V) = .1 P(flower|V) = .050 P(flies|V) = .076

P(a|N) = .001 P(like|P) = .068 P(flowers|N) = .050 P(flies|N) = .025

⋮ ⋮ P(flowers|V) = .053 ⋮

HMM: Lexical generation probabilities

like
.068

like
flower

flies
birds

.076
.025

.063.012

.001

the

.54

.36
like

flower

flies

.05
.076

.1
Det

END

.71

.397

.25

.225

.34

.26

.353

.34

.74

1.0

.29

START

.058

• Given the observed output, we want to find
the most likely path through the model.

Hidden Markov models 1

The can will rust

det modal verb modal verb noun

noun noun verb

verb verb

• At any state in an HMM, how you got there is
irrelevant to computing the next transition.

• So, just need to remember the best path and
probability up to that point.

• Define PCi–1 as the probability of the best
sequence up to state Ci–1.

• Then find Ci that maximizes
P(Ci–1) × P(Ci|Ci–1) × P(w|Ci) ❸ from slide 17

Hidden Markov models 2

Viterbi algorithm
• Given an HMM and an observation O of its

output, finds the most probable sequence S
of states that produced O.

• O = words of sentence, S = PoS tags of sentence

• Parameters of HMM based on large training
corpus.

• Consider tags as terminals (i.e., use a PoS
tagger to pre-process input texts).

Det N Modal Verb.

• For probability of each grammar rule, use
MLE.

• Probabilities derived from hand-parsed
corpora (treebanks).

• Count frequency of use c of each rule , for
each non-terminal C and each different RHS .

What are some problems with this approach?

Statistical chart parsing 1

• MLE probability of rules:

• For each rule :

• Takes no account of the context of use of a
rule: independence assumption.

• Source-normalized: assumes a top-down
generative process.

• NLTK’s pchart demo doesn’t POS-tag first
(words are generated top-down), and it
shows P(t) rather than P(t|s)’. Why?

Statistical chart parsing 2

❹

⬅

• In this view of chart parsing, probability of chart
entries is relatively simple to calculate. For
completed constituents:

e0 is the entry for current constituent, of category C0;

e1 … en are chart entries for C1 … Cn in the RHS of

the rule.

NB: Unlike for PoS tagging above, the Ci are not necessarily lexical categories.

Statistical chart parsing 3

❺

• Consider a complete parse tree, t, with root
label S.

• Recasting ❺, t has the probability:
𝑃 𝑡 = 𝑃 𝑆 ∗ Π𝑛𝑃 𝑟𝑢𝑙𝑒 𝑛 𝑐𝑎𝑡 𝑛

where n ranges over all nodes in the tree t;
rule(n) is the rule used for n;
cat(n) is the category of n.

• P(S) = 1!

• “Bottoms out” at lexical categories.

• Note that we’re parsing bottom-up, but the
generative model “thinks” top-down
regardless.

Statistical chart parsing 4

❻

• Maximum likelihood estimates on an
annotated corpus can be improved to
increase the likelihood of a different,
unannotated corpus

• Step 1: parse the unannotated corpus using
the MLE parameters.

• Step 2: adjust the parameters according to the
expected relative frequencies of different rules
in the parse trees obtained in Step 1:

• ṗ(A→B C) = μ(A→B C) / Z

• ṗ(A→w) = μ(A→w) / Z

Inside-Outside Algorithm

• 𝜇 𝐴 → 𝐵𝐶 = σ 𝑖,𝑘,𝑗 𝜇(𝐴 → 𝐵𝐶, 𝑖, 𝑘, 𝑗)

• 𝜇 𝐴 → 𝑤 = σ𝑖 𝜇 𝐴, 𝑖 𝛿𝑖 𝑤

where we now count having seen an A from i to j, a B
from i to k, and a C from k to j,

…or an A at location i, where there appears the word w).

Inside-Outside Algorithm 2

• We can define these position-specific μ’s in
terms of:

• outside probability

• inside probability

Inside-Outside Algorithm 3

wp wmwp-1w1 wq Wq+1

𝜷(𝑵, 𝒑, 𝒒)

𝜶(𝑵, 𝒑, 𝒒)

• 𝜇 𝐴 → 𝐵𝐶, 𝑖, 𝑘, 𝑗 =
𝑝 𝐴 → 𝐵𝐶 𝛽 𝐴, 𝑖, 𝑗 𝛼 𝐵, 𝑖, 𝑘 𝛼(𝐶, 𝑘 + 1, 𝑗)

• 𝜇 𝐴, 𝑖 = 𝜇(𝐴, 𝑖, 𝑖)

• 𝜇 𝐴, 𝑖, 𝑗 = 𝛼 𝐴, 𝑖, 𝑗 𝛽(𝐴, 𝑖, 𝑗)

• 𝑍 = 𝛼 𝑆, 1, 𝑛

There are also very terse, recursive
formulations of α and β that are amenable to
dynamic programming.

Inside-Outside Algorithm 4

• But just like non-statistical chart parsers, this
one only answers ‘yes’ or ‘no’ (with a
probability) in polynomial time:

• It’s not supposed to matter how we got each
constituent. Just the non-terminal label and the
span are all that should matter.

• There might be exponentially many trees in
this formulation.

• And we’re not calculating the probability that
the input is a sentence – this is only the
probability of one interpretation (tree).

Statistical chart parsing 5

• Evaluation method:

• Train on part of a parsed corpus.
(I.e., gather rules and statistics.)

• Test on a different part of the corpus.

• In one sense, the best evaluation of a method
like this would be data likelihood, but since
we’re scoring trees instead of strings, it’s
difficult to defend any sort of intuition about
the numbers assigned to them.

Evaluation 1

Fraction of correct constituents in output.

• Evaluation: PARSEVAL measures compare
parser output to known correct parse:

• Labelled precision, labelled recall.

• F-measure = harmonic mean of precision and
recall = 2PR / (P + R)

Evaluation 2

Fraction of constituents in output that are correct.

• Evaluation: PARSEVAL measures compare
parser output to known correct parse:

• Penalize for cross-brackets per sentence:
Constituents in output that overlap two (or more)
correct ones; e.g., [[A B] C] for [A [B C]].

• [[Nadia] [[smelled] [the eggplant]]]

[[[Nadia] [smelled]] [the eggplant]]

• The labels on the subtrees aren’t necessary

for this one.

Evaluation 3

• PARSEVAL is a classifier accuracy score –
much more extensional. All that matters is
the right answer at the end.

• But that still means that we can look at parts
of the right answer.

• Can get ~75% labelled precision, recall, and
F with above methods.

Evaluation 4

Improving statistical parsing

• Problem: Probabilities are based only on
structures and categories:

• But actual words strongly condition which rule
is used (cf Ratnaparkhi).

• Improve results by conditioning on more
factors, including words. Think semantics –
the words themselves give us a little bit of
access to this.

❹

• Head of a phrase: its central or key word.

• The noun of an NP, the preposition of a PP, etc.

• Lexicalized grammar: Refine the grammar
so that rules take heads of phrases into
account — the actual words.

• BEFORE: Rule for NP.
AFTER: Rules for NP-whose-head-is-aardvark,
NP-whose-head-is-abacus, …, NP-whose-head-is-
zymurgy.

• And similarly for VP, PP, etc.

Lexicalized grammars 1

• Notation: cat(head,tag) for constituent
category cat headed by head with part-of-
speech tag.

• e.g., NP(aardvark,NN), PP(without,IN)

Lexicalized grammars 2

NP-whose-head-is-the-NN-aardvark

PP-whose-head-is-the-IN-without

A lexicalized grammar

TOP → S(bought,VBD)

S(bought,VBD) → NP(week,NN) NP(IBM,NNP)

VP(bought,VBD)

NP(week,NN) → JJ(Last,JJ) NN(week,NN)

NP(IBM,NNP) → NNP(IBM,NNP)

VP(bought,VBD) → VBD(bought,VBD)

NP(Lotus,NNP)

NP(Lotus,NNP) → NNP(Lotus,NNP)

Lexical Rules:

JJ(Last,JJ) → Last

NN(week,NN) → week

NNP(IBM,NNP) → IBM

VBD(bought,VBD) → bought

NNP(Lotus,NNP) → Lotus

Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 2003, 589–

637.

• Number of rules and categories explodes, but
no theoretical change in parsing process
(whether statistical or not).

• But far too specific for practical use; each is
too rarely used to determine its probability.

• Need something more than regular (unlexicalized)
rules and less than complete lexicalization …

• … perhaps we should change the process
after all.

Lexicalized grammars 3

Starting from unlexicalized rules:

• 1. Lexicalization: Consider the head word of
each node, not just its category:

• 𝑃 𝑡 = 𝑃 𝑆 ∗ Π𝑛𝑃(𝑟𝑢𝑙𝑒(𝑛)|ℎ𝑒𝑎𝑑 𝑛 )

where head(n) is the PoS-tagged head word of node n.

• Needs finer-grained probabilities:

• e.g., probability that rule r is used, given we have an
NP whose head is the noun deficit.

Lexicalized parsing 1

Replaces ❻
from slide 40

• 2. Head and parent: Condition on the head
and the head of the parent node in the tree:

Lexicalized parsing 2

e.g., probability of rule r given that head is the noun deficit.

e.g., probability that head is the noun deficit,

given that parent phrase’s head is the verb report.

• Lexical information introduces context into
CFG.

• Grammar is larger.

• Potential problems of sparse data.

• Possible solutions: Smoothing; back-off
estimates.

Effects on parsing

If you don’t have data for a fine-grained

situation, use data from a coarser-grained

situation that it’s contained in.

Bikel’s 2004 intepretation
• Can condition on any information available in

generating the tree.

• Basic idea: Avoid sparseness of
lexicalization by decomposing rules.

• Make plausible independence assumptions.

• Break rules down into small steps (small number
of parameters).

• Each rule still parameterized with word/PoS pair:

S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD)

Michael Collins. Head-driven statistical models for natural language parsing. Computational Linguistics, 29(4), 2003, 589–

637.

Collins’s “model 1” 1
• Lexical Rules, with probability 1:

tag(word, tag) → word

• Internal Rules, with treebank-based
probabilities. Separate terminals to the left
and right of the head; generate one at a time:

X, Li, H, and Ri all have the form cat(head,tag).
Notation: Italic lowercase symbol for (head,tag):

Collins’s “model 1” 2
• Assume there are additional Ln+1 and Rm+1

representing phrase boundaries (“STOP”).

• Example:
S(bought,VBD) → NP(week,NN) NP(IBM,NNP) VP(bought,VBD)
n = 2, m = 0 (two constituents on the left of the head, zero on the right).
X = S, H = VP, L1 = NP, L2 = NP, L3 = STOP, R1 = STOP.
h = (bought,VBD), l1 = (IBM,NNP), l2 = (week,NN).

• Distinguish probabilities of heads Ph, of left
constituents Pl, and of right constituents Pr.

Probabilities of internal rules 1

Generate head constituent

Generate left modifiers (stop at STOP) Generate right modifiers

(stop at STOP)
By independence assumption

Probabilities of internal rules 2

P( S(bought,VBD)
→ NP(week,NN) NP(IBM,NNP) VP(bought,VBD) )

≈ Ph(VP | S, bought,VBD)
× Pl(NP(IBM,NNP) | S, bought,VBD, VP)

× Pl(NP(week,NN) | S, bought,VBD, VP)
× Pl(STOP | S, bought,VBD, VP)

× Pr(STOP | S, bought,VBD, VP)

Example:

Generate head constituent

Generate left

modifiers

Generate right modifiers

• (Badly-named) “distance measure” to capture
properties of attachment relevant to current
modifier.

• becomes

and analogously on the right.

• The value of distancex is actually a pair of Boolean
random variables:
• Is string 1..(i – 1) of length 0?

i.e., is attachment of modifier i to the head?

• Does string 1..(i – 1) contain a verb?
i.e., is attachment of modifier i crossing a verb?

Adding other dependencies

Collins’s “model 1” 4
• Backs off …

• to tag probability when no data for specific word;

• to complete non-lexicalization when necessary.

Collins’s Models 2 and 3
• Model 2: Add verb subcategorization and

argument/adjunct distinction.

• Model 3: Integrate gaps and trace
identification into model.

• Especially important with addition of
subcategorization.

Results and conclusions 1

• Model 2 outperforms Model 1.

• Model 3: Similar performance, but identifies
traces too.

• Model 2 performs best overall:

• LP = 89.6, LR = 89.9 [sentences ≤ 100 words].

• LP = 90.1, LR = 90.4 [sentences ≤ 40 words].

• Rich information improves parsing
performance.

Results and conclusions 2

• Strengths:

• Incorporation of lexical and other linguistic
information.

• Competitive results.

• Weaknesses:

• Supervised training.

• Performance tightly linked to particular type of
corpus used.

Results and conclusions 3

• Importance to CL:

• High-performance parser showing benefits of
lexicalization and linguistic information.

• Publicly available, widely used in research.

• There was some initial hope that it would make
language models better, but that didn’t pan out.

• But it was fairly successful at giving us some
access to semantics, i.e. language modelling
makes parsing better.

Related Posts