程序代写代做代考 python chain 15-tagging.pptx

15-tagging.pptx

Ling 131A
Introduc0on to NLP with Python

Tagging

Marc Verhagen, Fall 2017

Contents

•  Final project ques0ons?
•  Assignment 4 ques0ons?
•  Language models
•  POS tagging

– Chain rule
– Bigram tagger
– Brill tagger

•  Train and test corpora; evalua0on measures

Language Models
•  Set of rules or a sta0s0cal model that underly
some aspect of language

•  Bigram sta0s0cs can model what sentences look
like
– What is the next word given the previous words in the
sentence?

– What is the probability of a sentence?
– The probability of all sentences in a language must

sum to 1.
•  The Mutual Informa0on score that we saw a few
weeks ago can model colloca0ons

Language Models

•  Formal grammars (e.g. regular, context free)
give a hard binary model of all the legal
sentences in a language.

•  For NLP, a probabilistic model of a language
that gives a probability that a string is a
member of a language can be more useful.

What is POS-tagging?

•  Part of speech (POS): also known as lexical
category or some0mes as word class

•  POS-tagging: the process of assigning part-of-
speech tags to words in a corpus (automa0cally
or manually)
– Usually one part-of-speech per word in that context
–  Thus resolving some lexical ambiguity.

•  Tagset: The collec0on of tags for a par0cular task

Two scenarios

•  Given a POS-tagged corpus, iden0fy
linguis0cally interes0ng phenomena
– The Brown corpus, the Penn Treebank

•  Given raw text, automa0cally assign POS tags
to all words (POS-tagging)

What you can do with POS tagged data

•  What is the most common POS tag?
•  What is the most frequent POS tag following
another POS tag?

•  What is the most frequent verb/noun in a
POS-tagged corpus?

•  Find the most ambiguous words in a pos-
tagged corpus

•  Some of these ques0ons may be part of
assignment 5

Automa0c POS-tagging

•  Two issues: ambiguity, unknown words
•  Approaches:

– Rule-based
•  Regular expressions

– Machine learning
•  Learning sta0s0cs (n-gram models)
•  Learning rules (transforma0on-based learning)

Part of speech (POS) ambiguity

Penn Treebank Tagset

Automa0c POS-tagging: using context

•  Water
– Water the à verb
– The water à noun

•  Context is morphological, syntac0c or
seman0c

Morphological context

•  Inflec0onal morphology
– Verb:

•  destroy, destroying, destroyed
– Noun:

•  destruc0on, destruc0ons

•  Deriva0onal morphology
– Noun:

•  destruc0on

Syntac0c context

•  Verb:
– The bomb destroyed the building.
– He decided to water the plant.

•  Noun:
– The destruc0on of building

Seman0c context

•  Verb: ac0on, ac0vity
•  Noun: state, object, etc.
•  Not directly observable, therefore hard to
exploit, but it’s there:
– A noun in one language is usually translated into a
noun in another language

Baselines and toplines

•  Baseline: what is the least you can do?
– Assigning the most frequent tag to all words
– Using regular expressions to exploit morphological
informa0on

– Assigning to each word its mostly likely tag
•  Topline: what is the best you can do?

– Use a combina0on of morphological, syntac0c, and
seman0c context

– Determining the contribu0on of each type of context,
and mixing them up in the op0mal way

Baseline 1

•  The Default Tagger assigns the most likely tag
import nltk
from nltk.corpus import brown

brown_news_tagged = brown.tagged_words(categories=’news’)
brown_sents = brown.sents(categories=’news’)

default_tagger = nltk.DefaultTagger(‘NN’)
default_tagger.tag(brown_sents[10])

Baseline #2

•  The Regular Expression Tagger uses a couple
of morphological rules
import nltk
from nltk.corpus import brown

brown_news_tagged = brown.tagged_words(categories=’news’)
brown_sents = brown.sents(categories=’news’)

patterns = […]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[10])

Baseline #2

•  Regular expression tagger rules
– VB: base form, e.g., ‘go’
– VBZ: 3rd person singular present, e.g., goes
– VBN: past par0ciple, e.g., ‘gone’
– VBG: gerund, e.g., ‘going’
– VBD: simple past, ‘went’
– VBP: non-3rd person singular present

Baseline #3

•  The Unigram Tagger tags a word token with its
most likely tag, regardless of the context

import nltk
from nltk.corpus import brown

brown_news_tagged = brown.tagged_words(categories=’news’)
brown_sents = brown.sents(categories=’news’)

# training step
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

# running the tagger
unigram_tagger.tag(brown_sents[10])

Today

•  Final project ques0ons?
•  Assignment 4 ques0ons?
•  POS tagging

– Chain rule
– Bigram tagger
– Brill tagger

N-Gram Tagging

•  Based on sta0s0cs derived from a set of
n-grams (unigrams, bigrams, trigrams).

•  Bigram example:
– Given the frequencies of

• 
• 

– What is the most likely tag for water if it follows

N-Gram Model Formulas

•  Chain rule of probability

•  Bigram approximation

)|()|()…|()|()()( 11
1

1
1

2
131211

=

− ∏== k
n

k
k

n
n

n wwPwwPwwPwwPwPwP

)|()( 1
1

1 −
=
∏= k
n

k
k

n wwPwP

Chain Rule

•  Calcula0ng the probability of a sequence of
words given condi0onal probabili0es

•  Condi0onal probability
– measure of the probability of an event given that
another event has occurred

– P(cat|the) =
the probability of “cat” if we already have “the”

– P(the) . P(cat|the) =
the probability of “the cat”

Chain Rule Formula

Chain Rule Example

•  Input: “Play it again Sam”
•  What is the probability that

– ”it” will follow “play”?
– ”again” will follow “play it”?
– ”Sam” will follow “play it again”?

Chain Rule Example

Chain Rule Example

Beyond words

•  We used wk to stand in for just the word
•  Instead we could use something like ,
that is, a pair of a word and its tag

•  For example
–  instead of looking for the probability of “cat”
following “the”

– we can look for the probability of the noun “cat”
following the determiner “the”

•  Or we can use any combina0on of features

Problems

•  Need very big memory
•  Combinatorics of storing the probabili0es of
words given any preceding sequence is O(2n)

N-Gram Models
•  With the regular chain rule we estimates the probability

of each word given prior context.
•  An N-gram model uses only N-1 words of prior

context.
–  Unigram: P(sleeps)
–  Bigram: P(sleeps | cat)
–  Trigram: P(sleeps | the cat)

•  Markov assumption: the future behavior of a system
only depends on its recent history
–  in a kth-order Markov model, the next state only depends

on the k most recent states (an N-gram model is a (N-1)-
order Markov model)

Bigram Model

•  The probability of the sequence ABCD using all
context is
– P(A) . P(B|A) . P(C|AB) . P(D|ABC)

•  But with the Markov assump0on you can
throw out your long-term memory
– P(A) . P(B|A) . P(C|B) . P(D|C)
– P(play it again sam)
= P(play) . P(it|play) . P(it|again) . P(sam|again)
= P(play|S).* P(it|play) . P(it|again) . P(sam|again)

Estimating Probabilities
•  N-gram conditional probabilities can be estimated from

raw text based on the relative frequency of word
sequences.

To have a consistent probabilistic model, append a unique
start () and end () symbol to every sentence and treat
these as additional words.

)(
)(

)|(
1

1
1


− =

n

nn
nn wC

wwC
wwP

)(
)(

)|( 1
1

1
11

1 −
+−


+−−

+− = n
Nn

n
n
Nnn

Nnn wC
wwC

wwP

Bigram:

N-gram:

Examples

•  P( i want english food )
= P(i | ) P(want | i) P(english | want)
P(food | english) P(
| food)
= .25 x .33 x .0011 x .5 x .68 = .000031

•  P( i want chinese food )
= P(i | ) P(want | i) P(chinese | want)
P(food | chinese) P(
| food)
= .25 x .33 x .0065 x .52 x .68 = .00019

Unknown Words

•  How to handle words in the test corpus that did
not occur in the training data, i.e. out of
vocabulary (OOV) words?

•  Train a model that includes an explicit symbol
for an unknown word ().
– Choose a vocabulary in advance and replace other

words in the training corpus with .
– Replace the first occurrence of each word in the

training data with .

Smoothing

•  Many rare yet impossible combina0ons never occur in
training (sparse data), so many parameters are zero
–  If a new combina0on occurs during tes0ng, it is given a
probability of zero and the en0re sequence gets a
probability of zero

•  In prac0ce, parameters are smoothed (regularized) to
reassign some probability mass to unseen events.

•  Adding probability mass to unseen events requires
removing it from seen ones in order to maintain a joint
distribu0on that sums to 1.

Laplace Smoothing

•  Aka “Add-One Smoothing”
•  “Hallucinate” addi0onal training data in which
each possible N-gram occurs exactly once and
adjust es0mates accordingly.

Bigram:
VwC

wwC
wwP

n

nn
nn

+

+
=


− )(

1)(
)|(

1

1
1

V is the total number of possible (N-1)-grams
(i.e. the vocabulary size for a bigram model).

Backoff

•  Use a simpler model for those cases where
your more complicated (yet beoer) model has
nothing to say

But wait a minute…
none of this was about tagging

Context

•  Unigram tagging
– Using one item of context, the word itself.

•  N-Gram tagging
– context is the current word together with the pos
tags of the n-1 preceding tokens

context can include
more than just the
tag itself

Condi0onal probability

•  P(tagi | tagi-1, wi)
•  So we are looking for the probability of tagi
given that the previous tag was tagi-1 and the
word itself is wi

Drawbacks of N-Gram approaches

•  Condi0oned on just the previous tags, even
though word tokens can be useful as well
– Condi0oning on word tokens is unrealis0c

•  Size of the n-grams (bi-gram and tri-gram
tables) can be large, and the n-grams are hard
to interpret

•  Transforma0on-based learning addresses
these issues

Transforma0on-based Tagging

•  Transforma0on-Based Error Driven Learning
(TBL)
– Aka the Brill tagger, aper its inventor Eric Brill
– a simple rule-based approach to automated
learning of linguis0c knowledge

•  Idea:
– first solve a problem with a simple technique
–  then apply transforma0ons to correct mistakes

Tagging with transforma0on rules

Phrase to increase grants to states for voca0onal rehabilita0on

Unigram TO NN NNS TO NNS IN JJ NN

Rule 1 VB

Rule 2 IN

Output TO VB NNS IN NNS IN JJ NN

Gold TO VB NNS IN NNS IN JJ NN

Tagging with transforma0on rules

Sample Rule Templates

Templates vs rules

•  A rule is an instance of a template
•  Template: alter (A, B, prevtag(C))
•  Rules:

– alter (modal, noun, prevtag(ar0cle))
– alter (noun, verb, prevtag (to))
– alter (verb, noun, prevtag(ar0cle))
– alter (verb, noun, prevtag(adjec0ve))
Is there a way to learn these rules automatically?

Two Components in Transforma0on

•  A rewrite rule (Ac0on)
–  ex. Change the tag from modal to noun

•  A triggering environment (Condi0on)
–  ex. The preceding word is a determiner

The/det can/modal rusted/verb ./.
to
The/det can/noun rusted/verb ./.

Transforma0on-Based Error-Driven Learning

UNANNOTATED

TEXT

INITIAL STATE

ANNOTATED

TEXT

LEARNER

TRUTH

RULES

Learning procedure

•  Ini0al tagger:
– Assigning the most likely tag to each word
– Deciding the most likely tag based on a training
corpus

–  Tag a test corpus with this ini0al tagger
•  Transforma0on

– Write some rules based on the templates
– Using the training corpus to spot paoerns
– Applying transforma0ons to the test corpus

•  Evaluate your accuracy

Train and Test Corpora
•  A language model must be trained on a large

corpus of text.
•  Model can be evaluated based on its ability to

generate good results on a held-out test corpus
–  testing on the training corpus gives an optimistically

biased estimate
•  Training and test corpus should be representative

of the actual application data.
– May need to adapt a general model to a small amount

of new in-domain data by adding highly weighted
small corpus to original training data.

Evalua0on Measures

•  Accuracy: percentage of correct tags
•  Precision, recall and f-measure

– FP – false posi0ves
– TP – true posi0ves
– FN – false nega0ves

John lives in London
ENT – – ENT
– ENT – ENT
FN FP TN TP

Evalua0on Measures

•  Precision = tp / (tp + fp)
•  Recall = tp / (tp + fn)
•  F-measure = 2 . (P . R) / (P + R)

– Harmonic means
– Penalty if P or R is preoy bad