15-tagging.pptx
Ling 131A
Introduc0on to NLP with Python
Tagging
Marc Verhagen, Fall 2017
Contents
• Final project ques0ons?
• Assignment 4 ques0ons?
• Language models
• POS tagging
– Chain rule
– Bigram tagger
– Brill tagger
• Train and test corpora; evalua0on measures
Language Models
• Set of rules or a sta0s0cal model that underly
some aspect of language
• Bigram sta0s0cs can model what sentences look
like
– What is the next word given the previous words in the
sentence?
– What is the probability of a sentence?
– The probability of all sentences in a language must
sum to 1.
• The Mutual Informa0on score that we saw a few
weeks ago can model colloca0ons
Language Models
• Formal grammars (e.g. regular, context free)
give a hard binary model of all the legal
sentences in a language.
• For NLP, a probabilistic model of a language
that gives a probability that a string is a
member of a language can be more useful.
What is POS-tagging?
• Part of speech (POS): also known as lexical
category or some0mes as word class
• POS-tagging: the process of assigning part-of-
speech tags to words in a corpus (automa0cally
or manually)
– Usually one part-of-speech per word in that context
– Thus resolving some lexical ambiguity.
• Tagset: The collec0on of tags for a par0cular task
Two scenarios
• Given a POS-tagged corpus, iden0fy
linguis0cally interes0ng phenomena
– The Brown corpus, the Penn Treebank
• Given raw text, automa0cally assign POS tags
to all words (POS-tagging)
What you can do with POS tagged data
• What is the most common POS tag?
• What is the most frequent POS tag following
another POS tag?
• What is the most frequent verb/noun in a
POS-tagged corpus?
• Find the most ambiguous words in a pos-
tagged corpus
• Some of these ques0ons may be part of
assignment 5
Automa0c POS-tagging
• Two issues: ambiguity, unknown words
• Approaches:
– Rule-based
• Regular expressions
– Machine learning
• Learning sta0s0cs (n-gram models)
• Learning rules (transforma0on-based learning)
Part of speech (POS) ambiguity
Penn Treebank Tagset
Automa0c POS-tagging: using context
• Water
– Water the à verb
– The water à noun
• Context is morphological, syntac0c or
seman0c
Morphological context
• Inflec0onal morphology
– Verb:
• destroy, destroying, destroyed
– Noun:
• destruc0on, destruc0ons
• Deriva0onal morphology
– Noun:
• destruc0on
Syntac0c context
• Verb:
– The bomb destroyed the building.
– He decided to water the plant.
• Noun:
– The destruc0on of building
Seman0c context
• Verb: ac0on, ac0vity
• Noun: state, object, etc.
• Not directly observable, therefore hard to
exploit, but it’s there:
– A noun in one language is usually translated into a
noun in another language
Baselines and toplines
• Baseline: what is the least you can do?
– Assigning the most frequent tag to all words
– Using regular expressions to exploit morphological
informa0on
– Assigning to each word its mostly likely tag
• Topline: what is the best you can do?
– Use a combina0on of morphological, syntac0c, and
seman0c context
– Determining the contribu0on of each type of context,
and mixing them up in the op0mal way
Baseline 1
• The Default Tagger assigns the most likely tag
import nltk
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories=’news’)
brown_sents = brown.sents(categories=’news’)
default_tagger = nltk.DefaultTagger(‘NN’)
default_tagger.tag(brown_sents[10])
Baseline #2
• The Regular Expression Tagger uses a couple
of morphological rules
import nltk
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories=’news’)
brown_sents = brown.sents(categories=’news’)
patterns = […]
regexp_tagger = nltk.RegexpTagger(patterns)
regexp_tagger.tag(brown_sents[10])
Baseline #2
• Regular expression tagger rules
– VB: base form, e.g., ‘go’
– VBZ: 3rd person singular present, e.g., goes
– VBN: past par0ciple, e.g., ‘gone’
– VBG: gerund, e.g., ‘going’
– VBD: simple past, ‘went’
– VBP: non-3rd person singular present
Baseline #3
• The Unigram Tagger tags a word token with its
most likely tag, regardless of the context
import nltk
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories=’news’)
brown_sents = brown.sents(categories=’news’)
# training step
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
# running the tagger
unigram_tagger.tag(brown_sents[10])
Today
• Final project ques0ons?
• Assignment 4 ques0ons?
• POS tagging
– Chain rule
– Bigram tagger
– Brill tagger
N-Gram Tagging
• Based on sta0s0cs derived from a set of
n-grams (unigrams, bigrams, trigrams).
• Bigram example:
– Given the frequencies of
•
•
– What is the most likely tag for water if it follows
N-Gram Model Formulas
• Chain rule of probability
• Bigram approximation
)|()|()…|()|()()( 11
1
1
1
2
131211
−
=
− ∏== k
n
k
k
n
n
n wwPwwPwwPwwPwPwP
)|()( 1
1
1 −
=
∏= k
n
k
k
n wwPwP
Chain Rule
• Calcula0ng the probability of a sequence of
words given condi0onal probabili0es
• Condi0onal probability
– measure of the probability of an event given that
another event has occurred
– P(cat|the) =
the probability of “cat” if we already have “the”
– P(the) . P(cat|the) =
the probability of “the cat”
Chain Rule Formula
Chain Rule Example
• Input: “Play it again Sam”
• What is the probability that
– ”it” will follow “play”?
– ”again” will follow “play it”?
– ”Sam” will follow “play it again”?
Chain Rule Example
Chain Rule Example
Beyond words
• We used wk to stand in for just the word
• Instead we could use something like
that is, a pair of a word and its tag
• For example
– instead of looking for the probability of “cat”
following “the”
– we can look for the probability of the noun “cat”
following the determiner “the”
• Or we can use any combina0on of features
Problems
• Need very big memory
• Combinatorics of storing the probabili0es of
words given any preceding sequence is O(2n)
N-Gram Models
• With the regular chain rule we estimates the probability
of each word given prior context.
• An N-gram model uses only N-1 words of prior
context.
– Unigram: P(sleeps)
– Bigram: P(sleeps | cat)
– Trigram: P(sleeps | the cat)
• Markov assumption: the future behavior of a system
only depends on its recent history
– in a kth-order Markov model, the next state only depends
on the k most recent states (an N-gram model is a (N-1)-
order Markov model)
Bigram Model
• The probability of the sequence ABCD using all
context is
– P(A) . P(B|A) . P(C|AB) . P(D|ABC)
• But with the Markov assump0on you can
throw out your long-term memory
– P(A) . P(B|A) . P(C|B) . P(D|C)
– P(play it again sam)
= P(play) . P(it|play) . P(it|again) . P(sam|again)
= P(play|S).* P(it|play) . P(it|again) . P(sam|again)
Estimating Probabilities
• N-gram conditional probabilities can be estimated from
raw text based on the relative frequency of word
sequences.
To have a consistent probabilistic model, append a unique
start () and end () symbol to every sentence and treat
these as additional words.
)(
)(
)|(
1
1
1
−
−
− =
n
nn
nn wC
wwC
wwP
)(
)(
)|( 1
1
1
11
1 −
+−
−
+−−
+− = n
Nn
n
n
Nnn
Nnn wC
wwC
wwP
Bigram:
N-gram:
Examples
• P( i want english food )
= P(i | ) P(want | i) P(english | want) | food)
P(food | english) P(
= .25 x .33 x .0011 x .5 x .68 = .000031
• P( i want chinese food )
= P(i | ) P(want | i) P(chinese | want) | food)
P(food | chinese) P(
= .25 x .33 x .0065 x .52 x .68 = .00019
Unknown Words
• How to handle words in the test corpus that did
not occur in the training data, i.e. out of
vocabulary (OOV) words?
• Train a model that includes an explicit symbol
for an unknown word (
– Choose a vocabulary in advance and replace other
words in the training corpus with
– Replace the first occurrence of each word in the
training data with
Smoothing
• Many rare yet impossible combina0ons never occur in
training (sparse data), so many parameters are zero
– If a new combina0on occurs during tes0ng, it is given a
probability of zero and the en0re sequence gets a
probability of zero
• In prac0ce, parameters are smoothed (regularized) to
reassign some probability mass to unseen events.
• Adding probability mass to unseen events requires
removing it from seen ones in order to maintain a joint
distribu0on that sums to 1.
Laplace Smoothing
• Aka “Add-One Smoothing”
• “Hallucinate” addi0onal training data in which
each possible N-gram occurs exactly once and
adjust es0mates accordingly.
Bigram:
VwC
wwC
wwP
n
nn
nn
+
+
=
−
−
− )(
1)(
)|(
1
1
1
V is the total number of possible (N-1)-grams
(i.e. the vocabulary size for a bigram model).
Backoff
• Use a simpler model for those cases where
your more complicated (yet beoer) model has
nothing to say
But wait a minute…
none of this was about tagging
Context
• Unigram tagging
– Using one item of context, the word itself.
• N-Gram tagging
– context is the current word together with the pos
tags of the n-1 preceding tokens
context can include
more than just the
tag itself
Condi0onal probability
• P(tagi | tagi-1, wi)
• So we are looking for the probability of tagi
given that the previous tag was tagi-1 and the
word itself is wi
Drawbacks of N-Gram approaches
• Condi0oned on just the previous tags, even
though word tokens can be useful as well
– Condi0oning on word tokens is unrealis0c
• Size of the n-grams (bi-gram and tri-gram
tables) can be large, and the n-grams are hard
to interpret
• Transforma0on-based learning addresses
these issues
Transforma0on-based Tagging
• Transforma0on-Based Error Driven Learning
(TBL)
– Aka the Brill tagger, aper its inventor Eric Brill
– a simple rule-based approach to automated
learning of linguis0c knowledge
• Idea:
– first solve a problem with a simple technique
– then apply transforma0ons to correct mistakes
Tagging with transforma0on rules
Phrase to increase grants to states for voca0onal rehabilita0on
Unigram TO NN NNS TO NNS IN JJ NN
Rule 1 VB
Rule 2 IN
Output TO VB NNS IN NNS IN JJ NN
Gold TO VB NNS IN NNS IN JJ NN
Tagging with transforma0on rules
Sample Rule Templates
Templates vs rules
• A rule is an instance of a template
• Template: alter (A, B, prevtag(C))
• Rules:
– alter (modal, noun, prevtag(ar0cle))
– alter (noun, verb, prevtag (to))
– alter (verb, noun, prevtag(ar0cle))
– alter (verb, noun, prevtag(adjec0ve))
Is there a way to learn these rules automatically?
Two Components in Transforma0on
• A rewrite rule (Ac0on)
– ex. Change the tag from modal to noun
• A triggering environment (Condi0on)
– ex. The preceding word is a determiner
The/det can/modal rusted/verb ./.
to
The/det can/noun rusted/verb ./.
Transforma0on-Based Error-Driven Learning
UNANNOTATED
TEXT
INITIAL STATE
ANNOTATED
TEXT
LEARNER
TRUTH
RULES
Learning procedure
• Ini0al tagger:
– Assigning the most likely tag to each word
– Deciding the most likely tag based on a training
corpus
– Tag a test corpus with this ini0al tagger
• Transforma0on
– Write some rules based on the templates
– Using the training corpus to spot paoerns
– Applying transforma0ons to the test corpus
• Evaluate your accuracy
Train and Test Corpora
• A language model must be trained on a large
corpus of text.
• Model can be evaluated based on its ability to
generate good results on a held-out test corpus
– testing on the training corpus gives an optimistically
biased estimate
• Training and test corpus should be representative
of the actual application data.
– May need to adapt a general model to a small amount
of new in-domain data by adding highly weighted
small corpus to original training data.
Evalua0on Measures
• Accuracy: percentage of correct tags
• Precision, recall and f-measure
– FP – false posi0ves
– TP – true posi0ves
– FN – false nega0ves
John lives in London
ENT – – ENT
– ENT – ENT
FN FP TN TP
Evalua0on Measures
• Precision = tp / (tp + fp)
• Recall = tp / (tp + fn)
• F-measure = 2 . (P . R) / (P + R)
– Harmonic means
– Penalty if P or R is preoy bad