Part of speech tagging
COMP90042
Natural Language Processing Lecture 5
COPYRIGHT 2020, THE UNIVERSITY OF MELBOURNE
1
COMP90042
L5
• • •
2 assignments (down from 3)
20% of subject (no change)
1st assignment will be released in week 4
Assignments
2
COMP90042
L5
• •
Online workshops available till week 12 Workshop slides by tutors:
‣ Modules > Workshops > Workshop Slides
Workshops
3
COMP90042
L5
Correction on Lecture 3, Page 22
4
COMP90042
L5
Correction on Lecture 3, Page 22
5
COMP90042
L5
What is Part-of-Speech (POS)?
• AKAwordclasses,morphologicalclasses,syntactic categories
• Nouns,verbs,adjective,etc
• POStellsusquiteabitaboutawordandits
neighbours:
‣ nouns are often preceded by determiners
‣ verbs preceded by nouns
‣ content as a noun pronounced as CONtent
‣ content as a adjective pronounced as conTENT
6
COMP90042
L5
•
Training data:
‣ “The lawyer convinced the jury.” → Sam
‣ “Ruby travelled around Australia.” → Sam
‣ “The hospital was cleaned by the janitor.” → Max ‣ “Lunch was served at 12pm.” → Max
Authorship Attribution Revisited
• “Thebookstorewasopenedbythemanager.”→? • Similarstructure(passivevoice).
‣ Not captured by simple BOW representations. How to ensure a computer knows/learns this?
•
7
COMP90042
L5
Information Extraction Given this:
‣ “Brasilia, the Brazilian capital, was founded in 1960.”
• Obtainthis:
‣ capital(Brazil, Brasilia) ‣ founded(Brasilia, 1960)
• Manystepsinvolvedbutfirstneedtoknownouns (Brasilia, capital), adjectives (Brazilian), verbs (founded) and numbers (1960).
•
8
COMP90042
L5
Outline
Parts of speech, tagsets Automatic tagging
9
COMP90042
L5
POS Open Classes
Open vs closed classes: how readily do POS categories take on new words? Just a few open classes:
• Nouns
‣ Proper (Australia) versus common (wombat)
‣ Mass (rice) versus count (bowls)
• Verbs
‣ Rich inflection (go/goes/going/gone/went)
‣ Auxiliary verbs (be, have, and do in English)
‣ Transitivity (wait versus hit versus give)
— number of arguments
10
COMP90042
L5
POS Open Classes Adjectives
‣ Gradable (happy) versus non-gradable (computational)
• Adverbs
‣ Manner (slowly)
‣ Locative (here)
‣ Degree (really)
‣ Temporal (yesterday)
•
11
COMP90042
L5
POS Closed Classes (English) • Prepositions (in, on, with, for, of, over,…)
‣ on the table • Particles
‣ brushed himself off
• Determiners
‣ Articles (a, an, the)
‣ Demonstratives (this, that, these, those) ‣ Quantifiers (each, every, some, two,…)
• Pronouns
‣ Personal (I, me, she,…)
‣ Possessive (my, our,…)
‣ Interrogative or Wh (who, what, …)
12
COMP90042
L5
POS Closed Classes (English)
• Conjunctions
‣ Coordinating (and, or, but)
‣ Subordinating (if, although, that, …)
•
Modal verbs
‣ Ability (can, could)
‣ Permission (can, may)
‣ Possibility (may, might, could, will) ‣ Necessity (must)
• Andsomemore…
‣ negatives, politeness markers, etc
13
COMP90042
L5
• •
Ambiguity
Many word types belong to multiple classes
Compare:
‣ Time flies like an arrow ‣ Fruit flies like a banana
Time
flies
like
an
arrow
noun
verb
preposition
determiner
noun
Fruit
flies
like
a
banana
noun
noun
verb
determiner
noun
14
COMP90042
L5
• • • •
British Left Waffles on Falkland Islands Juvenile Court to Try Shooting Defendant Teachers Strike Idle Kids
Eye Drops Off Shelf
POS Ambiguity in News Headlines
15
COMP90042
L5
• • • •
[British Left] [Waffles] [on] [Falkland Islands] Juvenile Court to Try Shooting Defendant Teachers Strike Idle Kids
Eye Drops Off Shelf
POS Ambiguity in News Headlines
16
COMP90042
L5
• • • •
[British Left] [Waffles] [on] [Falkland Islands] [Juvenile Court] [to] [Try] [Shooting Defendant] Teachers Strike Idle Kids
Eye Drops Off Shelf
POS Ambiguity in News Headlines
17
COMP90042
L5
• • • •
[British Left] [Waffles] [on] [Falkland Islands] [Juvenile Court] [to] [Try] [Shooting Defendant] [Teachers Strike] [Idle Kids]
Eye Drops Off Shelf
POS Ambiguity in News Headlines
18
COMP90042
L5
• • • •
[British Left] [Waffles] [on] [Falkland Islands] [Juvenile Court] [to] [Try] [Shooting Defendant] [Teachers Strike] [Idle Kids]
[Eye Drops] [Off Shelf]
POS Ambiguity in News Headlines
19
COMP90042
L5
•
• MajorEnglishtagsets
‣ Brown (87 tags)
‣ Penn Treebank (45 tags) ‣ CLAWS/BNC (61 tags)
‣ “Universal” (12 tags)
•
Tagsets
A compact representation of POS information ‣ Usually ≤ 4 capitalized characters
‣ Often includes inflectional distinctions
At least one tagset for all major languages
20
COMP90042
L5
DT
IN
MD
RP particle TO to
Major Penn Treebank Tags
NN noun
JJ adjective
VB verb RB adverb
determiner preposition
CD
PRP
CC
WH wh-pronoun
modal
coordinating conjunction
cardinal number personal pronoun
21
COMP90042
L5
Penn Treebank Derived Tags NN: NNS (plural, wombats), NNP (proper, Australia),
NNPS (proper plural, Australians)
VB: VB (infinitive, eat), VBP (1st /2nd person present, eat), VBZ (3rd person singular, eats), VBD (past tense, ate), VBG (gerund, eating), VBN (past participle, eaten)
JJ: JJR (comparative, nicer), JJS (superlative, nicest) RB: RBR (comparative, faster), RBS (superlative,
fastest)
PRP: PRP$ (possessive, my)
WH: WH$ (possessive, whose), WDT(wh-determiner, who), WRB (wh-adverb, where)
22
COMP90042
L5
Tagged Text Example
The/DT limits/NNS to/TO legal/JJ absurdity/NN
stretched/VBD another/DT notch/NN this/DT week/ NN
when/WRB the/DT Supreme/NNP Court/NNP refused/VBD to/TO hear/VB an/DT appeal/VB from/ IN a/DT case/NN that/WDT says/VBZ corporate/JJ defendants/NNS must/MD pay/VB damages/NNS even/RB after/IN proving/VBG that/IN they/PRP could/MD not/RB possibly/RB have/VB
caused/VBN the/DT harm/NN ./.
23
COMP90042
L5
Why Automatically POS tag?
• Important for morphological analysis, e.g. lemmatisation
• For some applications, we want to focus on certain POS ‣ E.g. nouns are important for information retrieval, adjectives
for sentiment analysis
• Very useful features for certain classification tasks
‣ E.g. genre classification
• POS tags can offer word sense disambiguation
‣ E.g. cross/NN vs cross/VB cross/JJ
• Can use them to create larger structures (parsing)
24
COMP90042
L5
• •
Automatic Taggers
Rule-based taggers
Statistical taggers
‣ Unigram tagger
‣ Classifier-based taggers
‣ Hidden Markov Model (HMM) taggers
25
COMP90042
L5
Rule-based tagging
•
‣ From a lexical resource, or a corpus
• Oftenincludesotherlexicalinformation,e.g.verb
subcategorisation (its arguments)
• Applyrulestonarrowdowntoasingletag
‣ E.g. If DT comes before word, then eliminate VB ‣ Relies on some unambiguous contexts
• Largesystemshave1000sofconstraints
Typically starts with a list of possible tags for each word
26
COMP90042
L5
Unigram tagger
•
•
•
•
• Oftenconsideredthebaselineformorecomplex approaches
Assign most common tag to each word type Requires a corpus of tagged words
“Model” is just a look-up table
But actually quite good, ~90% accuracy
‣ Correctly resolves about 75% of ambiguity
27
COMP90042
L5
•
Use a standard discriminative classifier (e.g. logistic regression, neural network), with features:
‣ Target word
‣ Lexical context around the word
‣ Already classified tags in sentence
Classifier-Based Tagging
• Amongthebestsequentialmodels
‣ But can suffer from error propagation: wrong predictions from previous steps affect the next ones
28
COMP90042
L5
Hidden Markov Models
A basic sequential (or structured) model
Like sequential classifiers, use both previous tag and lexical evidence
•
• •
Unlike classifiers, treat previous tag(s) evidence and lexical evidence as independent from each other
‣ Less sparsity
‣ Fast algorithms for sequential prediction, i.e. finding the best tagging of entire word sequence
29
COMP90042
L5
•
•
Huge problem in morphologically rich languages
(e.g. Turkish)
•
Can use sub-word representations to capture morphology (look for common affixes)
Unknown Words
Can use things we’ve seen only once (hapax legomena) to best guess for things we’ve never seen before
‣ Tend to be nouns, followed by verbs
‣ Unlikely to be determiners
30
COMP90042
L5
•
•
•
Part of speech is a fundamental intersection between linguistics and automatic text analysis
A Final Word
A fundamental task in NLP, provides useful information for many other applications
Methods applied to it are typical of language tasks in general, e.g. probabilistic, sequential machine learning
31
COMP90042
L5
•
Reading JM3 Ch. 8 8.1-8.3, 8.5.1
32