CS计算机代考程序代写 information retrieval Hidden Markov Mode algorithm l5-pos-v2

l5-pos-v2

COPYRIGHT 2021, THE UNIVERSITY OF MELBOURNE
1

COMP90042
Natural Language Processing

Lecture 5
Semester 1 2021 Week 3

Jey Han Lau

Part of Speech Tagging

COMP90042 L5

2

What is Part of Speech (POS)?

• AKA word classes, morphological classes, syntactic
categories

• Nouns, verbs, adjective, etc

• POS tells us quite a bit about a word and its
neighbours:

‣ nouns are often preceded by determiners

‣ verbs preceded by nouns

‣ content as a noun pronounced as CONtent

‣ content as a adjective pronounced as conTENT

COMP90042 L5

3

Information Extraction

• Given this:
‣ “Brasilia, the Brazilian capital, was founded in 1960.”

• Obtain this:
‣ capital(Brazil, Brasilia)
‣ founded(Brasilia, 1960)

• Many steps involved but first need to know nouns
(Brasilia, capital), adjectives (Brazilian), verbs
(founded) and numbers (1960).

COMP90042 L5

4

Outline

• Parts of speech

• Tagsets

• Automatic Tagging

COMP90042 L5

5

POS Open Classes

Open vs closed classes: how readily do POS
categories take on new words? Just a few open
classes:

• Nouns
‣ Proper (Australia) versus common (wombat)
‣ Mass (rice) versus count (bowls)

• Verbs
‣ Rich inflection (go/goes/going/gone/went)
‣ Auxiliary verbs (be, have, and do in English)
‣ Transitivity (wait versus hit versus give) 


— number of arguments

COMP90042 L5

6

POS Open Classes

• Adjectives
‣ Gradable (happy) versus non-gradable (computational)

• Adverbs
‣ Manner (slowly)
‣ Locative (here)
‣ Degree (really)
‣ Temporal (today)

COMP90042 L5

7

POS Closed Classes (English)
• Prepositions (in, on, with, for, of, over,…)

‣ on the table

• Particles
‣ brushed himself off

• Determiners
‣ Articles (a, an, the)
‣ Demonstratives (this, that, these, those)
‣ Quantifiers (each, every, some, two,…)

• Pronouns
‣ Personal (I, me, she,…)
‣ Possessive (my, our,…)
‣ Interrogative or Wh (who, what, …)

COMP90042 L5

8

POS Closed Classes (English)

• Conjunctions
‣ Coordinating (and, or, but)
‣ Subordinating (if, although, that, …)

• Modal verbs
‣ Ability (can, could)
‣ Permission (can, may)
‣ Possibility (may, might, could, will)
‣ Necessity (must)

• And some more…

‣ negatives, politeness markers, etc

COMP90042 L5

9

• Noun
• Verb
• Adjective
• Adverb

PollEv.com/jeyhanlau569

Is POS universal? What open classes
are seen in all languages?

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L5

10

COMP90042 L5

11

Ambiguity

• Many word types belong to multiple classes

• POS depends on context

• Compare:
‣ Time flies like an arrow
‣ Fruit flies like a banana

Time flies like an arrow
noun verb preposition determiner noun

Fruit flies like a banana
noun noun verb determiner noun

COMP90042 L5

12

POS Ambiguity in News Headlines

• British Left Waffles on Falkland Islands

‣ [British Left] [Waffles] [on] [Falkland Islands]
• Juvenile Court to Try Shooting Defendant

‣ [Juvenile Court] [to] [Try] [Shooting Defendant]
• Teachers Strike Idle Kids

‣ [Teachers Strike] [Idle Kids]  
• Eye Drops Off Shelf

‣ [Eye Drops] [Off Shelf] 

COMP90042 L5

13

Tagsets

COMP90042 L5

14

Tagsets

• A compact representation of POS information
‣ Usually ≤ 4 capitalized characters (e.g. NN = noun)
‣ Often includes inflectional distinctions

• Major English tagsets
‣ Brown (87 tags)
‣ Penn Treebank (45 tags)
‣ CLAWS/BNC (61 tags)
‣ “Universal” (12 tags)

• At least one tagset for all major languages

COMP90042 L5

15

Major Penn Treebank Tags

NN noun VB verb
JJ adjective RB adverb
DT determiner CD cardinal number
IN preposition PRP personal pronoun
MD modal CC coordinating conjunction
RP particle WH wh-pronoun
TO to

COMP90042 L5

16

Derived Tags (Open Class)
• NN (noun singular, wombat)

‣ NNS (plural, wombats)

‣ NNP (proper, Australia)

‣ NNPS (proper plural, Australians)

• VB (verb infinitive, eat)

‣ VBP (1st /2nd person present, eat)

‣ VBZ (3rd person singular, eats)

‣ VBD (past tense, ate)

‣ VBG (gerund, eating)

‣ VBN (past participle, eaten)

COMP90042 L5

17

Derived Tags (Open Class)

• JJ (adjective, nice)

‣ JJR (comparative, nicer)

‣ JJS (superlative, nicest)

• RB (adverb, fast)

‣ RBR (comparative, faster)

‣ RBS (superlative, fastest)

COMP90042 L5

18

Derived Tags (Closed Class)

• PRP (pronoun personal, I)

‣ PRP$ (possessive, my)

• WP (Wh-pronoun, what):

‣ WP$ (possessive, whose)

‣ WDT(wh-determiner, which)

‣ WRB (wh-adverb, where)

COMP90042 L5

19

Tagged Text Example

The/DT limits/NNS to/TO legal/JJ absurdity/NN 

stretched/VBD another/DT notch/NN this/DT week/
NN 

when/WRB the/DT Supreme/NNP Court/NNP
refused/VBD to/TO hear/VB an/DT appeal/VB from/
IN a/DT case/NN that/WDT says/VBZ corporate/JJ
defendants/NNS must/MD pay/VB damages/NNS
even/RB after/IN proving/VBG that/IN they/PRP
could/MD not/RB possibly/RB have/VB 

caused/VBN the/DT harm/NN ./.

COMP90042 L5

20

Tagged Text Example

The/DT limits/NNS to/TO legal/JJ absurdity/NN 

stretched/VBD another/DT notch/NN this/DT week/
NN 

when/WRB the/DT Supreme/NNP Court/NNP
refused/VBD to/TO hear/VB an/DT appeal/VB from/
IN a/DT case/NN that/WDT says/VBZ corporate/JJ
defendants/NNS must/MD pay/VB damages/NNS
even/RB after/IN proving/VBG that/IN they/PRP
could/MD not/RB possibly/RB have/VB 

caused/VBN the/DT harm/NN ./.

COMP90042 L5

21

Tag the following sentence with Penn
Treebank’s POS tagset:

CATS SHOULD CATCH MICE EASILY

PollEv.com/jeyhanlau569

http://PollEv.com/jeyhanlau569
http://PollEv.com/jeyhanlau569

COMP90042 L5

22

Automatic Tagging

COMP90042 L5

23

Why Automatically POS tag?
• Important for morphological analysis, e.g. lemmatisation

• For some applications, we want to focus on certain POS
‣ E.g. nouns are important for information retrieval, adjectives

for sentiment analysis

• Very useful features for certain classification tasks
‣ E.g. genre attribution (fiction vs. non-fiction)

• POS tags can offer word sense disambiguation
‣ E.g. cross/NN vs cross/VB cross/JJ

• Can use them to create larger structures (parsing;
lecture 14–16)

COMP90042 L5

24

Automatic Taggers

• Rule-based taggers

• Statistical taggers
‣ Unigram tagger
‣ Classifier-based taggers
‣ Hidden Markov Model (HMM) taggers

COMP90042 L5

25

Rule-based tagging

• Typically starts with a list of possible tags for each
word
‣ From a lexical resource, or a corpus

• Often includes other lexical information, e.g. verb
subcategorisation (its arguments)

• Apply rules to narrow down to a single tag
‣ E.g. If DT comes before word, then eliminate VB
‣ Relies on some unambiguous contexts

• Large systems have 1000s of constraints

COMP90042 L5

26

Unigram tagger

• Assign most common tag to each word type

• Requires a corpus of tagged words

• “Model” is just a look-up table

• But actually quite good, ~90% accuracy
‣ Correctly resolves about 75% of ambiguity

• Often considered the baseline for more complex
approaches

COMP90042 L5

27

Classifier-Based Tagging

• Use a standard discriminative classifier (e.g.
logistic regression, neural network), with features:
‣ Target word
‣ Lexical context around the word
‣ Already classified tags in sentence

• But can suffer from error propagation: wrong
predictions from previous steps affect the next
ones

COMP90042 L5

28

Hidden Markov Models
• A basic sequential (or structured) model

• Like sequential classifiers, use both previous tag and
lexical evidence

• Unlike classifiers, considers all possibilities of previous tag

• Unlike classifiers, treat previous tag evidence and lexical
evidence as independent from each other
‣ Less sparsity
‣ Fast algorithms for sequential prediction, i.e. finding the best

tagging of entire word sequence

• Next lecture!

COMP90042 L5

29

Unknown Words

• Huge problem in morphologically rich languages 

(e.g. Turkish)

• Can use things we’ve seen only once (hapax
legomena) to best guess for things we’ve never
seen before

‣ Tend to be nouns, followed by verbs

‣ Unlikely to be determiners

• Can use sub-word representations to capture
morphology (look for common affixes)

COMP90042 L5

30

A Final Word

• Part of speech is a fundamental intersection
between linguistics and automatic text analysis

• A fundamental task in NLP, provides useful
information for many other applications

• Methods applied to it are typical of language tasks
in general, e.g. probabilistic, sequential machine
learning

COMP90042 L5

31

Reading

• JM3 Ch. 8-8.2