程序代写代做代考 information retrieval Accelerated Natural Language Processing Week 1/Unit 2

Accelerated Natural Language Processing Week 1/Unit 2
What’s so Special about Words?
Sharon Goldwater
(based on slides by Philipp Koehn)
Sharon Goldwater
ANLP Week 1/Unit 2
• What are word types and tokens, and what is the characteristic frequency distribution of word tokens?
• What is morphology, how does it differ across languages, and why does it matter for NLP?
• What’s the difference between a stem, lemma, and affix?
• What are the characteristics of derivational and inflectional
morphology?
Sharon Goldwater ANLP Week 1/Unit 2 1
Data: Words
Possible definition: strings of letters separated by spaces • But how about:
– punctuation: commas, periods, etc are normally not part of words, but others less clear: high-risk, Joe’s, @sloppyjoe
– compounds: website, Computerlinguistikvorlesung • And what if there are no spaces:
Processing text to decide/extract words is called tokenization.
2 Sharon Goldwater ANLP Week 1/Unit 2 3
This Unit
Video 1: Words as data (Types, tokens, and Zipf’s law)
伦敦每日快报指出,两台记载黛安娜王妃一九九七年巴黎 死亡车祸调查资料的手提电脑,被从前大都会警察总长的 办公室里偷走.
Sharon Goldwater
ANLP Week 1/Unit 2

Word Counts
Out of 24m total word tokens (instances) in the English Europarl corpus, the most frequent are:
Word Counts
any word
nouns
But there are 93638 distinct words (types) altogether, and 36231 occur only once! Examples:
• cornflakes, mathematicians, fuzziness, jumbling • pseudo-rapporteur, lobby-ridden, perfunctorily, • Lycketoft, UNCITRAL, H-0695
• policyfor, Commissioneris, 145.95, 27a
Frequency Token Frequency Token
1,698,599 the 849,256 of 793,731 to 640,257 and 508,560 in 407,638 that 400,467 is 394,778 a 263,040 I
Sharon Goldwater
124,598 European 104,325 Mr
92,195 Commission 66,781 President 62,867 Parliament 57,804 Union 53,683 report 53,547 Council 45,842 States
ANLP Week 1/Unit 2
Plotting word frequencies
4
Sharon Goldwater
ANLP Week 1/Unit 2 5
Plotting word frequencies
Order words by frequency. What is the freq of nth ranked word? Frequency Token Rank
Order words by frequency. What is the freq of nth ranked word?
1,698,599 the 849,256 of 793,731 to 640,257 and 508,560 in 407,638 that 400,467 is 394,778 a 263,040 I
1 2 3 4 5 6 7 8 9
Sharon Goldwater
ANLP Week 1/Unit 2
6
Sharon Goldwater
ANLP Week 1/Unit 2 7

Rescaling the axes
To really see what’s going on, use logarithmic axes:
Sharon Goldwater
ANLP Week 1/Unit 2
Zipf’s law
8 Sharon Goldwater
ANLP Week 1/Unit 2 9
Zipf’s law
Summarizes the behaviour we just saw:
f×r≈k
• f = frequency of a word
• r = rank of a word (if sorted by frequency) • k = a constant
Summarizes the behaviour we just saw:
f×r≈k
• f = frequency of a word
• r = rank of a word (if sorted by frequency) • k = a constant
Why a line in log-scales?
fr=k ⇒ f=k ⇒ logf = logk−logr
r
y=c−x
Sharon Goldwater
ANLP Week 1/Unit 2
10 Sharon Goldwater ANLP Week 1/Unit 2 11

• Data
Linguistics and Data
Video 2: Introduction to morphology (the structure inside of words)
– looking at real use of language in text
– can learn a lot from empirical evidence
– but: Zipf’s law means there will always be rare instances
• Linguistics
– build a better understanding of language structure – linguistic analysis points to what is important
– but: many ambiguities cannot be explained easily
Sharon Goldwater
ANLP Week 1/Unit 2 12
Two plots from last time
ANLP Week 1/Unit 2 14
Sharon Goldwater
ANLP Week 1/Unit 2 13
How Many Different Words?
Sharon Goldwater
Sharon Goldwater
10,000 sentences from the Europarl corpus
Language Different words English 16k French 22k Dutch 24k Italian 25k Portuguese 26k Spanish 26k Danish 29k Swedish 30k German 32k Greek 33k Finnish 55k
Why the difference? Morphology.
ANLP Week 1/Unit 2 15

Interlude/reminder: types and tokens
The word word is ambiguous.
• Word type: “10k sentences from English Europarl have 16k
different words” (unique strings, lexical items)
• Word token: “English Europarl has 54m words” (possibly repeated instances)
What is morphology?
The study of wordforms and word formation. • Structured relationships between words:
play, played, replay, player played, walked, jumped
• How units of meaning (morphemes) can be arranged to form word types (morphotactics):
de+salin+ate+ion but not ate+salin+ion+de
Sharon Goldwater ANLP Week 1/Unit 2 17
Why does morphology matter?
Example (Russian):
zhenshina devochke dala knigu woman+NOM girl+DAT gave book+ACC ‘the woman gave the girl a book’
vs.
zhenshine devochka dala knigu woman+DAT girl+NOM gave book+ACC ‘the girl gave the woman a book’
A noun’s case marking (a kind of morphology) indicates its role in the sentence, where English uses word order and prepositions.
a cat and a brown dog chased a black dog: 10 tokens, 7 types.
Sharon Goldwater ANLP Week 1/Unit 2
Why does morphology matter?
• Information retrieval: return pages with related forms.
• Language modelling: make predictions about unseen words
16
• Machine translation and language understanding: signals differences in meaning (might be expressed using word order in other languages).
Sharon Goldwater
ANLP Week 1/Unit 2
18
Sharon Goldwater ANLP Week 1/Unit 2 19

Video 3: Stems, lemmas, and affixes
Morphemes: Stems and Affixes
• Lemma: the canonical form or dictionary form of a set of words
– fly, flies, flew and flying all have the lemma fly.
– walk, walks, walked and walking all have the lemma walk. – walker, walkers have the lemma walker.
• Lemma: the canonical form or dictionary form of a set of words
– fly, flies, flew and flying all have the lemma fly.
– walk, walks, walked and walking all have the lemma walk. – walker, walkers have the lemma walker.
• Stem: definitions can vary, but often: the part of the word that is common to all its variants
– stem of produce, production is produc.
– stem of walk, walks, walked, walking, walker, walkers is walk.
– Do fly, flies, flew, flying have a common stem fl?
Or maybe only fly and flying share a stem: fly. Decision may depend on application.
• Two types of morphemes – stems: small, cat, walk
Sharon Goldwater
ANLP Week 1/Unit 2 20
Stems vs. Lemmas
– affixes: +ed, un+ • Four types of affixes
– suffix
– prefix
– infix
– circumfix
Sharon Goldwater
ANLP Week 1/Unit 2 21
Stems vs. Lemmas
Sharon Goldwater
ANLP Week 1/Unit 2 22
Sharon Goldwater ANLP Week 1/Unit 2 23

• Plural of nouns
• Comparative and superlative of adjectives
small+er small+est
• In English: these typically change the meaning
• Formation of adverbs
• Verb tenses
great+ly walk+ed
re+consider
• Some language use prefixing much more widely
Sharon Goldwater ANLP Week 1/Unit 2 25
Not that easy…
• Affixes are not always simply attached
• In writing, some letters may be changed/added/removed
– walk+ed
– frame+d
– emit+ted
– carr(–y)+ied
• In speaking, some sounds may be changed/added/removed – Compare the final sound: cats [s] vs dogs [z] vs foxes [@z]
Suffix
Prefix
cat+s
• All inflectional morphology in English uses suffixes
Sharon Goldwater ANLP Week 1/Unit 2 24
Other types of morphology
Mainly in non-English languages; check textbook or online. • Infixes
• Circumfixes
• Reduplication
• Root and pattern
• Adjectives • Verbs
un+friendly dis+interested
Sharon Goldwater
ANLP Week 1/Unit 2 26
Sharon Goldwater ANLP Week 1/Unit 2 27

Irregular Forms
Video 4: Inflection and derivation
• Some words have irregular forms:
– is, was, been – eat, ate, eaten – go, went, gone
• Irregular forms tend to be the most frequent (and vice versa)
Sharon Goldwater ANLP Week 1/Unit 2
Inflectional vs. Derivational Morphology
28
Sharon Goldwater
ANLP Week 1/Unit 2 29
Inflectional Morphology
• So far, distinctions are mainly about form: where does the morpheme go, what does it look like?
• We can also distinguish more by function: inflection or derivation?
• Inflectional morphology typically
– does not change basic meaning or part of speech
– expresses grammatical features or relations between words – applies to all words of the same part of speech
• In English, we inflect
– nouns for count (plural: +s) and for possessive case (+’s)
– verbs for tense (+ed, +ing) and a special 3rd person singular
present form (+s)
– adjectives in comparative (+er) and superlative (+est) forms.
• In German, we inflect
– nouns for count and case
– verbs for tense, person, and count
– adjectives for count, case, gender, and definiteness
– determiners for count, case and gender
Sharon Goldwater ANLP Week 1/Unit 2 30
Sharon Goldwater ANLP Week 1/Unit 2 31

Forms of the German the Singular
Plural
Inflectional vs. Derivational Morphology
• Inflectional morphology typically
– does not change basic meaning or part of speech
– expresses grammatical features or relations between words – applies to all words of the same part of speech
• Derivational morphology
– may change the part of speech or meaning of a word
– is not driven by syntactic relations outside the word
– may be “picky”: drama+(t)ize but not traged(-y)+ize
– applies closer to the stem; whereas inflection occurs at word
edges: govern+ment+s, centr+al+ize+d
Sharon Goldwater ANLP Week 1/Unit 2 33
Derivational Morphology
• Changing the verb back to a noun
wordify → wordification (8k hits on Google)
• A person/thing who engages in wordification
wordification → wordificator (was 8 hits, now 21k: another
app!) • A person/thing who wordifies
wordify → wordifier (1500 hits on Google)
• What is the difference between a wordifier and a wordificator?
Case
nominative (subject)
n.
die der den die
male
fem.
n.
male
fem.
der
die
das
die
die
des
der
des
der
der
dem
der
dem
den
den
den
die
das
die
die
genitive (possessive) dative (indirect object) accusative (direct object)
Phrase/role: [The A]/s put [the B]/o [of the C]/p [on the D]/io Not only many different forms,
Sharon Goldwater
but each form is highly ambiguous.
ANLP Week 1/Unit 2 32
Derivational Morphology
• Changing the part of speech, e.g. noun to verb word → wordify
• Is it a real word?
• Consulting Google (a few years ago):
– 8,840 hits: e.g., wordify mugs, tshirts and magnets • Google now returns over 75k hits. (Why?)
Sharon Goldwater ANLP Week 1/Unit 2 34
Sharon Goldwater ANLP Week 1/Unit 2 35

Sharon Goldwater
ANLP Week 1/Unit 2 36
Compounds
Sharon Goldwater ANLP Week 1/Unit 2 37
Acronyms/Initialisms
• Wikileaks / Guardian, document 2007-081-100110-0444:
Derivational Morphology
Derivational Morphology
• Turning wordification into a ideology:
wordification → wordificationism (was just 1 hit:)
• An adherent of wordificationism
wordificationism → wordificationist
• Used to have 0 hits on Google, now you get these slides! • We created a new word!
I think you’re confusing the term “Democracy” with “Capitalism”; I think you mean “Has Capitalism failed”?
No. It hasn’t.
I agree, Hambone; I’m just trying to correct the
wordificationism.
Where in the world did you get the word “wordificationism”? Not in the Merriam-Webster dictionary, not in the Thesaurus…
OGA operating in TF Catamount sector moved into Malekshay for operation. LN Shum Khan ran at the sight of the approaching CFA’s. CF utilized the escalation of force doctrine and shouted to stop, fired warning shots and then fired to wound. The LN was hit in the ankle and treated by Element medics on scene. It was determined through discussions with local Elders that the man was a deaf mute that was nervous of the CF operation. Solatia was made in the form of supplies and the Element mission progressed
• Creating new words by merging multiple words • (Somewhat) rare in English
home work → homework web site → website
• More common in other languages (like German)
Sharon Goldwater ANLP Week 1/Unit 2 38
Sharon Goldwater
ANLP Week 1/Unit 2 39

Video 5: Morphological variation across languages
Morphology differs across languages
• Usually a trade-off between morphology and syntax (word order)
– Some languages have no verb tenses
→ use explicit time references (yesterday)
– Case inflection determines roles of noun phrase
→ use fixed word order instead
→ use prepositional phrases instead of cased noun phrases
• Examples from the World Atlas of Language Structures (wals.info)
– prefixes vs. suffixes
– cases (zero to more than ten)
– past tense remoteness distinctions
Sharon Goldwater ANLP Week 1/Unit 2 41
Sharon Goldwater
ANLP Week 1/Unit 2 40
Sharon Goldwater
ANLP Week 1/Unit 2 42
Sharon Goldwater ANLP Week 1/Unit 2 43

Sharon Goldwater ANLP Week 1/Unit 2 44