14-tokenization.pptx
Ling 131A
Introduc0on to NLP with Python
Tokeniza0on
Marc Verhagen, Fall 2017
Contents
• Python sessions
• Final project
• Assignment 4
• Regular Expressions in Python
• Tokeniza0on
Python sessions
• From the registrar:
– Shapiro Science Center GL 14 has been reserved
for LING 131a review sessions on Mondays and
Wednesdays from 1-1:50pm during the semester
while classes are in session with the excep0on of
Nov 21 (Thanksgiving).
• Start tomorrow
Final Project
• Start thinking about a project
• Project proposal will be due on Tuesday
November 20th
– Submit to me by email
• Groups of up to 4 people
• Should involve some serious coding beyond what
you have done so far
• Delivery vehicle is again GitHub
• Graded on code and final report
Final Project Topics
• Project does not need to involve NLTK
• Programming heavy
– Create your own tokenizer, lemma0zer, POS
tagger, syntac0c parser, word aligner, en0ty
extractor or some other NLP module
• Linguis0cs heavy
– Analyze a corpus of data or compare a couple of
corpora
– Should go beyond running some NLTK code
Final Project Topics
• Examples from last year
– Natural language interface to rela0onal database
– Haiku bot
– Twi]er classifica0on
– Text normaliza0on (currencies)
– Detec0ng challenging or uncommon words
– ques0on-answering system
– Spanish chat box
– Genera0ng rhyme pa]erns
NLP Pipeline
Text cleanup:
raw text
processing
Tokeniza0on/
segmenta0on
Text
normaliza0on:
stemming,
lemma0za0on
POS tagging/
Morphological
analysis
NP Chunking,
Syntac0c
parsing
Seman0c
analysis
Tokeniza0on
• First level of abstrac0on
• What are the basic units in your text
• Knowing that ‘the’ is the same thing whether it
occurs as ‘the’ or ‘the.’
• Usually does not include normaliza0on
– The man in the High Tower.
– Tokenized as
• The man in the High Tower .
– Not as
• the man in the high tower .
Penn Treebank Tokeniza0on
• Lorrillard Inc., the unit of New York-based Loews Corp.
that makes Kent cigare]es, stopped using crocidolite
in its Micornite cigare]e filters in 1956.
• Typically, money-fund yields beat comparable invest-
ments because porfolio managers can vary maturi0es
and go ager highest rates.
• Periods and hyphens are ambiguous
• Hyphens originally not considered single tokens in PTB,
but later revised to context dependent tokeniza0on.
Penn Treebank Tokeniza0on
• Lorrillard Inc. , the unit of New York – based Loews
Corp. that makes Kent cigare]es , stopped using
crocidolite in its Micornite cigare]e filters in 1956 .
• Typically, money-fund yields beat comparable
investments because porfolio managers can vary
maturi0es and go ager highest rates .
• Periods and hyphens are ambiguous
• Hyphens originally not considered single tokens in PTB,
but later revised to context dependent tokeniza0on.
Penn Treebank Tokeniza0on
• In the new posi0on he will oversee Mazda ’s U.S.
sales , services , parts and marke0ng opera0ons .
• We did n’t have much of a choice .
• U.S. trade officials said the Philippines and
Thailand would be the main beneficiaries of the
president ‘s ac0on .
• Anything ‘s possible — how about the new Guinea
Fund ?
• Contrac0ons are separated out
Penn Treebank Tokeniza0on
• Assets of the 400 taxable funds grew by $1.5
billion during the latest week.
• Exports in October stood $5.29 billion, a mere
0.7% increase from a year earlier, while
imports increased sharply to $5.39 billion, up
20% from last year.
Penn Treebank Tokeniza0on
• Assets of the 400 taxable funds grew by $ 1.5
billion during the latest week .
• Exports in October stood $ 5.29 billion , a mere
0.7 % increase from a year earlier , while
imports increased sharply to $ 5.39 billion , up
20 % from last year .
• Punctua0on marks are their own tokens, and
not just periods and commas
Penn Treebank Tokeniza0on
• The federal government suspended sales of
the U.S. savings bonds because Congress
hasn’t liged the ceiling on government debt.
• The Treasury said the U.S. will default on Nov.
9 if Congress doesn’t act by then.
Penn Treebank Tokeniza0on
• The federal government suspended sales of
the U.S. savings bonds because Congress has
n’t liged the ceiling on government debt .
• The Treasury said the U.S. will default on Nov.
9 if Congress does n’t act by then .
• Contrac0ons are separated out
Stemming and Lemma0za0on
• The process of reducing inflected words to their
word stem or root form (aka lemma)
• Strongly related, but there are differences
– Stemmer usually does not take the context into
account and does not care about a word’s category
– Lemma0zer includes dic0onary lookup of wordforms
and uses the context of a word
• for example, mee0ng should be mapped to meet if it is a
verb but not if it is a noun
– Stemming is faster compared with lemma0za0on, but
can’t ensure the resul0ng words are legi0mate
Stemming and Lemma0za0on
• Regular expressions ogen used for suffix
stripping, but
– Can’t elegantly handle irregular pa]erns, e.g.,
women à woman
– Off-the-shelf stemmers (Lancaster and Porter) use
many special rules to handle these irregulari0es
Tokeniza0on in NLTK
from nltk import word_tokenize
text1 = “””Lorrillard Inc., the unit of New York-based Loews
Corp. that makes Kent cigarettes, stopped using crocidolite in
its Micornite cigarette filters in 1956.”””
print(word_tokenize(text1))
Stemming and Lemma0zing
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
words = [‘caresses’, ‘flies’, ‘dies’, ‘mules’, ‘denied’,
‘died’, ‘agreed’, ‘owned’, ‘humbled’, ‘sized’,
‘meeting’, ‘stating’, ‘siezing’, ‘itemization’,
‘sensational’,’traditional’, ‘reference’,
‘colonizer’, ‘plotted’]
stemmer = PorterStemmer()
print([stemmer.stem(w) for w in words])
lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(w) for w in words])