程序代写代做代考 python database 14-tokenization.pptx

14-tokenization.pptx

Ling 131A
Introduc0on to NLP with Python

Tokeniza0on

Marc Verhagen, Fall 2017

Contents

•  Python sessions
•  Final project
•  Assignment 4
•  Regular Expressions in Python
•  Tokeniza0on

Python sessions

•  From the registrar:
– Shapiro Science Center GL 14 has been reserved
for LING 131a review sessions on Mondays and
Wednesdays from 1-1:50pm during the semester
while classes are in session with the excep0on of
Nov 21 (Thanksgiving).

•  Start tomorrow

Final Project

•  Start thinking about a project
•  Project proposal will be due on Tuesday
November 20th
–  Submit to me by email

•  Groups of up to 4 people
•  Should involve some serious coding beyond what
you have done so far

•  Delivery vehicle is again GitHub
•  Graded on code and final report

Final Project Topics

•  Project does not need to involve NLTK
•  Programming heavy
– Create your own tokenizer, lemma0zer, POS
tagger, syntac0c parser, word aligner, en0ty
extractor or some other NLP module

•  Linguis0cs heavy
– Analyze a corpus of data or compare a couple of
corpora

– Should go beyond running some NLTK code

Final Project Topics

•  Examples from last year
– Natural language interface to rela0onal database
– Haiku bot
– Twi]er classifica0on
– Text normaliza0on (currencies)
– Detec0ng challenging or uncommon words
– ques0on-answering system
– Spanish chat box
– Genera0ng rhyme pa]erns

NLP Pipeline

Text cleanup:
raw text
processing

Tokeniza0on/
segmenta0on

Text
normaliza0on:
stemming,

lemma0za0on

POS tagging/
Morphological

analysis

NP Chunking,
Syntac0c
parsing

Seman0c
analysis

Tokeniza0on

•  First level of abstrac0on
•  What are the basic units in your text
•  Knowing that ‘the’ is the same thing whether it
occurs as ‘the’ or ‘the.’

•  Usually does not include normaliza0on
–  The man in the High Tower.
–  Tokenized as

•  The man in the High Tower .
– Not as

•  the man in the high tower .

Penn Treebank Tokeniza0on

•  Lorrillard Inc., the unit of New York-based Loews Corp.
that makes Kent cigare]es, stopped using crocidolite
in its Micornite cigare]e filters in 1956.

•  Typically, money-fund yields beat comparable invest-
ments because porfolio managers can vary maturi0es
and go ager highest rates.

•  Periods and hyphens are ambiguous
•  Hyphens originally not considered single tokens in PTB,
but later revised to context dependent tokeniza0on.

Penn Treebank Tokeniza0on

•  Lorrillard Inc. , the unit of New York – based Loews
Corp. that makes Kent cigare]es , stopped using
crocidolite in its Micornite cigare]e filters in 1956 .

•  Typically, money-fund yields beat comparable
investments because porfolio managers can vary
maturi0es and go ager highest rates .

•  Periods and hyphens are ambiguous
•  Hyphens originally not considered single tokens in PTB,
but later revised to context dependent tokeniza0on.

Penn Treebank Tokeniza0on
•  In the new posi0on he will oversee Mazda ’s U.S.
sales , services , parts and marke0ng opera0ons .

• We did n’t have much of a choice .
•  U.S. trade officials said the Philippines and
Thailand would be the main beneficiaries of the
president ‘s ac0on .

•  Anything ‘s possible — how about the new Guinea
Fund ?

•  Contrac0ons are separated out

Penn Treebank Tokeniza0on

•  Assets of the 400 taxable funds grew by $1.5
billion during the latest week.

•  Exports in October stood $5.29 billion, a mere
0.7% increase from a year earlier, while
imports increased sharply to $5.39 billion, up
20% from last year.

Penn Treebank Tokeniza0on

•  Assets of the 400 taxable funds grew by $ 1.5
billion during the latest week .

•  Exports in October stood $ 5.29 billion , a mere
0.7 % increase from a year earlier , while
imports increased sharply to $ 5.39 billion , up
20 % from last year .

•  Punctua0on marks are their own tokens, and
not just periods and commas

Penn Treebank Tokeniza0on

•  The federal government suspended sales of
the U.S. savings bonds because Congress
hasn’t liged the ceiling on government debt.

•  The Treasury said the U.S. will default on Nov.
9 if Congress doesn’t act by then.

Penn Treebank Tokeniza0on

•  The federal government suspended sales of
the U.S. savings bonds because Congress has
n’t liged the ceiling on government debt .

•  The Treasury said the U.S. will default on Nov.
9 if Congress does n’t act by then .

•  Contrac0ons are separated out

Stemming and Lemma0za0on

•  The process of reducing inflected words to their
word stem or root form (aka lemma)

•  Strongly related, but there are differences
–  Stemmer usually does not take the context into
account and does not care about a word’s category

–  Lemma0zer includes dic0onary lookup of wordforms
and uses the context of a word
•  for example, mee0ng should be mapped to meet if it is a
verb but not if it is a noun

–  Stemming is faster compared with lemma0za0on, but
can’t ensure the resul0ng words are legi0mate

Stemming and Lemma0za0on

•  Regular expressions ogen used for suffix
stripping, but
– Can’t elegantly handle irregular pa]erns, e.g.,
women à woman

– Off-the-shelf stemmers (Lancaster and Porter) use
many special rules to handle these irregulari0es

Tokeniza0on in NLTK

from nltk import word_tokenize

text1 = “””Lorrillard Inc., the unit of New York-based Loews
Corp. that makes Kent cigarettes, stopped using crocidolite in
its Micornite cigarette filters in 1956.”””

print(word_tokenize(text1))

Stemming and Lemma0zing

from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

words = [‘caresses’, ‘flies’, ‘dies’, ‘mules’, ‘denied’,
‘died’, ‘agreed’, ‘owned’, ‘humbled’, ‘sized’,
‘meeting’, ‘stating’, ‘siezing’, ‘itemization’,

‘sensational’,’traditional’, ‘reference’,
‘colonizer’, ‘plotted’]

stemmer = PorterStemmer()
print([stemmer.stem(w) for w in words])

lemmatizer = WordNetLemmatizer()
print([lemmatizer.lemmatize(w) for w in words])