Topic 2: Text Documents and Pre-Processing
Feature Extraction Sentence segmentation Tokenisation
Regular expressions Canonicalisation
Stemming and lemmatisation
Morphological Processes Inflection and derivation
Morphological Analysers
The Porter stemmer Finite State models
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 1 / 27
Document Pre-Processing
document A document B document C
feature extractor
A features B features C features
task specific module
output
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 2 / 27
Document Feature Extraction
Identifying the tokens and sentences that make up the document
Segment document into sentences — sentence segmentation
Segment sentences into sequences of tokens — tokenisation
Canonicalise tokens
— stemming or lemmatisation
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 3 / 27
Sentence Segmentation
Breaking a document into a sequence of sentences Punkt sentence segmenter
T. Kiss and J. Strunk (2006) Unsupervised Sentence Boundary Detection, Computational Linguistics, 32(4). aclweb.org/anthology-new/J/J06/J06-4003.pdf
The default sentence segmenter in NLTK nltk.googlecode.com/svn/trunk/doc/api/nltk.tokenize.punkt-module.html
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 4 / 27
Tricky Example
CELLULAR COMMUNICATIONS INC. sold 1,550,000 common shares at $21.75 each yesterday, according to lead underwriter L.F. Rothschild & Co.
Difficult to find reliable features that characterise uses of punctuation that mark the end of a sentence
Errors lead to problems for down-stream processing steps
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 5 / 27
Punkt Sentence Segmenter
Language independent
Works directly with raw text (no annotation required) Does not make use of any pre-compiled lists
Avoids dependence on orthographic information — works well even with lower case text
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 6 / 27
Punkt Sentence Segmenter
Possible to accurately identify abbreviations — collocation of truncated word and period — tend to be short
— often have internal periods
Rules for identification of abbreviations generalises across languages
99.2% accurate identification of abbreviations!
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 7 / 27
Tokenisation
What is tokenisation?
Text is just a sequence of characters Useful to chunk into sequence of tokens
Things that can be tokens:
— word
— item of punctuation
— numerical quantities (e.g. £2, 314.99)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 8 / 27
Contractions
[…, “can’t”,…] ⇒ […, “ca”, “n’t”,…] […, “he’ll”,…] ⇒ […, “he”, “’ll”,…] […, “weren’t”,…] ⇒ […, “were”, “n’t”,…]
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 9 / 27
Tokenising in NLTK
Can use regular expressions to write a tokeniser Regular expression can characterise the form of a token
tokenizer = RegexpTokenizer(’\w+|\$[\d\.]+\S+’)
Lots of examples discussed in NLTK textbook
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 10 / 27
Regular Expressions
What are they?
Convenient way of using a single expressions to describe more than one possible string
Suppose you are looking for uses of colour or color colou?r
Any word beginning and ending with letter s s\w*s
Typically involves a matching process
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 11 / 27
Regular Expressions
^[Ww]hat (can|does) s?he mean\s+by\s+(\w+)\s*\.$
. any character
\s whitespace
^ start anchor [ab] alternative letters ? optional
+ one or more * any number
\w character, digit \S non-whitespace $ end anchor (X|Y) alternative strings \ escape
?+ one or more (minimal) ?* any number (minimal)
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 12 / 27
Regular Expressions and Finite State Systems
Regular Expressions are equivalent to Finite State Machines Tokenisation can be viewed as a process of translation Corresponding abstract machine: Finite State Transducers
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 13 / 27
Tokenisation Issues
Decisions about what should constitute a token are crucial Has impact on down-stream processing steps
Tokens will be the atomic units of meaning
Issue for consideration in Lab Session 3
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 14 / 27
Token Canonicalisation
Removing “unwanted” distinctions
— what makes a distinction “unwanted” is a matter of context
Token counts become less sparse
— token counts form basis of much of statistical text analysis
Morphological variants of a word
— might be useful when measuring document topical similarity
Spelling variants of a word
Collapsing synonyms to canonical term
— might be useful in biomedical text processing
Data Science Group (Informatics)
NLE/ANLP
Autumn 2015 15 / 27
Morphology
Morphology:
The study of the way that words are built up out of smaller parts: “smaller parts” → morphemes
Morphemes:
The building block of words
The smallest meaningful units of a language.
Morphological processes:
Inflectional or derviational
Regular or irregular
Concatenative or non-concatenative
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 16 / 27
Some Examples of English Derivational Morphology
start (VERB) : cause to happen or begin starter (NOUN): thing that starts something
start + er (AFFIX): thing that acts restart (VERB): start again
re (AFFIX): again + start (VERB)
restartable (ADJECTIVE): capable of being restarted restart (VERB) + able (AFFIX): capable of being
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 17 / 27
Some Examples of English Inflectional Morphology
starts (VERB): causes to happen or begin
start + s (AFFIX): 3rd person, singular, present tense (regular verbs)
started (VERB): caused to happen or begin start + ed (AFFIX): past tense (regular verbs)
starters (NOUN): things that start something starter + s (AFFIX): plural (regular nouns)
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 18 / 27
Classes of Morpheme
Roots:
basic word form that is not further analysable in terms of inflectional or derivational morphology
e.g. start, stop, book, hazy, understand, policeman, …
Stems:
a word form which may be inflected by further affixes start, restart, restarter, restartability, …
Affixes:
Prefixes: re(start), un(restartable), …
Suffixes: starter, starts, restartable,…
Also: Infixes and circumflexes (though not really in English)
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 19 / 27
OK, but what’s it good for anyway?!
Spell-checking
e.g. is antidisestablishmentarianisms a legal word?
Decoding word class and meaning:
proogable (ADJECTIVE?): cabable of being prooged??
reproogers (NOUN?): things that can proog something again?? Providing efficient lexical representation:
No need to store regular word forms explicitly
Store base forms and have rules for generating/analysing
Canonicalizing text data (stemming or lemmatisation):
reduce words to a common base form
e.g. start, starts, started, starting: all forms of start
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 20 / 27
Stemming and Lemmatisation
Stemming:
Removal of inflectional affixes: start s, start ed, start ing May conflate unrelated words: arm, army
And what about irregular words: sings, sang, sung ?
Lemmatisation:
Reduce to a ‘dictionary headword’: sings, sang, sung → SING
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 21 / 27
Stemmers: Porter Stemmer
Lexicon free stemmer: nltk.PorterStemmer() Rewrite rules:
ATIONAL → ATE (e.g. relational, relate) FUL → ε (e.g. hopeful, hope)
SSES → SS (e.g. caresses, caress)
Errors of commission: organization → organ
university → universe Errors of omission:
urgency (not stemmed to urgent) European (not stemmed to Europe)
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 22 / 27
Lemmatisers: Morphy
A bit cleverer than the Porter Stemmer: nltk.wordnet.morphy() Knowledge of inflectional morphology
An exception list for irregulars
Apply stemming rules, then:
compare the result to the WordNet dictionary
if result is a real word, then keep it, else use the original word.
See: http://wordnet.princeton.edu/man/morphy.7WN.html
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 23 / 27
Morphological Analysers: Finite State Automata
Finite state automata for lexical recognition and generation:
un- q0
ε
adj-root1
q1 q2 adj-root1
-er, -ly, -est
adj-root1: e.g. happy → unhappy, happier, happily, unhappiest…. adj-root2: e.g. big → big, bigger, biggest (c.f. *unbig, *bigly)
q5
q3
q4
-er, -est
adj-root2
Figure: from Jurafsky & Martin, Ch 3
Data Science Group (Informatics) Morphology, Lemmatisation and Stemming Autumn 2015 24 / 27
Morphological Analysers: Finite State Transducers
Morphological parsing: mapping between lexical (morphemes) and surface (orthographic) representations:
lexical surface
c
a
t
+N
+PL
c
a
t
s
Figure: from Jurafsky & Martin, Ch 3
Kimmo Koskenniemi (1983) Two-Level Morphology: A General
Computational Model for Word-Form Recognition and Production.
Appropriate compuational model: Finite State Transducer
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 25 / 27
Morphological Analysers: Finite State Transducers
Need to deal with spelling changes. For example: cat + s → cats
BUT fox + s → foxes
In practice, make use of two mappings:
Figure: from Jurafsky & Martin, Ch 3
Lexical to intermediate level marks morphemes and boundaries Intermediate to surface level deals with spelling changes
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 26 / 27
Next Topic: Document Classification
Classification scenarios Sentiment analysis
Topic Relevance Document filtering
Word list based classifiers
Data Science Group (Informatics)
Morphology, Lemmatisation and Stemming
Autumn 2015 27 / 27