11-corpora.pptx
NLTK Texts and Corpora
LING 131A, Fall 2018
Marc Verhagen, Brandeis University
Today
• Quiz 1 results
– questions 8, 9, 11 and 12 updates
• Putting list comprehensions to bed
– [ [ (x,y) for x in lst1 ] for y in lst2 ]
• Animals, Dogs and a Zoo
– doctest revisited
• Coding standards
• NLTK Texts and Corpora
• Assignment 3
Quiz 1 Results
0
2
4
6
8
10
12
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
List comprehensions
Aminals in the Zoo
Python style guide
• https://www.python.org/dev/peps/pep-0008/
• Install and run
• Key insight: code is read much more often than it is
written. Style guides are about consistency and
readability
• But… always:
– Guides are just guides
– There is danger in getting distracted too much by the guide
– A Foolish Consistency is the Hobgoblin of Little Minds
$ pip3 install pep8
$ pep8 some_script.py
Corpora
• Large set of text
– Structured (has levels of annotation)
– Balanced
– Samples of real world text
• competence versus performance
• Used for
– Hypothesis testing
– Statistical analysis
– Training examples for supervised machine learning
Distribution
• “You know a word by the company it keeps”
• Distribution
– Frequency distribution
– Neighboring words
• Concordance/KWIC
• Collocations
– Similar words
• Words that have the same neighbors
Corpora in NLTK
Corpus reader
• Define common interfaces for corpora that have
different formats
• High-level tasks can then just use these common
interfaces
• Sample corpus readers implemented in NLTK
– PlaintextCorpusReader, CategorizedPlaintextReader,
CategorizedTaggedCorpusReader,
BracketParseCorpusReader,
DependencyCorpusReader, WordlistCorpusReader …
Frequency Distributions and
Conditional Frequency Distributions
• Taking text provided by corpus readers, we
can build frequency distributions and
conditional Frequency distributions (or to
compute probabilities and conditional
probabilities)
• Using these distributions, we can inspect data,
or build language models or classifiers (See
Chapter 6 for building classifiers)
Conditional Frequency Distributions
a collection of frequency distributions, each one for a different “condition”
NLTK
• Text
• FreqDist
• CorpusReader
– PlainTextCorpusReader
– CategorizedTaggedCorpusReader
• ConcatenatedCorpusView
• StreamBackedCorpusView