程序代写代做代考 python 11-corpora.pptx

11-corpora.pptx

NLTK Texts and Corpora

LING 131A, Fall 2018
Marc Verhagen, Brandeis University

Today

•  Quiz 1 results
– questions 8, 9, 11 and 12 updates

•  Putting list comprehensions to bed
– [ [ (x,y) for x in lst1 ] for y in lst2 ]

•  Animals, Dogs and a Zoo
– doctest revisited

•  Coding standards
•  NLTK Texts and Corpora
•  Assignment 3

Quiz 1 Results

0

2

4

6

8

10

12

12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

List comprehensions

Aminals in the Zoo

Python style guide

•  https://www.python.org/dev/peps/pep-0008/
•  Install and run

•  Key insight: code is read much more often than it is
written. Style guides are about consistency and
readability

•  But… always:
– Guides are just guides
– There is danger in getting distracted too much by the guide
– A Foolish Consistency is the Hobgoblin of Little Minds

$ pip3 install pep8
$ pep8 some_script.py

Corpora

•  Large set of text
– Structured (has levels of annotation)
– Balanced
– Samples of real world text
• competence versus performance

•  Used for
– Hypothesis testing
– Statistical analysis
– Training examples for supervised machine learning

Distribution

•  “You know a word by the company it keeps”

•  Distribution
– Frequency distribution
– Neighboring words
• Concordance/KWIC
• Collocations

– Similar words
• Words that have the same neighbors

Corpora in NLTK

Corpus reader

•  Define common interfaces for corpora that have
different formats

•  High-level tasks can then just use these common
interfaces

•  Sample corpus readers implemented in NLTK
– PlaintextCorpusReader, CategorizedPlaintextReader,
CategorizedTaggedCorpusReader,
BracketParseCorpusReader,
DependencyCorpusReader, WordlistCorpusReader …

Frequency Distributions and
Conditional Frequency Distributions

•  Taking text provided by corpus readers, we
can build frequency distributions and
conditional Frequency distributions (or to
compute probabilities and conditional
probabilities)

•  Using these distributions, we can inspect data,
or build language models or classifiers (See
Chapter 6 for building classifiers)

Conditional Frequency Distributions

a collection of frequency distributions, each one for a different “condition”

NLTK

•  Text
•  FreqDist
•  CorpusReader
– PlainTextCorpusReader
– CategorizedTaggedCorpusReader

•  ConcatenatedCorpusView
•  StreamBackedCorpusView