10-corpora.pptx
NLTK Texts and Corpora
LING 131A, Fall 2018
Marc Verhagen, Brandeis University
Today
• Assignment 1
• Assignment 2
• Quiz 1: content and examples
• Some loose ends on classes
– extra class
– variable access
– class methods
• NLTK texts and corpora
• Exercise
Quiz contents
• All lecture notes
• NLTK book chapter 1 and 2
– see LATTE for more precise info
• quesQons
– mulQple choice, mostly on Python
– open-ended NLTK quesQons
– a couple of open-ended Python programming
quesQons
Loose end from last week
class Student:
def __init__(self, n, a):
self.full_name = n
self.age = a
def get_age(self):
self.hair = “black”
return self.age
def get_hair_color(self):
return self.hair
Loose end from last week
>>> bob = Student(‘Bob Smith’, 23)
>>> bob.full_name # Access an attribute.
‘Bob Smith’
>>> bob.age # Access an attribute.
23
>>> bob.hair # Access an attribute.
?? # This will give an error.
>>> bob.get_age() # Access a method.
23
>>> bob.hair # Access an attribute again.
?? # Now it will succeed.
Class methods
• Regular instance methods are associated with
an instance of a class
• Class methods are associated with the class
itself
>>> fluffy = Dog(fluffy)
>>> fluffy.get_name()
‘fluffy’
>>> Dog.get_count()
1
class Dog(object):
count = 0
def __init__(self, name):
self.name = name
self.__class__.count += 1
@classmethod
def get_count(cls):
return cls.count
if __name__ == ‘__main__’:
d1 = Dog(‘fluffy’)
d2 = Dog(‘fido’)
print(Dog.get_count())
NLTK Texts and Corpora
Text Corpus
• Structured collecQon of texts
– That is, a corpus is usually built for some purpose
• Used for text analysis and training ML models
• Some types:
– raw versus annotated
– monolingual versus mulQlingual
– text only versus mulQ-modal
– parallel/aligned/comparable
– Types in NLTK
DistribuQon
• “You know a word by the company it keeps”
• DistribuQon
– Frequency distribuQon
– Neighboring words
• Concordance/KWIC
• CollocaQons
– Similar words
• Words that have the same neighbors
Zipf’s Law
Given some text, the frequency of any word is inversely proporQonal to its rank
in the frequency table.
• The most frequent word will occur approximately twice as o^en as the
second most frequent word, three Qmes as o^en as the third most frequent
word, etc.
• Only a small set of words (types) accounts for a large part of the text, for
example, the Brown Corpus of American English text has a bout a million
words (tokens) and only 135 vocabulary items are needed to account for half
of them
FreqDist(text6).plot(20)
FreqDist(text6).plot(100)
• CollocaQons are special kinds of bigrams
– Mutual InformaQon
– Kenneth Ward Church and Patrick Hanks. 1990.
Word associa*on norms, mutual informa*on, and
lexicography. ComputaQonal LinguisQcs, Volume
16 Issue 1, March 1990. Pages 22-29.
– Defined as
Bigrams and CollocaQons
MI(x, y) = log 2
P(x, y)
P(x)P(y)
Examples
11.05 8 8 8 Round Table
10.73 10 10 10 Pie Iesu
10.73 10 10 10 Iesu domine
7.54 7 13 1 sacred quest
6.00 7 38 1 join my
2.85 107 22 1 it will
-1.85 204 299 1 you the
sacred quest
length of text6 is 16,967
P(x,y) = 1/16,967 = 0.0000589
P(x) = 7/16,967 = 0.0004126
P(y) = 13/16,967 = 0.0007662
MI(x,y) = log2(0.0000589 / (0.0004126 * 0.0007662))
= log2(186.3133) = 7.54
MI(x, y) = log 2
P(x, y)
P(x)P(y)
hgps://www.youtube.com/watch?v=YgYEuJ5u1K0
NLTK
• Text
• FreqDist
• CorpusReader
– PlainTextCorpusReader
– CategorizedTaggedCorpusReader
• ConcatenatedCorpusView
• StreamBackedCorpusView