N-gram Language Models¶
In this notebook, we’ll be building bigram n-gram language models from scratch. The first part of building a language model is collecting counts from corpora. We’ll do some preprocessing first, by lowercasing everything and add (start) and (end) symbols at the beginning and end of each sentence. For bigrams, we are using dictionaries of dictionaries with the strings as keys, which is a convenient though not particularly memory efficient way to represent things. We will use the unigram counts later for doing smoothing.
In [1]:
from collections import defaultdict
from collections import Counter
def convert_sentence(sentence):
return [““] + [w.lower() for w in sentence] + [““]
def get_counts(sentences):
bigram_counts = defaultdict(Counter)
unigram_counts = Counter()
start_count = 0 # “” counts: need these for bigram probs
# collect initial unigram statistics
for sentence in sentences:
sentence = convert_sentence(sentence)
for word in sentence[1:]: # from 1, so we don’t generate the token
unigram_counts[word] += 1
start_count += 1
# collect bigram counts
for sentence in sentences:
sentence = convert_sentence(sentence)
# generate a list of bigrams
bigram_list = zip(sentence[:-1], sentence[1:])
# iterate over bigrams
for bigram in bigram_list:
first, second = bigram
bigram_counts[first][second] += 1
token_count = float(sum(unigram_counts.values()))
return unigram_counts, bigram_counts, start_count, token_count
Once we have counts, we can use them to generate sentences. Here we use numpy.random.choice, which allows us to randomly choose from a list based on a corresponding list of probabilities, which we calculate by normalizing the raw counts. We start with , and generate the next word given the bigram counts which begin with , and then use that word to generate the next word, etc. It stops when it generates an . We return a string with some fixes to make the sentence look a proper sentence.
In [2]:
from numpy.random import choice
def generate_sentence(bigram_counts):
current_word = “”“:
sentence = [current_word]
while current_word != “
# get counts for previous word
prev_word = current_word
prev_word_counts = bigram_counts[prev_word]
# obtain bigram probability distribution given the previous word
bigram_probs = []
total_counts = float(sum(prev_word_counts.values()))
for word in prev_word_counts:
bigram_probs.append(prev_word_counts[word] / total_counts)
# sample the next word
current_word = choice(list(prev_word_counts.keys()), p=bigram_probs)
sentence.append(current_word)
# get rid of start and end of sentence tokens
sentence = ” “.join(sentence[1:-1])
return sentence
Now let’s try out our n-gram driven sentence generator on samples from two corpora: the Penn Treebank, and some out-of-copyright English literature from Project Gutenberg:
In [4]:
import nltk
from nltk.corpus import gutenberg, treebank
nltk.download(‘gutenberg’)
nltk.download(‘treebank’)
gutenberg_unigrams, gutenberg_bigrams, gutenberg_start_count, gutenberg_token_count = get_counts(gutenberg.sents())
print(“Gutenberg”)
for i in range(1,6):
print(“Sentence %d” % i)
print(generate_sentence(gutenberg_bigrams))
treebank_unigrams, treebank_bigrams, treebank_start_count, treebank_token_count = get_counts(treebank.sents())
print(“Treebank”)
for i in range(1,6):
print(“Sentence %d” % i)
print(generate_sentence(treebank_bigrams))
[nltk_data] Downloading package gutenberg to /Users/jason/nltk_data…
[nltk_data] Package gutenberg is already up-to-date!
[nltk_data] Downloading package treebank to /Users/jason/nltk_data…
[nltk_data] Unzipping corpora/treebank.zip.
Gutenberg
Sentence 1
the hivite under the porch that you off from earlier time reliev ‘ d by the sons of the days to shew a company generally cut : 3 : 29 : ‘ t enjoy his brethren .
Sentence 2
3 : else on the field and so doing .
Sentence 3
he took no advocate for lightning to him into the field .
Sentence 4
nor bring them that he will break !
Sentence 5
cetus , to take away , through such a little forgotten me in new continent of torrent of the den by the years , lest ye his cumbrous elements , lies shall pay his magic glass of their preparations of nature of which , and the full , south – gallery , and we both grow again .
Treebank
Sentence 1
the average rate of performance of the dam , could help *-1 off about 24,000 shares 0 it *exp*-2 ‘s recent months , the sound , hid his pyramids of spinoff company had the first time , at all in japan ‘s nomination to discuss the usia publicly traded * through the souper combo product development corp. , a certificate of deposit and general corp. reported cancer deaths since june 1 , editor and miles of america inc. said 0 employees , which *t*-57 is * to be assumed *-121 by architect julia morgan stanley & rubber , the 32 *u* fiscal 1989 , had been able *-2 across the bronx and loan and mrs. yeargin , which they are exercised *-2 to the dumpster ” -lrb- d. , a plant , campbell shares of the japanese industrial unit to get a share .
Sentence 2
* growing interest costs .
Sentence 3
your tv : “ considerable energy ‘s polystyrene foam , a u.s. should have *-1 noncompetitively to place *t*-1 are already had requested duty-free treatment *t*-15 per share .
Sentence 4
“ is peculiar to management books for customers whose sales of * encouraging new drugs .
Sentence 5
mr. mcgovern himself raising the u.s. this instance , gillett holdings ltd. has experienced a public-relations machine tied * prosecuting mrs. ward .
Generally, we can see some local coherence but most of these sentences are complete nonsense. Across the two corpora, the sentences are noticeably different, it’s very obvious that the model from Project Gutenberg is trained on literature, whereas the Penn Treebank data is financial. For the latter, there are some strange tokens (those starting with *) we should probably have filtered out.
Using language models to generate sentences is fun but not very useful. A more practical application is the ability to assign a probability to a sentence. In order to do that for anything but toy examples, however, we will need to smooth our models so it doesn’t assign a zero probability to the sentence whenever it sees a bigram. Here, we’ll test two fairly simple smoothing techniques, add-k smoothing and interpolated smoothing. In both cases, we will calculate the log probability, to avoid working with very small numbers. The functions below give the probability for a single word at index i in a sentence.
Notice that interpolation is implemented using 3 probabilities: the bigram, the unigram and a “zerogram” probability. The “zerogram” actually refers to the probability of any word appearing. We need this extra probability in order to account for out-of-vocabulary (OOVs) words, which result in zero probability for both bigrams and unigrams. Estimating the probability of OOVs is a general problem: here we use an heuristic that uses a uniform distribution over all words in the vocabulary (1 / |V|).
In [6]:
import math
def get_log_prob_addk(prev_word, word, unigram_counts, bigram_counts, k):
sm_bigram_counts = bigram_counts[prev_word][word] + k
sm_unigram_counts = unigram_counts[prev_word] + k*len(unigram_counts)
return math.log(sm_bigram_counts / sm_unigram_counts)
def get_log_prob_interp(prev_word, word, unigram_counts, bigram_counts, start_count, token_count, lambdas):
bigram_lambda = lambdas[0]
unigram_lambda = lambdas[1]
zerogram_lambda = 1 – lambdas[0] – lambdas[1]
# start by getting bigram probability
sm_bigram_counts = bigram_counts[prev_word][word] * bigram_lambda
if sm_bigram_counts == 0.0:
interp_bigram_counts = 0
else:
if prev_word == ““:
u_counts = start_count
else:
u_counts = unigram_counts[prev_word]
interp_bigram_counts = sm_bigram_counts / float(u_counts)
# unigram probability
interp_unigram_counts = (unigram_counts[word] / token_count) * unigram_lambda
# “zerogram” probability: this is to account for out-of-vocabulary words
# this is just 1 / |V|
vocab_size = len(unigram_counts)
interp_zerogram_counts = (1 / float(vocab_size)) * zerogram_lambda
return math.log(interp_bigram_counts + interp_unigram_counts + interp_zerogram_counts)
Extending this to calculate the probability of an entire sentence is trivial.
In [7]:
def get_sent_log_prob_addk(sentence, unigram_counts, bigram_counts, start_count, token_count, k):
sentence = convert_sentence(sentence)
bigram_list = zip(sentence[:-1], sentence[1:])
return sum([get_log_prob_addk(prev_word,
word,
unigram_counts,
bigram_counts,
k) for prev_word, word in bigram_list])
def get_sent_log_prob_interp(sentence, unigram_counts, bigram_counts, start_count, token_count, lambdas):
sentence = convert_sentence(sentence)
bigram_list = zip(sentence[:-1], sentence[1:])
return sum([get_log_prob_interp(prev_word,
word,
unigram_counts,
bigram_counts,
start_count,
token_count,
lambdas) for prev_word, word in bigram_list])
sentence = “revenue increased last quarter .”.split()
print(get_sent_log_prob_addk(sentence, gutenberg_unigrams, gutenberg_bigrams, gutenberg_start_count,
gutenberg_token_count, 0.05))
print(get_sent_log_prob_interp(sentence,
gutenberg_unigrams,
gutenberg_bigrams,
gutenberg_start_count,
gutenberg_token_count,
(0.8, 0.19)))
print(get_sent_log_prob_addk(sentence, treebank_unigrams, treebank_bigrams, treebank_start_count,
treebank_token_count, 0.05))
print(get_sent_log_prob_interp(sentence,
treebank_unigrams,
treebank_bigrams,
treebank_start_count,
treebank_token_count,
(0.8, 0.19)))
-48.460406447015146
-49.538820071127226
-39.776378681452364
-39.245480234555636
The output for our sample sentence looks reasonable, in particular using the Treebank model results in a noticeably higher probability, which is what we’d expect given the input sentence. The differences in probability between the different smoothing techniques is more modest (though keep in mind this is a logrithmic scale). Now, let’s use perplexity to evaluate different smoothing techniques at the level of the corpus. For this, we’ll use the Brown corpus again, dividing it up randomly into a training set and a test set based on an 80/20 split.
In [8]:
from nltk.corpus import brown
from random import shuffle
nltk.download(‘brown’)
sents = list(brown.sents())
shuffle(sents)
cutoff = int(0.8*len(sents))
training_set = sents[:cutoff]
test_set = [[word.lower() for word in sent] for sent in sents[cutoff:]]
brown_unigrams, brown_bigrams, brown_start_count, brown_token_count = get_counts(training_set)
[nltk_data] Downloading package brown to /Users/jason/nltk_data…
[nltk_data] Package brown is already up-to-date!
Since our probabilities are in log space, we will calculate perplexity in log space as well, then take the exponential at the end
$PP(W) = \sqrt[m]{\frac{1}{P(W)}}$
$\log{PP(W)} = -\frac{1}{m} \log{P(W)}$
In [9]:
def calculate_perplexity(sentences, unigram_counts, bigram_counts, start_count,
token_count, smoothing_function, parameter):
total_log_prob = 0
test_token_count = 0
for sentence in sentences:
test_token_count += len(sentence) + 1 # have to consider the end token
total_log_prob += smoothing_function(sentence,
unigram_counts,
bigram_counts,
start_count,
token_count,
parameter)
return math.exp(-total_log_prob / test_token_count)
Let’s see how our two smoothing techniques do with a range of possible parameter values
In [10]:
print(“add k”)
for k in [0.0001,0.001,0.01, 0.05,0.2,1.0]:
print(k)
print(calculate_perplexity(test_set,
brown_unigrams,
brown_bigrams,
brown_start_count,
brown_token_count,
get_sent_log_prob_addk,
k))
print(“interpolation”)
for bigram_lambda in [0.98,0.95,0.75,0.5,0.25,0.001]:
unigram_lambda = 0.99 – bigram_lambda
lambdas = (bigram_lambda, unigram_lambda)
print(lambdas)
print(calculate_perplexity(test_set,
brown_unigrams,
brown_bigrams,
brown_start_count,
brown_token_count,
get_sent_log_prob_interp,
lambdas))
add k
0.0001
747.860084740397
0.001
608.1037030464857
0.01
715.7003595336809
0.05
1033.197366149906
0.2
1681.6595463271226
1.0
3486.8451342027374
interpolation
(0.98, 0.010000000000000009)
762.7719951864839
(0.95, 0.040000000000000036)
585.2040019562442
(0.75, 0.24)
425.4470331471468
(0.5, 0.49)
422.80505963590605
(0.25, 0.74)
497.62931007291564
(0.001, 0.989)
980.3820682875372
Our results indicate that, with regards to perplexity, interpolation is generally better than add k. Very low (though not too low) k is preferred for add k, wheres our best lambdas is in the middle of the range, though apparently with a small preference for more weight on the bigram probability, which makes sense.
From the basis given here, you can try playing around with some of the other corpora in NLTK and see if you get similar results. You could also implement a trigram model, or another kind of smoothing, to see if you can get better perplexity scores.
In [ ]: