Lab08
Statistical Language Model (SLM)
A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability $P(w_1, \ldots, w_m)$ to the whole sequence. One model solution is to make the assumption that the probability distribution for a word depends only on the previous $n$ words. This is known as an n-gram model, or unigram model when $n = 1$.
Bigrams and Trigrams
An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model. Using Latin numerical prefixes, an n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. English cardinal numbers are sometimes used, e.g., “four-gram”, “five-gram”, and so on.
For example, the frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.
Let’s see how to build a such a model with NLTK. Let’s download some Reuters data and inspect it.
In [ ]:
import nltk
from nltk.util import bigrams, trigrams
from collections import Counter, defaultdict
from nltk.corpus import reuters
nltk.download(‘reuters’)
!unzip -qq /root/nltk_data/corpora/reuters.zip -d /root/nltk_data/corpora
nltk.download(‘punkt’)
[nltk_data] Downloading package reuters to /root/nltk_data…
[nltk_data] Downloading package punkt to /root/nltk_data…
[nltk_data] Unzipping tokenizers/punkt.zip.
Out[ ]:
True
In [ ]:
first_sentence = reuters.sents()[0]
first_sentence
Out[ ]:
[‘ASIAN’,
‘EXPORTERS’,
‘FEAR’,
‘DAMAGE’,
‘FROM’,
‘U’,
‘.’,
‘S’,
‘.-‘,
‘JAPAN’,
‘RIFT’,
‘Mounting’,
‘trade’,
‘friction’,
‘between’,
‘the’,
‘U’,
‘.’,
‘S’,
‘.’,
‘And’,
‘Japan’,
‘has’,
‘raised’,
‘fears’,
‘among’,
‘many’,
‘of’,
‘Asia’,
“‘”,
‘s’,
‘exporting’,
‘nations’,
‘that’,
‘the’,
‘row’,
‘could’,
‘inflict’,
‘far’,
‘-‘,
‘reaching’,
‘economic’,
‘damage’,
‘,’,
‘businessmen’,
‘and’,
‘officials’,
‘said’,
‘.’]
Now let’s see what the n-grams look like. More details can be found at bigrams(), trigrams(), ngrams()
In [ ]:
print(“bigrams without pad: “, list(bigrams(first_sentence)))
print(“bigrams with pad: “, list(bigrams(first_sentence, pad_left=True, pad_right=True)))
print(“trigrams without pad: “, list(trigrams(first_sentence)))
print(“trigrams with pad: “, list(trigrams(first_sentence, pad_left=True, pad_right=True)))
bigrams without pad: [(‘ASIAN’, ‘EXPORTERS’), (‘EXPORTERS’, ‘FEAR’), (‘FEAR’, ‘DAMAGE’), (‘DAMAGE’, ‘FROM’), (‘FROM’, ‘U’), (‘U’, ‘.’), (‘.’, ‘S’), (‘S’, ‘.-‘), (‘.-‘, ‘JAPAN’), (‘JAPAN’, ‘RIFT’), (‘RIFT’, ‘Mounting’), (‘Mounting’, ‘trade’), (‘trade’, ‘friction’), (‘friction’, ‘between’), (‘between’, ‘the’), (‘the’, ‘U’), (‘U’, ‘.’), (‘.’, ‘S’), (‘S’, ‘.’), (‘.’, ‘And’), (‘And’, ‘Japan’), (‘Japan’, ‘has’), (‘has’, ‘raised’), (‘raised’, ‘fears’), (‘fears’, ‘among’), (‘among’, ‘many’), (‘many’, ‘of’), (‘of’, ‘Asia’), (‘Asia’, “‘”), (“‘”, ‘s’), (‘s’, ‘exporting’), (‘exporting’, ‘nations’), (‘nations’, ‘that’), (‘that’, ‘the’), (‘the’, ‘row’), (‘row’, ‘could’), (‘could’, ‘inflict’), (‘inflict’, ‘far’), (‘far’, ‘-‘), (‘-‘, ‘reaching’), (‘reaching’, ‘economic’), (‘economic’, ‘damage’), (‘damage’, ‘,’), (‘,’, ‘businessmen’), (‘businessmen’, ‘and’), (‘and’, ‘officials’), (‘officials’, ‘said’), (‘said’, ‘.’)]
bigrams with pad: [(None, ‘ASIAN’), (‘ASIAN’, ‘EXPORTERS’), (‘EXPORTERS’, ‘FEAR’), (‘FEAR’, ‘DAMAGE’), (‘DAMAGE’, ‘FROM’), (‘FROM’, ‘U’), (‘U’, ‘.’), (‘.’, ‘S’), (‘S’, ‘.-‘), (‘.-‘, ‘JAPAN’), (‘JAPAN’, ‘RIFT’), (‘RIFT’, ‘Mounting’), (‘Mounting’, ‘trade’), (‘trade’, ‘friction’), (‘friction’, ‘between’), (‘between’, ‘the’), (‘the’, ‘U’), (‘U’, ‘.’), (‘.’, ‘S’), (‘S’, ‘.’), (‘.’, ‘And’), (‘And’, ‘Japan’), (‘Japan’, ‘has’), (‘has’, ‘raised’), (‘raised’, ‘fears’), (‘fears’, ‘among’), (‘among’, ‘many’), (‘many’, ‘of’), (‘of’, ‘Asia’), (‘Asia’, “‘”), (“‘”, ‘s’), (‘s’, ‘exporting’), (‘exporting’, ‘nations’), (‘nations’, ‘that’), (‘that’, ‘the’), (‘the’, ‘row’), (‘row’, ‘could’), (‘could’, ‘inflict’), (‘inflict’, ‘far’), (‘far’, ‘-‘), (‘-‘, ‘reaching’), (‘reaching’, ‘economic’), (‘economic’, ‘damage’), (‘damage’, ‘,’), (‘,’, ‘businessmen’), (‘businessmen’, ‘and’), (‘and’, ‘officials’), (‘officials’, ‘said’), (‘said’, ‘.’), (‘.’, None)]
trigrams without pad: [(‘ASIAN’, ‘EXPORTERS’, ‘FEAR’), (‘EXPORTERS’, ‘FEAR’, ‘DAMAGE’), (‘FEAR’, ‘DAMAGE’, ‘FROM’), (‘DAMAGE’, ‘FROM’, ‘U’), (‘FROM’, ‘U’, ‘.’), (‘U’, ‘.’, ‘S’), (‘.’, ‘S’, ‘.-‘), (‘S’, ‘.-‘, ‘JAPAN’), (‘.-‘, ‘JAPAN’, ‘RIFT’), (‘JAPAN’, ‘RIFT’, ‘Mounting’), (‘RIFT’, ‘Mounting’, ‘trade’), (‘Mounting’, ‘trade’, ‘friction’), (‘trade’, ‘friction’, ‘between’), (‘friction’, ‘between’, ‘the’), (‘between’, ‘the’, ‘U’), (‘the’, ‘U’, ‘.’), (‘U’, ‘.’, ‘S’), (‘.’, ‘S’, ‘.’), (‘S’, ‘.’, ‘And’), (‘.’, ‘And’, ‘Japan’), (‘And’, ‘Japan’, ‘has’), (‘Japan’, ‘has’, ‘raised’), (‘has’, ‘raised’, ‘fears’), (‘raised’, ‘fears’, ‘among’), (‘fears’, ‘among’, ‘many’), (‘among’, ‘many’, ‘of’), (‘many’, ‘of’, ‘Asia’), (‘of’, ‘Asia’, “‘”), (‘Asia’, “‘”, ‘s’), (“‘”, ‘s’, ‘exporting’), (‘s’, ‘exporting’, ‘nations’), (‘exporting’, ‘nations’, ‘that’), (‘nations’, ‘that’, ‘the’), (‘that’, ‘the’, ‘row’), (‘the’, ‘row’, ‘could’), (‘row’, ‘could’, ‘inflict’), (‘could’, ‘inflict’, ‘far’), (‘inflict’, ‘far’, ‘-‘), (‘far’, ‘-‘, ‘reaching’), (‘-‘, ‘reaching’, ‘economic’), (‘reaching’, ‘economic’, ‘damage’), (‘economic’, ‘damage’, ‘,’), (‘damage’, ‘,’, ‘businessmen’), (‘,’, ‘businessmen’, ‘and’), (‘businessmen’, ‘and’, ‘officials’), (‘and’, ‘officials’, ‘said’), (‘officials’, ‘said’, ‘.’)]
trigrams with pad: [(None, None, ‘ASIAN’), (None, ‘ASIAN’, ‘EXPORTERS’), (‘ASIAN’, ‘EXPORTERS’, ‘FEAR’), (‘EXPORTERS’, ‘FEAR’, ‘DAMAGE’), (‘FEAR’, ‘DAMAGE’, ‘FROM’), (‘DAMAGE’, ‘FROM’, ‘U’), (‘FROM’, ‘U’, ‘.’), (‘U’, ‘.’, ‘S’), (‘.’, ‘S’, ‘.-‘), (‘S’, ‘.-‘, ‘JAPAN’), (‘.-‘, ‘JAPAN’, ‘RIFT’), (‘JAPAN’, ‘RIFT’, ‘Mounting’), (‘RIFT’, ‘Mounting’, ‘trade’), (‘Mounting’, ‘trade’, ‘friction’), (‘trade’, ‘friction’, ‘between’), (‘friction’, ‘between’, ‘the’), (‘between’, ‘the’, ‘U’), (‘the’, ‘U’, ‘.’), (‘U’, ‘.’, ‘S’), (‘.’, ‘S’, ‘.’), (‘S’, ‘.’, ‘And’), (‘.’, ‘And’, ‘Japan’), (‘And’, ‘Japan’, ‘has’), (‘Japan’, ‘has’, ‘raised’), (‘has’, ‘raised’, ‘fears’), (‘raised’, ‘fears’, ‘among’), (‘fears’, ‘among’, ‘many’), (‘among’, ‘many’, ‘of’), (‘many’, ‘of’, ‘Asia’), (‘of’, ‘Asia’, “‘”), (‘Asia’, “‘”, ‘s’), (“‘”, ‘s’, ‘exporting’), (‘s’, ‘exporting’, ‘nations’), (‘exporting’, ‘nations’, ‘that’), (‘nations’, ‘that’, ‘the’), (‘that’, ‘the’, ‘row’), (‘the’, ‘row’, ‘could’), (‘row’, ‘could’, ‘inflict’), (‘could’, ‘inflict’, ‘far’), (‘inflict’, ‘far’, ‘-‘), (‘far’, ‘-‘, ‘reaching’), (‘-‘, ‘reaching’, ‘economic’), (‘reaching’, ‘economic’, ‘damage’), (‘economic’, ‘damage’, ‘,’), (‘damage’, ‘,’, ‘businessmen’), (‘,’, ‘businessmen’, ‘and’), (‘businessmen’, ‘and’, ‘officials’), (‘and’, ‘officials’, ‘said’), (‘officials’, ‘said’, ‘.’), (‘said’, ‘.’, None), (‘.’, None, None)]
Now, let’s build a trigram model using the Reuters corpus. Building a bigram model is completely analogous and easier.
In [ ]:
# create a model which contains the trigram counts
model = defaultdict(lambda: defaultdict(lambda: 0))
for sentence in reuters.sents():
for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
model[(w1, w2)][w3] += 1
In [ ]:
# inspect the counts of some trigrams
print(model[“what”, “the”][“economists”])
print(model[“what”, “the”][“nonexistingword”])
# counts of the sentence starting with “The”
print(model[None, None][“The”])
2
0
8839
In [ ]:
# convert counts to probabilities
for w1_w2 in model:
total_count = float(sum(model[w1_w2].values()))
for w3 in model[w1_w2]:
model[w1_w2][w3] /= total_count
In [ ]:
print(model[“what”, “the”][“economists”] )
print(model[“what”, “the”][“nonexistingword”])
# probabilities of the sentence starting with “The”
print(model[None, None][“The”])
0.043478260869565216
0.0
0.16154324146501936
Now you have a tri-gram language model. Let’s generate some text. The output text is actually really readable!
In [ ]:
import random
text = [None, None]
sentence_finished = False
# Keep generating the next word until reaching the end of the sentence
while not sentence_finished:
# Randomly select a probability threshold r
r = random.random()
accumulator = .0
# Go through the possible w3 conditioned on current w1 and w2
for word in model[tuple(text[-2:])].keys():
# Accumulate the probability
accumulator += model[tuple(text[-2:])][word]
# When the threshold is reached, use the current w3 as the next word to be generated
if accumulator >= r:
text.append(word)
break
# If the last two words are None, it will reach the end and stop generating
if text[-2:] == [None, None]:
sentence_finished = True
# The generated sentence is as follows
‘ ‘.join([t for t in text if t])
Out[ ]:
‘Johnson Geneva U . S . CORN GROWERS BLAST CANADA CORN DECISION UNJUSTIFIED – YEUTTER The U . K . TRADE There is a greater proportion of companies formed by & lt ; KAWS . T > and Saga Petroleum A / S of Stavanger has been raised by 4 . 60 dlrs Net loss 9 . 32 billion dlrs .’
Decoding Algorithms
In NLP tasks such as chatbot, text summarization, and machine translation, the prediction required is a sequence of words.
It is common for models developed for these types of problems to output a probability distribution over each word in the vocabulary for each word in the output sequence. It is then left to a decoder process to transform the probabilities into a final sequence of words.
Decoding the most likely output sequence involves searching through all the possible output sequences based on their likelihood. The size of the vocabulary is often tens or hundreds of thousands of words, or even millions of words. Therefore, the search problem is exponential in the length of the output sequence and is intractable (NP-complete) to search completely.
In practice, heuristic search methods are used to return one or more approximate or “good enough” decoded output sequences for a given prediction.
Candidate sequences of words are scored based on their likelihood. It is common to use a greedy search or a beam search to locate candidate sequences of text. We will look at both of these decoding algorithms now.
Greedy Decoder
A simple approximation is to use a greedy search that selects the most likely word at each step in the output sequence. This approach has the benefit that it is very fast, but the quality of the final output sequences may be far from optimal.
We can demonstrate the greedy search approach to decoding with a small contrived example in Python. We can start off with a prediction problem that involves a sequence of 10 words. Each word is predicted as a probability distribution over a vocabulary of 5 words
In [ ]:
from numpy import array
from numpy import argmax
In [ ]:
# define a sequence of 10 words over a vocab of 5 words
data = [[0.1, 0.2, 0.3, 0.4, 0.5],
[0.5, 0.4, 0.3, 0.2, 0.1],
[0.1, 0.2, 0.3, 0.4, 0.5],
[0.5, 0.4, 0.3, 0.2, 0.1],
[0.1, 0.2, 0.3, 0.4, 0.5],
[0.5, 0.4, 0.3, 0.2, 0.1],
[0.1, 0.2, 0.3, 0.4, 0.5],
[0.5, 0.4, 0.3, 0.2, 0.1],
[0.1, 0.2, 0.3, 0.4, 0.5],
[0.5, 0.4, 0.3, 0.2, 0.1]]
data = array(data)
We will assume that the words have been integer encoded, such that the column index can be used to look-up the associated word in the vocabulary. Therefore, the task of decoding becomes the task of selecting a sequence of integers from the probability distributions.
The argmax() mathematical function can be used to select the index of an array that has the largest value. We can use this function to select the word index that is most likely at each step in the sequence. This function is provided directly in numpy.
The greedy_decoder() function below implements this decoder strategy using the argmax function.
In [ ]:
# greedy decoder, pick only
def greedy_decoder(data):
# index for largest probability each row
return [argmax(s) for s in data]
Running the example outputs a sequence of integers that could then be mapped back to words in the vocabulary.
In [ ]:
#decode seqeunce
result = greedy_decoder(data)
print(result)
[4, 0, 4, 0, 4, 0, 4, 0, 4, 0]
Beamsearch Decoder
Another popular heuristic is the beam search that expands upon the greedy search and returns a list of most likely output sequences.
Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.
We do not need to start with random states; instead, we start with the k most likely words as the first step in the sequence. Common beam width values are 1 for a greedy search and values of 5 or 10 for common benchmark problems in machine translation. Larger beam widths result in better performance of a model as the multiple candidate sequences increase the likelihood of better matching a target sequence. This increased performance results in a decrease in decoding speed.
In [ ]:
from numpy import array
from numpy import argmax
from numpy import log
The beam_search_decoder() function below implements the beam search decoder.
In [ ]:
# beam search
def beam_search_decoder(data, k):
sequences = [[list(), 0.0]]
# walk over each step in sequence
for step,row in enumerate(data):
all_candidates = list()
# expand each current candidate
for i in range(len(sequences)):
seq, score = sequences[i]
for j in range(len(row)):
candidate = [seq + [j], score + (-log(row[j])) ] #we are summing up the negative log, so we need to find the minimum score(which is the highest prob)
all_candidates.append(candidate)
# order all candidates by score
ordered = sorted(all_candidates, key=lambda tup:tup[1])
# select k best
sequences = ordered[:k]
# display the k-best sequences
print(“The”, str(k), “best sequences at step “, str(step), “: “)
print(sequences)
print()
return sequences
We can tie this together with the sample data from the previous section and this time return the 3 most likely sequences. Running the example prints both the integer sequences and their log likelihood.
In [ ]:
# decode sequence
result = beam_search_decoder(data, 3)
print()
print(“The final decoded 3 best sequences: “)
for seq in result:
print(seq)
The 3 best sequences at step 0 :
[[[4], 0.6931471805599453], [[3], 0.916290731874155], [[2], 1.2039728043259361]]
The 3 best sequences at step 1 :
[[[4, 0], 1.3862943611198906], [[4, 1], 1.6094379124341003], [[3, 0], 1.6094379124341003]]
The 3 best sequences at step 2 :
[[[4, 0, 4], 2.0794415416798357], [[4, 0, 3], 2.3025850929940455], [[4, 1, 4], 2.3025850929940455]]
The 3 best sequences at step 3 :
[[[4, 0, 4, 0], 2.772588722239781], [[4, 0, 4, 1], 2.995732273553991], [[4, 0, 3, 0], 2.995732273553991]]
The 3 best sequences at step 4 :
[[[4, 0, 4, 0, 4], 3.4657359027997265], [[4, 0, 4, 0, 3], 3.6888794541139363], [[4, 0, 4, 1, 4], 3.6888794541139363]]
The 3 best sequences at step 5 :
[[[4, 0, 4, 0, 4, 0], 4.1588830833596715], [[4, 0, 4, 0, 4, 1], 4.382026634673881], [[4, 0, 4, 0, 3, 0], 4.382026634673881]]
The 3 best sequences at step 6 :
[[[4, 0, 4, 0, 4, 0, 4], 4.852030263919617], [[4, 0, 4, 0, 4, 0, 3], 5.075173815233827], [[4, 0, 4, 0, 4, 1, 4], 5.075173815233827]]
The 3 best sequences at step 7 :
[[[4, 0, 4, 0, 4, 0, 4, 0], 5.545177444479562], [[4, 0, 4, 0, 4, 0, 4, 1], 5.768320995793772], [[4, 0, 4, 0, 4, 0, 3, 0], 5.768320995793772]]
The 3 best sequences at step 8 :
[[[4, 0, 4, 0, 4, 0, 4, 0, 4], 6.238324625039508], [[4, 0, 4, 0, 4, 0, 4, 0, 3], 6.461468176353717], [[4, 0, 4, 0, 4, 0, 4, 1, 4], 6.461468176353717]]
The 3 best sequences at step 9 :
[[[4, 0, 4, 0, 4, 0, 4, 0, 4, 0], 6.931471805599453], [[4, 0, 4, 0, 4, 0, 4, 0, 4, 1], 7.154615356913663], [[4, 0, 4, 0, 4, 0, 4, 0, 3, 0], 7.154615356913663]]
The final decoded 3 best sequences:
[[4, 0, 4, 0, 4, 0, 4, 0, 4, 0], 6.931471805599453]
[[4, 0, 4, 0, 4, 0, 4, 0, 4, 1], 7.154615356913663]
[[4, 0, 4, 0, 4, 0, 4, 0, 3, 0], 7.154615356913663]
Exercise
E1. Please describe two alternative solutions in order to prevent the zero count issue in n-gram language models. Please do not list them up but describe how they work.
Your answer: Check page 22 and the corresponding lecture recording.
E2. Neural Language Model
You are required to modify the below example code that can be working with beam search (k > 1)
Now, let’s see how to build a language model for generating natural language text by implement and training state-of-the-art Recurrent Neural Network. The objective of this model is to generate new text, given that some input text is present. Lets start building the architecture.
In [ ]:
import numpy as np
from numpy import array
from numpy import argmax
from numpy import log
Lets use a popular nursery rhyme — “Cat and Her Kittens” as our corpus. A corpus is defined as the collection of text documents.
In [ ]:
import re
# Pad sequences to the max length
def pad_sequences_pre(input_sequences, maxlen):
output = []
for inp in input_sequences:
if len(inp)< maxlen:
output.append([0]*(maxlen-len(inp)) + inp)
else:
output.append(inp[:maxlen])
return output
# Prepare the data
def dataset_preparation(data):
corpus = data.lower().split("\n")
normalized_text=[]
for string in corpus:
tokens = re.sub(r"[^a-z0-9]+", " ", string.lower())
normalized_text.append(tokens)
tokenized_sentences=[sentence.strip().split(" ") for sentence in normalized_text]
word_list_dict ={}
for sent in tokenized_sentences:
for word in sent:
if word != "":
word_list_dict[word] = 1
word_list = list(word_list_dict.keys())
word_to_index = {word:word_list.index(word) for word in word_list}
total_words = len(word_list)+1
# create input sequences using list of tokens
input_sequences = []
for line in tokenized_sentences:
token_list = []
for word in line:
if word!="":
token_list.append(word_to_index[word])
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
# pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences_pre(input_sequences, maxlen=max_sequence_len))
# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
return predictors, np.array(label), max_sequence_len, total_words, word_list, word_to_index
data = '''The cat and her kittens
They put on their mittens
To eat a christmas pie
The poor little kittens
They lost their mittens
And then they began to cry.
O mother dear, we sadly fear
We cannot go to-day,
For we have lost our mittens
If it be so, ye shall not go
For ye are naughty kittens'''
predictors, label, max_sequence_len, total_words, word_list, word_to_index = dataset_preparation(data)
In [ ]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from sklearn.metrics import accuracy_score
# Define the model
class LSTMTagger(nn.Module):
def __init__(self, embedding_dim, hidden_dim_1, hidden_dim_2, total_words):
super(LSTMTagger, self).__init__()
self.hidden_dim_1 = hidden_dim_1
self.hidden_dim_2 = hidden_dim_2
self.word_embeddings = nn.Embedding(total_words, embedding_dim)
self.lstm1 = nn.LSTM(embedding_dim, hidden_dim_1, batch_first=True)
self.lstm2 = nn.LSTM(hidden_dim_1, hidden_dim_2, batch_first=True)
self.hidden2tag = nn.Linear(hidden_dim_2, total_words)
def forward(self, sentence):
embeds = self.word_embeddings(sentence)
lstm_out_1, _ = self.lstm1(embeds)
lstm_out_2, _ = self.lstm2(lstm_out_1)
tag_space = self.hidden2tag(lstm_out_2[:,-1,:])
# The reason we are using log_softmax here is that we want to calculate -log(p) and find the minimum score
tag_scores = F.log_softmax(tag_space, dim=1)
return tag_scores
# Parameter setting
EMBEDDING_DIM = 10
HIDDEN_DIM_1 = 150
HIDDEN_DIM_2 = 100
batch_size=predictors.shape[0]
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM_1, HIDDEN_DIM_2, total_words).cuda()
loss_function = nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
sentence =torch.from_numpy(predictors).cuda().to(torch.int64)
targets = torch.from_numpy(label).cuda().to(torch.int64)
# Training
for epoch in range(100):
model.train()
model.zero_grad()
tag_scores = model(sentence)
loss = loss_function(tag_scores, targets)
loss.backward()
optimizer.step()
if epoch % 10 == 9:
model.eval()
_, predicted = torch.max(tag_scores, 1)
prediction = predicted.view(-1).cpu().numpy()
t = targets.view(-1).cpu().numpy()
acc = accuracy_score(prediction,t)
print('Epoch: %d, training loss: %.4f, training acc: %.2f%%'%(epoch+1,loss.item(),100*acc))
Epoch: 10, training loss: 3.6768, training acc: 6.25%
Epoch: 20, training loss: 3.5086, training acc: 8.33%
Epoch: 30, training loss: 3.1279, training acc: 18.75%
Epoch: 40, training loss: 2.6786, training acc: 22.92%
Epoch: 50, training loss: 2.2757, training acc: 39.58%
Epoch: 60, training loss: 1.9705, training acc: 56.25%
Epoch: 70, training loss: 1.6983, training acc: 81.25%
Epoch: 80, training loss: 1.4756, training acc: 83.33%
Epoch: 90, training loss: 1.2986, training acc: 87.50%
Epoch: 100, training loss: 1.1193, training acc: 91.67%
The code below only works with k=1, it does not store the candidates. You need to modify the code to make it work with k > 1.
In [ ]:
# convert index to word
def ind_to_word(predicted_ind):
for word, index in word_to_index.items():
if index == predicted_ind:
return word
return “”
# get the top k most predicted results
def get_topK(predicted, k=1):
# Get the index of the highest k index
# Since the input is just one sentence, we can use [0] to extract the prediction result
top_k = np.argsort(predicted[0])[-k:]
# return a list of tuple
# tuple[0]:word_id, tuple[1]:log(p)
return [(id, predicted[0][id]) for id in top_k]
# To-Do: modify this function
# Generate text, currently only works with k=1
# Hint: The easist way is modifying the code from line 40-47, but it is not compulsory
def generate_text(seed_text, next_words, max_sequence_len, k=1):
seed_candidates = [(seed_text, .0)]
for _ in range(next_words):
successives = []
# if k = 1, len(seed_candidates) will always be 1
for i in range(len(seed_candidates)):
seed_text, score = seed_candidates[i]
token_list = [word_to_index[word] for word in seed_text.split()]
token_list = pad_sequences_pre([token_list], maxlen=max_sequence_len-1)
seed_input = torch.from_numpy(np.array(token_list)).cuda().to(torch.int64)
predicted = model(seed_input).cpu().detach().numpy()
# Since it it only works with k = 1, we can simply use [0] to get the word id and log(p)
# However, if k = 3, you can’t simply use [0] to get the candidates
id, s = get_topK(predicted, k)[0]
# get the output word
output_word = ind_to_word(id)
# put the word into the sentence input
# calcualte the accumulated score by -log(p)
successives.append((seed_text + ‘ ‘ + output_word, score – s))
# Get the lowest k accumulated scores (highest k accumulated probabilities)
# Then, make them as the seed_candidate for the next word to predict
ordered = sorted(successives, key=lambda tup: tup[1])
seed_candidates = ordered[:k]
return seed_candidates[0][0]
print(generate_text(“we naughty”, 3, max_sequence_len, k=1))
# print(generate_text(“we naughty”, 3, max_sequence_len, k=3))
# Please note that it can happen that k=1 and k=3 have the same output because this is only a small dataset.
we naughty lost their mittens
Sample Output (Your output would be different, it is based on the trained model)
we naughty lost their mittens
E2 Sample Solution
In [ ]:
def generate_text(seed_text, next_words, max_sequence_len, k=1):
seed_candidates = [(seed_text, .0)]
for _ in range(next_words):
successives = []
for i in range(len(seed_candidates)):
seed_text, score = seed_candidates[i]
token_list = [word_to_index[word] for word in seed_text.split()]
token_list = pad_sequences_pre([token_list], maxlen=max_sequence_len-1)
seed_input = torch.from_numpy(np.array(token_list)).cuda().to(torch.int64)
predicted = model(seed_input).cpu().detach().numpy()
for id, s in get_topK(predicted, k):
output_word= ind_to_word(id)
successives.append((seed_text + ‘ ‘ + output_word, score – s))
ordered = sorted(successives, key=lambda tup: tup[1])
seed_candidates = ordered[:k]
return seed_candidates[0][0]
print(generate_text(“we naughty”, 3, max_sequence_len, k=1))
print(generate_text(“we naughty”, 3, max_sequence_len, k=3))
we naughty go to day
we naughty go to day