COMP5900X Assignment 2 (Supplementary Materials)¶
Use this code to answer the questions in Assignment 2 Part 1.
Sentiment Analysis of IMDB Dataset¶
In the following, we’ll be building a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative). This will be done on movie reviews, using the IMDb dataset.
We will use:
• bidirectional LSTM
• multi-layer LSTM (Deep LSTM)
Preparing Data¶
One of the main concepts of TorchText is the Field. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either “pos” or “neg”.
The parameters of a Field specify how the data should be processed.
We use the TEXT field to define how the review should be processed, and the LABEL field to process the sentiment.
Our TEXT field has tokenize=’spacy’ as an argument. This defines that the “tokenization” (the act of splitting the string into discrete “tokens”) should be done using the spaCy tokenizer. If no tokenize argument is passed, the default is simply splitting the string on spaces.
LABEL is defined by a LabelField, a special subset of the Field class specifically used for handling labels. We will explain the dtype argument later.
We also set the random seeds for reproducibility.
We’ll be using packed padded sequences, which will make our LSTM only process the non-padded elements of our sequence, and for any padded element the output will be a zero tensor. To use packed padded sequences, we have to tell the RNN how long the actual sequences are. We do this by setting include_lengths = True for our TEXT field. This will cause batch.text to now be a tuple with the first element being our sentence (a numericalized tensor that has been padded) and the second element being the actual lengths of our sentences.
In [0]:
import torch
from torchtext import data
from torchtext import datasets
SEED = 2019
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
TEXT = data.Field(tokenize = ‘spacy’, include_lengths = True)
LABEL = data.LabelField(dtype = torch.float)
Another handy feature of TorchText is that it has support for common datasets used in natural language processing (NLP).
The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as torchtext.datasets objects. It process the data using the Fields we have previously defined. The IMDb dataset consists of 50,000 movie reviews, each marked as being a positive or negative review.
In [0]:
from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
We can also check an example.
In [0]:
print(vars(train_data.examples[0]))
The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the .split() method.
By default this splits 70/30, however by passing a split_ratio argument, we can change the ratio of the split, i.e. a split_ratio of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set.
We also pass our random seed to the random_state argument, ensuring that we get the same train/validation split each time.
In [0]:
import random
train_data, valid_data = train_data.split(random_state = random.seed(SEED))
We can see how many examples are in each split.
In [0]:
print(f’Number of training examples: {len(train_data)}’)
print(f’Number of validation examples: {len(valid_data)}’)
print(f’Number of testing examples: {len(test_data)}’)
Next, we have to build a vocabulary. This is a effectively a look up table where every unique word in your data set has a corresponding index (an integer).
We do this as our machine learning model cannot operate on strings, only numbers. Each index is used to construct a one-hot vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won’t fit onto the GPU. We only keep the top 25,000 most common words. What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special unknown or
Instead of having our word embeddings initialized randomly, they are initialized with these pre-trained pre-trained word embeddings. We get these vectors simply by specifying which vectors we want and passing it as an argument to build_vocab. TorchText handles downloading the vectors and associating them with the correct words in our vocabulary.
Here, we’ll be using the “glove.6B.100d” vectors”. glove is the algorithm used to calculate the vectors, go here for more. 6B indicates these vectors were trained on 6 billion tokens and 100d indicates these vectors are 100-dimensional.
The theory is that these pre-trained vectors already have words with similar semantic meaning close together in vector space, e.g. “terrible”, “awful”, “dreadful” are nearby. This gives our embedding layer a good initialization as it does not have to learn these relations from scratch.
Note: these vectors are about 862MB, so watch out if you have a limited internet connection.
By default, TorchText will initialize words in your vocabulary but not in your pre-trained embeddings to zero. We don’t want this, and instead initialize them randomly by setting unk_init to torch.Tensor.normal_. This will now initialize those words via a Gaussian distribution.
In [0]:
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data,
max_size = MAX_VOCAB_SIZE,
vectors = “glove.6B.100d”,
unk_init = torch.Tensor.normal_)
LABEL.build_vocab(train_data)
We can see the vocabulary directly using itos (int to string) method.
In [0]:
print(TEXT.vocab.itos[:10])
We can see the vocab size.
In [0]:
print(f”Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}”)
print(f”Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}”)
Why is the vocab size 25002 and not 25000? One of the addition tokens is the
When we feed sentences into our model, we feed a batch of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the longest within the batch are padded.

The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.
We’ll use a BucketIterator which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.
We also want to place the tensors returned by the iterator on the GPU (if one is available). PyTorch handles this using torch.device, we then pass this device to the iterator.
Another thing for packed padded sequences all of the tensors within a batch need to be sorted by their lengths. This is handled in the iterator by setting sort_within_batch = True.
In [0]:
BATCH_SIZE = 64
device = torch.device(‘cuda’ if torch.cuda.is_available() else ‘cpu’)
train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train_data, valid_data, test_data),
batch_size = BATCH_SIZE,
sort_within_batch = True,
device = device)
Build the Model¶
The next stage is building the model that we’ll eventually train and evaluate. Our three layers are an embedding layer, our RNN, and a linear layer.
The embedding layer is used to transform our sparse one-hot vector (sparse as most of the elements are 0) into a dense embedding vector (dense as the dimensionality is a lot smaller and all the elements are real numbers). This embedding layer is simply a single fully connected layer. As well as reducing the dimensionality of the input to the LSTM, there is the theory that words which have similar impact on the sentiment of the review are mapped close together in this dense vector space. We will initialize the embedding layer with GloVe embeddings.

The MyRNN layer is our LSTM which takes in our dense vector $x_{t}$ and the previous hidden state $h_{t-1}$, and the previous cell memory $c_{t-1}$ which it uses to calculate the next hidden state, $h_t$.
$$(h_t, c_t) = \text{LSTM}(x_t, h_t, c_t)$$
Thus, the model using an LSTM looks something like (with the embedding layers omitted):

The initial cell state, $c_0$, like the initial hidden state is initialized to a tensor of all zeros.
Finally, the linear layer takes the final hidden state and feeds it through a fully connected layer, $f(h_T)$, transforming it to the correct output dimension.
Bidirectional RNN¶
As well as having an RNN processing the words in the sentence from the first to the last (a forward RNN), we have a second RNN processing the words in the sentence from the last to the first (a backward RNN).
In PyTorch, the hidden state (and cell state) tensors returned by the forward and backward RNNs are stacked on top of each other in a single tensor.
We make our sentiment prediction using a concatenation of the last hidden state from the forward LSTM (obtained from final word of the sentence), $h_T^\rightarrow$, and the last hidden state from the backward LSTM (obtained from the first word of the sentence), $h_T^\leftarrow$, i.e. $\hat{y}=f(h_T^\rightarrow, h_T^\leftarrow)$
Multi-layer RNN¶
In multi-layer LSTM (also called deep LSTM) we add additional LSTMs on top of the initial standard LSTM, where each LSTM added is another layer. The hidden state output by the first (bottom) LSTM at time-step $t$ will be the input to the LSTM above it at time step $t$. The prediction is then made from the final hidden state of the final (highest) layer.
Implementation Details¶
We are not going to learn the embedding for the
To use an RNN instead of the LSTM, you can use nn.RNN instead of nn.LSTM. Also, note that the LSTM returns the output and a tuple of the final hidden state and the final cell state, whereas the standard RNN only returned the output and final hidden state.
As the final hidden state of our LSTM has both a forward and a backward component, which will be concatenated together, the size of the input to the nn.Linear layer is twice that of the hidden dimension size.
Implementing bidirectionality and adding additional layers are done by passing values for the num_layers and bidirectional arguments for the RNN/LSTM.
The LSTM has a dropout argument which adds dropout on the connections between hidden states in one layer to hidden states in the next layer.
As we are passing the lengths of our sentences to be able to use packed padded sequences, we have to add a second argument, text_lengths, to forward.
Before we pass our embeddings to the LSTM, we need to pack them, which we do with nn.utils.rnn.packed_padded_sequence. This will cause our LSTM to only process the non-padded elements of our sequence. The LSTM will then return packed_output (a packed sequence) as well as the hidden and cell states (both of which are tensors). Without packed padded sequences, hidden and cell are tensors from the last element in the sequence, which will most probably be a pad token, however when using packed padded sequences they are both from the last non-padded element in the sequence.
We then unpack the output sequence, with nn.utils.rnn.pad_packed_sequence, to transform it from a packed sequence to a tensor. The elements of output from padding tokens will be zero tensors (tensors where every element is zero). Usually, we only have to unpack output if we are going to use it later on in the model. Although we aren’t in this case, we still unpack the sequence just to show how it is done.
The final hidden state, hidden, has a shape of [num layers $\times$ num directions, batch size, hid dim]. These are ordered: _[forward_layer_0, backward_layer_0, forward_layer_1, backwardlayer 1]. As we want the final (top) layer forward and backward hidden states, we get hidden[2,:,:] and hidden[3,:,:], and concatenate them together before passing them to the linear layer (after applying dropout).
The input dimension is the dimension of the one-hot vectors, which is equal to the vocabulary size i.e. 25,002.
The embedding dimension is the size of the dense word vectors i.e. 100.
The hidden dimension is the size of the hidden states i.e. 200.
The output dimension is usually the number of classes, however in the case of only 2 classes the output value is between 0 and 1 and thus can be 1-dimensional, i.e. a single scalar real number.
In [0]:
import torch.nn as nn
class MyRNN(nn.Module):
def __init__(self):
super().__init__()
vocab_size = 25002
embedding_dim = 100
hidden_dim = 200
output_dim = 1
n_layers = 2
bidirectional = True
dropout = 0.5
pad_idx = TEXT.vocab.stoi[TEXT.pad_token]
self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx = pad_idx)
self.lstm = nn.LSTM(embedding_dim,
hidden_dim,
num_layers=n_layers,
bidirectional=bidirectional,
dropout=dropout)
self.fc = nn.Linear(hidden_dim * 2, output_dim)
self.dropout = nn.Dropout(dropout)
def forward(self, text, text_lengths):
#text = [sent len, batch size]
embedded = self.embedding(text)
#embedded = [sent len, batch size, emb dim]
#pack sequence
packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
packed_output, (hidden, cell) = self.lstm(packed_embedded)
#unpack sequence
output, output_lengths = nn.utils.rnn.pad_packed_sequence(packed_output)
#output = [sent len, batch size, hid dim * 2]
#output over padding tokens are zero tensors
#hidden = [num layers * 2, batch size, hid dim]
#cell = [num layers * 2, batch size, hid dim]
#concat the final forward (hidden[2,:,:]) and backward (hidden[3,:,:]) hidden layers
#and apply dropout
hidden = self.dropout(torch.cat((hidden[2,:,:], hidden[3,:,:]), dim = 1))
#hidden = [batch size, hid dim * 2]
return self.fc(hidden)
We now create an instance of our RNN class. We’ll print out the total number of parameters in our model. We also print the details of the number of parameters in each layer. For example, ()’lstm.weight_ih_l0′, 80000) indicates that there are 100$\times$200 parameters that connect the embeddings (dim=100) to the hidden layer (dim=200), and there are 3X additional parameters coressponding to the three gates in LSTM. Hence a total of 100$\times$200$\times$4=80,000 parameters. Likewise (‘lstm.weight_hh_l0′, 160000) indicates that there are 200$\times$200 parameters that connect the previous hidden state (dim=200) to the hidden layer (dim=200), and there are 3X additional parameters coressponding to the three gates in LSTM. Hence a total of 200$\times$200$\times$4=160,000 parameters.
In [0]:
model = MyRNN()
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f’The model has {count_parameters(model):,} trainable parameters’)
print(MyRNN())
[(n, p.numel()) for n, p in MyRNN().named_parameters()]
Next we will copy the pre-trained word embeddings we loaded earlier into the embedding layer of our model.
We retrieve the embeddings from the field’s vocab, and check they’re the correct size, [vocab size, embedding dim]
In [0]:
pretrained_embeddings = TEXT.vocab.vectors
print(pretrained_embeddings.shape)
We then replace the initial weights of the embedding layer with the pre-trained embeddings.
Note: this should always be done on the weight.data and not the weight!
In [0]:
model.embedding.weight.data.copy_(pretrained_embeddings)
As our
We do this by manually setting their row in the embedding weights matrix to zeros. We get their row by finding the index of the tokens, which we have already done for the padding index.
In [0]:
UNK_IDX = TEXT.vocab.stoi[TEXT.unk_token]
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]
EMBEDDING_DIM = 100
model.embedding.weight.data[UNK_IDX] = torch.zeros(EMBEDDING_DIM)
model.embedding.weight.data[PAD_IDX] = torch.zeros(EMBEDDING_DIM)
print(model.embedding.weight.data)
We can now see the first two rows of the embedding weights matrix have been set to zeros. As we passed the index of the pad token to the padding_idx of the embedding layer it will remain zeros throughout training, however the
Train the Model¶
Now we’ll set up the training and then train the model. First, we’ll create an optimizer. This is the algorithm we use to update the parameters of the module. We will use Adam instead of SGD that we used in Assignment 1. SGD updates all parameters with the same learning rate and choosing this learning rate can be tricky. Adam adapts the learning rate for each parameter, giving parameters that are updated more frequently lower learning rates and parameters that are updated infrequently higher learning rates.
In [0]:
import torch.optim as optim
optimizer = optim.Adam(model.parameters())
Next, we’ll define our loss function. In PyTorch this is commonly called a criterion.
The loss function here is binary cross entropy with logits. Our model currently outputs an unbound real number. As our labels are either 0 or 1, we want to restrict the predictions to a number between 0 and 1. We do this using the sigmoid. We then use this bound scalar to calculate the loss using binary cross entropy.
The BCEWithLogitsLoss criterion carries out both the sigmoid and the binary cross entropy steps.
Using .to, we can place the model and the criterion on the GPU (if we have one).
In [0]:
criterion = nn.BCEWithLogitsLoss()
model = model.to(device)
criterion = criterion.to(device)
Our criterion function calculates the loss, however we have to write our function to calculate the accuracy.
This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).
We then calculate how many rounded predictions equal the actual labels and average it across the batch.
In [0]:
def binary_accuracy(preds, y):
#round predictions to the closest integer
rounded_preds = torch.round(torch.sigmoid(preds))
correct = (rounded_preds == y).float() #convert into float for division
acc = correct.sum() / len(correct)
return acc
The train function iterates over all examples, one batch at a time.
model.train() is used to put the model in “training mode”, which turns on dropout and batch normalization. Though we aren’t using batch normalization in this model.
For each batch, we first zero the gradients. Each parameter in a model has a grad attribute which stores the gradient calculated by the criterion. PyTorch does not automatically remove (or “zero”) the gradients calculated from the last gradient calculation, so they must be manually zeroed.
As we have set include_lengths = True, our batch.text is a tuple with the first element being the numericalized tensor and the second element being the actual lengths of each sequence. We separate these into their own variables, text and text_lengths, before passing them to the model.
We then feed the batch of sentences, batch.text, into the model. Note, you do not need to do model.forward(batch.text), simply calling the model works. The squeeze is needed as the predictions are initially size [batch size, 1], and we need to remove the dimension of size 1 as PyTorch expects the predictions input to our criterion function to be of size [batch size].
The loss and accuracy are then calculated using our predictions and the labels, batch.label, with the loss being averaged over all examples in the batch.
In [0]:
def train(model, iterator, optimizer, criterion):
epoch_loss = 0
epoch_acc = 0
model.train()
for batch in iterator:
optimizer.zero_grad()
text, text_lengths = batch.text
predictions = model(text, text_lengths).squeeze(1)
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
evaluate is similar to train, with a few modifications as you don’t want to update the parameters when evaluating.
model.eval() puts the model in “evaluation mode”, this turns off dropout and batch normalization. Though, we are not using batch normalization in this model.
No gradients are calculated on PyTorch operations inside the with no_grad() block. This causes less memory to be used and speeds up computation.
The rest of the function is the same as train, with the removal of optimizer.zero_grad(), loss.backward() and optimizer.step(), as we do not update the model’s parameters when evaluating.
In [0]:
def evaluate(model, iterator, criterion):
epoch_loss = 0
epoch_acc = 0
model.eval()
with torch.no_grad():
for batch in iterator:
text, text_lengths = batch.text
predictions = model(text, text_lengths).squeeze(1)
loss = criterion(predictions, batch.label)
acc = binary_accuracy(predictions, batch.label)
epoch_loss += loss.item()
epoch_acc += acc.item()
return epoch_loss / len(iterator), epoch_acc / len(iterator)
And also create a nice function to tell us how long our epochs are taking.
In [0]:
import time
def epoch_time(start_time, end_time):
elapsed_time = end_time – start_time
elapsed_mins = int(elapsed_time / 60)
elapsed_secs = int(elapsed_time – (elapsed_mins * 60))
return elapsed_mins, elapsed_secs
Finally, we train our model… At each epoch, if the validation loss is the best we have seen so far, we’ll save the parameters of the model and then after training has finished we’ll use that model on the test set.
In [0]:
N_EPOCHS = 5
best_valid_loss = float(‘inf’)
for epoch in range(N_EPOCHS):
start_time = time.time()
train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
end_time = time.time()
epoch_mins, epoch_secs = epoch_time(start_time, end_time)
if valid_loss < best_valid_loss: best_valid_loss = valid_loss torch.save(model.state_dict(), 'best-model.pt') print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s') print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%') print(f'\t Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}%') Finally, the metric we actually care about, the test loss and accuracy, which we get from our parameters that gave us the best validation loss. In [0]: model.load_state_dict(torch.load('best-model.pt')) test_loss, test_acc = evaluate(model, test_iterator, criterion) print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%') User Input¶ We can now use our model to predict the sentiment of any sentence we give it. Our predict_sentiment function does a few things: • sets the model to evaluation mode • tokenizes the sentence, i.e. splits it from a raw string into a list of tokens • indexes the tokens by converting them into their integer representation from our vocabulary • gets the length of our sequence • converts the indexes, which are a Python list into a PyTorch tensor • add a batch dimension by unsqueezeing • converts the length into a tensor • squashes the output prediction from a real number between 0 and 1 with the sigmoid function • converts the tensor holding a single value into an integer with the item() method We are expecting reviews with a negative sentiment to return a value close to 0 and positive reviews to return a value close to 1. In [0]: import spacy nlp = spacy.load('en') def predict_sentiment(model, sentence): model.eval() tokenized = [tok.text for tok in nlp.tokenizer(sentence)] indexed = [TEXT.vocab.stoi[t] for t in tokenized] length = [len(indexed)] tensor = torch.LongTensor(indexed).to(device) tensor = tensor.unsqueeze(1) length_tensor = torch.LongTensor(length) prediction = torch.sigmoid(model(tensor, length_tensor)) return prediction.item() An example negative review... In [0]: predict_sentiment(model, "This film is terrible") An example positive review... In [0]: predict_sentiment(model, "This film is great")