worksheet07_solutions
COMP90051 Workshop 7¶
Recurrent neural networks (RNNs)¶
In this worksheet, we’ll implement a recurrent neural network (RNN) for sentiment analysis of movie reviews. The input to the network will be a movie review, represented as a string, and the output will be a binary label which is “1” if the sentiment is positive and “0” if the sentiment is negative. In practice, a network like this could be applied to social media monitoring—e.g. tracking customer’s perceptions of a new product, or prioritising support for disgruntled customers.
By the end of this worksheet, you should be able to:
convert raw text to a vector representation for input into a neural network
build a neural network with a recurrent architecture to take advantage of the order of words in text
use tf.data to feed data into neural network as an alternative to NumPy arrays
In [1]:
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams[‘figure.dpi’] = 108
import tensorflow as tf
from packaging import version
print(“TensorFlow version: “, tf.__version__)
assert version.parse(tf.__version__) >= version.parse(“2.3”), \
“This notebook requires TensorFlow 2.3 or above.”
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
TensorFlow version: 2.6.0
1. Large Movie Review (IMDb) Dataset¶
In order to train and evaluate our sentiment analysis model, we’ll use the Large Movie Review Dataset.
It contains 50,000 movie reviews crawled from the IMDb website, split evenly into train/test sets.
The dataset excludes neutral reviews (with a rating $> 4/10$ and $< 7/10$) and defines:
a positive review as having a rating $\geq 7/10$
a negative review as having a rating $\leq 4/10$.
More information about the dataset is available here.
We'll download the dataset using the TensorFlow Datasets (TFDS) library, which provides access to a variety of machine learning datasets.
First, we need to install the TFDS library using pip, since it's not included in core TensorFlow.
In [2]:
!pip install -q tensorflow-datasets
After installing the library, we can use it to download the dataset.
In [3]:
import tensorflow_datasets as tfds
(train_ds, test_ds), ds_info = tfds.load(name='imdb_reviews', split=('train', 'test'), as_supervised=True, with_info=True)
The train_ds and test_ds variables represent the train/test datasets respectively, however they aren't stored in NumPy arrays as we're used to.
Instead, the TFDS library returns a special type of object called a TensorFlow Dataset.
A Dataset offers two main advantages over NumPy arrays:
It can represent large datasets without reading all of the data into memory. This is important for large-scale applications, where the dataset must be streamed from disk into memory.
It represents the output of a data pipeline (which may include transformations) rather than the data itself.
Since a Dataset is not guaranteed to completely exist in memory, random access (i.e. indexing) is not supported.
Instead, a Dataset behaves more like a Python iterable—we can only request the next element, which may be a single instance or a batch of instances.
Let's iterate over the first 3 elements in train_ds and print them out.
In [4]:
for x, y in train_ds.take(3).as_numpy_iterator():
# decode binary string to UTF-8
review = x.decode()
# convert integer binary label to string
label = "positive" if y == 1 else "negative"
print("A {} review:\n".format(label), review, "\n")
A negative review:
This was an absolutely terrible movie. Don't be lured in by Christopher Walken or Michael Ironside. Both are great actors, but this must simply be their worst role in history. Even their great acting could not redeem this movie's ridiculous storyline. This movie is an early nineties US propaganda piece. The most pathetic scenes were those when the Columbian rebels were making their cases for revolutions. Maria Conchita Alonso appeared phony, and her pseudo-love affair with Walken was nothing but a pathetic emotional plug in a movie that was devoid of any real meaning. I am disappointed that there are movies like this, ruining actor's like Christopher Walken's good name. I could barely sit through it.
A negative review:
I have been known to fall asleep during films, but this is usually due to a combination of things including, really tired, being warm and comfortable on the sette and having just eaten a lot. However on this occasion I fell asleep because the film was rubbish. The plot development was constant. Constantly slow and boring. Things seemed to happen, but with no explanation of what was causing them or why. I admit, I may have missed part of the film, but i watched the majority of it and everything just seemed to happen of its own accord without any real concern for anything else. I cant recommend this film at all.
A negative review:
Mann photographs the Alberta Rocky Mountains in a superb fashion, and Jimmy Stewart and Walter Brennan give enjoyable performances as they always seem to do.
But come on Hollywood – a Mountie telling the people of Dawson City, Yukon to elect themselves a marshal (yes a marshal!) and to enforce the law themselves, then gunfighters battling it out on the streets for control of the town?
Nothing even remotely resembling that happened on the Canadian side of the border during the Klondike gold rush. Mr. Mann and company appear to have mistaken Dawson City for Deadwood, the Canadian North for the American Wild West.
Canadian viewers be prepared for a Reefer Madness type of enjoyable howl with this ludicrous plot, or, to shake your head in disgust.
Question: Based on the above training examples, what features do you think will be important for predicting positive/negative sentiment? If the words in the text were randomly shuffled, do you think accuracy would suffer?
Answer: The word order is probably not crucial for predicting sentiment. The absence or presence of certain words provide a strong signal, regardless of where they appear in the text. For example, the presence of words like “terrible”, “disappointed” and “ridiculous” are likely to be correlated with negative sentiment. However, mistakes could be made if word order is ignored—e.g. “terrible” may refer to an experience a character goes through in a movie, rather than the movie itself.
When working with Datasets, batching and random shuffling are treated as steps in the data input pipeline.
In the code block below, we add these steps to the pipeline in preparation for pre-processing and training:
First, we cache the entire dataset in memory. This is okay since the dataset is relatively small.
Second, we randomly shuffle the instances.
Third, we group the instances into batches of size 128.
Fourth, we call prefetch to reduce latency
In [5]:
train_ds = train_ds.cache()
train_ds = train_ds.shuffle(ds_info.splits[‘train’].num_examples)
train_ds = train_ds.batch(128)
train_ds = train_ds.prefetch(tf.data.experimental.AUTOTUNE)
test_ds = test_ds.cache()
test_ds = test_ds.shuffle(ds_info.splits[‘test’].num_examples)
test_ds = test_ds.batch(128)
test_ds = test_ds.prefetch(tf.data.experimental.AUTOTUNE)
2. Text preprocessing¶
In order to feed textual data into a neural network, we need to transform it into a vector representation.
This is often done in two steps:
Tokenisation.
The text is split into a sequence of words/tokens.
For example, when splitting on white space, the text “This was an absolutely terrible movie” would become [“This”, “was”, “an”, “absolutely”, “terrible”, “movie”].
Word vectorisation/embedding.
Individual words/tokens are mapped to a vector representation.
We’ll start with a simple one-hot vector encoding, and map to a lower-dimensional representation later on.
For the one-hot vector encoding, we choose a vocabulary of size $K$.
Each word in the vocabulary is associated with a dimension in the space, or equivalently a word id in the set $\{0, \ldots, K-1\}$.
Word ids 0 and 1 are reserved to represent padding and unknown words (also known as out-of-vocabulary words), respectively.
In between steps 1 and 2 above, we’ll also do some standardisation:
we’ll convert text to lowercase
we’ll remove HTML break tags “\
” (which you might have noticed in the third review we printed out above)
we’ll remove punctuation characters
We can perform all of these steps using a built-in layer from Keras called TextVectorization.
It does everything we need by default, apart from the standardisation, which we need to specify manually.
In the code block below, we define a function to perform standardisation.
In [6]:
import re
import string
def custom_standardisation(input_data):
“””
Args:
input_data : a Tensor of shape (batch_size,)
Return:
a Tensor of shape (batch_size,)
“””
# Convert string to lowercase
input_data = tf.strings.lower(input_data)
# Remove break tags
input_data = tf.strings.regex_replace(input_data, “
“, ” “)
# Remove punctuation
punctuation_pattern = “[%s]” % re.escape(string.punctuation)
input_data = tf.strings.regex_replace(input_data, punctuation_pattern, “”)
return input_data
Next we define the TextVectorization layer.
Note that we’re limiting the size of the vocabulary to $K = 5000$.
We’ll also require that the sequences be of uniform length 300 to improve efficiency.
If a review contains more than 300 words, only the first 300 words will be kept.
If a review contains less than 300 words, the remaining “slots” will be filled with padding (id 0).
In [7]:
VOCAB_SIZE = 5000
MAX_LEN = 300
vectorise_layer = TextVectorization(max_tokens=VOCAB_SIZE, output_mode=”int”,
standardize=custom_standardisation,
output_sequence_length=MAX_LEN)
To set the vocabulary for the layer, we need to call the adapt method and pass in some training data.
In [8]:
# Training dataset without class labels
text_ds = train_ds.map(lambda x, y: x)
vectorise_layer.adapt(text_ds)
We can access the learned vocabulary using the get_vocabulary method.
The most frequently occurring words are assigned the smallest integer ids.
In [9]:
vectorise_layer.get_vocabulary()[:10]
Out[9]:
[”, ‘[UNK]’, ‘the’, ‘and’, ‘a’, ‘of’, ‘to’, ‘is’, ‘in’, ‘it’]
To test our TextVectorization layer, let’s apply it to the first batch of training data and print out the first 3 examples in the batch.
In [10]:
for batch in text_ds.take(1).map(vectorise_layer).as_numpy_iterator():
print(batch[0:3])
[[ 44 22 62 494 2 3979 4857 433 17 91 22 139 40 66
9 296 5 1 288 610 482 109 75 141 10 68 21 255
12 57 28 43 4214 12 11 17 7 40 4 696 3743 127
5 1 85 4 309 1 1162 7 2 930 1 15 4 1
16 4 1223 2593 12 43 33 1 3 1 1 3 11 1
2830 43 4 200 5 1 12 4917 503 127 28 1632 5 2
1 30 4 59 216 53 4 4446 2908 8 2 2685 1389 60
13 8 2 776 5 1 456 2 186 12 47 23 57 1083
2 1 2 92 1 7 46 425 5 1 52 1 8 1381
1 1 1 16 4 169 5 1 1 40 38 4 1575 1
5 1 37 4 1 6 440 725 1851 15 1 99 1080 15
11 63 34 23 21 2 166 17 525 18 2 760 182 2
313 145 4857 433 10 102 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]
[ 29 31 56 6 22 1774 148 22 75 6 298 2 497 2
256 55 337 307 55 454 27 544 69 6 66 11 19 16
24 155 1024 20 624 256 30 2 1 1 8 355 10 13
18 1 147 164 10 268 53 1061 242 48 10 13 308 18
10 76 109 2697 9 260 2 132 16 2789 4619 145 9 622
1 147 293 9 124 392 69 1 55 337 13 4 638 2912
3417 331 2981 2 273 15 2079 8 360 15 212 33 558 8
6 103 48 13 406 6 26 33 215 17 10 67 106 104
95 30 2 1 2537 6 11 1740 1 3 2 614 5 1
73 11 268 596 4401 1 41 2 4174 736 100 1178 2 731
10 13 1 18 53 30 12 4188 575 10 2729 2 1 100
308 171 622 236 10 746 10 13 124 2079 100 1 11 535
1173 240 560 1 19 4853 9 6 130 9 13 4 1 567
15 4 1 1 12 67 240 6 2802 24 85 1024 10 89
1762 55 337 15 650 69 18 9 13 4 221 72 15 4
516 6 2346 14 33 1118 10 124 161 9 176 1173 240 10
236 124 52 1502 16 2 491 5 2 1644 2912 3417 14 33
1509 574 14 23 31 2 1 1480 1 14 2 1 1 574
6 1 3626 14 2 1 1 1 120 33 1 283 2 479
491 7 329 6 590 35 2 1 3945 6 2 1 1 2
1348 1 2 1684 1364 5 11 1677 19 23 85 1059 2 1462
1169 2815 35 1 2 858]
[ 11 17 13 380 10 199 16 31 2 46 48 533 151 9
58 26 122 89 424 123 59 3469 1 2182 13 374 54 139
1227 6 1617 1 715 1 7 1522 4 951 287 2 921 27
90 8 2 17 10 97 109 26 123 246 13 2 1 5
48 142 642 3 2048 4012 54 13 1020 18 390 8 348 5
46 50 111 3018 103 40 38 1779 16 1 1 13 4 4293
122 41 99 79 1202 17 22 76 26 1 6 21 25 1030
123 3615 59 3438 10 115 22 7 77 4 49 35 703 2
2542 870 59 6 366 145 92 1398 125 1134 45 5 297 3
1633 6 2 81 36 25 446 106 9 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0]]
3. RNN-based classifier¶
Now that we’ve prepared a layer for preprocessing, we’re ready to implement the neural network.
The network will consist of three main components:
Embedding layer. This layer will map word vectors from a sparse one-hot encoded representation to a dense lower-dimensional embedding space. It is effectively equivalent to a dense layer with a linear activation function.
RNN layer. This layer will sequentially apply an RNN cell to the densely-encoded word vectors. Since we’re doing classification, we only need to use the output at the end of the sequence. In other applications, such as language modelling, we would use the output from each element of the sequence.
Dense layers. These layers will process the final output of the RNN layer to produce a sentiment classification.
Exercise: Complete the following code block to implement an RNN-based sentiment classifier. You should add three layers:
an LSTM layer with L_UNITS = 32 units followed by
a Dense layer with D1_UNITS = 32 units, and
a Dense layer with 1 unit.
[Note: you may like to experiment with your own architecture. For example, you could use an SimpleRNN layer in place of a LSTM layer, or a Bidirectional RNN.]
In [11]:
DIM_EMBED = 32
L_UNITS = 32
D1_UNITS = 32
model = keras.Sequential([
# Specify that the model will accept a 1D tensor of strings as input with
# an unknown batch size
keras.Input(shape=(None,), dtype=tf.string),
# Vectorise the strings using the preprocessing layer we defined above
vectorise_layer,
# Map the sequence of one-hot encoded integer vectors to a
# lower-dimensional embedding space. By setting `mask_zero = True`, we
# instruct Keras to apply a mask so that downstream layers can ignore
# padded parts of the sequence.
layers.Embedding(VOCAB_SIZE, DIM_EMBED, mask_zero=True),
# Sequentially apply an LSTM cell to the sequence of word embeddings,
# returning only the final output
layers.LSTM(L_UNITS), # fill in
# Apply two dense layers to output the probability of a positive sentiment
layers.Dense(D1_UNITS, activation=’relu’), # fill in
layers.Dense(1, activation=’sigmoid’) # fill in
])
# DIM_EMBED = 32
# L_UNITS = 32
# D1_UNITS = 32
# V = 5000
# max_length = 300
# LSTM : 8320
# RNN : 2080
# LSTM = RNN * 4
model.summary()
Model: “sequential”
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
text_vectorization (TextVect (None, 300) 0
_________________________________________________________________
embedding (Embedding) (None, 300, 32) 160000
_________________________________________________________________
lstm (LSTM) (None, 32) 8320
_________________________________________________________________
dense (Dense) (None, 32) 1056
_________________________________________________________________
dense_1 (Dense) (None, 1) 33
=================================================================
Total params: 169,409
Trainable params: 169,409
Non-trainable params: 0
_________________________________________________________________
Since the final output is a scalar corresponding to the probability of a positive sentiment, we’ll minimise the binary cross-entropy loss.
In the code block below, we compile the model and run training for 15 epochs.
Training may take 10-15 min to complete on a CPU.
In [12]:
model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=False),
optimizer=keras.optimizers.Adam(1e-4),
metrics=[‘accuracy’])
history = model.fit(train_ds, validation_data=test_ds, epochs=15)
Epoch 1/15
196/196 [==============================] – 36s 172ms/step – loss: 0.6920 – accuracy: 0.5359 – val_loss: 0.6893 – val_accuracy: 0.6243
Epoch 2/15
196/196 [==============================] – 32s 165ms/step – loss: 0.6011 – accuracy: 0.7258 – val_loss: 0.4830 – val_accuracy: 0.8002
Epoch 3/15
196/196 [==============================] – 32s 164ms/step – loss: 0.4112 – accuracy: 0.8421 – val_loss: 0.3892 – val_accuracy: 0.8473
Epoch 4/15
196/196 [==============================] – 32s 165ms/step – loss: 0.3319 – accuracy: 0.8800 – val_loss: 0.3466 – val_accuracy: 0.8621
Epoch 5/15
196/196 [==============================] – 32s 165ms/step – loss: 0.2862 – accuracy: 0.8973 – val_loss: 0.3252 – val_accuracy: 0.8708
Epoch 6/15
196/196 [==============================] – 32s 164ms/step – loss: 0.2574 – accuracy: 0.9079 – val_loss: 0.3194 – val_accuracy: 0.8722
Epoch 7/15
196/196 [==============================] – 33s 166ms/step – loss: 0.2370 – accuracy: 0.9160 – val_loss: 0.3270 – val_accuracy: 0.8693
Epoch 8/15
196/196 [==============================] – 34s 171ms/step – loss: 0.2191 – accuracy: 0.9237 – val_loss: 0.3079 – val_accuracy: 0.8740
Epoch 9/15
196/196 [==============================] – 33s 167ms/step – loss: 0.2070 – accuracy: 0.9288 – val_loss: 0.3189 – val_accuracy: 0.8724
Epoch 10/15
196/196 [==============================] – 32s 166ms/step – loss: 0.1974 – accuracy: 0.9328 – val_loss: 0.3220 – val_accuracy: 0.8710
Epoch 11/15
196/196 [==============================] – 33s 166ms/step – loss: 0.1879 – accuracy: 0.9376 – val_loss: 0.3404 – val_accuracy: 0.8666
Epoch 12/15
196/196 [==============================] – 32s 166ms/step – loss: 0.1798 – accuracy: 0.9411 – val_loss: 0.3427 – val_accuracy: 0.8662
Epoch 13/15
196/196 [==============================] – 33s 166ms/step – loss: 0.1734 – accuracy: 0.9438 – val_loss: 0.3530 – val_accuracy: 0.8650
Epoch 14/15
196/196 [==============================] – 32s 165ms/step – loss: 0.1657 – accuracy: 0.9478 – val_loss: 0.3512 – val_accuracy: 0.8639
Epoch 15/15
196/196 [==============================] – 33s 169ms/step – loss: 0.1605 – accuracy: 0.9517 – val_loss: 0.3974 – val_accuracy: 0.8598
Let’s plot loss curves to check for overfitting.
In [13]:
plt.plot(history.epoch, history.history[‘loss’], label=’Train’)
plt.plot(history.epoch, history.history[‘val_loss’], label=’Test’)
plt.ylabel(‘Loss’)
plt.xlabel(‘Epoch’)
plt.legend()
plt.show()
The loss curves show that the model is overfitting, as the training loss has dropped below the test loss.
Question: What actions can we take to prevent overfitting? What could be the main cause of overfitting in this case?
Answer:
It’s not surprising that the model is overfitting, as the training data is relatively small, and we have not applied any regularisation.
Looking at the number of parameters in the various layers, we see that the Embedding layer contains 94\% of the total parameters.
So it may be one of the main culprits for overfitting.
To reduce overfitting of the embedding layer in particular, we could try:
Reducing the size of the vocabulary and/or embedding space.
Applying dropout to the features in the embedding space. This can be done by adding a Dropout layer after the Embedding layer.
Using a pre-trained embedding layer. TensorFlow Hub provides a repository of pretrained models/layers. For example, we could use the gnews-swivel-20dim word embedding which was trained on a 130GB corpus of English news articles.
Using more movie review data (if we can find some).
4. Testing the model¶
Since we integrated text prepocessing into the model, it’s straightforward to apply it to new test instances.
We can simply pass in a NumPy array of strings, and immediately get sentiment predictions.
Exercise: Test the sentiment classification model on a movie review of your own.
In [14]:
# fill in
my_review = “The plot is nothing, the narrative arc is all over the place, and the comedy is tired and sad.”
print(“My review:\n”, my_review)
pred_sentiment = model.predict(np.expand_dims(my_review, 0))
print(“Sentiment: “, pred_sentiment.squeeze())
My review:
The plot is nothing, the narrative arc is all over the place, and the comedy is tired and sad.
Sentiment: 0.19684121
Bonus: visualising the embedding space (optional)¶
Exercise: Run the code block below to save the word dictionary and weights in the Embedding layer as TSV files. You can then load these files in the Embedding Projector web app to visualise the embedding space.
In [15]:
import io
weights = model.layers[1].get_weights()[0]
vocab = model.layers[0].get_vocabulary()
out_v = io.open(‘vecs.tsv’, ‘w’, encoding=’utf-8′)
out_m = io.open(‘meta.tsv’, ‘w’, encoding=’utf-8′)
for index, word in enumerate(vocab):
if index == 0:
continue # skip 0, it’s padding.
vec = weights[index]
out_v.write(‘\t’.join([str(x) for x in vec]) + “\n”)
out_m.write(word + “\n”)
out_v.close()
out_m.close()