Sentiment Classification: classifying IMDB reviews¶
In this task, you will learn how to process text data and how to train neural networks with limited input text data using pre-trained embeddings for sentiment classification (classifying a review document as “positive” or “negative” based solely on the text content of the review).
We will use the Embedding layer in Keras to represent text input. The Embedding layer is best understood as a dictionary mapping integer indices (which stand for specific words) to dense vectors. It takes as input integers, then looks up these integers into an internal dictionary, and finally returns the associated vectors. It’s effectively a dictionary lookup.
The Embedding layer takes as input a 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of integers. It can embed sequences of variable lengths, so for instance we could feed into our embedding layer above batches that could have shapes (32, 10) (batch of 32 sequences of length 10) or (64, 15) (batch of 64 sequences of length 15). All sequences in a batch must have the same length, though (since we need to pack them into a single tensor), so sequences that are shorter than others should be padded with zeros, and sequences that are longer should be truncated.
This layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). Such a 3D tensor can then be processed by a RNN layer or a 1D convolution layer.
You can instantiate the Embedding layer by randomly initialising its weights (its internal dictionary of token vectors). During training, these word vectors will be gradually adjusted via backpropagation, structuring the space into something that the downstream model can exploit. Once fully trained, your embedding space will show a lot of structure — a kind of structure specialized for the specific problem you were training your model for. You can also instantiate the Embedding layer by intialising its weights using the pre-trained word embeddings, such as GloVe word embeddings pretrained from Wikipedia articles.
a) Download the IMDB data as raw text¶
First, create a “data” directory, then head to http://ai.stanford.edu/~amaas/data/sentiment/ and download the raw IMDB dataset (if the URL isn’t working anymore, just Google “IMDB dataset”). Save it into the “data” directory. Uncompress it. Store the individual reviews into a list of strings, one string per review, and also collect the review labels (positive / negative) into a separate labels list.
In [ ]:
import os
# write your code here
b) Pre-process the review documents¶
Pre-process review documents by tokenisation and split the data into the training and testing sets. You can restrict the training data to the first 1000 reviews and only consider the top 5,000 words in the dataset. You can also cut reviews after 100 words (that is, each review contains a maximum of 100 words).
In [ ]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np
# write your code here
c) Download the GloVe word embeddings and map each word in the dataset into its pre-trained GloVe word embedding.¶
First go to https://nlp.stanford.edu/projects/glove/ and download the pre-trained embeddings from 2014 English Wikipedia into the “data” directory. It’s a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for 400,000 words (or non-word tokens). Un-zip it.
Parse the un-zipped file (it’s a txt file) to build an index mapping words (as strings) to their vector representation (as number vectors).
Build an embedding matrix that will be loaded into an Embedding layer later. It must be a matrix of shape (max_words, embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in our reference word index (built during tokenization). Note that the index 0 is not supposed to stand for any word or token — it’s a placeholder.
In [ ]:
# write your code here
d) Build and train a simple Sequential model¶
The model contains an Embedding Layer with maximum number of tokens to be 10,000 and embedding dimensionality as 100. Initialise the Embedding Layer with the pre-trained GloVe word vectors. Set the maximum length of each review to 100. Flatten the 3D embedding output to 2D and add a Dense Layer which is the classifier. Train the model with a ‘rmsprop’ optimiser. You need to freeze the embedding layer by setting its trainable attribute to False so that its weights will not be updated during training.
In [ ]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
# write your code here
e) Plot the training and validation loss and accuracies and evaluate the trained model on the test set.¶
What do you observe from the results?
In [ ]:
import matplotlib.pyplot as plt
# write your code here
f) Add an LSTM layer into the simple neural network architecture and re-train the model on the training set, plot the training and validation loss/accuracies, also evaluate the trained model on the test set and report the result.¶
In [ ]:
from keras.layers import LSTM
# write your code here