W05L1-1-WordEmbeddings
Using word embeddings¶
This notebook is based on the code samples found in Chapter 6, Section 1 of Deep Learning with Python and hosted on https://github.com/fchollet/deep-learning-with-python-notebooks.
Note that the original text features far more content, in particular further explanations and figures.
In [1]:
import tensorflow as tf
tf.config.experimental.list_physical_devices()
Out[1]:
[PhysicalDevice(name=’/physical_device:CPU:0′, device_type=’CPU’),
PhysicalDevice(name=’/physical_device:XLA_CPU:0′, device_type=’XLA_CPU’),
PhysicalDevice(name=’/physical_device:GPU:0′, device_type=’GPU’),
PhysicalDevice(name=’/physical_device:XLA_GPU:0′, device_type=’XLA_GPU’)]
In [2]:
physical_devices = tf.config.list_physical_devices(‘GPU’)
tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
In [3]:
from tensorflow import keras
keras.__version__
Out[3]:
‘2.3.0-tf’
A popular and powerful way to associate a vector with a word is the use of dense “word vectors”, also called “word embeddings”.
While the vectors obtained through one-hot encoding are binary, sparse (mostly made of zeros) and very high-dimensional (same dimensionality as the
number of words in the vocabulary), “word embeddings” are low-dimensional floating point vectors
(i.e. “dense” vectors, as opposed to sparse vectors).
Unlike word vectors obtained via one-hot encoding, word embeddings are learned from data.
It is common to see word embeddings that are 256-dimensional, 512-dimensional, or 1024-dimensional when dealing with very large vocabularies.
On the other hand, one-hot encoding words generally leads to vectors that are 20,000-dimensional or higher (capturing a vocabulary of 20,000
token in this case). So, word embeddings pack more information into far fewer dimensions.
There are two ways to obtain word embeddings:
Learn word embeddings jointly with the main task you care about (e.g. document classification or sentiment prediction).
In this setup, you would start with random word vectors, then learn your word vectors in the same way that you learn the weights of a neural network.
Load into your model word embeddings that were pre-computed using a different machine learning task than the one you are trying to solve.
These are called “pre-trained word embeddings”.
Learning word embeddings with the Embedding layer¶
The simplest way to associate a dense vector to a word would be to pick the vector at random and let the model learn the best values of the vector during the training stage.
Keras provides the Embedding layer that facilitates this.
The Embedding layer is normally the first layer of the neural network.
The Embedding layer takes a word index as input.
In [4]:
from tensorflow.keras.layers import Embedding
# The Embedding layer takes at least two arguments:
# the number of possible tokens, here 1000 (1 + maximum word index),
# and the dimensionality of the embeddings, here 64.
embedding_layer = Embedding(1000, 64)
The Embedding layer is best understood as a dictionary mapping integer indices (which stand for specific words) to dense vectors. It takes
as input integers, it looks up these integers into an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup.
The Embedding layer takes as input a 2D tensor of integers, of shape (samples, sequence_length), where each entry is a sequence of
integers. It can embed sequences of variable lengths, so for instance we could feed into our embedding layer above batches that could have
shapes (32, 10) (batch of 32 sequences of length 10) or (64, 15) (batch of 64 sequences of length 15). All sequences in a batch must
have the same length, though (since we need to pack them into a single tensor), so sequences that are shorter than others should be padded
with zeros, and sequences that are longer should be truncated.
This layer returns a 3D floating point tensor, of shape (samples, sequence_length, embedding_dimensionality). Such a 3D tensor can then
be processed by a RNN layer or a 1D convolution layer (both will be introduced in the next sections).
When you instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, just like with any
other layer. During training, these word vectors will be gradually adjusted via backpropagation, structuring the space into something that the
downstream model can exploit. Once fully trained, your embedding space will show a lot of structure — a kind of structure specialized for
the specific problem you were training your model for.
Keras’ pad_sequences converts a sequence of lists of word indices into a matrix of rows so that:
If a sequence is longer than the maximum length, the sequence is truncated (by default at the beginning).
If a sequence is shorter than the maximum length, zeros are padded (by default at the beginning).
In [5]:
from tensorflow.keras import preprocessing
my_data = [[1,2,23,43], [2,6,1,31,3,4,21]]
preprocessing.sequence.pad_sequences(my_data, maxlen=6)
Out[5]:
array([[ 0, 0, 1, 2, 23, 43],
[ 6, 1, 31, 3, 4, 21]], dtype=int32)
Let’s apply this idea to the IMDB movie review sentiment prediction task that you are already familiar with. Let’s quickly prepare
the data. We will restrict the movie reviews to the top 10,000 most common words (like we did the first time we worked with this dataset),
and cut the reviews after only 20 words. Our network will simply learn 8-dimensional embeddings for each of the 10,000 words, turn the
input integer sequences (2D integer tensor) into embedded sequences (3D float tensor), flatten the tensor to 2D, and train a single Dense
layer on top for classification.
In [6]:
from tensorflow.keras.datasets import imdb
from tensorflow.keras import preprocessing
# Number of words to consider as features
max_features = 10000
# Cut texts after this number of words
# (among top max_features most common words)
maxlen = 20
# Load the data as lists of integers.
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
# This turns our lists of integers
# into a 2D integer tensor of shape `(samples, maxlen)`
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
/home/diego/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/datasets/imdb.py:155: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify ‘dtype=object’ when creating the ndarray
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
/home/diego/anaconda3/lib/python3.8/site-packages/tensorflow/python/keras/datasets/imdb.py:156: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify ‘dtype=object’ when creating the ndarray
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
In [7]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Flatten, Dense
model = Sequential()
# We specify the maximum input length to our Embedding layer
# so we can later flatten the embedded inputs
model.add(Embedding(10000, 8, input_length=maxlen))
# After the Embedding layer,
# our activations have shape `(samples, maxlen, 8)`.
# We flatten the 3D tensor of embeddings
# into a 2D tensor of shape `(samples, maxlen * 8)`
model.add(Flatten())
# We add the classifier on top
model.add(Dense(1, activation=’sigmoid’))
In [8]:
model.compile(optimizer=’rmsprop’, loss=’binary_crossentropy’, metrics=[‘acc’])
model.summary()
Model: “sequential”
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 20, 8) 80000
_________________________________________________________________
flatten (Flatten) (None, 160) 0
_________________________________________________________________
dense (Dense) (None, 1) 161
=================================================================
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
The following code trains the model using 10 epochs and a batch size of 32. Also, prior to training the model, it partitions the data into a training set and a validation set. A validation split of 0.2 indicates that 20% of the data set is used for the validation set.
Keras will allocate the first samples of the data set to the training set, and the final samples to the validation set. If you want to do a random partition of the data set, you should shuffle the data before calling to fit.
In [9]:
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_split=0.2)
Epoch 1/10
625/625 [==============================] – 2s 4ms/step – loss: 0.6696 – acc: 0.6215 – val_loss: 0.6141 – val_acc: 0.7106
Epoch 2/10
625/625 [==============================] – 3s 5ms/step – loss: 0.5367 – acc: 0.7533 – val_loss: 0.5191 – val_acc: 0.7360
Epoch 3/10
625/625 [==============================] – 3s 5ms/step – loss: 0.4587 – acc: 0.7894 – val_loss: 0.4953 – val_acc: 0.7462
Epoch 4/10
625/625 [==============================] – 3s 4ms/step – loss: 0.4213 – acc: 0.8088 – val_loss: 0.4903 – val_acc: 0.7560
Epoch 5/10
625/625 [==============================] – 3s 4ms/step – loss: 0.3949 – acc: 0.8238 – val_loss: 0.4921 – val_acc: 0.7610
Epoch 6/10
625/625 [==============================] – 3s 4ms/step – loss: 0.3722 – acc: 0.8367 – val_loss: 0.4947 – val_acc: 0.7608
Epoch 7/10
625/625 [==============================] – 2s 4ms/step – loss: 0.3522 – acc: 0.8492 – val_loss: 0.4969 – val_acc: 0.7588
Epoch 8/10
625/625 [==============================] – 3s 4ms/step – loss: 0.3330 – acc: 0.8601 – val_loss: 0.5028 – val_acc: 0.7594
Epoch 9/10
625/625 [==============================] – 3s 4ms/step – loss: 0.3143 – acc: 0.8711 – val_loss: 0.5079 – val_acc: 0.7610
Epoch 10/10
625/625 [==============================] – 3s 5ms/step – loss: 0.2971 – acc: 0.8796 – val_loss: 0.5160 – val_acc: 0.7586
In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
acc = history.history[‘acc’]
val_acc = history.history[‘val_acc’]
loss = history.history[‘loss’]
val_loss = history.history[‘val_loss’]
epochs = range(1, len(acc) + 1)
plt.subplot(121)
plt.plot(epochs, acc, ‘bo’, label=’Training acc’)
plt.plot(epochs, val_acc, ‘b’, label=’Validation acc’)
plt.title(‘Training and validation accuracy’)
plt.legend()
#plt.figure()
plt.subplot(122)
plt.plot(epochs, loss, ‘bo’, label=’Training loss’)
plt.plot(epochs, val_loss, ‘b’, label=’Validation loss’)
plt.title(‘Training and validation loss’)
plt.legend()
plt.show()
We get to a validation accuracy of ~76%, which is pretty good considering that we only look at the first 20 words in every review. But
note that merely flattening the embedded sequences and training a single Dense layer on top leads to a model that treats each word in the
input sequence separately, without considering inter-word relationships and sentence structure (e.g. it would likely treat both “this movie
is shit” and “this movie is the shit” as being negative “reviews”). The following alternatives would normally give better results:
Average the word embeddings to generate a summary of the word embeddings. Keras has the layer GlobalAveragePooling() that can be used, for example:
model = Sequential()
model.add(Embedding(10000, 8, input_length=maxlen))
# After the Embedding layer,
# our activations have shape `(samples, maxlen, 8)`.
model.add(GlobalAveragePooling1D())
# After computing the average,
# our activations have shape `(samples, 8)`
# We add the classifier on top
model.add(Dense(1, activation=’sigmoid’))
Add a recurrent layer (we will cover this later in the course).
Add a 1D convolutional layer (see the textbook for details).
Option 1 is a quick solution that sometimes gives surprisingly good results. Options 2 and 3 are more complex solutions that learn word dependencies in the input text.
Using pre-trained word embeddings¶
Sometimes, you have so little training data available that could never use your data alone to learn an appropriate task-specific embedding
of your vocabulary. What to do then?
We can then use pre-trained word embeddings, created by a third party!
Instead of learning word embeddings jointly with the problem you want to solve, you could be loading embedding vectors from a pre-computed
embedding space known to be highly structured and to exhibit useful properties — that captures generic aspects of language structure. The
rationale behind using pre-trained word embeddings in natural language processing is very much the same as for using pre-trained convnets
in image classification: we don’t have enough data available to learn truly powerful features on our own, but we expect the features that
we need to be fairly generic, i.e. common visual features or semantic features. In this case it makes sense to reuse features learned on a
different problem.
Such word embeddings are generally computed using word occurrence statistics (observations about what words co-occur in sentences or
documents), using a variety of techniques, some involving neural networks, others not. The idea of a dense, low-dimensional embedding space
for words, computed in an unsupervised way, was initially explored by Bengio et al. in the early 2000s, but it only started really taking
off in research and industry applications after the release of one of the most famous and successful word embedding scheme: the Word2Vec
algorithm, developed by Mikolov at Google in 2013. Word2Vec dimensions capture specific semantic properties, e.g. gender.
There are various pre-computed databases of word embeddings that can download and start using in a Keras Embedding layer. Word2Vec is one
of them. Another popular one is called “GloVe”, developed by Stanford researchers in 2014. It stands for “Global Vectors for Word
Representation”, and it is an embedding technique based on factorizing a matrix of word co-occurrence statistics. Its developers have made
available pre-computed embeddings for millions of English tokens, obtained from Wikipedia data or from Common Crawl data.
Putting it all together: from raw text to word embeddings¶
Let’s take a look at how you can get started using GloVe embeddings in a Keras model. The same method will of course be valid for Word2Vec
embeddings or any other word embedding database that you can download. We will also use this example to introduce Keras’ text tokenization
techniques.
We will be using a model similar to the one we just went over — embedding sentences in sequences of vectors, flattening them and training a
Dense layer on top. But we will do it using pre-trained word embeddings, and instead of using the pre-tokenized IMDB data packaged in
Keras, we will start from scratch, by downloading the original text data.
Download the IMDB data as raw text¶
First, head to http://ai.stanford.edu/~amaas/data/sentiment/ and download the raw IMDB dataset (if the URL isn’t working anymore, just
Google “IMDB dataset”). Uncompress it.
In [12]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
–2021-03-22 11:17:19– http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)… 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’
aclImdb_v1.tar.gz 100%[===================>] 80.23M 4.55MB/s in 34s
2021-03-22 11:17:53 (2.35 MB/s) – ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]
In [13]:
!tar xzf aclImdb_v1.tar.gz
In [14]:
!ls aclImdb
imdbEr.txt imdb.vocab README test train
Now let’s collect the individual training reviews into a list of strings, one string per review, and let’s also collect the review labels
(positive / negative) into a labels list:
In [15]:
import os
imdb_dir = ‘aclImdb’
train_dir = os.path.join(imdb_dir, ‘train’)
labels = []
texts = []
for label_type in [‘neg’, ‘pos’]:
dir_name = os.path.join(train_dir, label_type)
for fname in os.listdir(dir_name):
if fname[-4:] == ‘.txt’:
f = open(os.path.join(dir_name, fname))
texts.append(f.read())
f.close()
if label_type == ‘neg’:
labels.append(0)
else:
labels.append(1)
In [16]:
texts[0]
Out[16]:
‘I initially gained interest in this film after reading a review saying this movie reminded the reviewer of Silent Hill.
Being a huge Silent Hill fan, and disappointed with it\’s movie debut, I thought I would give this one a chance. Mind, Fearnet only lists this movie as “Dark Floors”, not by it\’s full name. So when I saw the name “Mr. Lordi” in the credits I immediately thought of the band (I had a few friends in college that like them) but didn\’t think it was important and quickly pushed the thought aside.
The film starts out strong. Despite the fact “creepy little girl” has been done to DEATH, the good use of audio and sense of isolation really started to piece the the story together. The tense atmosphere built rapidly, and every indication pointed to the film being excellent. As monsters are the true stars of horror, I couldn\’t wait to see what was lurking in the halls of the hospital the main characters had found themselves trapped in…
And then the first monster showed up, and I found myself greatly underwhelmed. By the time the second appeared, I boggled at the fact it looked like it had just come from a Megadeth concert, and the silliness turned me off completely.
Over the course of the movie the atmosphere did remain intact, and the story left you wondering just what was going on, but the scares were pretty much non-existent. However, I held out hope that the end would make it all worthwhile. Unfortunately that was not to be the case. By the time the movie had reached it\’s climax, I was in utter disbelief, and I immediately recognized the big bad in his final reveal… The lead singer of Lordi? Seriously?
Was that what the movie all boiled down to? A bunch of poor souls being chased around a hospital by Lordi band members? The silly monster design suddenly made sense. If you\’re going to be that corny, may as well through the members of Marilyn Manson, or even KISS in there too. Not to mention the fact that I\’m pretty sure I saw the ending in one of Lordi\’s music videos a few years ago. They had to go and make an entire movie off of it?
Worst of all, when I found out what really had been going on, all I could manage was a yawn. I\’m not going to “ruin” it for you, but I can safely say it\’s probably a plot device you\’ve seen before. Most likely more then once.
So, unless you\’re a huge Lordi fan, stay away from this. It\’s not scary, it doesn\’t bring anything new to the table (although it does a decent job of borrowing from other horror movies, mainly Silent Hill). And, I can\’t stress this enough, LORDI is the antagonist. LORDI. Talk about a buzzkill.
Really, you\’d be better off trying to scare yourself watching Slipknot music videos. In other words, it\’s just not possible.’
In [17]:
labels[0]
Out[17]:
0
Tokenize the data¶
Keras’ tokenizer can be used to map each word to a word index through the following two steps:
Create a mapping of words to indices using fit_on_texts on the training data.
Find the indices of the training data using texts_to_sequences.
Let’s vectorize the texts we collected, and prepare a training and validation split.
We will merely be using the concepts we introduced earlier in this section.
Because pre-trained word embeddings are meant to be particularly useful on problems where little training data is available (otherwise,
task-specific embeddings are likely to outperform them), we will add the following twist: we restrict the training data to its first 200
samples. So we will be learning to classify movie reviews after looking at just 200 examples…
In [18]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
maxlen = 100 # We will cut reviews after 100 words
training_samples = 200 # We will be training on 200 samples
validation_samples = 10000 # We will be validating on 10000 samples
max_words = 10000 # We will only consider the top 10,000 words in the dataset
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
print(‘Found %s unique tokens.’ % len(word_index))
data = pad_sequences(sequences, maxlen=maxlen)
labels = np.asarray(labels)
print(‘Shape of data tensor:’, data.shape)
print(‘Shape of label tensor:’, labels.shape)
Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)
Incidentally, you can also use Keras’ tokeniser to generate one-hot encoding by using texts_to_matrix with the mode=binary option. Look at Keras documentation for other encoding options: https://faroit.github.io/keras-docs/1.2.2/preprocessing/text/
In [19]:
one_hot = tokenizer.texts_to_matrix(texts, mode=’binary’)
one_hot.shape
Out[19]:
(25000, 10000)
In [20]:
# Split the data into a training set and a validation set
# But first, shuffle the data, since we started from data
# where samples are ordered (all negative first, then all positive).
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]
Download the GloVe word embeddings¶
Head to https://nlp.stanford.edu/projects/glove/ (where you can learn more about the GloVe algorithm), and download the pre-computed
embeddings from 2014 English Wikipedia. It’s a 822MB zip file named glove.6B.zip, containing 100-dimensional embedding vectors for
400,000 words (or non-word tokens). Un-zip it.
In [19]:
!wget http://nlp.stanford.edu/data/glove.6B.zip
–2021-03-16 15:21:43– http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)… 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80… connected.
HTTP request sent, awaiting response… 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
–2021-03-16 15:21:44– https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443… connected.
HTTP request sent, awaiting response… 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
–2021-03-16 15:21:45– http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)… 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80… connected.
HTTP request sent, awaiting response… 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’
glove.6B.zip 100%[===================>] 822.24M 1.60MB/s in 4m 20s
2021-03-16 15:26:06 (3.16 MB/s) – ‘glove.6B.zip’ saved [862182613/862182613]
The following code unzips the data. If you do not have unzip installed or you are using Goocle Colaboratory you may need to run this first:
In [ ]:
!apt install unzip
In [21]:
!unzip glove.6B.zip
Archive: glove.6B.zip
inflating: glove.6B.50d.txt
inflating: glove.6B.100d.txt
inflating: glove.6B.200d.txt
inflating: glove.6B.300d.txt
Alternatively, map Google drive (if you are using Google Colaboratory)¶
If you are using Google colaboratory you can use cloud instances with a GPU and you can also store data in your Google Drive. The following cells of code mount your Google Drive after following an authorisation step. Uncomment them and run them on a Google colaboratory notebook.
In [21]:
#from google.colab import drive
#drive.mount(‘/gdrive’)
In [22]:
#cp ‘/gdrive/My Drive/COMP348/glove/glove.6B.100d.txt’ .
Pre-process the embeddings¶
Let’s parse the un-zipped file to build an index mapping words (as strings) to their vector representation (as number vectors). The file is a text file where each line shows the word followed by the vector representation. For example, the first lines of glove.6B.50d.txt are:
In [22]:
!head glove.6B.100d.txt
the -0.038194 -0.24487 0.72812 -0.39961 0.083172 0.043953 -0.39141 0.3344 -0.57545 0.087459 0.28787 -0.06731 0.30906 -0.26384 -0.13231 -0.20757 0.33395 -0.33848 -0.31743 -0.48336 0.1464 -0.37304 0.34577 0.052041 0.44946 -0.46971 0.02628 -0.54155 -0.15518 -0.14107 -0.039722 0.28277 0.14393 0.23464 -0.31021 0.086173 0.20397 0.52624 0.17164 -0.082378 -0.71787 -0.41531 0.20335 -0.12763 0.41367 0.55187 0.57908 -0.33477 -0.36559 -0.54857 -0.062892 0.26584 0.30205 0.99775 -0.80481 -3.0243 0.01254 -0.36942 2.2167 0.72201 -0.24978 0.92136 0.034514 0.46745 1.1079 -0.19358 -0.074575 0.23353 -0.052062 -0.22044 0.057162 -0.15806 -0.30798 -0.41625 0.37972 0.15006 -0.53212 -0.2055 -1.2526 0.071624 0.70565 0.49744 -0.42063 0.26148 -1.538 -0.30223 -0.073438 -0.28312 0.37104 -0.25217 0.016215 -0.017099 -0.38984 0.87424 -0.72569 -0.51058 -0.52028 -0.1459 0.8278 0.27062
, -0.10767 0.11053 0.59812 -0.54361 0.67396 0.10663 0.038867 0.35481 0.06351 -0.094189 0.15786 -0.81665 0.14172 0.21939 0.58505 -0.52158 0.22783 -0.16642 -0.68228 0.3587 0.42568 0.19021 0.91963 0.57555 0.46185 0.42363 -0.095399 -0.42749 -0.16567 -0.056842 -0.29595 0.26037 -0.26606 -0.070404 -0.27662 0.15821 0.69825 0.43081 0.27952 -0.45437 -0.33801 -0.58184 0.22364 -0.5778 -0.26862 -0.20425 0.56394 -0.58524 -0.14365 -0.64218 0.0054697 -0.35248 0.16162 1.1796 -0.47674 -2.7553 -0.1321 -0.047729 1.0655 1.1034 -0.2208 0.18669 0.13177 0.15117 0.7131 -0.35215 0.91348 0.61783 0.70992 0.23955 -0.14571 -0.37859 -0.045959 -0.47368 0.2385 0.20536 -0.18996 0.32507 -1.1112 -0.36341 0.98679 -0.084776 -0.54008 0.11726 -1.0194 -0.24424 0.12771 0.013884 0.080374 -0.35414 0.34951 -0.7226 0.37549 0.4441 -0.99059 0.61214 -0.35111 -0.83155 0.45293 0.082577
. -0.33979 0.20941 0.46348 -0.64792 -0.38377 0.038034 0.17127 0.15978 0.46619 -0.019169 0.41479 -0.34349 0.26872 0.04464 0.42131 -0.41032 0.15459 0.022239 -0.64653 0.25256 0.043136 -0.19445 0.46516 0.45651 0.68588 0.091295 0.21875 -0.70351 0.16785 -0.35079 -0.12634 0.66384 -0.2582 0.036542 -0.13605 0.40253 0.14289 0.38132 -0.12283 -0.45886 -0.25282 -0.30432 -0.11215 -0.26182 -0.22482 -0.44554 0.2991 -0.85612 -0.14503 -0.49086 0.0082973 -0.17491 0.27524 1.4401 -0.21239 -2.8435 -0.27958 -0.45722 1.6386 0.78808 -0.55262 0.65 0.086426 0.39012 1.0632 -0.35379 0.48328 0.346 0.84174 0.098707 -0.24213 -0.27053 0.045287 -0.40147 0.11395 0.0062226 0.036673 0.018518 -1.0213 -0.20806 0.64072 -0.068763 -0.58635 0.33476 -1.1432 -0.1148 -0.25091 -0.45907 -0.096819 -0.17946 -0.063351 -0.67412 -0.068895 0.53604 -0.87773 0.31802 -0.39242 -0.23394 0.47298 -0.028803
of -0.1529 -0.24279 0.89837 0.16996 0.53516 0.48784 -0.58826 -0.17982 -1.3581 0.42541 0.15377 0.24215 0.13474 0.41193 0.67043 -0.56418 0.42985 -0.012183 -0.11677 0.31781 0.054177 -0.054273 0.35516 -0.30241 0.31434 -0.33846 0.71715 -0.26855 -0.15837 -0.47467 0.051581 -0.33252 0.15003 -0.1299 -0.54617 -0.37843 0.64261 0.82187 -0.080006 0.078479 -0.96976 -0.57741 0.56491 -0.39873 -0.057099 0.19743 0.065706 -0.48092 -0.20125 -0.40834 0.39456 -0.02642 -0.11838 1.012 -0.53171 -2.7474 -0.042981 -0.74849 1.7574 0.59085 0.04885 0.78267 0.38497 0.42097 0.67882 0.10337 0.6328 -0.026595 0.58647 -0.44332 0.33057 -0.12022 -0.55645 0.073611 0.20915 0.43395 -0.012761 0.089874 -1.7991 0.084808 0.77112 0.63105 -0.90685 0.60326 -1.7515 0.18596 -0.50687 -0.70203 0.66578 -0.81304 0.18712 -0.018488 -0.26757 0.727 -0.59363 -0.34839 -0.56094 -0.591 1.0039 0.20664
to -0.1897 0.050024 0.19084 -0.049184 -0.089737 0.21006 -0.54952 0.098377 -0.20135 0.34241 -0.092677 0.161 -0.13268 -0.2816 0.18737 -0.42959 0.96039 0.13972 -1.0781 0.40518 0.50539 -0.55064 0.4844 0.38044 -0.0029055 -0.34942 -0.099696 -0.78368 1.0363 -0.2314 -0.47121 0.57126 -0.21454 0.35958 -0.48319 1.0875 0.28524 0.12447 -0.039248 -0.076732 -0.76343 -0.32409 -0.5749 -1.0893 -0.41811 0.4512 0.12112 -0.51367 -0.13349 -1.1378 -0.28768 0.16774 0.55804 1.5387 0.018859 -2.9721 -0.24216 -0.92495 2.1992 0.28234 -0.3478 0.51621 -0.43387 0.36852 0.74573 0.072102 0.27931 0.92569 -0.050336 -0.85856 -0.1358 -0.92551 -0.33991 -1.0394 -0.067203 -0.21379 -0.4769 0.21377 -0.84008 0.052536 0.59298 0.29604 -0.67644 0.13916 -1.5504 -0.20765 0.7222 0.52056 -0.076221 -0.15194 -0.13134 0.058617 -0.31869 -0.61419 -0.62393 -0.41548 -0.038175 -0.39804 0.47647 -0.15983
and -0.071953 0.23127 0.023731 -0.50638 0.33923 0.1959 -0.32943 0.18364 -0.18057 0.28963 0.20448 -0.5496 0.27399 0.58327 0.20468 -0.49228 0.19974 -0.070237 -0.88049 0.29485 0.14071 -0.1009 0.99449 0.36973 0.44554 0.28998 -0.1376 -0.56365 -0.029365 -0.4122 -0.25269 0.63181 -0.44767 0.24363 -0.10813 0.25164 0.46967 0.3755 -0.23613 -0.14129 -0.44537 -0.65737 -0.042421 -0.28636 -0.28811 0.063766 0.20281 -0.53542 0.41307 -0.59722 -0.38614 0.19389 -0.17809 1.6618 -0.011819 -2.3737 0.058427 -0.2698 1.2823 0.81925 -0.22322 0.72932 -0.053211 0.43507 0.85011 -0.42935 0.92664 0.39051 1.0585 -0.24561 -0.18265 -0.5328 0.059518 -0.66019 0.18991 0.28836 -0.2434 0.52784 -0.65762 -0.14081 1.0491 0.5134 -0.23816 0.69895 -1.4813 -0.2487 -0.17936 -0.059137 -0.08056 -0.48782 0.014487 -0.6259 -0.32367 0.41862 -1.0807 0.46742 -0.49931 -0.71895 0.86894 0.19539
in 0.085703 -0.22201 0.16569 0.13373 0.38239 0.35401 0.01287 0.22461 -0.43817 0.50164 -0.35874 -0.34983 0.055156 0.69648 -0.17958 0.067926 0.39101 0.16039 -0.26635 -0.21138 0.53698 0.49379 0.9366 0.66902 0.21793 -0.46642 0.22383 -0.36204 -0.17656 0.1748 -0.20367 0.13931 0.019832 -0.10413 -0.20244 0.55003 -0.1546 0.98655 -0.26863 -0.2909 -0.32866 -0.34188 -0.16943 -0.42001 -0.046727 -0.16327 0.70824 -0.74911 -0.091559 -0.96178 -0.19747 0.10282 0.55221 1.3816 -0.65636 -3.2502 -0.31556 -1.2055 1.7709 0.4026 -0.79827 1.1597 -0.33042 0.31382 0.77386 0.22595 0.52471 -0.034053 0.32048 0.079948 0.17752 -0.49426 -0.70045 -0.44569 0.17244 0.20278 0.023292 -0.20677 -1.0158 0.18325 0.56752 0.31821 -0.65011 0.68277 -0.86585 -0.059392 -0.29264 -0.55668 -0.34705 -0.32895 0.40215 -0.12746 -0.20228 0.87368 -0.545 0.79205 -0.20695 -0.074273 0.75808 -0.34243
a -0.27086 0.044006 -0.02026 -0.17395 0.6444 0.71213 0.3551 0.47138 -0.29637 0.54427 -0.72294 -0.0047612 0.040611 0.043236 0.29729 0.10725 0.40156 -0.53662 0.033382 0.067396 0.64556 -0.085523 0.14103 0.094539 0.74947 -0.194 -0.68739 -0.41741 -0.22807 0.12 -0.48999 0.80945 0.045138 -0.11898 0.20161 0.39276 -0.20121 0.31354 0.75304 0.25907 -0.11566 -0.029319 0.93499 -0.36067 0.5242 0.23706 0.52715 0.22869 -0.51958 -0.79349 -0.20368 -0.50187 0.18748 0.94282 -0.44834 -3.6792 0.044183 -0.26751 2.1997 0.241 -0.033425 0.69553 -0.64472 -0.0072277 0.89575 0.20015 0.46493 0.61933 -0.1066 0.08691 -0.4623 0.18262 -0.15849 0.020791 0.19373 0.063426 -0.31673 -0.48177 -1.3848 0.13669 0.96859 0.049965 -0.2738 -0.035686 -1.0577 -0.24467 0.90366 -0.12442 0.080776 -0.83401 0.57201 0.088945 -0.42532 -0.018253 -0.079995 -0.28581 -0.01089 -0.4923 0.63687 0.23642
” -0.30457 -0.23645 0.17576 -0.72854 -0.28343 -0.2564 0.26587 0.025309 -0.074775 -0.3766 -0.057774 0.12159 0.34384 0.41928 -0.23236 -0.31547 0.60939 0.25117 -0.68667 0.70873 1.2162 -0.1824 -0.48442 -0.33445 0.30343 1.086 0.49992 -0.20198 0.27959 0.68352 -0.33566 -0.12405 0.059656 0.33617 0.37501 0.56552 0.44867 0.11284 -0.16196 -0.94346 -0.67961 0.18581 0.060653 0.43776 0.13834 -0.48207 -0.56141 -0.25422 -0.52445 0.097003 -0.48925 0.19077 0.21481 1.4969 -0.86665 -3.2846 0.56854 0.41971 1.2294 0.78522 -0.29369 0.63803 -1.5926 -0.20437 1.5306 0.13548 0.50722 0.18742 0.48552 -0.28995 0.19573 0.0046515 0.092879 -0.42444 0.64987 0.52839 0.077908 0.8263 -1.2208 -0.34955 0.49855 -0.64155 -0.72308 0.26566 -1.3643 -0.46364 -0.52048 -1.0525 0.22895 -0.3456 -0.658 -0.16735 0.35158 0.74337 0.26074 0.061104 -0.39079 -0.84557 -0.035432 0.17036
‘s 0.58854 -0.2025 0.73479 -0.68338 -0.19675 -0.1802 -0.39177 0.34172 -0.60561 0.63816 -0.26695 0.36486 -0.40379 -0.1134 -0.58718 0.2838 0.8025 -0.35303 0.30083 0.078935 0.44416 -0.45906 0.79294 0.50365 0.32805 0.28027 -0.4933 -0.38482 -0.039284 -0.2483 -0.1988 1.1469 0.13228 0.91691 -0.36739 0.89425 0.5426 0.61738 -0.62205 -0.31132 -0.50933 0.23335 1.0826 -0.044637 -0.12767 0.27628 -0.032617 -0.27397 0.77764 -0.50861 0.038307 -0.33679 0.42344 1.2271 -0.53826 -3.2411 0.42626 0.025189 1.3948 0.65085 0.03325 0.37141 0.4044 0.35558 0.98265 -0.61724 0.53901 0.76219 0.30689 0.33065 0.30956 -0.15161 -0.11313 -0.81281 0.6145 -0.44341 -0.19163 -0.089551 -1.5927 0.37405 0.85857 0.54613 -0.31928 0.52598 -1.4802 -0.97931 -0.2939 -0.14724 0.25803 -0.1817 1.0149 0.77649 0.12598 0.54779 -1.0316 0.064599 -0.37523 -0.94475 0.61802 0.39591
In [23]:
glove_dir = ”
embeddings_index = {}
f = open(os.path.join(glove_dir, ‘glove.6B.100d.txt’))
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype=’float32′)
embeddings_index[word] = coefs
f.close()
print(‘Found %s word vectors.’ % len(embeddings_index))
Found 400000 word vectors.
Now, let’s build an embedding matrix that we will be able to load into an Embedding layer. It must be a matrix of shape (max_words,
embedding_dim), where each entry i contains the embedding_dim-dimensional vector for the word of index i in our reference word index
(built during tokenization). Note that the index 0 is not supposed to stand for any word or token — it’s a placeholder.
In [24]:
embedding_dim = 100
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if i < max_words:
if embedding_vector is not None:
# Words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
Define a model¶
We will be using the same model architecture as before:
In [25]:
from tensorflow.keras.models import Sequential
from tensorflow. keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 100, 100) 1000000
_________________________________________________________________
flatten_1 (Flatten) (None, 10000) 0
_________________________________________________________________
dense_1 (Dense) (None, 32) 320032
_________________________________________________________________
dense_2 (Dense) (None, 1) 33
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
Load the GloVe embeddings in the model¶
The Embedding layer has a single weight matrix: a 2D float matrix where each entry i is the word vector meant to be associated with
index i. Simple enough. Let's just load the GloVe matrix we prepared into our Embedding layer, the first layer in our model.
Additionally, we freeze the embedding layer (we set its trainable attribute to False), so that the pre-trained embeddings are not updated during the training stage.
In [26]:
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
We can now observe that the number of trainable parameters is much smaller:
In [27]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 100, 100) 1000000
_________________________________________________________________
flatten_1 (Flatten) (None, 10000) 0
_________________________________________________________________
dense_1 (Dense) (None, 32) 320032
_________________________________________________________________
dense_2 (Dense) (None, 1) 33
=================================================================
Total params: 1,320,065
Trainable params: 320,065
Non-trainable params: 1,000,000
_________________________________________________________________
Train and evaluate¶
Let's compile our model and train it:
In [28]:
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')
Epoch 1/10
7/7 [==============================] - 1s 211ms/step - loss: 2.0225 - acc: 0.4800 - val_loss: 0.7286 - val_acc: 0.5077
Epoch 2/10
7/7 [==============================] - 2s 250ms/step - loss: 0.6273 - acc: 0.6750 - val_loss: 0.9464 - val_acc: 0.5046
Epoch 3/10
7/7 [==============================] - 2s 260ms/step - loss: 0.6680 - acc: 0.7000 - val_loss: 0.6979 - val_acc: 0.5248
Epoch 4/10
7/7 [==============================] - 1s 166ms/step - loss: 0.5253 - acc: 0.6750 - val_loss: 0.8610 - val_acc: 0.5046
Epoch 5/10
7/7 [==============================] - 2s 246ms/step - loss: 0.5288 - acc: 0.6950 - val_loss: 0.7361 - val_acc: 0.5085
Epoch 6/10
7/7 [==============================] - 1s 181ms/step - loss: 0.4491 - acc: 0.6950 - val_loss: 0.7009 - val_acc: 0.5337
Epoch 7/10
7/7 [==============================] - 2s 259ms/step - loss: 0.4431 - acc: 0.7700 - val_loss: 0.6998 - val_acc: 0.5427
Epoch 8/10
7/7 [==============================] - 2s 297ms/step - loss: 0.4451 - acc: 0.8050 - val_loss: 0.9960 - val_acc: 0.5053
Epoch 9/10
7/7 [==============================] - 2s 222ms/step - loss: 0.3991 - acc: 0.8350 - val_loss: 0.8773 - val_acc: 0.5075
Epoch 10/10
7/7 [==============================] - 1s 189ms/step - loss: 0.4066 - acc: 0.8300 - val_loss: 0.6917 - val_acc: 0.5645
Let's plot its performance over time:
In [29]:
%matplotlib inline
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.subplot(121)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
#plt.figure()
plt.subplot(122)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
The model quickly starts overfitting, unsurprisingly given the small number of training samples. Validation accuracy has high variance for
the same reason, but seems to reach high 50s.
Note that your mileage may vary: since we have so few training samples, performance is heavily dependent on which exact 200 samples we
picked, and we picked them at random. If it worked really poorly for you, try picking a different random set of 200 samples, just for the
sake of the exercise (in real life you don't get to pick your training data).
We can also try to train the same model without loading the pre-trained word embeddings and without freezing the embedding layer. In that
case, we would be learning a task-specific embedding of our input tokens, which is generally more powerful than pre-trained word embeddings
when lots of data is available. However, in our case, we have only 200 training samples. Let's try it:
In [30]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_val, y_val))
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_3 (Embedding) (None, 100, 100) 1000000
_________________________________________________________________
flatten_2 (Flatten) (None, 10000) 0
_________________________________________________________________
dense_3 (Dense) (None, 32) 320032
_________________________________________________________________
dense_4 (Dense) (None, 1) 33
=================================================================
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
7/7 [==============================] - 2s 248ms/step - loss: 0.6950 - acc: 0.4850 - val_loss: 0.6928 - val_acc: 0.5059
Epoch 2/10
7/7 [==============================] - 2s 258ms/step - loss: 0.5189 - acc: 0.9700 - val_loss: 0.6993 - val_acc: 0.5057
Epoch 3/10
7/7 [==============================] - 2s 282ms/step - loss: 0.3022 - acc: 0.9850 - val_loss: 0.6942 - val_acc: 0.5200
Epoch 4/10
7/7 [==============================] - 2s 275ms/step - loss: 0.1380 - acc: 1.0000 - val_loss: 0.7058 - val_acc: 0.5146
Epoch 5/10
7/7 [==============================] - 2s 253ms/step - loss: 0.0675 - acc: 1.0000 - val_loss: 0.7071 - val_acc: 0.5184
Epoch 6/10
7/7 [==============================] - 2s 309ms/step - loss: 0.0342 - acc: 1.0000 - val_loss: 0.7079 - val_acc: 0.5223
Epoch 7/10
7/7 [==============================] - 2s 225ms/step - loss: 0.0187 - acc: 1.0000 - val_loss: 0.7140 - val_acc: 0.5236
Epoch 8/10
7/7 [==============================] - 1s 207ms/step - loss: 0.0109 - acc: 1.0000 - val_loss: 0.7162 - val_acc: 0.5227
Epoch 9/10
7/7 [==============================] - 2s 258ms/step - loss: 0.0065 - acc: 1.0000 - val_loss: 0.7325 - val_acc: 0.5213
Epoch 10/10
7/7 [==============================] - 1s 164ms/step - loss: 0.0039 - acc: 1.0000 - val_loss: 0.7256 - val_acc: 0.5271
In [31]:
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.subplot(121)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
#plt.figure()
plt.subplot(122)
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
Validation accuracy stalls in the low 50s. So in our case, pre-trained word embeddings does outperform jointly learned embeddings. If you
increase the number of training samples, this will quickly stop being the case -- try it as an exercise.
Finally, let's evaluate the model on the test data. First, we will need to tokenize the test data. Remember that we have created a mapping of words to indices. We now reuse this mapping by calling to texts_to_sequences. You should not call to fit_on_texts now! (why not?)
In [32]:
test_dir = os.path.join(imdb_dir, 'test')
labels = []
texts = []
for label_type in ['neg', 'pos']:
dir_name = os.path.join(test_dir, label_type)
for fname in sorted(os.listdir(dir_name)):
if fname[-4:] == '.txt':
f = open(os.path.join(dir_name, fname))
texts.append(f.read())
f.close()
if label_type == 'neg':
labels.append(0)
else:
labels.append(1)
sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)
And let's load and evaluate the first model:
In [33]:
model.load_weights('pre_trained_glove_model.h5')
model.evaluate(x_test, y_test)
782/782 [==============================] - 2s 3ms/step - loss: 0.6912 - acc: 0.5644
Out[33]:
[0.6911874413490295, 0.5644000172615051]
We get an appalling test accuracy of 56%. Working with just a handful of training samples is hard!