readme
Overview
The practical exercises are based around a program for predicting dictionary head words, given
a definition. A neural network is trained to compose words in a definition so that the resulting
definition vector is close to the vector for the corresponding head word. For example, one of
the training instances could be:
The default options in the program are such that fawn has a pre-trained word embedding,
whereas the embeddings for the words in the definition are learned. There are currently two
options for composing the words: an LSTM and a bag-of-words model. For the former, the
words in the definition are composed using an LSTM sequence model, and the final hidden
state is taken as the representation for the definition. For the latter, the word vectors in the
definition are simply averaged. For both composition methods, the objective is to build a vector
for the definition (a young deer) which is close to (as measured by the cosine distance) the
vector for the head word (fawn).
There are three parts to the practical (detailed below). First, you will be asked to make an
addition to the code by writing an evaluation function which calculates the average (median)
position of the correct head word, given a definition, when a list of possible head words is
ranked by the model. Second, you will be asked to perform some experiments to see how the
results vary when various parameters in the model are changed, for example the optimization
function and learning rate. And finally, you will be asked to write a report based on the first two
parts.
Get the Code and Data
tar xzvf data_practical.tgz
cd into the src directory, and take a look at the code. The following command runs the training
procedure:
python train_definition_model.py
If you run watch nvidia-smi in another terminal (after ssh’ing into it) you’ll see the GPU usage
while the program is running (we’d like this to be high, to fully utilize the GPU resources).
The program is saving models after each epoch. (Note that the default model directory is /tmp.
You may want to change this if there is insufficient space available on /tmp, or increase the
available space.)You will probably want to use the Linux screen command when carrying out
full training of a model, so that you can exit the ssh session and log back in again at a later time
(without killing the training process).
The saved models can be reloaded and used as part of the evaluation procedure, with the
addition of a couple of flags:
python train_definition_model.py –restore –evaluate
The –restore flag will find the latest model from the saved model directory and load it, and the –
-evaluate flag runs the evaluation routine.
Part I: Write an Evaluation Function
Currently the evaluation function just prints out a message. What we would like it to do is
calculate the median rank of the correct head word for the 200 development test instances,
over the complete vocabulary. The development data is in data/concept_descriptions.tok (for
you to look at), and has already been processed into a form suitable for reading into the model.
These 200 development test instances are what you are going to use to evaluate the model.
For example, one of the test instances is . So the code will need to build a vector for the
definition, and then create a similarity ranking for all the words in the vocabulary. We would like
the vector for hat to be the closest, but the vocabulary is large so this is a difficult task. What
you should find is that the result will be very good for a few cases (in the top 10), but for many
the ranking will be lower than this.
In terms of modifying the code, the tensorflow graph is already set up to compute the score for
each word in the vocabulary. What you need to do is query the graph to get back the scores as
a numpy variable, and then use numpy to calculate the rankings. The evaluate_model function
contains a few high-level comments to help you along.
You will want to use all 200 development test examples, so when creating and running the
evaluation function think carefully about the appropriate batch size.
Part II: Experiment with some of the Model
(Hyper-)Parameters
The model is currently trained using the Adam optimizer. Try another one, e.g. gradient
descent. Also try a few values for the initial learning rate. What behaviour do you observe? Does
the loss still go down at the same rate? Does the optimizer appear to be finding a good
minimum?
How long does the model have to train for before you start to see reasonable performance on
some of the examples? Does it begin to overfit? How does the batch size affect the efficiency
and effectiveness of the training procedure? What about the embedding size? RNN vs. bag-of-
words?
Try and think of a few more experiments you can run to probe the behaviour of the model. This
part of the practical is left deliberately open-ended.
Think carefully about what experiments you want to run. Training the model to convergence
may take a while, and you will not be able to do this an unlimited number of times.
Part III: Write a Report
The report should be no longer than 5,000 words (not including the appendix).
The report should contain an appendix containing the code you wrote for the evaluation
function (and only that additional code), as well as some screenshots of the evaluation output.
You may also include tables or graphs of results in the appendix.
Part I of your report should contain a summary of your median ranking findings. What was the
overall median? Which examples did the model perform well on? Which ones did it perform
badly on?
Part II of your report should contain a description of the experiments you ran to investigate
how the values of various hyper-parameters affect the performance of the model, and any
additional experiments to probe the behaviour of the model.
Overview
Get the Code and Data
Part I: Write an Evaluation Function
Part II: Experiment with some of the Model (Hyper-)Parameters
Part III: Write a Report