CS计算机代考程序代写 scheme python Keras Practical Week 05

Practical Week 05

An End-to-End Text Classification System¶
In this workshop you will implement a text classification system from scratch. This means that we will not rely on Keras’ convenient data sets. These data sets are pre-processed and it will be useful if you know how to tokenise and find the word indices of text collections not provided by Keras.

The task will be to classify questions. To run this task we advice that you use Google Colaboratory (also called Google Colab), which is a cloud solution to run Jupyter notebooks. The demonstrator will show how to use Google Colab. For additional information and to practice with the use of notebooks in Google Colab, you can also follow this link:

Welcome notebook and link to additional resources

Question Classification¶
NLTK has a corpus of questions and their question types according to a particular classification scheme (e.g. DESC refers to a question expecting a descriptive answer, such as one starting with “How”; HUM refers to a question expecting an answer referring to a human). Below is an example of use of the corpus:

In [1]:

import nltk
nltk.download(“qc”)
from nltk.corpus import qc
train = qc.tuples(“train.txt”)
test = qc.tuples(“test.txt”)

[nltk_data] Downloading package qc to /home/diego/nltk_data…
[nltk_data] Package qc is already up-to-date!

In [2]:

train[:3]

Out[2]:

[(‘DESC:manner’, ‘How did serfdom develop in and then leave Russia ?’),
(‘ENTY:cremat’, ‘What films featured the character Popeye Doyle ?’),
(‘DESC:manner’, “How can I find a list of celebrities ‘ real names ?”)]

In [3]:

test[:3]

Out[3]:

[(‘NUM:dist’, ‘How far is it from Denver to Aspen ?’),
(‘LOC:city’, ‘What county is Modesto , California in ?’),
(‘HUM:desc’, ‘Who was Galileo ?’)]

Exercise: Find all question types¶
Write Python code that lists all the possible question types of the training set (remember: for data exploration, never look at the test set).

In [4]:

qtypes = # … write your answer here

In [5]:

qtypes

Out[5]:

[‘ENTY:sport’,
‘ENTY:substance’,
‘NUM:period’,
‘ENTY:religion’,
‘ENTY:termeq’,
‘HUM:desc’,
‘ABBR:exp’,
‘DESC:desc’,
‘NUM:volsize’,
‘ENTY:techmeth’,
‘NUM:other’,
‘LOC:country’,
‘ENTY:event’,
‘NUM:weight’,
‘ENTY:instru’,
‘ABBR:abb’,
‘LOC:state’,
‘HUM:title’,
‘NUM:dist’,
‘NUM:code’,
‘LOC:mount’,
‘LOC:city’,
‘ENTY:product’,
‘NUM:perc’,
‘DESC:def’,
‘ENTY:color’,
‘LOC:other’,
‘ENTY:lang’,
‘ENTY:currency’,
‘ENTY:food’,
‘ENTY:animal’,
‘NUM:date’,
‘ENTY:veh’,
‘ENTY:other’,
‘DESC:manner’,
‘ENTY:body’,
‘ENTY:word’,
‘HUM:ind’,
‘ENTY:cremat’,
‘HUM:gr’,
‘DESC:reason’,
‘ENTY:dismed’,
‘NUM:speed’,
‘NUM:money’,
‘ENTY:letter’,
‘NUM:temp’,
‘NUM:count’,
‘ENTY:symbol’,
‘ENTY:plant’,
‘NUM:ord’]

Exercise: Find all general types¶
The question types have two parts. The first part describes a general type, and the second part defines a subtype. For example, the question type DESC:manner belongs to the general DESC type and within that type to the manner subtype. Let’s focus on the general types only. Write Python code that lists all the possible general types (there are 6 of them).

In [6]:

general_types = # … write your answer here
general_types

Out[6]:

[‘NUM’, ‘HUM’, ‘LOC’, ‘ABBR’, ‘ENTY’, ‘DESC’]

Exercise: Partition the data¶
There is a train and test data, but for this exercise we want to have a partition into train, dev-test, and test. In this exercise, combine all data into one array and do a 3-way partition into train, dev-test, and test. Make sure that you shuffle the data prior to doing the partition. Also, make sure that you only use the general label types.

In [8]:

# … write your answer here

Exercise: Tokenise the data¶
Use Keras’ tokeniser to tokenise all the data. For this exercise we will use only the 100 most frequent words in the training set (since you aren’t supposed to use the dev-test or test sets to extract features).

In [9]:

from tensorflow.keras.preprocessing.text import Tokenizer

# Write your code here

In [10]:

indices_train = # … write your code here
indices_devtest = # … write your code here
indices_test = # … write your code here

Exercise: Vectorize the data¶
The following code shows the distribution of lengths of my training data (could be different in your training data):

In [11]:

%matplotlib inline
from matplotlib import pyplot as plt
plt.hist([len(d) for d in indices_train])

Out[11]:

(array([ 43., 1001., 1327., 815., 169., 162., 43., 7., 2.,
2.]),
array([ 0. , 1.8, 3.6, 5.4, 7.2, 9. , 10.8, 12.6, 14.4, 16.2, 18. ]),
)

2021-03-17T18:44:30.935151
image/svg+xml

Matplotlib v3.3.4, https://matplotlib.org/

The histogram shows that the longest question in the training data has 18 word indices, but by far most of the questions have at least 10. Based on this, use Keras’ pad_sequences to vectorize the questions into sequences of 10 word indices. The default will be to truncate the beginning, but we want to truncate the end (since the first words of a question are often very important to determine the question type). For this you can use the option truncating=’post’: https://keras.io/preprocessing/sequence/

In [12]:

from tensorflow.keras.preprocessing.sequence import pad_sequences
maxlen = 10
x_train = pad_sequences(indices_train, maxlen=maxlen, truncating=’post’)
x_devtest = pad_sequences(indices_devtest, maxlen=maxlen, truncating=’post’)
x_test = pad_sequences(indices_test, maxlen=maxlen, truncating=’post’)

Exercise: Vectorise the labels¶
Convert the labels to one-hot encoding. If you use Keras’ to_categorical you would first need to convert the labels to integers.

In [13]:

from tensorflow.keras.utils import to_categorical

# Write your code here

Exercise: Define the model¶
Define a model for classification. For this model, use a feedforward architecture with an embedding layer of size 20, a layer that computes the average of word embeddings (use GlobalAveragePooling1D), a hidden layer of 16 units, and relu activation. You need to determine the size and activation of the output layer.

In [14]:

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Dense, GlobalAveragePooling1D

embedding_dim = 20

# Write your code here

model.summary()

Model: “sequential”
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 10, 20) 2000
_________________________________________________________________
global_average_pooling1d (Gl (None, 20) 0
_________________________________________________________________
dense (Dense) (None, 16) 336
_________________________________________________________________
dense_1 (Dense) (None, 6) 102
=================================================================
Total params: 2,438
Trainable params: 2,438
Non-trainable params: 0
_________________________________________________________________

Exercise: Train and evaluate¶
Train your model. In the process you need to determine the optimal number of epochs. Then answer the following questions:

What was the optimal number of epochs and how did you determine this?
Is the system overfitting? Justify your answer.

In [1]:

# Write your code here

Optional Exercise: Data exploration¶
Plot the distribution of labels in the training data and compare with the distribution of labels in the devtest data. Plot also the distribution of predictions in the devtest data. What can you learn from this?

In [2]:

# Write your code here

Optional Exercise: Improve your system¶
Try the following options:

Use pre-trained word embeddings
Use recurrent neural networks.

Feel free to try each option separately and in combination, and compare the results. Feel also free to try with other variants of the initial architecture, such as:

Introducing more hidden layers.
Changing the size of embeddings.
Changing the number of units in the hidden layer(s).

In [0]: