Practical Week 06
Practical Week 6: Multi-label Classification¶
In this practical you will practice with multi-label classification, in preparation for assignment 2.
In a multi-label classification task, the documents may have one or more labels (or sometimes none). An example of multi-label classification of images is to assign keywords to images. This blog post shows an example of multi-label text classification using scikit-learn. The post describes multiple methods you can use for the task. In this practical we will use the same data:
Multi-label Text Classification
The following code loads the CMU Movie Summary Corpus and prepares it for use in the practical exercises. The corpus consists of movie plots and additional information, including the genre (or genres) of each movie. Your task will be to determine the genres of a movie, given the text of the movie plot. For this code to work, the following file must be in the same folder as this notebook:
MovieSummaries.zip
In [11]:
import pandas as pd
from zipfile import ZipFile
zip_file = ZipFile(‘MovieSummaries.zip’)
metadata = pd.read_csv(zip_file.open(‘MovieSummaries/movie.metadata.tsv’), sep=’\t’, header=None)
metadata.columns = [“movie_id”,1,”movie_name”,3,4,5,6,7,”genre”]
genres = metadata[[“movie_id”,”movie_name”,”genre”]]
genres.head()
Out[11]:
movie_id movie_name genre
0 975900 Ghosts of Mars {“/m/01jfsb”: “Thriller”, “/m/06n90”: “Science…
1 3196793 Getting Away with Murder: The JonBenét Ramsey … {“/m/02n4kr”: “Mystery”, “/m/03bxz7”: “Biograp…
2 28463795 Brun bitter {“/m/0lsxr”: “Crime Fiction”, “/m/07s9rl0”: “D…
3 9363483 White Of The Eye {“/m/01jfsb”: “Thriller”, “/m/0glj9q”: “Erotic…
4 261236 A Woman in Flames {“/m/07s9rl0”: “Drama”}
In [12]:
plots = pd.read_csv(zip_file.open(‘MovieSummaries/plot_summaries.txt’), sep=’\t’, header=None)
plots.columns = [“movie_id”, “plot”]
plots.head()
Out[12]:
movie_id plot
0 23890098 Shlykov, a hard-working taxi driver and Lyosha…
1 31186339 The nation of Panem consists of a wealthy Capi…
2 20663735 Poovalli Induchoodan is sentenced for six yea…
3 2231378 The Lemon Drop Kid , a New York City swindler,…
4 595909 Seventh-day Adventist Church pastor Michael Ch…
In [13]:
movies = pd.merge(plots, genres, on = ‘movie_id’)
movies.head()
Out[13]:
movie_id plot movie_name genre
0 23890098 Shlykov, a hard-working taxi driver and Lyosha… Taxi Blues {“/m/07s9rl0”: “Drama”, “/m/03q4nz”: “World ci…
1 31186339 The nation of Panem consists of a wealthy Capi… The Hunger Games {“/m/03btsm8”: “Action/Adventure”, “/m/06n90”:…
2 20663735 Poovalli Induchoodan is sentenced for six yea… Narasimham {“/m/04t36”: “Musical”, “/m/02kdv5l”: “Action”…
3 2231378 The Lemon Drop Kid , a New York City swindler,… The Lemon Drop Kid {“/m/06qm3”: “Screwball comedy”, “/m/01z4y”: “…
4 595909 Seventh-day Adventist Church pastor Michael Ch… A Cry in the Dark {“/m/0lsxr”: “Crime Fiction”, “/m/07s9rl0”: “D…
In [14]:
import json
genres_lists = []
for i in movies[‘genre’]:
genres_lists.append(list(json.loads(i).values()))
movies[‘genre’] = genres_lists
movies.head()
Out[14]:
movie_id plot movie_name genre
0 23890098 Shlykov, a hard-working taxi driver and Lyosha… Taxi Blues [Drama, World cinema]
1 31186339 The nation of Panem consists of a wealthy Capi… The Hunger Games [Action/Adventure, Science Fiction, Action, Dr…
2 20663735 Poovalli Induchoodan is sentenced for six yea… Narasimham [Musical, Action, Drama, Bollywood]
3 2231378 The Lemon Drop Kid , a New York City swindler,… The Lemon Drop Kid [Screwball comedy, Comedy]
4 595909 Seventh-day Adventist Church pastor Michael Ch… A Cry in the Dark [Crime Fiction, Drama, Docudrama, World cinema…
The following code uses scikit-learn’s MultiLabelBinarizer to generate a column for each movie genre:
In [15]:
from sklearn.preprocessing import MultiLabelBinarizer
multilabel_binarizer = MultiLabelBinarizer()
multilabel_binarizer.fit_transform(movies[‘genre’])
# transform target variable
y = multilabel_binarizer.transform(movies[‘genre’])
for idx, genre in enumerate(multilabel_binarizer.classes_):
movies[genre] = y[:,idx]
movies.head()
Out[15]:
movie_id plot movie_name genre Absurdism Acid western Action Action Comedy Action Thrillers Action/Adventure Addiction Drama Adult Adventure Adventure Comedy Airplanes and airports Albino bias Alien Film Alien invasion Americana Animal Picture Animals Animated Musical Animated cartoon Animation Anime Anthology Anthropology Anti-war Anti-war film Apocalyptic and post-apocalyptic fiction Archaeology Archives and records Art film Auto racing Avant-garde B-Western B-movie Backstage Musical Baseball Beach Film … Star vehicle Statutory rape Steampunk Stoner film Stop motion Superhero Superhero movie Supermarionation Supernatural Surrealism Suspense Swashbuckler films Sword and Sandal Sword and sorcery Sword and sorcery films Tamil cinema Teen Television movie The Netherlands in World War II Therimin music Thriller Time travel Tokusatsu Tollywood Tragedy Tragicomedy Travel Vampire movies War effort War film Werewolf fiction Western Whodunit Women in prison films Workplace Comedy World History World cinema Wuxia Z movie Zombie Film
0 23890098 Shlykov, a hard-working taxi driver and Lyosha… Taxi Blues [Drama, World cinema] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
1 31186339 The nation of Panem consists of a wealthy Capi… The Hunger Games [Action/Adventure, Science Fiction, Action, Dr… 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 20663735 Poovalli Induchoodan is sentenced for six yea… Narasimham [Musical, Action, Drama, Bollywood] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 2231378 The Lemon Drop Kid , a New York City swindler,… The Lemon Drop Kid [Screwball comedy, Comedy] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 595909 Seventh-day Adventist Church pastor Michael Ch… A Cry in the Dark [Crime Fiction, Drama, Docudrama, World cinema… 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
5 rows × 367 columns
Finally, the following code uses scikit-learn to split the data into a train set, a dev-test set, and a development set.
In [17]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(movies, random_state=42, test_size=0.30, shuffle=True)
train, devtest = train_test_split(train, random_state=42, test_size=0.30, shuffle=True)
print(“Training size:”, len(train))
print(“Devtest size:”, len(devtest))
print(“Test size:”, len(test))
Training size: 20679
Devtest size: 8863
Test size: 12662
In [31]:
train_texts = list(train[‘plot’])
train_labels = train.drop(labels=[‘movie_id’, ‘movie_name’, ‘plot’, ‘genre’], axis=1).to_numpy()
In [32]:
devtest_texts = list(devtest[‘plot’])
devtest_labels = devtest.drop(labels=[‘movie_id’, ‘movie_name’, ‘plot’, ‘genre’], axis=1).to_numpy()
In [33]:
test_texts = list(test[‘plot’])
test_labels = test.drop(labels=[‘movie_id’, ‘movie_name’, ‘plot’, ‘genre’], axis=1).to_numpy()
Exercise: A Simple Classifier¶
Design a TensorFlow-Keras neural model that has the following sequence of layers:
An input layer that will accept the tf.idf encoding of the input text, using the top 8000 words. For this, you can use scikit-learn’s TfidfVectorizer with the option max_features=8000.
An output layer with as many cells as possible movie genres (there are 363 distinct genres, can you find out how to calculate the number of distinct genres?).
Each output cell will act as a binary classifier, so the activation function should be sigmoid, and the loss function should be binary_crossentropy.
Train your neural model using the training set. Determine the optimal number of epochs by examining the accuracy results on the devtest set. The model summary should look like this:
Layer (type) Output Shape Param #
=================================================================
dense_2 (Dense) (None, 363) 2904363
=================================================================
Total params: 2,904,363
Trainable params: 2,904,363
Non-trainable params: 0
In [1]:
# Write your code here
Exercise: A Recurrent Neural Network¶
Implement a recurrent neural network that has the following sequence of layers:
An embedding layer that generates embedding vectors with 100 dimensions. Set the maximum input length to 100 words.
A LSTM layer that generates an output of 120 dimensions.
The final output layer that has the 363 cells with a sigmoid activation.
The model summary should look like this:
Layer (type) Output Shape Param #
=================================================================
embedding_4 (Embedding) (None, 100, 100) 800000
_________________________________________________________________
lstm_4 (LSTM) (None, 120) 106080
_________________________________________________________________
dense_7 (Dense) (None, 363) 43923
=================================================================
Total params: 950,003
Trainable params: 950,003
Non-trainable params: 0
For this exercise, use Keras’ tokenizer with the option num_words=8000 (so that you use the same vocabulary size as in the previous exercise).
In [2]:
# Write your code here
Optional Exercise: A More Complex Neural Network¶
Try to improve your classifiers by trying some of these options:
Use a different number of words, number of embeddings dimensions, etc.
Add hidden layers.
Stack LSTM layers (this may make the system much slower to train; use it only if your computer has a GPU).
Use pre-trained word embeddings.
Use BERT from the Huggingface transformers library https://github.com/huggingface/transformers (this may make the system much slower to train; use it only if your computer has a GPU).
In [ ]:
# Write your code here