COSC2779LabExercises_W8
¶
COSC 2779 | Deep Learning
¶
Week 8 Lab Exercises: **Classify text by using transfer learning from a pre-trained embedding**
¶
Introduction¶
In this tutorial, you will learn how to classify text by using transfer learning from a pre-trained embedding.
A pre-trained model is a saved network that was previously trained on a large dataset, typically on a large-scale language modelling task. You either use the pretrained model as is or use transfer learning to customise this model to a given task.
In this tutorial, you will:
Use pre-trained word embeddings in tensorflow.
The lab is partly based on How to use pre-trained word vectors by NormalizedNerd
This notebook is designed to run on Google Colab. If you like to run this on your local machine, make sure that you have installed TensorFlow version 2.0.
Setting up the Notebook¶
Let’s first load the packages we need.
In [ ]:
import tensorflow as tf
AUTOTUNE = tf.data.experimental.AUTOTUNE
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
import pathlib
import shutil
import tempfile
from IPython import display
from matplotlib import pyplot as plt
We can use the tensor board to view the learning curves. Let’s first set it up.
In [ ]:
logdir = pathlib.Path(tempfile.mkdtemp())/”tensorboard_logs”
shutil.rmtree(logdir, ignore_errors=True)
# Load the TensorBoard notebook extension
%load_ext tensorboard
# Open an embedded TensorBoard viewer
%tensorboard –logdir {logdir}/models
We can also write our own function to plot the models training history ones training has completed.
In [ ]:
from itertools import cycle
def plotter(history_hold, metric = ‘binary_crossentropy’, ylim=[0.0, 1.0]):
cycol = cycle(‘bgrcmk’)
for name, item in history_hold.items():
y_train = item.history[metric]
y_val = item.history[‘val_’ + metric]
x_train = np.arange(0,len(y_val))
c=next(cycol)
plt.plot(x_train, y_train, c+’-‘, label=name+’_train’)
plt.plot(x_train, y_val, c+’–‘, label=name+’_val’)
plt.legend()
plt.xlim([1, max(plt.xlim())])
plt.ylim(ylim)
plt.xlabel(‘Epoch’)
plt.ylabel(metric)
plt.grid(True)
Loading the dataset¶
We are going to use the HappyDB database for this lab.
HappyDB is a collection of happy moments described by individuals experiencing those moments.
The task is to classify text of happy moments to classes:
affection
achievement
enjoy_the_moment
bonding
leisure
nature
exercise
The data can be downloaded from: HappyDB. I have also uploaded the data to canvas.
If you use this dataset for any other purpose, please cite:
@inproceedings{asai2018happydb,
title = {HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments},
author = {Asai, Akari and Evensen, Sara and Golshan, Behzad and Halevy, Alon
and Li, Vivian and Lopatenko, Andrei and Stepanov, Daniela and Suhara, Yoshihiko
and Tan, Wang-Chiew and Xu, Yinzhan},
booktitle = {Proceedings of LREC 2018},
month = {May}, year={2018},
address = {Miyazaki, Japan},
publisher = {European Language Resources Association (ELRA)}
}
Download data from colab
In [ ]:
from google.colab import drive
drive.mount(‘/content/drive’)
In [ ]:
!cp /content/drive/’My Drive’/COSC2779/COSC2779lab10/cleaned_hm.csv .
Read the dataset and explore
In [ ]:
data = pd.read_csv(“cleaned_hm.csv”)
data.head()
We will be using the cleaned_hm as our x and predicted_category as our y
In [ ]:
data[“predicted_category”].value_counts()
In [ ]:
data[“num_sentence”].value_counts()
Data cleaning and pre processing¶
Data pre processing¶
Deleting happy moments with more than 10 sentences
In [ ]:
mod_data = data.loc[data[‘num_sentence’] <= 10] mod_data["predicted_category"].value_counts() Categorical to numerical In [ ]: encode = { "affection" : 0, "achievement" : 1, "bonding" : 2, "enjoy_the_moment" : 3, "leisure" : 4, "nature" : 5, "exercise" : 6 } mod_data["predicted_category"] = mod_data["predicted_category"].apply(lambda x: encode[x]) mod_data.head() Text cleaning¶ Cleaning text is an important part in NLP and involves lot of engineering. The code below uses python string package and nltk to do some cleaning. A more popular text pre processing package is RegEx and the associated python library re In [ ]: import string import nltk nltk.download('punkt') from nltk.tokenize import word_tokenize from nltk.corpus import stopwords happy_lines = list() lines = mod_data["cleaned_hm"].values.tolist() for line in lines: # tokenize the text tokens = word_tokenize(line) # convert to lower case tokens = [w.lower() for w in tokens] # remove puntuations table = str.maketrans('', '', string.punctuation) stripped = [w.translate(table) for w in tokens] # remove non alphabetic characters words = [word for word in stripped if word.isalpha()] happy_lines.append(words) The above cleans the cleaned_hm column and converts it to list of words. In [ ]: happy_lines[:2] Generating the task vocabulary¶ You can use the keras Tokenizer to generate a vocabulary for your data In [ ]: from tensorflow.python.keras.preprocessing.text import Tokenizer from tensorflow.python.keras.preprocessing.sequence import pad_sequences validation_split = 0.20 max_length = 55 tokenizer_obj = Tokenizer() tokenizer_obj.fit_on_texts(happy_lines) sequences = tokenizer_obj.texts_to_sequences(happy_lines) word_index = tokenizer_obj.word_index print("unique tokens - "+str(len(word_index))) vocab_size = len(tokenizer_obj.word_index) + 1 print('vocab_size - '+str(vocab_size)) lines_pad = pad_sequences(sequences, maxlen=max_length, padding='post') category = mod_data['predicted_category'].values Split the dataset into train and validation In [ ]: indices = np.arange(lines_pad.shape[0]) np.random.shuffle(indices) lines_pad = lines_pad[indices] category = category[indices] n_values = np.max(category) + 1 Y = np.eye(n_values)[category] num_validation_samples = int(validation_split * lines_pad.shape[0]) X_train_pad = lines_pad[:-num_validation_samples] y_train = Y[:-num_validation_samples] X_val_pad = lines_pad[-num_validation_samples:] y_val = Y[-num_validation_samples:] To demostrate the utility of transfer lerning I will sample a smaller training dataset. In [ ]: # Randomly sample some train data train_len = X_train_pad.shape[0] idx = np.random.randint(train_len, size=train_len//25) X_train_pad_sampled = X_train_pad[idx, :] y_train_sampled = y_train[idx] In [ ]: print('Shape of X_train_pad:', X_train_pad.shape) print('Shape of y_train:', y_train.shape) print('Shape of X_train_pad_sampled:', X_train_pad_sampled.shape) print('Shape of y_train_sampled:', y_train_sampled.shape) print('Shape of X_test_pad:', X_val_pad.shape) print('Shape of y_test:', y_val.shape) No transfer learning¶ First lets try a simple model without transfer learning. In [ ]: def get_callbacks(name): return [ tf.keras.callbacks.TensorBoard(logdir/name, histogram_freq=1), ] In [ ]: from tensorflow.keras.layers import Dense, Embedding, GRU, LSTM, Bidirectional from tensorflow.keras.models import Sequential embedding_dim = 100 embedding_layer = Embedding(len(word_index) + 1, embedding_dim, input_length=max_length, trainable=True) model_glove = Sequential() model_glove.add(embedding_layer) model_glove.add(LSTM(units=32, dropout=0.2, recurrent_dropout=0.25)) model_glove.add(Dense(7, activation='softmax')) model_glove.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy']) print(model_glove.summary()) To save time in the lab, lets train the model for 5 epoch. You may change this later. In [ ]: EPOCH = 5 In [ ]: m_histories = {} m_histories['no_TL'] = model_glove.fit(X_train_pad_sampled, y_train_sampled, batch_size=32, epochs=EPOCH, validation_data=(X_val_pad, y_val), callbacks=get_callbacks('models/no_TL'), verbose=1) In [ ]: plotter(m_histories, ylim=[0.0, 2.0], metric = 'loss') In [ ]: plotter(m_histories, ylim=[0.0, 1.1], metric = 'categorical_accuracy') What are your observations? With Transfer Learning¶ Now lets explore how we can transfere the word embeddings. We will be using the GloVe word embeddings made avaialbe stanford reserchers: GloVe: Global Vectors for Word Representation I have downloaded the 100 dimentional embedding matrix trained on wikipedea. You can download the appropriate word embedding from the above site and upload to your google drive. In [ ]: !cp /content/drive/'My Drive'/COSC2779/COSC2779lab10/glove.6B.100d.txt . Read the GloVe vectors from file In [ ]: file = open('glove.6B.100d.txt', encoding='utf-8') glove_vectors = dict() for line in file: values = line.split() word = values[0] features = np.asarray(values[1:]) glove_vectors[word] = features file.close() Transfere the GloVe embedding vectors to your embedding matrix. This will be done by mapping our task vocabulary to the GloVe vocabulary. In [ ]: E_T = np.zeros((len(word_index) + 1, embedding_dim)) for word, i in word_index.items(): embedding_vector = glove_vectors.get(word) if embedding_vector is not None: E_T[i] = embedding_vector Train the model. Note the chaanges to the embedding layer. In [ ]: from tensorflow.keras.layers import Dense, Embedding, GRU, LSTM, Bidirectional from tensorflow.keras.models import Sequential embedding_layer_TL = Embedding(len(word_index) + 1, embedding_dim, weights=[E_T], input_length=max_length, trainable=False) model_glove = Sequential() model_glove.add(embedding_layer_TL) model_glove.add(LSTM(units=32, dropout=0.2, recurrent_dropout=0.25)) model_glove.add(Dense(7, activation='softmax')) model_glove.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy']) print(model_glove.summary()) In [ ]: m_histories['with_TL'] = model_glove.fit(X_train_pad_sampled, y_train_sampled, batch_size=32, epochs=EPOCH, validation_data=(X_val_pad, y_val), callbacks=get_callbacks('models/with_TL'), verbose=1) In [ ]: plotter(m_histories, ylim=[0.0, 2], metric = 'loss') In [ ]: plotter(m_histories, ylim=[0.0, 1.1], metric = 'categorical_accuracy') Did you see an improvement when using TL In [ ]: import os from tensorboard.plugins import projector # Save Labels separately on a line-by-line manner. with open(os.path.join(logdir, 'metadata.tsv'), "w") as f: for word, i in word_index.items(): f.write("{}\n".format(word)) # Save the weights we want to analyse as a variable. Note that the first # value represents any unknown word, which is not in the metadata, so # we will remove that value. weights = tf.Variable(model_glove.layers[0].get_weights()[0][1:]) # Create a checkpoint from embedding, the filename and key are # name of the tensor. checkpoint = tf.train.Checkpoint(embedding=weights) checkpoint.save(os.path.join(logdir, "embedding.ckpt")) # Set up config config = projector.ProjectorConfig() embedding = config.embeddings.add() # The name of the tensor will be suffixed by `/.ATTRIBUTES/VARIABLE_VALUE` embedding.tensor_name = "embedding/.ATTRIBUTES/VARIABLE_VALUE" embedding.metadata_path = 'metadata.tsv' projector.visualize_embeddings(logdir, config) In [ ]: for layer in model_glove.layers: print(layer.name) if layer.name == 'embedding_1': layer.trainable = True