COSC2779LabExercises_W09_BERT
¶
COSC 2779 | Deep Learning
¶
Week 9 Lab Exercises: **Classify text – BERT**
¶
Introduction¶
In this tutorial, you will learn how to classify text by using transfer learning from a BERT model.
A pre-trained model is a saved network that was previously trained on a large dataset, typically on a large-scale language modelling task. You either use the pretrained model as is or use transfer learning to customise this model to a given task.
In this tutorial, you will:
Use pre-trained BERT model in tensorflow.
The lab is partly based on Classify text with BERT
This notebook is designed to run on Google Colab. If you like to run this on your local machine, make sure that you have installed TensorFlow version 2.0.
Setting up the Notebook¶
Let’s first load the packages we need.
In [ ]:
import tensorflow as tf
AUTOTUNE = tf.data.experimental.AUTOTUNE
import numpy as np
import pandas as pd
import tensorflow_datasets as tfds
import pathlib
import shutil
import tempfile
from IPython import display
from matplotlib import pyplot as plt
Loading the dataset¶
We are going to use the HappyDB database for this lab.
HappyDB is a collection of happy moments described by individuals experiencing those moments.
The task is to classify text of happy moments to classes:
affection
achievement
enjoy_the_moment
bonding
leisure
nature
exercise
The data can be downloaded from: HappyDB. I have also uploaded the data to canvas.
If you use this dataset for any other purpose, please cite:
@inproceedings{asai2018happydb,
title = {HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments},
author = {Asai, Akari and Evensen, Sara and Golshan, Behzad and Halevy, Alon
and Li, Vivian and Lopatenko, Andrei and Stepanov, Daniela and Suhara, Yoshihiko
and Tan, Wang-Chiew and Xu, Yinzhan},
booktitle = {Proceedings of LREC 2018},
month = {May}, year={2018},
address = {Miyazaki, Japan},
publisher = {European Language Resources Association (ELRA)}
}
In [ ]:
from google.colab import drive
drive.mount(‘/content/drive’)
In [ ]:
!cp /content/drive/’My Drive’/COSC2779/COSC2779lab9/HappyDBData.zip .
In [ ]:
!unzip -q -o HappyDBData.zip
!rm HappyDBData.zip
Next lets create a data loader.
In [ ]:
AUTOTUNE = tf.data.AUTOTUNE
batch_size = 32
seed = 42
raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
‘HappyDBData/Train’,
batch_size=batch_size,
label_mode=’categorical’,
validation_split=0.2,
subset=’training’,
seed=seed)
class_names = raw_train_ds.class_names
train_ds = raw_train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = tf.keras.preprocessing.text_dataset_from_directory(
‘HappyDBData/Train’,
batch_size=batch_size,
label_mode=’categorical’,
validation_split=0.2,
subset=’validation’,
seed=seed)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = tf.keras.preprocessing.text_dataset_from_directory(
‘HappyDBData/Test’,
label_mode=’categorical’,
batch_size=batch_size)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)
Data cleaning and pre processing¶
We learned how to do data cleaning in the last lab. Therefore this is left as an excersise for you. Try and incorporate data cleaning directly into the data loader.
Loading models from TensorFlow Hub¶
BERT and other Transformer encoder architectures have been wildly successful on a variety of tasks in NLP (natural language processing). They compute vector-space representations of natural language that are suitable for use in deep learning models. The BERT family of models uses the Transformer encoder architecture to process each token of input text in the full context of all tokens before and after, hence the name: Bidirectional Encoder Representations from Transformers.
BERT models are usually pre-trained on a large corpus of text, then fine-tuned for specific tasks. To learn more about BERT please read the tutorial:
A Visual Guide to Using BERT for the First Time
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
The Illustrated Transformer
Selecting the BERT model¶
Here you can choose which BERT model you will load from TensorFlow Hub and fine-tune. There are multiple BERT models available.
BERT-Base, Uncased and seven more models with trained weights released by the original BERT authors.
Small BERTs have the same general architecture but fewer and/or smaller Transformer blocks, which lets you explore tradeoffs between speed, size and quality.
ALBERT: four different sizes of “A Lite BERT” that reduces model size (but not computation time) by sharing parameters between layers.
BERT Experts: eight models that all have the BERT-base architecture but offer a choice between different pre-training domains, to align more closely with the target task.
Electra has the same architecture as BERT (in three different sizes), but gets pre-trained as a discriminator in a set-up that resembles a Generative Adversarial Network (GAN).
BERT with Talking-Heads Attention and Gated GELU [base, large] has two improvements to the core of the Transformer architecture.
The model documentation on TensorFlow Hub has more details and references to the research literature. Follow the links above, or click on the tfhub.dev URL printed after the next cell execution.
The suggestion is to start with a Small BERT (with fewer parameters) since they are faster to fine-tune. If you like a small model but with higher accuracy, ALBERT might be your next option. If you want even better accuracy, choose one of the classic BERT sizes or their recent refinements like Electra, Talking Heads, or a BERT Expert.
Aside from the models available below, there are multiple versions of the models that are larger and can yield even better accuracy, but they are too big to be fine-tuned on a single GPU. You will be able to do that on the Solve GLUE tasks using BERT on a TPU colab.
You’ll see in the code below that switching the tfhub.dev URL is enough to try any of these models, because all the differences between them are encapsulated in the SavedModels from TF Hub.
In [ ]:
# A dependency of the preprocessing for BERT inputs
!pip install -q -U tensorflow-text
!pip install -q tf-models-official
import os
import tensorflow_hub as hub
import tensorflow_text as text
from official.nlp import optimization # to create AdamW optimizer
tf.get_logger().setLevel(‘ERROR’)
In [ ]:
bert_model_name = ‘small_bert/bert_en_uncased_L-4_H-512_A-8’
map_name_to_handle = {
‘bert_en_uncased_L-12_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3’,
‘bert_en_cased_L-12_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_cased_L-12_H-768_A-12/3’,
‘bert_multi_cased_L-12_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_multi_cased_L-12_H-768_A-12/3’,
‘small_bert/bert_en_uncased_L-2_H-128_A-2’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/1’,
‘small_bert/bert_en_uncased_L-2_H-256_A-4’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-256_A-4/1’,
‘small_bert/bert_en_uncased_L-2_H-512_A-8’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-512_A-8/1’,
‘small_bert/bert_en_uncased_L-2_H-768_A-12’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-768_A-12/1’,
‘small_bert/bert_en_uncased_L-4_H-128_A-2’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-128_A-2/1’,
‘small_bert/bert_en_uncased_L-4_H-256_A-4’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-256_A-4/1’,
‘small_bert/bert_en_uncased_L-4_H-512_A-8’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/1’,
‘small_bert/bert_en_uncased_L-4_H-768_A-12’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-768_A-12/1’,
‘small_bert/bert_en_uncased_L-6_H-128_A-2’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-128_A-2/1’,
‘small_bert/bert_en_uncased_L-6_H-256_A-4’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-256_A-4/1’,
‘small_bert/bert_en_uncased_L-6_H-512_A-8’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-512_A-8/1’,
‘small_bert/bert_en_uncased_L-6_H-768_A-12’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-6_H-768_A-12/1’,
‘small_bert/bert_en_uncased_L-8_H-128_A-2’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-128_A-2/1’,
‘small_bert/bert_en_uncased_L-8_H-256_A-4’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-256_A-4/1’,
‘small_bert/bert_en_uncased_L-8_H-512_A-8’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-512_A-8/1’,
‘small_bert/bert_en_uncased_L-8_H-768_A-12’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-8_H-768_A-12/1’,
‘small_bert/bert_en_uncased_L-10_H-128_A-2’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-128_A-2/1’,
‘small_bert/bert_en_uncased_L-10_H-256_A-4’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-256_A-4/1’,
‘small_bert/bert_en_uncased_L-10_H-512_A-8’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-512_A-8/1’,
‘small_bert/bert_en_uncased_L-10_H-768_A-12’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-10_H-768_A-12/1’,
‘small_bert/bert_en_uncased_L-12_H-128_A-2’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/1’,
‘small_bert/bert_en_uncased_L-12_H-256_A-4’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-256_A-4/1’,
‘small_bert/bert_en_uncased_L-12_H-512_A-8’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-512_A-8/1’,
‘small_bert/bert_en_uncased_L-12_H-768_A-12’:
‘https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-768_A-12/1’,
‘albert_en_base’:
‘https://tfhub.dev/tensorflow/albert_en_base/2’,
‘electra_small’:
‘https://tfhub.dev/google/electra_small/2’,
‘electra_base’:
‘https://tfhub.dev/google/electra_base/2’,
‘experts_pubmed’:
‘https://tfhub.dev/google/experts/bert/pubmed/2’,
‘experts_wiki_books’:
‘https://tfhub.dev/google/experts/bert/wiki_books/2’,
‘talking-heads_base’:
‘https://tfhub.dev/tensorflow/talkheads_ggelu_bert_en_base/1’,
}
map_model_to_preprocess = {
‘bert_en_uncased_L-12_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘bert_en_cased_L-12_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_cased_preprocess/3’,
‘small_bert/bert_en_uncased_L-2_H-128_A-2’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-2_H-256_A-4’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-2_H-512_A-8’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-2_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-4_H-128_A-2’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-4_H-256_A-4’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-4_H-512_A-8’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-4_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-6_H-128_A-2’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-6_H-256_A-4’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-6_H-512_A-8’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-6_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-8_H-128_A-2’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-8_H-256_A-4’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-8_H-512_A-8’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-8_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-10_H-128_A-2’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-10_H-256_A-4’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-10_H-512_A-8’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-10_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-12_H-128_A-2’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-12_H-256_A-4’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-12_H-512_A-8’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘small_bert/bert_en_uncased_L-12_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘bert_multi_cased_L-12_H-768_A-12’:
‘https://tfhub.dev/tensorflow/bert_multi_cased_preprocess/3’,
‘albert_en_base’:
‘https://tfhub.dev/tensorflow/albert_en_preprocess/3’,
‘electra_small’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘electra_base’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘experts_pubmed’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘experts_wiki_books’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3’,
‘talking-heads_base’:
‘https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3′,
}
tfhub_handle_encoder = map_name_to_handle[bert_model_name]
tfhub_handle_preprocess = map_model_to_preprocess[bert_model_name]
print(f’BERT model selected : {tfhub_handle_encoder}’)
print(f’Preprocess model auto-selected: {tfhub_handle_preprocess}’)
The preprocessing model¶
Text inputs need to be transformed to numeric token ids and arranged in several Tensors before being input to BERT. TensorFlow Hub provides a matching preprocessing model for each of the BERT models discussed above, which implements this transformation using TF ops from the TF.text library. It is not necessary to run pure Python code outside your TensorFlow model to preprocess text.
The preprocessing model must be the one referenced by the documentation of the BERT model, which you can read at the URL printed above. For BERT models from the drop-down above, the preprocessing model is selected automatically.
Note: You will load the preprocessing model into a hub.KerasLayer to compose your fine-tuned model. This is the preferred API to load a TF2-style SavedModel from TF Hub into a Keras model.
In [ ]:
bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
Let’s try the preprocessing model on some text and see the output:
In [ ]:
text_test = [‘this is the ninth lab in deep learning course!’]
text_preprocessed = bert_preprocess_model(text_test)
print(f’Keys : {list(text_preprocessed.keys())}’)
print(f’Shape : {text_preprocessed[“input_word_ids”].shape}’)
print(f’Word Ids : {text_preprocessed[“input_word_ids”][0, :12]}’)
print(f’Input Mask : {text_preprocessed[“input_mask”][0, :12]}’)
print(f’Type Ids : {text_preprocessed[“input_type_ids”][0, :12]}’)
As you can see, now you have the 3 outputs from the preprocessing that a BERT model would use (input_words_id, input_mask and input_type_ids).
Some other important points:
The input is truncated to 128 tokens. The number of tokens can be customized, and you can see more details on the Solve GLUE tasks using BERT on a TPU colab.
The input_type_ids only have one value (0) because this is a single sentence input. For a multiple sentence input, it would have one number for each input.
Since this text preprocessor is a TensorFlow model, It can be included in your model directly.
Using the BERT model¶
Before putting BERT into your own model, let’s take a look at its outputs. You will load it from TF Hub and see the returned values.
In [ ]:
bert_model = hub.KerasLayer(tfhub_handle_encoder)
In [ ]:
bert_results = bert_model(text_preprocessed)
print(f’Loaded BERT: {tfhub_handle_encoder}’)
print(f’Pooled Outputs Shape:{bert_results[“pooled_output”].shape}’)
print(f’Pooled Outputs Values:{bert_results[“pooled_output”][0, :12]}’)
print(f’Sequence Outputs Shape:{bert_results[“sequence_output”].shape}’)
print(f’Sequence Outputs Values:{bert_results[“sequence_output”][0, :12]}’)
The BERT models return a map with 3 important keys: pooled_output, sequence_output, encoder_outputs:
pooled_output represents each input sequence as a whole. The shape is [batch_size, H]. You can think of this as an embedding for the entire movie review.
sequence_output represents each input token in the context. The shape is [batch_size, seq_length, H]. You can think of this as a contextual embedding for every token in the movie review.
encoder_outputs are the intermediate activations of the L Transformer blocks. outputs[“encoder_outputs”][i] is a Tensor of shape [batch_size, seq_length, 1024] with the outputs of the i-th Transformer block, for 0 <= i < L. The last value of the list is equal to sequence_output.
For the fine-tuning you are going to use the pooled_output array.
Define your model¶
You will create a very simple fine-tuned model, with the preprocessing model, the selected BERT model, one Dense and a Dropout layer.
Note: for more information about the base model's input and output you can follow the model's URL for documentation. Here specifically, you don't need to worry about it because the preprocessing model will take care of that for you.
In [ ]:
def build_classifier_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessing_layer = hub.KerasLayer(tfhub_handle_preprocess, name='preprocessing')
encoder_inputs = preprocessing_layer(text_input)
encoder = hub.KerasLayer(tfhub_handle_encoder, trainable=False, name='BERT_encoder')
outputs = encoder(encoder_inputs)
net = outputs['pooled_output']
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(7, activation='softmax', name='classifier')(net)
return tf.keras.Model(text_input, net)
In [ ]:
classifier_model = build_classifier_model()
bert_raw_result = classifier_model(tf.constant(text_test))
print(tf.sigmoid(bert_raw_result))
In [ ]:
tf.keras.utils.plot_model(classifier_model)
Model training¶
You now have all the pieces to train a model, including the preprocessing module, BERT encoder, data, and classifier.
Loss function¶
Since this is a multi class classification problem and the model outputs a probability, you'll use Crossentropy loss function.
In [ ]:
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=False)
metrics = tf.metrics.CategoricalAccuracy()
Optimizer¶
For fine-tuning, let's use the same optimizer that BERT was originally trained with: the "Adaptive Moments" (Adam). This optimizer minimizes the prediction loss and does regularization by weight decay (not using moments), which is also known as AdamW.
For the learning rate (init_lr), you will use the same schedule as BERT pre-training: linear decay of a notional initial learning rate, prefixed with a linear warm-up phase over the first 10% of training steps (num_warmup_steps). In line with the BERT paper, the initial learning rate is smaller for fine-tuning (best of 5e-5, 3e-5, 2e-5).
In [ ]:
epochs = 5
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
num_train_steps = steps_per_epoch * epochs
num_warmup_steps = int(0.1*num_train_steps)
init_lr = 3e-5
optimizer = optimization.create_optimizer(init_lr=init_lr,
num_train_steps=num_train_steps,
num_warmup_steps=num_warmup_steps,
optimizer_type='adamw')
Loading the BERT model and training¶
Using the classifier_model you created earlier, you can compile the model with the loss, metric and optimizer.
In [ ]:
classifier_model.compile(optimizer=optimizer,
loss=loss,
metrics=metrics)
In [ ]:
print(f'Training model with {tfhub_handle_encoder}')
history = classifier_model.fit(x=train_ds,
validation_data=val_ds,
epochs=epochs)