Fine-tuning with BERT¶
In this workshop, we’ll learn how to use a pre-trained BERT model for a sentiment analysis task. We’ll be using the pytorch framework, and huggingface’s transformers library, which provides a suite of transformer models with a consistent interface.
Note: You may find certain parts of the code difficult to follow. This is because the model is designed based on the pytorch framework, so there’ll be pytorch syntax littered throughout the code. If you want to understand the code better, you’re encouraged to do the pytorch tutorial, and also going through the code of a pytorch language model.
Now let’s enable TPU on the colab notebook. We can do this by going to “Runtime $>$ Change runtime type” and selecting “TPU” as the hardware accelerator. Click save.
First let’s install the pytorch and transformers packages and Pytorch-XLA for TPU
In [ ]:
!pip install torch torchvision transformers
In [ ]:
import os
assert os.environ[‘COLAB_TPU_ADDR’], ‘Make sure to select TPU from Edit > Notebook settings > Hardware accelerator’
!pip install cloud-tpu-client==0.10 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.8-cp37-cp37m-linux_x86_64.whl
In [ ]:
import torch
#[FOR TPU]
import torch_xla
import torch_xla.core.xla_model as xm
import torch_xla.core.xla_model as xm
import torch_xla.debug.metrics as met
import torch_xla.distributed.parallel_loader as pl
import torch_xla.distributed.xla_multiprocessing as xmp
import torch_xla.utils.utils as xu
#[FOR TPU] Checking device
t = torch.randn(2, 2, device=xm.xla_device())
print(t.device)
print(t)
The installation will take a couple minutes.
Once the packages are installed, we’ll load a pre-trained BERT model. We’ll use the smaller uncased BERT model (uncased means the data used for pre-training BERT is all lowercased).
In [ ]:
#load pretrained bert base model
from transformers import BertModel
bert_model = BertModel.from_pretrained(‘bert-base-uncased’)
print(“Done loading BERT model.”)
BERT uses WordPiece tokenisation, which is a sub-word tokenisation algorithm like BPE. Let’s tokenise a sentence with WordPiece and see how it works.
In [ ]:
from transformers import BertTokenizer
#load BERT’s WordPiece tokenisation model
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
sentence = ‘I enjoyed this movie sooooo much.’
tokens = tokenizer.tokenize(sentence)
print(tokens)
We can see for most words, they are tokenised as single words. But for the word “sooooo” it’s tokenised into three subwords: “soo”, “##oo” and “##o”. The double hashtag “##” is a symbol to indicate a subword span of within a word.
BERT prepends every sentence with a special [CLS] token. As we saw in the lecture, we need this token when we fine-tune BERT for downstream tasks (e.g. spam detection).
BERT also terminates a sentence with a special [SEP] token. While this token doesn’t do much when we’re working with problems that have single sentences as input (e.g. sentiment classification), it’s useful when we’re working with problems that involve sentence pairs (e.g. textual entailment and sentence similarity classification). In those cases, we need [SEP] to indicate when the first sentence finishes, and when the second sentence starts (e.g. for the sentence pair (this movie is fantastic, this film is amazing), the input to BERT will be: [CLS] this movie is fantastic [SEP] this film is amazing [SEP]).
Note: Recall that in addition to the masked language model objective, BERT is also pre-trained with the next-sentence prediction objective. [SEP] is needed for the next-sentence objective (since it involves a sentence pair), and [CLS] is also used in this objective to classify the input sentence pair. And so [CLS] and [SEP] isn’t used only during fine-tuning; they are also used during pre-training.
Now let’s preprocess the sentence by prepending [CLS] and appending [SEP].
In [ ]:
tokens = [‘[CLS]’] + tokens + [‘[SEP]’]
print(tokens)
Since we’ll be using minibatches when fine-tuning BERT, we need to pad the input sequences so that they are all of the same length. We’ll use the [PAD] token for this.
Additionally, we need to create an attention mask vector. This attention mask vector is a binary vector and it tells BERT what words should and should not be attended. We need the attention mask vector here because we want BERT to ignore the [PAD] tokens (i.e. to not consider them when doing self-attention).
Now let’s pad the input sequence to a fixed length (12 in this example) and create the binary attention mask vector
In [ ]:
T = 12
padded_tokens = tokens + [‘[PAD]’ for _ in range(T – len(tokens))]
print(padded_tokens)
attn_mask = [1 if token != ‘[PAD]’ else 0 for token in padded_tokens]
print(attn_mask)
The last preprocessing step is to create the segment IDs. The segment IDs is again a binary vector, and denotes the sentence IDs (first or second sentence) in the input sequence. We use 0 to denote the first sentence, and 1 the second sentence. As we are working with a single sentence as input for sentiment analysis, the segment IDs is just a vector of 0’s. But for a sentence pair classification problem like sentence similarity classification, an input like [CLS] this movie is fantastic [SEP] this film is amazing [SEP] would have a segment IDs of [0 0 0 0 0 0 1 1 1 1 1]
In [ ]:
seg_ids = [0 for _ in range(len(padded_tokens))] #Since we only have a single sequence as input
print(seg_ids)
Now let’s convert the tokens in preprocessed sentence into their respective vocab IDs
In [ ]:
token_ids = tokenizer.convert_tokens_to_ids(padded_tokens)
print(token_ids)
We can see [CLS] has a vocab ID of 101, [SEP] 102 and [PAD] 0.
Next let’s turn the IDs into torch tensors, and feed the input sequence to BERT.
In [ ]:
import torch
#Converting all the input vectors to torch tensors
token_ids_t = torch.tensor(token_ids).unsqueeze(0) #Shape : [1, 12]
attn_mask_t = torch.tensor(attn_mask).unsqueeze(0) #Shape : [1, 12]
seg_ids_t = torch.tensor(seg_ids).unsqueeze(0) #Shape : [1, 12]
#Feed them to bert and get the contextualised embeddings
outputs = bert_model(token_ids_t, attention_mask = attn_mask_t,\
token_type_ids = seg_ids_t)
hidden_reps = outputs.last_hidden_state
print(hidden_reps.shape)
print(hidden_reps[0, 0, :10])
output[0] is the contextualised embeddings for all tokens in the sentence (CLS is the first token).
It has a dimension of [1, 12, 768] because we have one sentence in our minibatch, and that sentence has a length of 12 tokens, and each token has a contextualised embedding of 768 dimensions.
output[1] is the pooling output
Now that we understood how to preprocess sentences and get the contextualised embeddings from BERT, let’s move to the actual classification problem: sentiment analysis.
We’ll be using the Stanford sentiment treebank data as our dataset. Let’s upload the data (10-train.tsv and 10-dev.tsv) to colab. We can do this by clicking the folder icon on the left, and selecting “Upload” to upload the two files.
Once the files are uploaded, you should see them appearing in the file system.
We’ll now create SSTDataset, a dataset class to load the data, and provide a function __getitem__ to fetch a sentence, preprocess it (following the steps we did earlier) and return the tokenised sentence, attention mask and ground truth label.
Note: we do not need to create the segment IDs here since we’re working with a task (sentiment analysys) that only has single sentences as input.
In [ ]:
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizer
import pandas as pd
class SSTDataset(Dataset):
def __init__(self, filename, maxlen):
#Store the contents of the file in a pandas dataframe
self.df = pd.read_csv(filename, delimiter = ‘\t’)
#Initialize the BERT tokenizer
self.tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
self.maxlen = maxlen
def __len__(self):
return len(self.df)
def __getitem__(self, index):
#Selecting the sentence and label at the specified index in the data frame
sentence = self.df.loc[index, ‘sentence’]
label = self.df.loc[index, ‘label’]
#Preprocessing the text to be suitable for BERT
tokens = self.tokenizer.tokenize(sentence) #Tokenize the sentence
tokens = [‘[CLS]’] + tokens + [‘[SEP]’] #Insering the CLS and SEP token in the beginning and end of the sentence
if len(tokens) < self.maxlen:
tokens = tokens + ['[PAD]' for _ in range(self.maxlen - len(tokens))] #Padding sentences
else:
tokens = tokens[:self.maxlen-1] + ['[SEP]'] #Prunning the list to be of specified max length
tokens_ids = self.tokenizer.convert_tokens_to_ids(tokens) #Obtaining the indices of the tokens in the BERT Vocabulary
tokens_ids_tensor = torch.tensor(tokens_ids) #Converting the list to a pytorch tensor
#Obtaining the attention mask i.e a tensor containing 1s for no padded tokens and 0s for padded ones
attn_mask = (tokens_ids_tensor != 0).long()
return tokens_ids_tensor, attn_mask, label
Now let's create the training and development data using the SSTDataset class and pytorch's DataLoader.
In [ ]:
from torch.utils.data import DataLoader
#Creating instances of training and development set
#maxlen sets the maximum length a sentence can have
#any sentence longer than this length is truncated to the maxlen size
train_set = SSTDataset(filename = '10-train.tsv', maxlen = 30)
dev_set = SSTDataset(filename = '10-dev.tsv', maxlen = 30)
#Creating intsances of training and development dataloaders
train_loader = DataLoader(train_set, batch_size = 64, num_workers = 5)
dev_loader = DataLoader(dev_set, batch_size = 64, num_workers = 5)
print("Done preprocessing training and development data.")
Recall how we use BERT for downstream task:

We feed the input sequence to BERT, and take the contextualised embedding of [CLS] produced by BERT, and pass it to a classifier, which can be a simple feedforward network. As our task is a binary classification problem (positive or negative sentiment label), we only need the output layer to produce a single scalar value to denote the probability of the positive class.
Note: when we fine-tune our model we'll update all the parameters in the model (BERT's and the classification layer's).
In [ ]:
import torch
import torch.nn as nn
from transformers import BertModel
class SentimentClassifier(nn.Module):
def __init__(self):
super(SentimentClassifier, self).__init__()
#Instantiating BERT model object
self.bert_layer = BertModel.from_pretrained('bert-base-uncased')
#Classification layer
#input dimension is 768 because [CLS] embedding has a dimension of 768
#output dimension is 1 because we're working with a binary classification problem
self.cls_layer = nn.Linear(768, 1)
def forward(self, seq, attn_masks):
'''
Inputs:
-seq : Tensor of shape [B, T] containing token ids of sequences
-attn_masks : Tensor of shape [B, T] containing attention masks to be used to avoid contibution of PAD tokens
'''
#Feeding the input to BERT model to obtain contextualized representations
outputs = self.bert_layer(seq, attention_mask = attn_masks)
cont_reps = outputs.last_hidden_state
#Obtaining the representation of [CLS] head
cls_rep = cont_reps[:,0]
#Feeding cls_rep to the classifier layer
logits = self.cls_layer(cls_rep)
return logits
Now let's create the sentiment classifier with pre-trained BERT's parameters. We'll put the model on GPU. This step might take a bit of time.
In [ ]:
# [FOR GPU]
#device = 0 #gpu ID
#print("Creating the sentiment classifier, initialised with pretrained BERT-BASE parameters...")
#net = SentimentClassifier()
#net.cuda(device) #Enable gpu support for the model
#print("Done creating the sentiment classifier.")
# [FOR TPU]
# Only instantiate model weights once in memory.
print("Creating the sentiment classifier, initialised with pretrained BERT-BASE parameters...")
WRAPPED_MODEL = xmp.MpModelWrapper(SentimentClassifier())
device = xm.xla_device()
model = WRAPPED_MODEL.to(device) #Enable TPU support for the model
print("Done creating the sentiment classifier.")
We need to define a loss function for our model. Since it's a binary task, we'll use binary cross-entropy. We'll also use Adam as our optimiser.
In [ ]:
import torch.nn as nn
import torch.optim as optim
lr = 2e-5
# [FOR TPU] need to scale learning rate to world size.
lr = lr * xm.xrt_world_size() # remove this if you want to run with GPU
criterion = nn.BCEWithLogitsLoss()
opti = optim.Adam(model.parameters(), lr = lr)
Next let's define the training function. During training, we fetch a minibatch of examples from the data loader, and feed it to the classifier to get the output logits and compute the loss. loss.backward() is a function to compute the gradients for the parameters based on the loss, and opti.step() is a function to update parameters using the gradients.
In [ ]:
import time
def train(net, criterion, opti, train_loader, dev_loader, max_eps, device):
best_acc = 0
st = time.time()
for ep in range(max_eps):
net.train()
# [FOR TPU] Using ParalellLoader
train_loader2 = pl.ParallelLoader(train_loader, [device])
train_loader2 = train_loader2.per_device_loader(device)
dev_loader2 = pl.ParallelLoader(dev_loader, [device])
dev_loader2 = dev_loader2.per_device_loader(device)
for it, (seq, attn_masks, labels) in enumerate(train_loader2):
#Clear gradients
opti.zero_grad()
#[FOR GPU] Converting these to cuda tensors
#seq, attn_masks, labels = seq.cuda(device), attn_masks.cuda(device), labels.cuda(device)
#Obtaining the logits from the model
logits = net(seq, attn_masks)
#Computing loss
loss = criterion(logits.squeeze(-1), labels.float())
#Backpropagating the gradients
loss.backward()
#[FOR GPU] Optimization step
#opti.step()
#[FOR TPU] Optimization step
xm.optimizer_step(opti)
if it % 100 == 0:
#Please remove [xla:{}] and xm.get_ordinal() if you want to run with GPU
acc = get_accuracy_from_logits(logits, labels)
print("[xla:{}] Iteration {} of epoch {} complete. Loss: {}; Accuracy: {}; Time taken (s): {}".format(xm.get_ordinal(), it, ep, loss.item(), acc, (time.time()-st)))
st = time.time()
dev_acc, dev_loss = evaluate(net, criterion, dev_loader2, device)
print("Epoch {} complete! Development Accuracy: {}; Development Loss: {}".format(ep, dev_acc, dev_loss))
if dev_acc > best_acc:
print(“Best development accuracy improved from {} to {}, saving model…”.format( best_acc, dev_acc))
best_acc = dev_acc
torch.save(net.state_dict(), ‘sstcls_{}.dat’.format(ep))
Notice that we check the development performance after every training epoch, and only save the model when we see a performance improvement.
A couple more housekeeping functions we need to define for evaluation.
In [ ]:
def get_accuracy_from_logits(logits, labels):
probs = torch.sigmoid(logits.unsqueeze(-1))
soft_probs = (probs > 0.5).long()
acc = (soft_probs.squeeze() == labels).float().mean()
return acc
def evaluate(net, criterion, dataloader, device):
net.eval()
mean_acc, mean_loss = 0, 0
count = 0
#[FOR TPU] Using ParalellLoader
dataloader = pl.ParallelLoader(dataloader, [device])
dataloader = dataloader.per_device_loader(device)
with torch.no_grad():
for seq, attn_masks, labels in dataloader:
#[FOR GPU] Converting these to cuda tensors
#seq, attn_masks, labels = seq.cuda(device), attn_masks.cuda(device), labels.cuda(device)
logits = net(seq, attn_masks)
mean_loss += criterion(logits.squeeze(-1), labels.float()).item()
mean_acc += get_accuracy_from_logits(logits, labels)
count += 1
return mean_acc / count, mean_loss / count
We’re almost ready to fine-tune the sentiment classifier! We’ll train with 1 epoch and see what performance we get.
There are about 1000 iterations for one epoch, and generally 100 iterations takes about 15 seconds on TPU or 20-35 seconds on GPU, so TPU is quite a bit faster than TPU. Note, though, that you might find TPU slower in the first 300 iterations.
In [ ]:
num_epoch = 1
#fine-tune the model
train(model, criterion, opti, train_loader, dev_loader, num_epoch, device)
Hopefully your model should get around 90% accuracy, which is a pretty good performance. Feel free to run again but with more epochs and see if you can get a better performance.
In [ ]: