程序代写代做代考 algorithm cuda file system GPU Fine-tuning with BERT¶

Fine-tuning with BERT¶
In this workshop, we’ll learn how to use a pre-trained BERT model for a sentiment analysis task. We’ll be using the pytorch framework, and huggingface’s transformers library, which provides a suite of transformer models with a consistent interface.
Note: You may find certain parts of the code difficult to follow. This is because the model is designed based on the pytorch framework, so there’ll be pytorch syntax littered throughout the code. If you want to understand the code better, you’re encouraged to do the pytorch tutorial, and also going through the code of a pytorch language model.
Now let’s enable GPU on the colab notebook. We can do this by going to “Runtime $>$ Change runtime type” and selecting “GPU” as the hardware accelerator. Click save.

First let’s install the pytorch and transformers packages
In [ ]:
!pip install torch torchvision transformers

The installation will take a couple minutes.
Once the packages are installed, we’ll load a pre-trained BERT model. We’ll use the smaller uncased BERT model (uncased means the data used for pre-training BERT is all lowercased).
In [ ]:
#load pretrained bert base model
from transformers import BertModel

bert_model = BertModel.from_pretrained(‘bert-base-uncased’)

print(“Done loading BERT model.”)

BERT uses WordPiece tokenisation, which is a sub-word tokenisation algorithm like BPE. Let’s tokenise a sentence with WordPiece and see how it works.
In [ ]:
from transformers import BertTokenizer

#load BERT’s WordPiece tokenisation model
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)

sentence = ‘I enjoyed this movie sooooo much.’
tokens = tokenizer.tokenize(sentence)
print(tokens)

We can see for most words, they are tokenised as single words. But for the word “sooooo” it’s tokenised into three subwords: “soo”, “##oo” and “##o”. The double hashtag “##” is a symbol to indicate a subword span of within a word.

BERT prepends every sentence with a special [CLS] token. As we saw in the lecture, we need this token when we fine-tune BERT for downstream tasks (e.g. spam detection).
BERT also terminates a sentence with a special [SEP] token. While this token doesn’t do much when we’re working with problems that have single sentences as input (e.g. sentiment classification), it’s useful when we’re working with problems that involve sentence pairs (e.g. textual entailment and sentence similarity classification). In those cases, we need [SEP] to indicate when the first sentence finishes, and when the second sentence starts (e.g. for the sentence pair (this movie is fantastic, this film is amazing), the input to BERT will be: [CLS] this movie is fantastic [SEP] this film is amazing [SEP]).
Note: Recall that in addition to the masked language model objective, BERT is also pre-trained with the next-sentence prediction objective. [SEP] is needed for the next-sentence objective (since it involves a sentence pair), and [CLS] is also used in this objective to classify the input sentence pair. And so [CLS] and [SEP] isn’t used only during fine-tuning; they are also used during pre-training.
Now let’s preprocess the sentence by prepending [CLS] and appending [SEP].
In [ ]:
tokens = [‘[CLS]’] + tokens + [‘[SEP]’]
print(tokens)

Since we’ll be using minibatches when fine-tuning BERT, we need to pad the input sequences so that they are all of the same length. We’ll use the [PAD] token for this.
Additionally, we need to create an attention mask vector. This attention mask vector is a binary vector and it tells BERT what words should and should not be attended. We need the attention mask vector here because we want BERT to ignore the [PAD] tokens (i.e. to not consider them when doing self-attention).
Now let’s pad the input sequence to a fixed length (12 in this example) and create the binary attention mask vector
In [ ]:
T = 12

padded_tokens = tokens + [‘[PAD]’ for _ in range(T – len(tokens))]
print(padded_tokens)

attn_mask = [1 if token != ‘[PAD]’ else 0 for token in padded_tokens]
print(attn_mask)

The last preprocessing step is to create the segment IDs. The segment IDs is again a binary vector, and denotes the sentence IDs (first or second sentence) in the input sequence. We use 0 to denote the first sentence, and 1 the second sentence. As we are working with a single sentence as input for sentiment analysis, the segment IDs is just a vector of 0’s. But for a sentence pair classification problem like sentence similarity classification, an input like [CLS] this movie is fantastic [SEP] this film is amazing [SEP] would have a segment IDs of [0 0 0 0 0 0 1 1 1 1 1]
In [ ]:
seg_ids = [0 for _ in range(len(padded_tokens))] #Since we only have a single sequence as input
print(seg_ids)

Now let’s convert the tokens in preprocessed sentence into their respective vocab IDs
In [ ]:
token_ids = tokenizer.convert_tokens_to_ids(padded_tokens)
print(token_ids)

We can see [CLS] has a vocab ID of 101, [SEP] 102 and [PAD] 0.
Next let’s turn the IDs into torch tensors, and feed the input sequence to BERT.
In [ ]:
import torch

#Converting all the input vectors to torch tensors
token_ids_t = torch.tensor(token_ids).unsqueeze(0) #Shape : [1, 12]
attn_mask_t = torch.tensor(attn_mask).unsqueeze(0) #Shape : [1, 12]
seg_ids_t = torch.tensor(seg_ids).unsqueeze(0) #Shape : [1, 12]

#Feed them to bert and get the contextualised embeddings
hidden_reps, _ = bert_model(token_ids_t, attention_mask = attn_mask_t,\
token_type_ids = seg_ids_t)
print(hidden_reps.shape)
print(hidden_reps[0, 0, :10])

hidden_reps is the contextualised embeddings for all tokens in the sentence
It has a dimension of [1, 12, 768] because we have one sentence in our minibatch, and that sentence has a length of 12 tokens, and each token has a contextualised embedding of 768 dimensions.

Now that we understood how to preprocess sentences and get the contextualised embeddings from BERT, let’s move to the actual classification problem: sentiment analysis.
We’ll be using the Stanford sentiment treebank data as our dataset. Let’s upload the data (10-train.tsv and 10-dev.tsv) to colab. We can do this by clicking the folder icon on the left, and selecting “Upload” to upload the two files.
Once the files are uploaded, you should see them appearing in the file system.

We’ll now create SSTDataset, a dataset class to load the data, and provide a function __getitem__ to fetch a sentence, preprocess it (following the steps we did earlier) and return the tokenised sentence, attention mask and ground truth label.
Note: we do not need to create the segment IDs here since we’re working with a task (sentiment analysys) that only has single sentences as input.
In [ ]:
import torch
from torch.utils.data import Dataset
from transformers import BertTokenizer
import pandas as pd

class SSTDataset(Dataset):

def __init__(self, filename, maxlen):

#Store the contents of the file in a pandas dataframe
self.df = pd.read_csv(filename, delimiter = ‘\t’)

#Initialize the BERT tokenizer
self.tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)

self.maxlen = maxlen

def __len__(self):
return len(self.df)

def __getitem__(self, index):

#Selecting the sentence and label at the specified index in the data frame
sentence = self.df.loc[index, ‘sentence’]
label = self.df.loc[index, ‘label’]

#Preprocessing the text to be suitable for BERT
tokens = self.tokenizer.tokenize(sentence) #Tokenize the sentence
tokens = [‘[CLS]’] + tokens + [‘[SEP]’] #Insering the CLS and SEP token in the beginning and end of the sentence
if len(tokens) < self.maxlen: tokens = tokens + ['[PAD]' for _ in range(self.maxlen - len(tokens))] #Padding sentences else: tokens = tokens[:self.maxlen-1] + ['[SEP]'] #Prunning the list to be of specified max length tokens_ids = self.tokenizer.convert_tokens_to_ids(tokens) #Obtaining the indices of the tokens in the BERT Vocabulary tokens_ids_tensor = torch.tensor(tokens_ids) #Converting the list to a pytorch tensor #Obtaining the attention mask i.e a tensor containing 1s for no padded tokens and 0s for padded ones attn_mask = (tokens_ids_tensor != 0).long() return tokens_ids_tensor, attn_mask, label Now let's create the training and development data using the SSTDataset class and pytorch's DataLoader. In [ ]: from torch.utils.data import DataLoader #Creating instances of training and development set #maxlen sets the maximum length a sentence can have #any sentence longer than this length is truncated to the maxlen size train_set = SSTDataset(filename = '10-train.tsv', maxlen = 30) dev_set = SSTDataset(filename = '10-dev.tsv', maxlen = 30) #Creating intsances of training and development dataloaders train_loader = DataLoader(train_set, batch_size = 64, num_workers = 5) dev_loader = DataLoader(dev_set, batch_size = 64, num_workers = 5) print("Done preprocessing training and development data.") Recall how we use BERT for downstream task:  We feed the input sequence to BERT, and take the contextualised embedding of [CLS] produced by BERT, and pass it to a classifier, which can be a simple feedforward network. As our task is a binary classification problem (positive or negative sentiment label), we only need the output layer to produce a single scalar value to denote the probability of the positive class. Note: when we fine-tune our model we'll update all the parameters in the model (BERT's and the classification layer's). In [ ]: import torch import torch.nn as nn from transformers import BertModel class SentimentClassifier(nn.Module): def __init__(self): super(SentimentClassifier, self).__init__() #Instantiating BERT model object self.bert_layer = BertModel.from_pretrained('bert-base-uncased') #Classification layer #input dimension is 768 because [CLS] embedding has a dimension of 768 #output dimension is 1 because we're working with a binary classification problem self.cls_layer = nn.Linear(768, 1) def forward(self, seq, attn_masks): ''' Inputs: -seq : Tensor of shape [B, T] containing token ids of sequences -attn_masks : Tensor of shape [B, T] containing attention masks to be used to avoid contibution of PAD tokens ''' #Feeding the input to BERT model to obtain contextualized representations cont_reps, _ = self.bert_layer(seq, attention_mask = attn_masks) #Obtaining the representation of [CLS] head (the first token) cls_rep = cont_reps[:, 0] #Feeding cls_rep to the classifier layer logits = self.cls_layer(cls_rep) return logits Now let's create the sentiment classifier with pre-trained BERT's parameters. We'll put the model on GPU. This step might take a bit of time. In [ ]: gpu = 0 #gpu ID print("Creating the sentiment classifier, initialised with pretrained BERT-BASE parameters...") net = SentimentClassifier() net.cuda(gpu) #Enable gpu support for the model print("Done creating the sentiment classifier.") We need to define a loss function for our model. Since it's a binary task, we'll use binary cross-entropy. We'll also use Adam as our optimiser. In [ ]: import torch.nn as nn import torch.optim as optim criterion = nn.BCEWithLogitsLoss() opti = optim.Adam(net.parameters(), lr = 2e-5) Next let's define the training function. During training, we fetch a minibatch of examples from the data loader, and feed it to the classifier to get the output logits and compute the loss. loss.backward() is a function to compute the gradients for the parameters based on the loss, and opti.step() is a function to update parameters using the gradients. In [ ]: import time def train(net, criterion, opti, train_loader, dev_loader, max_eps, gpu): best_acc = 0 st = time.time() for ep in range(max_eps): for it, (seq, attn_masks, labels) in enumerate(train_loader): #Clear gradients opti.zero_grad() #Converting these to cuda tensors seq, attn_masks, labels = seq.cuda(gpu), attn_masks.cuda(gpu), labels.cuda(gpu) #Obtaining the logits from the model logits = net(seq, attn_masks) #Computing loss loss = criterion(logits.squeeze(-1), labels.float()) #Backpropagating the gradients loss.backward() #Optimization step opti.step() if it % 100 == 0: acc = get_accuracy_from_logits(logits, labels) print("Iteration {} of epoch {} complete. Loss: {}; Accuracy: {}; Time taken (s): {}".format(it, ep, loss.item(), acc, (time.time()-st))) st = time.time() dev_acc, dev_loss = evaluate(net, criterion, dev_loader, gpu) print("Epoch {} complete! Development Accuracy: {}; Development Loss: {}".format(ep, dev_acc, dev_loss)) if dev_acc > best_acc:
print(“Best development accuracy improved from {} to {}, saving model…”.format(best_acc, dev_acc))
best_acc = dev_acc
torch.save(net.state_dict(), ‘sstcls_{}.dat’.format(ep))

Notice that we check the development performance after every training epoch, and only save the model when we see a performance improvement.
A couple more housekeeping functions we need to define for evaluation.
In [ ]:
def get_accuracy_from_logits(logits, labels):
probs = torch.sigmoid(logits.unsqueeze(-1))
soft_probs = (probs > 0.5).long()
acc = (soft_probs.squeeze() == labels).float().mean()
return acc

def evaluate(net, criterion, dataloader, gpu):
net.eval()

mean_acc, mean_loss = 0, 0
count = 0

with torch.no_grad():
for seq, attn_masks, labels in dataloader:
seq, attn_masks, labels = seq.cuda(gpu), attn_masks.cuda(gpu), labels.cuda(gpu)
logits = net(seq, attn_masks)
mean_loss += criterion(logits.squeeze(-1), labels.float()).item()
mean_acc += get_accuracy_from_logits(logits, labels)
count += 1

return mean_acc / count, mean_loss / count

We’re almost ready to fine-tune the sentiment classifier! We’ll train with 1 epoch and see what performance we get.
There are about 1000 iterations for one epoch, and generally 100 iterations takes about 20 seconds, so one epoch of training should take no more than 200 seconds with one GPU.
In [ ]:
num_epoch = 1

#fine-tune the model
train(net, criterion, opti, train_loader, dev_loader, num_epoch, gpu)

Hopefully your model should get around 90% accuracy, which is a pretty good performance. Feel free to run again but with more epochs and see if you can get a better performance.
In [ ]: