Report
In order to understand the model, I train the default model first and test it with my own code. The test code is similar to the evaluation code in train.py. The original model’s F1-score in test dataset is about 0.7915.
For Task1:
Take TP as the number of true labels(‘TAR’ or ‘HYP’) which are predicted as true labels, take FP as the number of negative(‘O’) labels which are predicted as true labels, take FN as the number of positive labels which are predicted as negative labels. According to the definition of F1-score:
F1-score = 2 * precision * recall / (precision + recall)
So we could:
1) Traverse the golden list to find all positive labels, supposing the number of all positive labels is m
2) Traverse the predict list to find all positive labels, supposing the number of predict labels is n.
3) Compare the golden list with predict list to find true positive items, q true positive items in all.
4) Calculate recall = q / m. precision = q / n. Then calculate F1-score according to the rule above.
If we train the model for 50 epochs and save the final model, the final loss is about 0.0057, the F1-score in test dataset is about 0.7914991384261918.
If we train the model for 50 epochs and save the model which performs best in development dataset, the smallest loss in the process is 0.0037, the F1-score in test dataset is about 0.8082633957391867.
We could conclude that the best model may appear during the training process, the final model may not be the best. The development dataset is equivalent to the test dataset because either of them is used during the training process, a model performs well in development dataset could performs well to in test set.
For Task 2:
I read the code here https://github.com/pytorch/pytorch/blob/v0.4.1/torch/nn/_functions/rnn.py#L26 carefully and compare the standard LSTM cell with the modified one, actually we only need to modify one line in the source code to change the definition of the cell state. See todo.py for details.
If we train the model for 50 epochs and save the final model, the final loss is about 0.0057, the F1-score in test dataset is about 0.7914991384261918.
If we train the model with new LSTM cell for 50 epochs and save the final model, the final loss is about 0.0045, the F1-score in test dataset is about 0.7869481765834933.
According to the result, the original model with old LSTM cell is better.
For Task 3:
According to the picture blow and the code in model.py, we need to:
(1) Calculate the char embedding for each word in each sentence.
(2) Extract char level feature through a Bi-LSTM whose structure is same as the Bi-LSTM used in the original model.
(3) Concatenate the output of the last cell with the word embedding.
If we train the model for 50 epochs and save the final model, the final loss is about 0.0057, the F1-score in test dataset is about 0.7914991384261918.
If we train the model with char embedding for 50 epochs and save the final model, the final loss is about 0.0030, the F1-score in test dataset is about 0.7893020221787345.
According to the result, loss with char embedding is less than the one without char embedding, however, char embedding does not improve the model’s F1-score.
Actually, the model with char embedding is more complicated, it should not perform worse the original model, I think it may need training more epochs, through the F1-score is only 0.7902 with 100 epochs.