MH6812: Deep Learning for Natural LanguageProcessing
Assignment 2 Submission Due: 24 Jan, 2019 before 2:30 pm
Submission: via email send it to srjotyntu.edu.sg with subject MH6812 Assignment 2 1 Question One 50 marks
Named Entity Recognition NER is an important information extraction task that requires to identify and classify named entities in a given text. These entity types are usually predefined like location, organization, person, and time. In this exercise, you will learn how to develop a neural NER model in Pytorch. In particular, you will learn
a How to prepare data input and output for developing a NER model. b How to design and fittrain a neural NER model.
c How to use the trained NER model to predict named entity types for a new text.
You can use this code as your codebase and build on top of it. This code implements the CNN
LSTMCRF NER model described in 1. The model has an architecture as shown in Figure 1. The dataset for this exercise will be the standard CoNLL NER dataset, which is also available in
the codebase inside data directory. Please do the following:
i Download the code repository. The dataset should have three files: eng.train, eng.testb, and eng.testa to be respectively used for training, testing, and validation. The dataset contains four different types of named entities: PERSON, LOCATION, ORGANIZATION, and MISC; it uses the BIO tagging scheme introduced in the lecture. Try to be familiar with the data and the BIO tagging scheme used in the data.
ii Run the code and see the results. The code performs basic prepossessing steps to generate tag mapping, word mapping and character mapping that you should use. You should understand the preprocessing and data laoding functions.
iii Go through the model part getlstmfeatures .., class BiLSTMCRFnn.Module. Notice that it implements a charaterlevel encoder with a convolutional neural network CNN and an LSTM recurrent neural network. The default setting uses a CNN for characterlevel encoding. Please read these implementations carefully.
1
CRF PRP VBP VBG NN Layer
We also run experiments on two other set of published embeddings, namely Senna 50 dimensional embeddings2 trained on Wikipedi and Reuters RCV1 corpus Collobert et al., 2011 and Googles Word2Vec 300dimensional embed dings3 trained on 100 billion words from Googl News Mikolov et al., 2013. To test the effec tiveness of pretrained word embeddings, we ex perimented with randomly initialized embedding with 100 dimensions, where embeddings are uni
Backward LSTM
Forward LSTM
Char Representation
Word Embedding
LSTM LSTM LSTM
LSTM
LSTM LSTM LSTM LSTM
formly sampled from range q 3 ,q 3 dim dim
where dim is the dimension of embeddings H et al., 2015. The performance of different wor embeddings is discussed in Section 4.4. Character Embeddings. Character embed dings are initialized with uniform samples fro
q 3 ,q 3 ,wherewesetdim30. dim dim
Weight Matrices and Bias Vectors. Matrix pa rameters are randomly initialized with unifor
We are playing
soccer
samplesfromq6 ,q6 ,whererand rc rc
Figure 3: The main architecture of our neural
are the number of of rows and columns in th structure Glorot and Bengio, 2010. Bias vec
Figure 1: The CNNLSTMCRF Model for NER by 1
network. The character representation for each tors are initialized to zero, except the bias bf fo word is computed by the CNN in Figure 1. Then the forget gate in LSTM , which is initialized t
iv The code implements a wordlevel encoder with an LSTM network1..0InJiotzsedfoewfaiuczltesteatlt.,in2g0,15. the character representation vector is concatenated
the output layer of the network has a CRF layer. You can change it to a regular Softmax layer.
with the word embedding before feeding into the
For now, leave it as it is i.e., use CRF. Your job is to replace the LSTMbased wordlevel
BLSTM network. Dashed arrows indicate dropout
3.2 Optimization Algorithm
encoder with a CNN layer convolutional layer followed by an optional max pooling layer.
layers applied on both the input and output vectors
Parameter optimization is performed with mini
The CNN layer should have the same output dimensions outchannels as the LSTM.
of BLSTM.
batch stochastic gradient descent SGD wit
v Report the testset results when you use only one such CNN layer in your network. Report
results when you use an LSTMbased characterlevel encoder. In each case, report the
improve the performance of our model see Sec initial learning rate of 0.01 for POS tag
number of parameters in your model.
00
tion 4.5 for details. ging, and 0.015 for NER, see Section 3.3., and th vi Now increase the number of CNN layers in your network and see the impact on your results.
In each case, report the results on the testset use the validation set to select the best model.
References
3 Network Training
In this section, we provide details about training the neural network. We implement the neural net
ast 01t,withdecayrate0.05an
t is the number of epoch completed. To reduce th
effects of gradient exploding, we use a gradie
clipping of 5.0 Pascanu et al., 2012. We explore 1 Xuezhe Ma and EduardwoHrkovuys.inEgntdhetoTheneadnoselqiubreanrcyelBaebreglsitnrga vetiaalb.,idirectional LSTMCNNs
CRF. In Proceedings of t2h0e1054.thTAhenncuoamlpMuteaetitoings ofofrthaesAinsgsloecimaotidoenl faorre Computational Linguistics
Volume 1: Long Papers, pages 10641074, Berlin, Germany, Augustsu2c0h1a6s. AdsasDoceilatatioZneifloer, 2012, Adam Kingm
run on a GeForce GTX TITAN X GPU. Using the
and Ba, 2014 or RMSProp Dauphin et al., 2015 but none of them meaningfully improve upon SG with momentum and gradient clipping in our pre liminary experiments.
Early Stopping. We use early stopping Gile 2001; Graves et al., 2013 based on performanc on validation sets. The best parameters appear around 50 epochs, according to our experiments.
2 http:ronan.collobert.comsenna
3 https:code.google.comarchivep word2vec
Computational Linguistics.
3.1 Parameter Initialization
Word Embeddings. We use Stanfords pub
2
1 http:nlp.stanford.eduprojects glove
settings discussed in this section, the model train ing requires about 12 hours for POS tagging and 8 hours for NER.
licly available GloVe 100dimensional embed dings1 trained on 6 billion words from Wikipedia and web text Pennington et al., 2014
1067
batch size 10 and momentum 0.9. We choose a
learning rate is updated on each epoch of trainin
other more sophisticated optimization algorithm
a
e
e d
m
m e
o
h n
e g d e
n d
a D
s e
a