CS代考 AU22 CSE 5525 Homework #2: HMM and CRF for Named Entity Recognition

AU22 CSE 5525 Homework #2: HMM and CRF for Named Entity Recognition
Deadline: 11:59PM on 10/05/2022.
Academic Integrity
You can discuss the homework assignments with other students and study course materials together towards solutions. However, all of the code and report you write must be your own! If you do discuss or study together with others (except via posts/comments on Teams), please list their names at the top of your written submission. Unless there is academic misconduct detected, this won’t be counted against your credit. (Plagiarism will be automatically detected on Carmen, e.g., by comparing your HW with all resources, as well as manually checked by TA. )

Copyright By PowCoder代写 加微信 powcoder

If the instructor or TA suspects that a student has committed academic misconduct in this course, they are obligated by university rules to report their suspicions to the Committee on Academic Miscon- duct. Please see more in Statements for CSE5525 on Carmen pages.
In this project, you’ll first implement the Viterbi algorithm on a fixed model (an HMM) and then generalize it to forward-backward and implement learning and decoding for a feature-based CRF. This assignment serves two purposes: (1) You will be exposed to inference and learning for a simple structured model where exact inference is possible. (2) You will learn some of the engineering factors that need to be considered when implementing a model like this.
The expectation is that you understand CRFs and HMMs from scratch and can implement them based on the code provided to you. Sequence models are fundamental concepts in NLP, which can be better absorbed when you implement the basic version from scratch and also pave the way for understanding more advanced topics in NLP. You should not call existing HMM/CRF libraries and automatic differentiation in your solution to this assignment.
Timeline & Credit
You will have around 3 weeks to work on this programming assignment. However, there are many events/deadlines in OCT such as Final Project Mid-term Report Due and Midterm. Late submissions will NOT be accepted. So, start early and plan ahead! This homework consists of two mandatory coding parts (part 1: 30 points; part 2: 50 points), one optional coding part (part 3: 20 bonus points), and a 1-2 page report (20 points), which in total will take 10% of your final grade.
Questions?
Please create a post on MS teams to get timely help from other students, the TA, and the instructor. Remember that participation takes 5% of your final grade. You can show your participation by actively answering others’ questions or participating in the discussion!

1 Background and Dataset
Named entity recognition is the task of identifying references to named entities of certain types in text. We use data presented in the CoNLL 2003 Shared Task (Tjong Kim Sang and , 2003; [1]). An example of a data instance (e.g., “Singapore Refining Company expected to shut CDU 3.”) is given below:
Singapore NNP I-NP B-ORG
Refining NNP I-NP I-ORG
Company NNP I-NP I-ORG
expected VBD I-VP O
to TO I-VP O
shut VB I-VP O
CDU NNP I-NP B-ORG
3 CD I-NP I-ORG
. . O NONE O
There are four columns here: the word, the POS tag, the chunk bit (a form of shallow parsing—you can ignore this), and the column containing the NER tag. NER labels are given in a BIO tag scheme: beginning, inside, outside. In the example above, two named entities are present: Singapore Refining Company and CDU 3. O tags denote text not part of a named entity. B tags indicate the start of a named entity, and I tags indicate the continuation of the previous named entity. Both B and I tags are hyphenated and contain a type after the hyphen, which in this dataset is one of PER, ORG, LOC, or MISC. One B tag can immediately follow another B tag in the case where a one-word entity is followed immediately by another entity. However, note that an I tag can only follow an I tag or B tag of the same type.
A NER system’s job is to predict the NER chunks of an unseen sentence, i.e., predict the last column given the others. Output is typically evaluated according to chunk-level F-measure.1 To evaluate a single sentence, let C denote the predicted set of labeled chunks where each chunk is represented by a tuple of (label, start index, end index) and let C∗ denote the gold set of chunks. We compute precision, recall, and F1 as follows:
Precision = |C ∩ C∗|; Recall = |C ∩ C∗|; F1 = 2 ∗ Precision ∗ Recall |C| |C∗| Precision + Recall
The gold labeled chunks from the example above are C∗ = {(ORG, 0, 3), (ORG, 6, 8)} using 0-based indexing and semi-inclusive notation for intervals. Defined in [1], precision is the percentage of named entities found by the learning system that are correct. Recall is the percentage of named entities present in the corpus that are found by the system. A named entity is correct only if it is an exact match of the corresponding entity in the data file.
To generalize to corpus-level evaluation, the numerators and denominators of precision and recall are aggregated across the corpus. State-of-the-art systems can get above 90 F1 on this dataset; you should be aiming to get close to this and build systems that can get in at least the mid-80s.
Data eng.train is the training set, eng.testa is the standard NER development set, and eng.testb.blind is a blind test set you will submit results on. The deu* files are German data, which you’re NOT required to do anything with in this assignment (but, feel free to consider using them for your final project).
1.2 Getting started
Download the data and code. You will need Python 3 and numpy.2 Try running: python ner.py
1Tag-level accuracy isn’t used because of the prevalence of the O class—always predicting O would give extremely high accuracy!
2If you don’t have numpy, see https://www.scipy.org/install.html. You may try installing with anaconda.

This will run a very bad NER system, which simply outputs the most common tag for every token, or O if it hasn’t been seen before. The system will print a bunch of warnings on the dev set – this baseline doesn’t even produce consistent tag sequences.
1.3 Framework Code
You have been given the framework code to start with:
ner.py: Contains the implementation of BadNerModel and the main function which reads the data,
trains the appropriate model, and evaluates it on the test set.
nerdata.py: Utilities for reading NER data, evaluation code, and functions for converting from BIO
to chunk representation and back. The main abstraction to pay attention to here is LabeledSentence, which contains a sequence of Token objects (wrappers around words, POS tags, and chunk information) and a set of Chunk objects representing a labeling. Gold examples are LabeledSentence and this is also what your system will return as predictions.
utils.py: Indexer and Counter are as HW#1, with Indexer additionally being useful for mapping between labels and indices in dynamic programming. A Beam data structure is also provided for Part 3. optimizers.py: There are three implemented optimizer classes, i.e., SGD, unregularized Adagrad, and L1–regularized Adagrad. These wrap the weight vector, exposing access and score methods to use it, and are updated by apply gradient update, which takes as input a Counter of feature values for
this gradient as well as the size of the batch the gradient was computed on.
models.py: You should feel free to modify anything in this file as you need, but the scaffolding will
likely serve you well. We will describe the code here in more detail in the following sections.
Next, try running:
python ner.py –model HMM
This will crash with an error message. You have to implement Viterbi decoding (Part 1 below) to make the HMM work.
2 What You Need to Do
2.1 Part 1: Viterbi Decoding (30 points)
Go to models.py. The train hmm model estimates initial state, transition, and emission probabilities from the labeled data and returns a new HmmNerModel with these probabilities. Your task is to implement Viterbi decoding in this model, so it can return LabeledSentence predictions in decode.
We’ve provided an abstraction for you in ProbabilisticSequenceScorer. This abstraction is meant to help you build inference code that can work for both generative and probabilistic scoring as well as feature-based scoring. score init scores the initial HMM state, score transition scores an HMM state transition, and score emission scores the HMM emission. All of these are implemented as log probabilities. Note that this abstraction operates in terms of indexed tags, where the indices have been produced by tag indexer. This allows you to score tags directly in the dynamic programming state space without converting back and forth to strings all the time.
You should implement the Viterbi algorithm with scores that come from log probabilities to find the highest log-probability path.
P (y1, …, yn|x1, …, xn) ∝ P (y1, …, yn, x1, …, xn) = P (y1)􏰄 􏰃 P (yi|yi−1)􏰅􏰄 􏰃 P (xi|yi)􏰅
􏰀􏰄􏰂n 􏰅􏰄􏰂n􏰅􏰁
argmaxy1,…,ynP(y1,…,yn|x1,…,xn)=argmaxy1,…,yn logP(y1)+ logP(yi|yi−1) + log P(xi|yi) i=2 i=1
If you are not sure about the interface to the code, take a look at BadNerModel decoding and use that as a template.
Grading expectation. Viterbi decoding for the HMM can get above 75 F1 on the development set. You could take this as a reference and debug your implementation if its performance is substantially lower. Note that a very low performance here might imply the incorrectness of your code, which can result in partial credit for this part as well as cause trouble for implementing Part 2 correctly.

Implementation tips. You have a lot of freedom to determine the implementation details, but here are some implementation tips:
1. Python data structures like lists and dictionaries can be pretty inefficient. Consider using numpy arrays in dynamic programs.
2. Once you run your dynamic program, you still need to extract the best answer. Typically this is done by either storing a backpointer for each cell to know how that cell’s value was derived or by using a backward pass over the chart to reconstruct the sequence (backtracing).
2.2 Part 2: CRF Training (50 points)
In the CRF training phase, you will implement learning and inference for a CRF sequence tagger with a fixed feature set. We provide a simple CRF feature set with emission features only. While you’ll need to impose some kind of sequential constraints in the model, transition features are often slow to learn: you should be able to get good performance by constraining the model to only produce valid BIO sequences (prohibiting a transition to I-X from anything except I-X and B-X). Crucially, this model does not rely on transition features, but instead uses hardcoded transitions.
We provide a code skeleton in CrfNerModel and train crf model. The latter calls feature extraction in extract emission features and builds a cache of features for each example — you can feel free to use this cache or not. It then loops through each example and applies the gradient updates appropriately.
You should take the following steps to implement your CRF:
1. Generalize your Viterbi code to forward-backward.
2. Extend your forward-backward code to use a scorer based on features—perhaps you might write
a FeatureBasedSequenceScorer with an interface similar to ProbabilisticSequenceScorer for this purpose.
3. Implement compute gradient by computing the stochastic gradient of the feature vector for a sentence.
4. Use the stochastic gradient in a learning loop to learn feature weights for the NER system.
5. Implement decode in CrfNerModel, which should be able to follow your Viterbi code using the new scorer.
Grading expectation. You should be able to get a score of at least 85 F1 on the development set. Assignments falling short of this will be judged based on completeness and awarded partial or full credit accordingly: If you get at least 75-85, for example, we can see that your implementation may be mostly working and you will receive substantial credit, and please conduct error analysis to analyze your implementation. [One reference implementation gets 88.2 F1 using the given emission features and unregularized Adagrad as the optimizer—see if you can beat that!]
Implementation tips. You have a lot of freedom to determine the implementation details, but here are some implementation tips:
• Make sure that your probabilities from forward-backward normalize correctly! You should be able to sum the forward x backward chart values at each sentence index and get the same value (the normalizing constant). You can check your forward-backward implementation in the HMM model if that’s useful.
• When implementing forward-backward, you’ll probably want to do so in log space rather than real space. (+, x) in real space translates to (log-sum-exp, +) in log space. Use numpy.logaddexp.
• Remember that the NER tag definition has hard constraints in it (only B-X or I-X can transition to I-X), so be sure to build these into your inference procedure. You can also incorporate features on tag pairs, but this is substantially more challenging and not necessary to get good performance.
• If your code is too slow, try (a) making use of the feature cache to reduce computation and (b) exploiting sparsity in the gradients (Counter is a good class for maintaining sparse maps). Run your code with python -m cProfile ner.py to print a profile and help identify bottlenecks.
• Implement things in baby steps! First, make sure that your marginal probabilities look correct on a single example. Then make sure that your optimizer can fit a very small training set (10

examples); you might want to write a small amount of extra code to compute log-likelihood and check that this goes up, along with train accuracy. Then work on scaling things up to more data and optimizing for development performance.
Part 3 (optional): Beam Search (20 bonus points)
Finally, you have the option to implement beam search for your CRF model. Beam search is an approximate inference scheme that only maintains a certain number of hypotheses at each timestep during inference, unlike Viterbi which keeps track of all of them. At each state, the highest-scoring hypotheses are kept in a data structure called a beam.
You should implement this modification in the decode beam method in CrfNerModel. You can use whatever data structures you want to implement the beam, but the Beam class provided in utils.py provides a natural way to handle a beam of scored hypotheses, although it’s not as efficient as it could be.
Your inference (decoding the test set) using a beam size of 2 should not lose more than 5 F1 compared to your Viterbi implementation.
Note that using the –inference BOTH flag allows you to run both beam and Viterbi at the same time, to facilitate the comparison of these. Furthermore, once you’ve implemented and run Q2 correctly, you don’t need to retrain the model every time. Every time the CRF is trained, the CRF model is written to crf model.pkl. You can add the –crf model path crf model.pkl argument to your script to skip training and load this saved model. Thus, you can test your inference without having to retrain the model, which is quite time-consuming.
Report (20 points)
A report of around 1-2 pages in PDF, not including references, though you are not expected to reference many papers. Your report should list your collaborators (i.e., who you discussed HW#2 with, if any; see the beginning of this pdf). In your writeup, you should report the following things:
Accuracy and runtime for your HMM and CRF model using Viterbi on the development set.
What you are doing, describe relevant details of your implementation. Even if you cannot make your implementation work, describe your effort.
Briefly discuss a few error cases and how the system could be further improved.
(Bonus questions for beam search) Try multiple beam sizes for the CRF model, including beam size 1. (You can plot these values as a graph or in a table.) Describe what trends you see in accuracy and runtime. How does beam search compare to your expectations here? Under what circumstances do you think beam search would be more effective relative to Viterbi? That is, what characteristics of a problem (sequence length, number of states, model, features, etc.) would change what you’ve observed?
What You Should Submit and Our Expectation
You should submit the following to Carmen:
• Your code and the trained model (stored as crf model.pkl) as a .zip or .tgz file (Training is time- consuming, so the auto-grader will not be retraining your model; it’ll use the one you upload.)
• Your CRF’s output on the testb blind test set in CoNLL format (named as eng.testb.out). Replace the fourth column with your predicted labels.
• The 1-2 page report.
Whenever you train your model, it’s written to crf model.pkl in the current directory. If you specify –crf model path filename as an argument, the script reads the trained model from the filename instead of doing the training.
Your code will be graded on the following criteria:
1. Execution: your code should train and evaluate within the time limits without crashing 5

2. Accuracy of your models on the development set
3. Accuracy on the blind test set: this is not explicitly reported by the autograder but we may consider it
Before you submit, make sure that the following commands work: python ner.py –model HMM
python ner.py –model CRF -crf model path crf model.pkl
For the bonus part:
python ner.py –model BOTH -crf model path crf model.pkl
3.1 Acknowledgment
This homework assignment is largely adapted from NLP courses by Dr. at UT Austin.
References
[1] Erik F Tjong Kim Sang and Fien . Introduction to the conll-2003 shared task: Language- independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142–147, 2003.

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com