程序代写代做代考 python chain Context Free Languages algorithm Homework_2

Homework_2

Homework 2: Tagging and Parsing¶

Student Name:

Student ID:

Python version used:

General info¶

Due date: 5pm, Thursday March 30

Submission method: see LMS

Submission materials: completed copy of this iPython notebook

Late submissions: -20% per day

Marks: 5% of mark for class

Overview: In this homework, you’ll be building a hidden Markov (HMM) part-of-speech tagger and applying it to text. This includes training over labelled text, then prediction (decoding) to tag an unlabelled sentence and evaluating the result. The catch is that you’re to do so using algorithms from probabilistic parsing. The HMM is a kind of weighted state-machine, which formally defines a (probablistic) regular language. Context free languages subsume the regular languages, and therefore a HMM is a specific type of PCFG. Your task will be to use this equivalence in order to implement HMM decoding using PCFG parsing routings, and use this for part-of-speech tagging.

Materials: See the main class LMS page for information on the basic setup required for this class, including an iPython notebook viewer and the python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gemsim. In particular, if you are not using a lab computer which already has it installed, we recommend installing all the data for NLTK, since you will need various parts of it to complete this assignment. You can also use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn’t run on the marker’s machine, you will lose marks.

Evaluation: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don’t ask for). The amount each section is worth is given in parenthesis after the instructions. You will be marked not only on the correctness of your methods, but also the quality and efficency of your code: in particular, you should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to Python style requirements. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it.

Extra credit: Each homework has a task which is optional with respect to getting full marks on the assignment, but that can be used to offset any points lost on this or any other homework assignment (but not the final project or the exam). We recommend you skip over this step on your first pass, and come back if you have time: the amount of effort required to receive full marks (1 point) on an extra credit question will be substantially more than earning the same amount of credit on other parts of the homework.

Updates: Any major changes to the assignment will be announced via LMS. Minor changes and clarifications will be announced in the forum on LMS, we recommend you check the forum regularly.

Academic Misconduct: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s Academic Misconduct policy where inappropriate levels of collusion or plagiarism are deemed to have taken place.

HMM Tagging¶

Instructions: For this homework we will be using the tagged sentences in the nltk.corpus.treebank corpus. You should start by accessing this dataset, using the tagged_sents method to access the sentences in the corpus, which are composed of pairs of tokens and their part of speech tags. For this homework, use all the fileids for training (we won’t both with heldout validation or testing sets for the homework.) For each sentence, add special tokens, and , to the start and end. You should lower-case the tokens in the text, to limit the number of parameters in the model and allow for better generalisation.

Estimate the parameters of a first order HMM tagger over the training partition. First order means that the transition probabilities depend on one previous tag, i.e., the A matrix stores $p(t | t’)$ entries, where t and t’ are the current and previous tag, respectively. Estimating the parameters (a.k.a. training) means learning the transition (A), observation (O) and starting state (pi) parameter. You will need to compute normalised frequency counts, as described in the lecture (and also the workshop and reading). You may want to store these using special dictionary variants from collections, such as Counter or defaultdict, or use numpy.array objects. Don’t worry about smoothing the parameters for this homework.

You should include in your transition parameters the transition to the token, which ends the sequence. You may want to do the same for the token, in which case the pi vector can be treated as a special row in A.

Your code should print out the following summary numbers: 1) the number of tag types, 2) the number of word tokens and word types (unique words), and 4) the number of parameters in each of A, O, and pi.

(1.5 marks)

In [ ]:

Instructions: The next part requires thinking about the relationships between formal languages. Recall that the regular languages (including the HMM) are a subset of the context-free languages (PCFGs). Note that although we have weighted (probabilistic) grammars, the same relation still holds. Your challenge is to figure out a context-free grammar that corresponds to the HMM tagger, and the weighting of each production.

To help get started, consider a part-of-speech tagged sentence, e.g., Donald/NNP has/VBZ small/JJ hands/NNS . The sequence of tags forms a chain, which is already a type of tree. Think about how this tagged sequence is scored by the HMM to compute the joint probability, which might help to figure out what non-terminal symbols you might need and where to attach a special S start symbol.

Use the HMM parameters computed above to construct a PCFG, using the nltk.grammar.ProbabilisticProduction and nltk.grammar.PCFG classes. This will require you to come up with a ‘grammar’ representation of the HMM, which you should create directly from your HMM parameters including the HMM probabilities in the PCFG productions. Your grammar should be in Chomsky Normal Form.

Finally, implement a probabilistic CYK parser using your tagging PCFG. Feel free to cut-and-paste code from the supplied notebooks, although ensure when you do this you clearly specify in comments the source of the code (and if you make changes to the copied code, then use comments to flag these changes.) Apply this “tagger” to the test sentence

a dozen dinosaurs debate deals for daily dessert delivery .

and print the predicted tree and the sequence of POS tags. During code development you may want to also view the chart.

Note: Depending on how you go about this, you may need to relax the test for grammar validity in the PCFG constructor by changing the constant nltk.grammar.PCFG.EPSILON.

(2.5 marks)

In [ ]:

Instructions: This approach is pretty slow, which you may notice if you apply your method to much longer senentences. Clearly the CYK algorithm is too general, and thus is not optimised for tagging grammars. Your job is to tailor the CYK algorithm to speed up its tagging performance. Start by pasting in a copy of the PCYK function from the notebook, and then make your edits. Please ensure that you clearly mark the parts you have changed with comments. You are free to sacrifice generality of the algorithm, so that it only works for your style of grammar. Are there aspects of your grammar that can be exploited by small changes to the parsing code? Looking carefully at the parse chart might help you to identify these opporunities.

Print out the time of running the original CYK parsing for an example sentence, and the time for your improved method (you may want to use the timeit library.) Also print the tag predictions for both parsing methods, to verify that they produce the same results.

(1 mark)

In [ ]:

Bonus: higher-order tagging¶

Instruction: Implement second order HMM tagger where the transition distribution considers the previous two tags as context, $p(t_i|t_{i-2},t_{i-1})$, in place of the single previous tag in the first order HMM. Feel free to go to even higher orders. You will have to be careful with low count events, as some perfectly valid sequences of three tags may not have been seen in training, so some smoothing will be needed. One simple method is to average the 1st order and 2nd order estimates, i.e., using for transition parameters $0.5 \times p(t_i|t_{i-2},t_{i-1}) + 0.5 \times p(t_i|t_{i-1})$. Your implementation should implement this HMM as an equivalent PCFG grammar, which you validate over the test sentence above (or other sentences you make up; but see OOV discussion below).

(1 bonus mark)

In [ ]:

A final word¶
HMMs are not usually implemented as PCFGs, in practice. Instead more tailored and optimised algorithms are used which are specific to the HMM. This homework aims to emphasise the similarities between the two approaches, and elucidate the link between the CYK and Viterbi algorithms.

Another important aspect is handling out-of-vocabulary words OOVs, i.e., words encountered in testing but not seen in training. The test sentence above was constructed to avoid this issue, and I suggest that if you test with other sentences, that you keep to the training vocabulary. Dealing with OOVs is out of scope, for now, although we will discuss how these can be handled in various contexts later in the subject.