程序代写代做代考 Microsoft Word – Document1

Microsoft Word – Document1

1 Language Modelling
One of the most basic tasks for handling textual context is language modelling, which is an
unsupervised machine learning problem. In this assignment, you will explore the usage of language
modelling for predicting which emoji a tweet will likely end with. This could later form the basis of an
opinion mining model.

1.1 Academic Code of Conduct

You are welcome to discuss the project with other students. However, any sharing of code and/or text
is not permitted, and each person must submit their own project. Plagiarism tools will be used in your
submissions. If you have questions regarding the project, post them on the discussion forum.

1.2 Preliminaries

You may use any programming language, and basic libraries such as NumPy to solve this project.
Make sure that we can run your code with these dependencies when you submit.�However, you are not
allowed to use pre-implemented functions to build the language model – i.e. you are required to
implement n-gram construction, smoothing and language model prediction yourself. If in doubt,
please ask in the discussion forum.

2 Project Task
You are given the multilingual emoji prediction dataset from 2018. Your task is to implement and
evaluate a language model that, given a sentence, predicts the most likely emoji to complete the
sentence. This section provides details on how to do this.

To help you in solving this project, I strongly recommend you keep “Speech and Language Processing”
by Jurafsky and Martin, particularly Chapter 4 on language modelling with n-grams at hand. You can
find it here: https://web.stanford.edu/~jurafsky/slp3/

2.1 Data

We will use the English part of the data from are centre search competition, SemEval 2018Task2on
Multilingual emoji prediction: https://competitions.codalab.org/competitions/17344.

You can download the dataset with already crawled English tweets, to be used in this assignment, here:
https://www.dropbox.com/s/cjoi8y8sgdur77t/Semeval2018-Task2-Emoji-Detection.zip?dl=0 .�Please
read the contained README files carefully. The dataset is prepared in a way such that it is split into a
“training” and “trial” part. Tweets and the corresponding emojis the tweets end with are stored
separately.

2.2 Data Preprocessing

Construct the vocabulary (i.e. the set of all unique words and emojis observed) over the training part of
the dataset. This should include an “unknown” token for words unseen at test time.

2.2 Language Model Construction

Implement several variants of a language model (simple n-gram model, n-gram model with add-k
smoothing, n-gram model with interpolation), which you will later evaluate and contrast, as follows:

2.2.1 Construct a n-gram based language model over the training set. Choose “n” freely (though note
that you will have to motivate and criticize this choice as part of the questions later). The n-gram
language model should predict the probability of a sentence occurring in the training dataset.

2.2.2 As outlined in Chapter 4 Section 4.4.2 and 4.4.3 of the “Speech and Language Processing” book,
implement add-k smoothing and interpolation.

2.4 Evaluation

You will evaluate your resulting language model on the “trial” part of the dataset in two ways.

2.4.1 Internal evaluation: you will evaluate the language model using perplexity.�The perplexity of a
language model P for a test set W = w1w2…wN is defined as follows:

2.4.2 External evaluation: you will evaluate how well the model can predict emojis which should
follow tweets. For this, calculate the probability of the language model observing each tweet in the test
set ending in each of the emojis. The final predicted emoji for the test set should be the one with the
highest probability over the emoji choices. Evaluate the performance of your resulting unsupervised
emoji prediction model using the macro F1 over emoji labels.

For each of the emojis, calculate the F1 as follows:

Note down the F1 achieved for every emoji separately, then also compute the macro F1 of the
individual F1 scores by averaging them over the number of emojis.

Maggie

Maggie

!
Question1: What size did you use for “n” in the n-gram language model, what values did you
use for “k” for add-k smoothing and what values of “λ” for interpolation? Why is that? Explain
the advantages and disadvantages of your choices.

Question2: Computethescoresfortheintrinsicandextrinsicevaluationforeachofthelanguage
model variants. What do the scores you achieved for the intrinsic and extrinsic evaluation
mean? Describe in your own words.

Question3: Instead of using the language model to predict the most likely emoji to complete
the sentence, use each of the language models to predict the most likely word following each of
the trial tweets, not adding the emojis at the end. What do you observe? Discuss.

Question4: What are the advantages and disadvantages of using a n-gram based language
model, as you have implemented here, over a neural language model, as also introduced in the
language modelling lecture?

Question5: One of the main problems with language processing models, including language
models, is that they are limited to the words they have seen, which were part of the original
vocabulary constructed. One scenario in which this issue arises is misspelled words. Discuss
why this is problematic and how a language model could be used to correct spelling mistakes.

Question6: Augment your language model resulting from 2.2.2 to handle unseen word sat test
time. The new language model should be able to: 1) assign a more meaningful probability to
these words than that corresponding to the “unknown” back-off word; 2) continue a sentence in
which the previous word is an unseen word in a more meaningful way than your language
model constructed resulting from 2.2.2 would. Assess this by measuring and contrasting
perplexity.

Question7: Test the usefulness of your new language model resulting from Question6 for the
task of correcting spelling mistakes. 1) For each of the tweets in the development set, create a
corresponding version with at least 3 types of automatically introduced spelling mistakes. You
can introduce those by permuting two neighbouring characters in each word of the tweet, e.g.
“characters” -> “charatcers”. The position of the first character to be permuted should be
randomly chosen, though for simplicity reasons you can restrict this to the second half of each
word; 2) Use your new language model to correct those mistakes, i.e. to reconstruct the original
tweets from the tweets with automatically introduced spelling mistakes; 3) Measure the
performance of your system by computing the F1 on a word-level.

!