程序代写代做代考 algorithm Excel html decision tree graph SentimentAnalysis

SentimentAnalysis
This assignment is inspired by a typical real-life scenario. Imagine you have been hired as a Data Scientist by a major airline company. Your job is to analyse the Twitter feed to determine customer sentiment towards your company and its competitors.
In this assignment, you will be given a collection of tweets about US airlines. The tweets have been manually labelled for sentiment. Sentiment is categorized as either positive, negative or neutral. Important: Do not distribute these tweets on the Internet, as this breaches Twitter’s Terms of Service.
You are expected to assess various supervised machine learning methods using a variety of features and settings to determine what methods work best for sentiment classification in this domain. The assignment has two components: programming to produce a collection of models for sentiment analysis, and a report to evaluate the effectiveness of the models. The programming part involves development of Python code for data preprocessing of tweets and experimentation of methods using NLP and machine learning toolkits. The report involves evaluating and comparing the models using various metrics, and comparison of the machine learning models to a baseline method.
You will use the NLTK toolkit for basic language preprocessing, and scikit-learn for feature con- struction and evaluating the machine learning models. You will be given an example of how to use NLTK and scikit-learn for this assignment (example.py). For the sentiment analysis baseline, NLTK includes a hand-crafted (crowdsourced) sentiment analyser, VADER,1 which may perform well in this domain because of the way it uses emojis and other features of social media text to intensify sentiment, however the accuracy of VADER is difficult to anticipate because: (i) crowd- sourcing is in general highly unreliable, and (ii) this dataset might not include much use of emojis and other markers of sentiment.
Data and Methods
A training dataset is a tsv (tab separated values) file containing a number of tweets, with one tweet per line, and linebreaks within tweets removed. Each line of the tsv file has three fields: instance number, tweet text and sentiment (positive, negative or neutral). A test dataset is a tsv file in the same format as the training dataset except that your code should ignore the sentiment field. Training and test datasets can be drawn from a supplied file dataset.tsv (see below).
For all models except VADER, consider a tweet to be a collection of words, where a word is a string ofatleasttwoletters,numbersorthesymbols#,@, ,$or%,delimitedbyaspace,afterremoving all other characters (two characters is the default minimum word length for CountVectorizer in scikit-learn). URLs should be treated as a space, so delimit words. Note that deleting “junk” characters may create longer words that were previously separated by those characters.
Use the supervised learning methods discussed in the lectures: Decision Trees (DT), Bernoulli Naive Bayes (BNB) and Multinomial Naive Bayes (MNB). Do not code these methods: instead use
1 https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109

the implementations from scikit-learn. Read the scikit-learn documentation on Decision Trees2 and Naive Bayes,3 and the linked pages describing the parameters of the methods. Look at example.py to see how to use CountVectorizer and train and test the machine learning algorithms, including how to generate metrics for the models developed.
The programming part of the assignment is to produce DT, BNB and MNB models and your own model for sentiment analysis in Python programs that can be called from the command line to train and classify tweets read from correctly formatted tsv files. The report part of the assignment is to analyse these models using a variety of parameters, preprocessing tools, scenarios and baselines.
Programming
You will produce and submit four Python programs: (i) DT sentiment.py (ii) BNB sentiment.py, (iii) MNB sentiment.py and (iv) sentiment.py. The first three of these are standard models as defined below. The last is a model that you develop following experimentation with the data. Use the given dataset (dataset.tsv) containing 5000 labelled tweets to develop the models.
These programs, when called from the command line with two file names as arguments, the first a training dataset and the second a test dataset, should print (to standard output), the instance number and sentiment produced by the classifier of each tweet in the test set when trained on the training set (one per line with a space between them) – each sentiment being the string “positive”, “negative” or “neutral”. For example:
python3 DT sentiment.py training.tsv test.tsv > output.txt
should write to the file output.txt the instance number and sentiment of each tweet in test.tsv,
as determined by the Decision Tree classifier trained on training.tsv.
When reading in training and test datasets, make sure your code reads all the instances (some
Python readers use “excel” format, which uses double quotes as separators).
Standard Models
Train the three standard models on the supplied dataset of 5000 tweets (the whole of dataset.tsv). For Decision Trees, use scikit-learn’s Decision Tree method with criterion set to ’entropy’ and with random state=0. Scikit-learn’s Decision Tree method does not implement pruning, rather you should make sure Decision Tree construction stops when a node covers fewer than 50 examples (1% of the training set). Decision Trees are likely to lead to fragmentation, so to avoid overfitting and reduce computation time, for all Decision Tree models use as features only the 1000 most frequent words from the vocabulary (after preprocessing to remove “junk” characters as described above). Write code to train and test a Decision Tree model in DT sentiment.py.
For both BNB and MNB, use scikit-learn’s implementations, but use all of the words in the vocabulary as features. Write two Pythons programs for training and testing Naive Bayes models, one a BNB model and one an MNB model, in BNB sentiment.py and MNB sentiment.py.
Your Model
Develop your best model for sentiment classification by varying the number and type of input features for the learners, the parameters of the learners, and the training/test set split, as described in your report (see below). Submit one program, sentiment.py, that trains and tests a model.
2 https://scikit-learn.org/stable/modules/tree.html 3https://scikit-learn.org/stable/modules/naive bayes.html

Report
In the report, you will first evaluate the standard models, then present your own model. For evaluating all models, report the results of training on the first 4000 tweets in dataset.tsv (the “training set”) and testing on the remaining 1000 tweets (the “test set”), rather than using the full dataset of 5000 tweets for training, so stopping the Decision Tree classifiers when nodes cover less than 40 tweets rather than 50. Use the metrics (micro- and macro-accuracy, precision, recall and F1) and classification reports from scikit-learn. Show the results in either tables or plots, and write a short paragraph in your response to each item below. The answer to each question should be self contained. Your report should be at most 10 pages. Do not include appendices.
1. (1 mark) Give simple descriptive statistics showing the frequency distribution for the sentiment classes for the whole dataset of 5000 tweets. What do you notice about the distribution?
2. (2 marks) Develop BNB and MNB models from the training set using (a) the whole vocabulary, and (b) the most frequent 1000 words from the vocabulary (as defined using CountVectorizer, after preprocessing by removing “junk” characters). Show all metrics on the test set comparing the two approaches for each method. Explain any similarities and differences in results.
3. (2 marks) Evaluate the three standard models with respect to the VADER baseline. Show all metrics on the test set and comment on the performance of the baseline and of the models relative to the baseline.
4. (2 marks) Evaluate the effect of preprocessing the input features by applying NLTK English stop word removal then NLTK Porter stemming on classifier performance for the three standard models. Show all metrics with and without preprocessing on the test set and explain the results.
5. (2 marks) Evaluate the effect that converting all letters to lower case has on classifier perfor- mance for the three standard models. Show all metrics with and without conversion to lower case on the test set and explain the results.
6. (6 marks) Describe your best method for sentiment analysis and justify your decision. Give some experimental results for your method trained on the training set of 4000 tweets and tested on the test set of 1000 tweets. Provide a brief comparison of your model to the standard models and the baseline (use the results from the previous questions).