Only the first submission marked will be counted. Do not press mark until you are ready to submit.
This problem set will assess how well you work with strings by guiding you towards a basic Bayesian classifier. Consider the sample of tweets using the SydneyTrains hashtag found in Tweets.csv.
(a) Read Tweets.csv as a pandas dataframe and save it as dftweets. Return the number of tweets in the dataset
(b) Look through the data and save a variable called collist that contains a list of the column headers. Return variable collist at the end of the cell.
(c) Create a new column called mod_text that contains a case-insensitive version of full_text with the punctuation removed. Remove only the punctuation specified in punclist. Save only the mod_text column as a csv with no header called mod_text.csv.
(d) Make a list of the 10 most frequently occurring words with the most frequent words at the top. Frequency is measured by the number of tweets that contain the word. Save the list as topwords.csv where each line contains exactly one word and there are no headers.
(e) The dataset contains a column about whether each tweet is about a bad experience (a personal complaint about Sydney Trains). Make another list of the 10 most frequently occurring words in tweets that are flagged as bad experiences and sort it so the most frequent are at the top. Save the list as BEtopwords.csv where each line contains exactly one word and there are no headers.
In a Bayesian classifier, a set of features is used to classify observations. Features are most useful for classifying if they are rare in the general population but common in the category of interest. See the example in the Week 3 e-Learning on Bayes’ Rule. For the task below, we are interested in a classifier with a single feature (i.e. the presence of a single word in the tweet is the only information we use to classify the tweet).
(f) Choose a single word to use as a classifier for identifying tweets that are about personal bad experiences on the Sydney Trains network. The goal is to choose a word such that you maximise the probability of a tweet belonging to the bad-experience category based only on the presence of this word. You should test the effectiveness of your word on the Tweets.csv dataset, but 10% of your mark will be based on how successfully your classifier performs on a hidden validation dataset. The success will be measured as the ratio of the true positive rate (fraction of bad experiences that are correctly classified as bad experiences) to the false positive rate (fraction of other tweets that are incorrectly classified as bad experiences) and you must satisfy the additional constraint that your word successfully identifies at least half of the bad experiences in Tweets.csv. Save the word you want to use as a string in a variable called myword.
You may run your notebook as many times as you like. Be mindful that the code and comments provided in the initial scaffold are required for the marking, so do not delete them.
You may discuss in general terms with your peers and tutors, but you must write your solution yourself. If you get ideas from your peers or a website, you should use comments to attribute the source.
When you are satisfied with your solution, press mark. Your mark for the problem set is based on the first valid submission. After you receive your mark, you will be able to resubmit in order to test changes and see how it changes the score, but the recorded mark will be based on the initial submission.