Assignment 3: Spam, wonderful spam
Thomas Lumley
9/5/2018
This assignment uses the SMS Spam dataset from Canvas
- Use rpart to fit and prune (if necessary) a tree predicting spam/non-spam from the common word counts in the wordmatrix matrix. Report the accuracy with a confusion matrix. Plot the fitted tree (without all the labels) and comment on its shape.
- For each common word in wordmatrix , compute the numbers and that give the number of occurrences in spam and non-spam messages respectively. The overall evidence provided by having this word in a message can be approximated by . A `Naïve Bayes’ classifier sums up the for every (common) word in the message to get an overall score for each message and then splits this at some threshold to get a classification. Construct a naive Bayes classifier and choose the threshold so the proportion of spam predicted is the same as the proportion observed. Report the accuracy with a confusion matrix (It’s called naïve Bayes because it would be a Bayesian predictor if the words were all independently chosen, which they obviously won’t be)
- Read the description at the UCI archive of how the dataset was constructed. Why is spam/non-spam accuracy likely to be higher with this dataset than in real life? What can you say about the generalisability of the classifier to particular populations of text users?
)1+ in(gol−)1+ iy(gol = ie in iy
ie