R语言代写 Assignment 3: Spam, wonderful spam

Assignment 3: Spam, wonderful spam

Thomas Lumley
9/5/2018
This assignment uses the SMS Spam dataset from Canvas

  1. Use rpart to fit and prune (if necessary) a tree predicting spam/non-spam from the common word counts in the wordmatrix matrix. Report the accuracy with a confusion matrix. Plot the fitted tree (without all the labels) and comment on its shape.
  2. For each common word in wordmatrix , compute the numbers and that give the number of occurrences in spam and non-spam messages respectively. The overall evidence provided by having this word in a message can be approximated by . A `Naïve Bayes’ classifier sums up the for every (common) word in the message to get an overall score for each message and then splits this at some threshold to get a classification. Construct a naive Bayes classifier and choose the threshold so the proportion of spam predicted is the same as the proportion observed. Report the accuracy with a confusion matrix (It’s called naïve Bayes because it would be a Bayesian predictor if the words were all independently chosen, which they obviously won’t be)
  3. Read the description at the UCI archive of how the dataset was constructed. Why is spam/non-spam accuracy likely to be higher with this dataset than in real life? What can you say about the generalisability of the classifier to particular populations of text users?

)1+ in(gol−)1+ iy(gol = ie in iy

ie