Individual Assignment 2 – Machine Learning
BUMK 758W
Overview
In this assignment, you will work on the reviewer viewership dataset. You will perform classification and
regression analysis using the machine learning methods covered in class. This is an individual assignment.
You are not supposed to discuss the problems with any other students.
The dataset is provided to you in one excel file, “reviewer_withavg.csv”, available at ELMS.
Preparation
You will use R for this assignment. In order to successfully complete the assignment, please ensure you
understand the R code that accompanies the class lectures. You should have the following packages installed
to run the R code covered in class and/or for your project: class, klaR, e1071, leaps, lars, and MASS.
Classification
First, create a column “Popular” in the dataset. Any reviewer who is Top10, Top50, or Top100 is considered
popular. That is, set Popular to 1 if any of Top10, Top50, or Top100 is 1, and 0 otherwise.
In this classification assignment, our task is to classify reviewer as either Popular or not Popular, using the
four characteristics of the reviewer as shown in class: avg_centrality, avg_content, avg_viewership,
avg_enhconent. You will use the first 50 records as the testing dataset, and the last 149 records as the
training dataset. You need to run classification using three methods: k-NN, Naïve Bayes, and SVM. For k-
NN, you need to test values for k between 1 and 10. For SVM, you will test both linear and polynomial
kernels.
Please submit your R code, and answer the following questions:
For k-NN, which k value minimizes classification error in the training dataset? What is the error
count?
For k-NN, which k value minimizes classification error in the testing dataset? What is the error
count?
For Naïve Bayes, what is the error count for the testing dataset?
For SVM, what are the error counts for the training dataset using linear and polynomial kernels,
respectively?
For SVM, what are the error counts for the testing dataset using linear and polynomial kernels,
respectively?
Which of the above algorithms has the best performance?
Regression
In the regression exercise, we will predict “avg_content” of a reviewer using the other characteristics of the
reviewer: avg_centrality, avg_viewership, avg_enhconent, Top10, Top50, Top100, Advisor, Lead. We will use
the entire dataset to fit the model.
First, fit a standard multiple regression model using avg_content as the dependent variable, and the others as
the independent variables.
Please fit the model and interpret the result.
Next, we will find out which variables should be included in the regression model for predicting
“avg_content”. Please use both full model search and forward-stepwise regression for this task. The model
selection criterion is Mallow’s Cp. Please answer the following questions:
What is the best model according to full model search?
What is the best model according to forward-stepwise regression?
What variables should be included if we want to use only three independent variables? Is the set the
same for full model search and for forward-stepwise regression? What variables should be included
if we want to use five independent variables?
Deliverable
Please submit your R code and answers to the questions. Please submit them in a text file (either plain text or
Word document is fine), putting the R code in the appendix.
Due Date
This assignment is due by midnight Friday, April 21, 2017.
Grading
This assignment is worth 25% of your final grade for the course. Grading will be done based on
completeness and correctness.