程序代写代做代考 Excel algorithm Individual Assignment 2 – Machine Learning

Individual Assignment 2 – Machine Learning

BUMK 758W

Overview

In this assignment, you will work on the reviewer viewership dataset. You will perform classification and

regression analysis using the machine learning methods covered in class. This is an individual assignment.

You are not supposed to discuss the problems with any other students.

The dataset is provided to you in one excel file, “reviewer_withavg.csv”, available at ELMS.

Preparation

You will use R for this assignment. In order to successfully complete the assignment, please ensure you

understand the R code that accompanies the class lectures. You should have the following packages installed

to run the R code covered in class and/or for your project: class, klaR, e1071, leaps, lars, and MASS.

Classification

First, create a column “Popular” in the dataset. Any reviewer who is Top10, Top50, or Top100 is considered

popular. That is, set Popular to 1 if any of Top10, Top50, or Top100 is 1, and 0 otherwise.

In this classification assignment, our task is to classify reviewer as either Popular or not Popular, using the

four characteristics of the reviewer as shown in class: avg_centrality, avg_content, avg_viewership,

avg_enhconent. You will use the first 50 records as the testing dataset, and the last 149 records as the

training dataset. You need to run classification using three methods: k-NN, Naïve Bayes, and SVM. For k-

NN, you need to test values for k between 1 and 10. For SVM, you will test both linear and polynomial

kernels.

Please submit your R code, and answer the following questions:

 For k-NN, which k value minimizes classification error in the training dataset? What is the error

count?

 For k-NN, which k value minimizes classification error in the testing dataset? What is the error

count?

 For Naïve Bayes, what is the error count for the testing dataset?

 For SVM, what are the error counts for the training dataset using linear and polynomial kernels,

respectively?

 For SVM, what are the error counts for the testing dataset using linear and polynomial kernels,

respectively?

 Which of the above algorithms has the best performance?

Regression

In the regression exercise, we will predict “avg_content” of a reviewer using the other characteristics of the

reviewer: avg_centrality, avg_viewership, avg_enhconent, Top10, Top50, Top100, Advisor, Lead. We will use

the entire dataset to fit the model.

First, fit a standard multiple regression model using avg_content as the dependent variable, and the others as

the independent variables.

 Please fit the model and interpret the result.

Next, we will find out which variables should be included in the regression model for predicting

“avg_content”. Please use both full model search and forward-stepwise regression for this task. The model

selection criterion is Mallow’s Cp. Please answer the following questions:

 What is the best model according to full model search?

 What is the best model according to forward-stepwise regression?

 What variables should be included if we want to use only three independent variables? Is the set the

same for full model search and for forward-stepwise regression? What variables should be included

if we want to use five independent variables?

Deliverable

Please submit your R code and answers to the questions. Please submit them in a text file (either plain text or

Word document is fine), putting the R code in the appendix.

Due Date

This assignment is due by midnight Friday, April 21, 2017.

Grading

This assignment is worth 25% of your final grade for the course. Grading will be done based on

completeness and correctness.