程序代写代做 Excel algorithm kernel Overview

Overview
Individual Assignment 2 – Machine Learning BUMK 746
In this assignment, you will perform classification and regression analysis using the machine learning methods covered in class. This is an individual assignment. You are not supposed to discuss the problems with any other students.
The datasets are provided to you in two excel files, “reviewer_withavg.csv” and “bank- full.csv”, available at ELMS.
Preparation
You will use R for this assignment. In order to successfully complete the assignment, please ensure you understand the R code that accompanies the class lectures. You should have the following packages installed to run the R code covered in class and/or for your project: class, klaR, e1071, rpart, randomForest, leaps, lars, MASS.
Part I: Classification using k-NN, Naïve Bayes, and SVM
In this part of the assignment, you will use the reviewer dataset. First, create a column “Popular” in the dataset. Any reviewer who is Top10, Top50, or Top100 is considered popular. That is, set Popular to 1 if any of Top10, Top50, or Top100 is 1, and 0 otherwise.
The task is to classify a reviewer as either Popular or not Popular, using the four characteristics of the reviewer as shown in class: avg_centrality, avg_content, avg_viewership, avg_enhconent. You will use the first 50 records as the testing dataset, and the last 149 records as the training dataset. You will evaluate three classification methods: k-NN, Naïve Bayes, and SVM. For k-NN, you need to test values for k between 1 and 10. For SVM, you will test both linear and polynomial kernels.
Please answer the following questions:
• For k-NN, which k value minimizes classification error in the training dataset? What is the error count?
• For k-NN, which k value minimizes classification error in the testing dataset? What is the error count?
• For Naïve Bayes, what is the error count for the testing dataset?
• For SVM, what are the error counts for the training dataset using linear and
polynomial
kernels, respectively?

• For SVM, what are the error counts for the testing dataset using linear and polynomial
kernels, respectively?
• Judging by error count, which of the above algorithms has the best performance?
Part II: Classification using Tree and Random Forest
In this part of the assignment, you will use the bank dataset. You will use the first 80% of the records as the training dataset, and the remaining 20% as the testing dataset.
The task is to predict the variable “y” (valued “yes” or “no”), using the following characteristics: duration, month, poutcome, job, education, and marital. Please perform the following tasks and answer the corresponding questions:
• Fit a classification tree to predict y using the list of characteristics specified above. Use cp=0.0001. Please report the confusion matrices and the F1 scores for the training dataset and that for the testing dataset.
• Prune the tree fitted in the previous step. What complexity parameter should you set? Why? (Please show evidence to support your answer.) Please report the confusion matrices and the F1 scores of the pruned tree for the training and testing datasets. How does the predictive performance of the pruned tree compare with that of the original tree?
• Fit a random forest to predict the same, using 50 trees. Please report the confusion matrices and the F1 scores for the training and testing datasets. How does the predictive performance compare with that of the original and the pruned tree? What are the two most important independent variables?
Part III: Regression
In this part of the assignment, you will use the reviewer dataset, to predict “avg_content” of a reviewer using the following characteristics of the reviewer: avg_centrality, avg_viewership, avg_enhconent, Top10, Top50, Top100, Advisor, Lead. We will use the entire dataset to fit the model.
The task is to find out which variables should be included in the regression model for predicting “avg_content”. Please use both full model search and forward-stepwise regression for this task. The model selection criterion is Mallow’s Cp. Please answer the following questions:
• What is the best model according to full model search?
• What is the best model according to forward-stepwise regression?

• What variables should be included if we want to use only three independent variables? Is
the set the same for full model search and for forward-stepwise regression? What variables should be included if we want to use five independent variables?
Deliverable
Please submit your R code and answers to the questions. Please submit them in a text file (either plain text or Word document is fine), putting the R code in the appendix.
Due Date