CS代考 BU510.650 – Data Analytics

BU510.650 – Data Analytics

Assignment # 4

Copyright By PowCoder代写 加微信 powcoder

Please submit two documents: Your answers to each part of every question in .pdf or .doc format, and

your R script, in .R format. In your document with answers, please do *not* respond with R output only.

While it is okay to include R output in that document, please make sure you spell out the response to

the question asked. Please submit your assignment through Blackboard and name your files using the

convention LastName_FirstName_AssignmentNumber. For example, Yazdi_Mohammad_4.pdf and

Yazdi_Mohammad_4.R.

For answering questions 1: Please watch Decision Tree in R recording of class.
For answering questions 2: Please watch KNN in R recording of class.

1. In this question, you will estimate a decision tree for the AutoLoss data. The data file for this

question, AutoLoss-DT.csv, is slightly different from the data file in Assignment 2. In particular,

instead of the actual loss amount for each vehicle, it has a column called HighLoss, which indicates

whether the loss is high (“Yes”) or low (“No”) for each vehicle. Our goal is to create a decision tree

that predicts whether the loss for a vehicle will be high or low.

To begin your work on this question, run the following two lines of code: The first one replaces ?s

with NA while reading the data from the .csv file, and the second one removes all the observations

with any NA.

AutoLoss <- read.csv("AutoLoss-DT.csv", na.strings = "?",stringsAsFactors = TRUE) AutoLoss <- na.omit(AutoLoss) **Please include set.seed(5) once at the beginning of your code, so we all get the same results.** a) Fit a decision tree to the entire data, with HighLoss as the response and all other variables as predictors. Plot the tree (including the names of predictors in the plot) and answer the following questions: Which predictors are used at the nodes of the tree? How many terminal nodes (leaves) does the tree have? b) Determine the best tree size, using cross-validation and pruning. (See how we accomplished this in TASK 7 of Carseats example.) Plot the tree you obtained (including the names of predictors in the c) Use the best tree to answer the following question (you do not need to use R for this): Suppose my car fits the description shown below. Will this car incur a high loss or not? FuelType Aspiration NumDoors BodyStyle DriveWheels Length Width Height gas std two wagon 4wd 160 70 60 Weight EngineSize Horsepower PeakRPM Citympg Price 3423 122 241 5000 26 23000 https://jhucarey.zoom.us/rec/share/HRqUolOnnf8tMqM47Uu0sTt1192gZQ9SRL_T-9WI0xKA6m-48OOKJV7Dv1nv05Ic.YQSibK9DmvMA7ktF https://jhucarey.zoom.us/rec/share/tfIWCKjCpxYIiSY2PXk52H17c41I4uvhEw84COnvstNHqA_w51KdIochtgjXKGj8.Xya88ePmSADAN2Ef?startTime=1668713491000 2. In this question, you will use the K-Nearest Neighbors (KNN) algorithm to predict whether a passenger will survive or not. To begin your work on this question, first read the data from the file "TitanicforKNN.csv" to a data frame named Titanic. **Note: Please review the data before proceeding. You will notice that I already converted all the categorical variables (Gender, Fare, Class) into 0-1 columns. I did so, because KNN does not work well with non-numeric variables.** Next, split the data into training data and test data, using random selection. Include half of the records in the training data and the rest in the test data. You learned how to do this using sample function in Task 3 in Carseats-DecisionTree.R for a related example. (**Remember to include set.seed(1) before the random selection in your code, so we all end up making the same split.**) (a) Run the KNN algorithm to predict the response variable Survived for each passenger in the test data. Do this for K = 2, 4, and 6. According to these predictions for K = 2, 4, and 6, what is the proportion of passengers in the test data that will survive? R Hints: To run the function knn(), recall that you need four inputs: (i) a matrix that contains the values of predictors in the training data, (ii) a matrix that contains the values of predictors in the test data, (iii) a vector containing the values of the response (Survived) in the training data, (iv) a value for K. To obtain (i), remove the Survived column from the training data. To obtain (ii), remove the Survived column from the test data. To obtain (iii), create a vector that stores the values of Survived column in the training data. See the Smarket-KNN.R for a related example. (b) For each K, compute the accuracy of predictions for the test data. Which K works best in this case? 程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com