CS计算机代考程序代写 data mining algorithm Final Examination

Final Examination

Problem 1. Use the final-p1.csv dataset for this problem
The given dataset has seven 2-dimensional objects and they are divided into two clusters. In the dataset, the first column is the object name, the second and the third columns are two attributes, and the last column shows the cluster of each object.
Suppose that you are running the k-means algorithm on the dataset. After a certain number of iterations, you have two clusters as shown in the given dataset.
Run one more iteration of the k-Means algorithm and show the two clusters at the end of the iteration. You must show which objects belong to each cluster at the end of the iteration. Use Manhattan distance when calculating the distance between objects. You must show all intermediate steps and calculations.
Problem 2. Use the final-p2.csv dataset for this problem.
The given dataset has ten 2-dimensional objects and the dataset is used for classification. In the dataset, the first two columns are attributes and the last column is the class attribute.
Problem 2-(1). Classify a new object O1 = using the KNN algorithm that we discussed in the class with K = 3.
Problem 2-(2). Classify a new object O2 = using the KNN algorithm that we discussed in the class with K = 5.
You must show all intermediate steps and calculations.
Problem 3. Use the final-p3.csv dataset for this problem.
Problem 3-(1). From the given dataset, mine all frequent itemsets using the Apriori algorithm. Show all candidate itemsets and frequent itemsets. You should follow the process described in the book and lecture (i.e., C1 → L1 → C2 → L2 → …). Assume that the minimum support = 60% (or 3 or more transactions).
Problem 3-(2). Sort all frequent 3-itemsets by their item number. Then, select the first frequent 3-itemset form the sorted list of frequent 3-itemsets and mine all strong rules from this itemset. Assume that the minimum confidence = 80%.
You must show all intermediate steps and calculations.

Problem 5. Use the final-p5.csv file for this problem.
The given file has a table that represents the test result of a classifier on a certain dataset.
Problem 5-(1). For each row, compute TP, FP, TN, FN, TPR, and FPR.
Problem 5-(2). Plot the ROC curve for the dataset. You must draw the curve yourself. You must use Weka, R, or other data mining or data analysis software to generate the curve. However, you may use a spreadsheet software.

Problem 7. Use the final-p7.csv dataset for this problem. The dataset has four nominal attributes.
Problem 7-(1). Construct the contingency table for two attributes A3 and Class. You may use any tool for this.
Problem 7-(2). Perform the Pearson’s chi-square test and determine whether there is a correlation or not. Use the significance level of 5%. You must not use any data mining or data analysis tool for this problem. You must do all calculations yourself and you must show all intermediate steps and calculations.
Problem 8. Use the final-p8.csv dataset for this problem.
The given dataset has five 2-dimensional objects and they are divided into two clusters. In the dataset, the first column is the object name, the second and the third columns are two attributes, and the last column shows the cluster of each object.
Calculate the distance between the two clusters using the Ward’s distance. Use the Euclidean distance when calculating the distance between objects. You must show all intermediate steps and calculations.
Problem 9. Used the final-p9.csv dataset for this problem.

Suppose you built two classifier models M1 and M2 from the same training dataset and tested them on the same test dataset using 10-fold cross-validation. The error rates obtained over 10 iterations (in each iteration the same training and test partitions were used for both M1 and M2) are given in the dataset. Determine whether there is a significant difference between the two models using the statistical method that we discussed in the class. Use a significance level of 5%. If there is a significant difference, which one is better? You must show all intermediate steps and calculations.

Note: When you calculate var(M1 – M2), calculate a sample variance (not a population variance).
Problem 10. Use the final-p10.csv dataset for this problem.
This dataset is a time series data, which shows the price of a certain item collected over 15 years (year 2000 through year 2014). You want to forecast the next year’s price based on the prices of previous years.
Suppose that you forecast the price of a certain year t, based on the prices of previous 2 years using the following formula:

Here, the subscript t denotes a year. For example, the forecast price of 2020 is calculated as:
price of 2020 = 1.03*((price of 2019) + (price of 2018))/2
Calculate the MSE of this forecast model. You must show all intermediate steps and calculations.