程序代写代做代考 Hive algorithm go graph decision tree data mining Data Mining (CSC 503/SENG 474)

Data Mining (CSC 503/SENG 474)
Assignment 1 Due on Friday, June 26th, 11:55pm
Instructions:
• You must complete this assignment on your own; this includes any coding/implementing, running of experiments, generating plots, analyzing results, writing up results, and working out problems. Assignments are for developing skills to make you strong. If you do the assignments well, you’ll almost surely do better on the midterm and final.
• On the other hand, you can certainly have high-level discussions with classmates about course material. You also are welcome to come to office hours (or otherwise somehow ask me and the TAs things if you are stuck).
• You must type up your analysis and solutions; I strongly encourage you to use LaTeX to do this. LaTeX is great both for presenting figures and math.
• Please submit your solutions via conneX by the due date/time indicated above. This is a hard deadline. Any late assignment will result in a zero (I won’t accept assignments even one day late; “I forgot to hit submit on conneX”, etc. will not be met with any sympathy). However, I will make an exception if I am notified prior to the deadline with an acceptable excuse and if you further can (soon thereafter) provide a signed note related to this excuse.
1

This assignment has two parts. The first part involves implementing some of the methods that you have already seen and analyzing their results on some classification problems; this part is for all students (both CSC 503 and SENG 474). The second part is actually only for graduate students (only CSC 503). The first part might seem like a lot of work. However, if you do it well, you’ll become strong, and also gain valuable skills for your project. This is a good thing.
1 Experiments and Analysis
First, “implement” the following methods:
• Decision trees (with pruning). For simplicity, I suggest going with reduced error pruning (this is the form of pruning we covered in class). For the split criterion, you can use informa- tion gain or the Gini index or some other good criterion. You might even try a few different ones as part of your analysis (explained further below).
• Random forests (no pruning). What forest size should you use? Well, you should exper-
iment with different forest sizes and see what happens. For the size of the random sample
(sampled with replacement) used to learn each tree, I suggest setting the number of samples
with replacement to be equal to the original sample size. For the number of random features
d (for d features) • Neural networks. Any number of layers is fine, as long at there is at least one hidden layer;
when selecting the split feature at each decision node, I suggest starting at and experimenting by going up and down from there.
I suggest going with 3 layers (i.e. the input layer, 1 hidden layer, and the output layer).
I put “implement” in quotes because I won’t actually require you to implement these methods; you can instead use machine learning software that you find online. If you do implement anything, you can use whatever programming language you like. For neural networks in particular, I shame- fully recommend using an existing implementation here, but eventually (after this assignment) for your own edification it would be great if you implement a neural network yourself.
What to test on
You’ll be analyzing the performance of these methods on two classification problems.
For the first problem, you’ll be using the heart disease dataset from the UCI repository:
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
In order to make your life a bit easier, I have provided a slightly cleaned and modified dataset, cleaned_processed.cleveland.data. In case you’re interested, you can find details on how I prepared this dataset in Appendix A. The last attribute/feature (the label) takes values 0 and 1, where 0 indicates no heart disease and 1 indicates some level of heart disease. Keep in mind that the dataset has not been split into a training and test set. You should do this. For any training/test split, a good rule of thumb is 80% for the training set and 20% for the test set.
The second problem will actually be designed by you. It will again be a classification problem, meaning that it consists of a set of training examples and a set of test examples for a classification task. For simplicity, I suggest going with a binary classification task. You can either generate the data yourself or find some data online. The important thing is that the problem should be interesting: that is to say, the learning task should not be so easy that every method can obtain zero test error, but it should be easy enough where some method is able to obtain a reasonably small test error. What does “reasonably small test error” mean? Well, for binary classification, this would be a test error well below 50%. Why? Because a classifier that just randomly predicts with equal probability will always get close to 50% test error.
2

How to do the analysis
The core of this assignment, meaning what is worth the most marks, is the experiment analysis. Your analysis will go into a file called report.pdf. This file should:
• contain a description of the second classification problem (the one you came up with or selected), including an explanation of why the problem is interesting; specifically, explain why it is non-trivial and why it enables good comparison of the methods.
• present the performance of the methods in terms of the training and test error on each of the problems. In particular, you should present graphs that show how each of training error and test error vary with training set size. But beyond that, you should also play with the parameters of the methods to see the effect. For instance, what happens when you change the learning rate (or number of nodes in the hidden layer, or the number of iterations of training) for the neural network? What happens if you change the number of random features used for the random forest? What happens if you change the pruning rule (or use a different split criterion) for the decision tree? As much as possible, try to present your results with plots.
• contain a detailed analysis of your results, including various design choices you made (like choice of split criterion for decision trees, choice of nonlinearity for neural networks, etc.). Try and explain the results that you got for the various experiments you ran, and use these results to compare the methods. Think about the best parameter settings for the methods (and maybe think about how the best parameter setting might change as the training sample size increases). Ideally, use your analysis to come up with ideas on how you might improve the methods.
Please make your analysis concise (yet thorough). Don’t ramble, and don’t make stuff up. Act as if you are a respected scientist presenting some work to your colleagues (and you want them to continue being your colleagues after the presentation).
What to submit
In all, for the analysis portion of the assignment, you should submit (as a zip file):
• the file report.pdf explained above;
• a file called README.txt which contains instructions for running your code (for any code you wrote) or gives proper attribution to any code/datasets you got from other sources (like the Internet, for example). If you mostly used some existing code/software but needed to modify a few files, then give attribution and mention the files you modified.
• a file called code.zip, containing your code. Please organize this file well (embrace the idea of directories). In case you mostly used some existing code/software but needed to modify a few files, then just provide those files here, and make sure your README.txt mentions those files in relation to the existing code/software.
• any additional files that you need; in particular, this should include the training and test sets for the second classification problem in case you came up with your own problem.
3

2 Problem-solving part – CSC 503 only
In this problem, we consider the gradient descent algorithm for a two-dimensional regression prob- lem. So, we have 2 continuous features x1 and x2, and the task is to predict a continuous target y.
􏰀x 􏰁
Suppose that for a given input xi = 1,i , we predict using hypotheses of the following form:
x2,i
fw(xi)=w1x1,i +w2x2,i +w3x1,ix2,i +b.
Assume that we have n training examples (x1, y1), . . . , (xn, yn). Suppose that we measure the error according to the squared error with a further penalty on the squared Euclidean norm of w. Then, for a fixed, positive number λ, the training error can be written as
1􏰂n λ
(yi − fw(xi))2 + 2 ∥w∥2 = 2n (yi − fw(xi))2 + 2 wj2.
i=1 j=1
In the below, assume the learning rate (also called the step size) is some positive number η.
E(w) = 2n
1 􏰂n λ 􏰂d
i=1
3

• •
(a) Derive the gradient descent update for b. (b) Derive the gradient descent update for w1. (c) Derive the gradient descent update for w3.
(d) Show the full gradient descent update for the vector w. How you’ll be marked
For each of CSC 503 and SENG 474, the total number of marks is 100, and you receive 10 marks for your code (in case you write code) together with the classification problem that you design.
For undergrads (SENG 474), the analysis (in report.pdf) is worth 90 marks.
For grad students (CSC 503), the analysis (in report.pdf) is worth 80 marks and the
Problem-solving part is worth 10 marks.
4

A Cleaned version of processed.cleveland.data
In order to obtain the provided cleaned version of the data, I started from the original dataset processed.cleveland.data, which can be found in the Data Folder from the link above. You should not use this dataset!
In the original processed.cleveland.data dataset, some of the 6 examples have one feature with the value ‘?’. This is a missing value. There are various ways to deal with missing values, but an easy way (when not many values are missing) is to simply discard those examples; this is what I did. Also, the last attribute/feature (the label) takes values {0, 1, 2, 3, 4}. I formed a binary classification task by:
• keeping the label 0 as is;
• grouping together the labels in the set {1, 2, 3, 4} by changing their label to 1. Note that all
of these non-zero labels indicate some level of heart disease.
5