3rd Assignment COMPX523-22A [v4]
Gomes, May 2022
1 Overall Description
This assignment worth 30% of the final score and it has only one part. The goal of this assignment is to perform and analyze experiments involving data stream classification and exploration.
Copyright By PowCoder代写 加微信 powcoder
Important information:
• Deadline: See Moodle
• There is no coding involved in this assignment, you are encouraged to use MOA to perform your experiments 1
• You might want to watch lecture from 05-05-2022 where feature scoring was discussed 2
• Your submission must contain at least one PDF file containing your re- port. Optionally you can also submit other analyses and a Jupyter note- book. However, notice that a Jupyter notebook does not replace the PDF report.
Throughout this assignment, there are (Optional) questions and requests. These do not give extra credit nor influence the final score. They are challenges that may go beyond what was taught in the course or require more coding and analysis efforts.
1You can use river as well, but some algorithms are not available in river yet.
2Precisely, the example about how to use the feature analysis tab starts around minute 54.
2 Evaluation and Analysis
Remember to address all the guiding questions in your report. Some experi- ments will require visualizations that can be created using the Feature Analysis tab in MOA.
Evaluation framework. For experimente 1, you will use a test-then-train evaluation approach and report the final result obtained, i.e. the result ob- tained after testing using all instances. For some of the (Optional) questions in Experiment 2 you might want to use a Prequential evaluation.
Metrics. For Experiment 1, report accuracy and recall (per class). Experi- ment 2 mostly relies on Visualizations, and you won’t need to report the actual values of the Feature Importances.
Dataset. You only need the “house elec hour“ data. Refer to attached ma- terials for the “.arff” and “.csv” versions.
2.1 Experiment 1: Benchmark of classification algorithms
Perform experiments using the following algorithms: Hoeffding Tree, Naive Bayes, Streaming Random Patches (SRP) and Adaptive Random Forest (ARF). Use the default values for the hyperparameters of these algorithms (exception: use number of learners = 10 for ARF and SRP algorithms).
2.1.1 Guiding questions for the analysis of Experiment 1
These are questions that you must cover in your analysis, you can also dis- cuss other interesting aspects that you observed. Remember that you are not required to answer the (Optional) questions.
1. What is the best model in terms of accuracy? If we assume that correct predictions on either class label are equally important.
2. If we focus the analysis on the model that obtained the highest accuracy, is there any class that is harder to be correctly classified? How can you identify that and what is this class?
3. (Optional) Is it possible to improve the overall predictive performance in each class by using an algorithm such as C-SMOTE? To verify, execute C-SMOTE and verify the output results.
2.2 Experiment 2: Data exploration
The goal of this experiment is to visualize and have a better understanding of the data. You might want to use the Feature Analysis tab from MOA to perform the data exploration tasks from this experiment.
• (Task 1) Plot all the input features (and the class label) values for the whole data.
• (Task 2) Choose at least 2 different periods (i.e. from instance n to m) to visualise the data of all features (See guiding question 1 below to decide which periods).
• (Task 3) Using the same configuration for SRP as in Experiment 1, create a plot with the features’ importance according to the COVER metric. Im- portant: use a window size of 100 instead of the default value of 500, this will give you more detail about the variations over time.
2.2.1 Guiding questions for the analysis of Experiment 2
1. (Related to Task 1 and Task 2) When you visualize the features over time for the whole data, are there periods of great variation (i.e. long periods with 0 readings or bursts of very high values)? If yes, choose at least 2 of such periods to visualise and report the characteristics (i.e. are all the features behaving differently during this period or all of them?).
2. (Related to Task 3) Does the feature importances for SRP change over time? What is the best feature most of the time?
3. (Related to Task 3) Are there features that have no impact in SRP predic- tions? What are these features and how can you verify that?
4. (Related to Task 3) Choose a period of variation in the features’ impor- tance (e.g. feature A used to be the most important, now it is the second important and feature B is the most important) and compare it against a plot of the involved features’ values during the same period. What can be observed? Is it possible to explain these variations in their importance based on the values?
5. (Optional) This question extends Question 4. Visualise the predictive per- formance (accuracy) for the algorithm during that same period to under- stand if the variation in the feature importance reflects a change in the model, which in turn caused a noticeable impact in accuracy. You will need to present the accuracy using a prequential analysis (instead of test- then-train) to notice variations in accuracy for short periods of time.
6. (Optional) If you increase the number of learners in SRP from 10 to 100, does this change the overall importance of the features or are they similar to the results obtained with SRP using 10 learners?
7. (Optional)AretheimportanceofthefeaturesconsistentwhenusingARF or SRP? To verify that, you can visualise the features’ importance ob- tained from ARF with 10 learners and compare them to those obtained by SRP with 10 learners.
8. (Optional) Cluster the data using an algorithm (such as CluTree). Does the clusters change over time? Could you observe any other interesting behavior based on the clustered data?
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com