程序代写代做代考 jvm Hive decision tree algorithm Java 1 Introduction

1 Introduction
This assignment is based on the human activity recognition using smartphones1 dataset. This data has been collected from a group of 30 volunteers aged 19-48 years. Each person performed six activities (walking, walking upstairs, walking downstairs, sitting, standing, lying down) whilst wearing a smartphone on the waist. Using the phone’s embedded accelerometer and gyroscope, data was captured relating to the 3-axial linear acceleration and 3-axial angular velocity at a sample rate of 50Hz. Video was then used to manually annotate the data. x
Figure 1: Activity monitoring (Image: Universitat Polit`ecnica de Catalunya, Catalonia (Spain))
The dataset presented here (and used in this assignment) is a sub-table of the original data and contains 1,200 data instances (or objects), where a single activity is represented by a single data instance. There are 336 attributes (or features) including the decision class label (‘ACTIVITY’), which can take the values 1 (walking), 2 (walking upstairs), 3 (walking downstairs), 4 (sitting), 5 (standing), 6 (lying down). Each of these classes is represented more-or-less equally by 200 instances in the dataset. The goal is to predict the activity (walking, sitting, etc.), using the extracted sensor input values. The sensor acceleration signal also has gravitational and body motion components. More info regarding the process and data can be obtained from the link provided in the footnote. It is important to note that the dataset for this assignment is a sub-table of the original data linked in the footnote and you should be aware that the distribution of the data has also been modified. This means that any classifiers learned on the full dataset will not generalise well to the data for this assignment.
Note also that some data instances have missing feature values. You need to be mindful of this when building classifiers.
2 General Guidelines
As with the previous assignment for this module, you should run experiments using WEKA (version 3.8.1) Explorer. If you are working on a machine outside of the departmental network (e.g. your laptop), please ensure that you download and install this exact version. This is very important as your results need to be reproducible on a departmental machine.
When dealing with large datasets in WEKA, it is possible that you may encounter out-of-memory errors because the JVM has not reserved enough memory for WEKA (i.e. the ‘heap size’ is too small). If this is the case, then there are a number of ways to address this – see the appendix for more info.
In many cases, the WEKA Explorer allows you to modify the random seed that will be used. However, you should use the default seed to aid reproducibility of your results. You will have to work out some details in applying WEKA Explorer, including the meaning of certain terminologies (such as ‘random seed’ as referred to above).
1 https://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
1

3 The Task
Essentially, there are three different subtasks for this assignment which are discussed in more detail in each section below.
3.1 Task 1 – Evaluating classifier performance (30%)
The first task is to examine how some of the classifiers provided in WEKA perform on the provided humanActivity.arff dataset.
1. You should train and evaluate different classifiers on the dataset by first loading the humanActivity.arff dataset. In this task, you will be using cross-validation (CV) on the dataset to evaluate the classifier performance. In order to reduce computation time you may use 10-fold cross validation.
2. AgoodstartingpointfortheanalysisistotrybothNaiveBayes(NaiveBayes)andC4.5DecisionTrees(J48).However, for a complete analysis I would expect you to explore the data using at least four different learners including some of the sub-symbolic learners that we covered in the lectures (e.g. SVM, NN, etc.). Compare the results (percent correct, PC, etc.) for the different classifiers. What do you notice? What do you think would be a reasonable baseline against which to compare the classification performance? You are not limited to any particular selection of classifier learners and you may try others (as many as you like, in fact). However, you must provide a solid reasoning for choosing those that we have not covered in the material for this module (and their inclusion in the report), and have an understanding of how they work, as well as stating why you think they perform better/worse than others. It is not sufficient to simply include lots of results for different classifiers without any meaningful analysis.
3. In the introduction section, it is mentioned that some instance-attribute values are missing. It is important todeal with these as they may affect the performance of the learners applied to the data. One way of addressing this might be to use one of the filters provided in WEKA to replace such values with something else. This may be appropriate in some cases and not in others. It is important to bear this in mind however, and you should check the data carefully, use your best judgement and clearly state any assumptions you make when performing any subsequent experiments.
The use of a filter and/or indeed manual manipulation of the dataset by identifying any values to be changed, of course results in a different dataset from the original. You should now re-analyse the modified dataset (or datasets if you attempt more than one approach) and present your findings. What effect does this have on the models that are learned? Again, as in the previous step, you should provide sound reasoning for any conclusions you draw and classifiers you choose.
4. Returning to the missing attribute values: can you suggest another way, apart from the strategies employed bythe WEKA filters, to change the values? If so, you should describe it and present any experimental results, if you make any changes to the data. Discuss the effects that these changes might have on the learner.
You should use the dataset you believe to be the most suitable, and explain your choice in your report. Discuss (in your report) the analysis of the experiments performed.
3.2 Task 2 – Feature selection (40%)
In any dataset, there may be superfluous features – i.e. those that do not contribute to determining the classes of the data instances. Indeed, often these attributes can be misleading and mean that classification performance can be negatively affected as a result. One way in which we can address this issue is to perform attribute or feature selection.
You are asked to perform feature selection by carefully considering the properties of the original attributes (mean, standard deviation, etc.), and applying any changes (removal or inclusion of attributes) for the supplied training set, before then also applying the same changes to the test set.
The two datasets that are provided for this are available on Blackboard; humanActivityTrain.arff and humanActivityTest.arff. Note: you should not use an automatic feature selection method or mechanism such as those included in WEKA for the first two parts of this task – I am interested in your analysis not whether you have chosen an optimal subset of the data. The use of a feature selection algorithm will be obvious from the detail provided in your report.
1. Examineeachoftheattributesinthetrainingdataset(humanActivityTrain.arff)indetailusingthevisualisationtools provided in the WEKA Explorer preprocess tab. Document/calculate some basic measures: mean, standard
2

deviation, maximum and minimum, and whatever other metrics you consider appropriate. What does this tell you about each of the features of the dataset? Document your findings in the report.
2. Using the findings from 1) above, can you suggest a strategy for the inclusion or removal of certain features? Ifso, perform a manual removal/inclusion of features on humanActivityTrain.arff using the preprocess tab in WEKA and save the new dataset. Remove the same features for humanActivityTest.arff. Now, re-analyse the performance of the classifiers that you chose for task 1. What do you notice? Is performance better or worse? Provide an analysis of why you think that the performance has changed. You may include/exclude as many features as you like as long as you have a sound reasoning for doing so. Also, you may generate more than one dataset with selected features and perform analyses accordingly.
3. Veryoftenaspartofsomeclassificationalgorithms,e.g.tree-basedapproaches,featureinformationgainscoresare used as a splitting criterion to perform classification. What can you say about the tree that is generated from the humanActivityTrain.arff and the features that are used to split upon using the J48 classifier in WEKA? Is the tree similar for the humanActivityTest.arff dataset? Document your findings in the report.
3.3 Task 3 – Summary of results/findings (15%)
The final task is to draw upon the findings of tasks 1 and 2 above and should summarise the following questions:
1. 2.
3. 4.
WhatisagoodmodelforthefulldatasetusingCV,andwhy?Provideareasonedargumentsummarisingthefindings in your report.
What effect does the approach you employed for dealing with missing-valued data (denoted ‘?’ in the .arff file) have upon the dataset and the performance of the classifiers? State why you think that the approach you have used is appropriate.
Whichfeaturesaremostusefulinthedataset?Refertoyourfindingsoftheanalysisoftheindividualfeaturesaswell as any methods such as decision trees (J48) which used feature information gain metrics. Is feature selection appropriate for this classification problem/dataset and why?
Does a reduced dataset offer any advantages over the full dataset even if there are some negative changes inoverall performance?
Submission and Marking
4
You are required to submit two separate things:
1. A report in .pdf format via TurnItIn. You should aim to keep your answers concise, while also conveying the important information. A report of 2,800 words max. (including references) is appropriate for this. You should also include the TurnItIn word count in your document so that it can be verified. Please do not exceed the word limit as this will delay the marking process.
2. The modified dataset(s) created in Task 2. This should be submitted via Blackboard not TurnItIn.
Your assignment will be assessed according to the department’s assessment criteria for essays (see Student Handbook Appendix AC) and marked based on your report and any supporting data you submit. In general it is advisable to concentrate on appropriately sized selections or excerpts of data to support your discussion. It is not necessary to include very large excerpts of data. However, if you feel the need to include such elements please do not include them into the main text, but instead as an appendix, in order not to clutter the format of the report. The following marking scheme will be used to mark your submission:
• Task1-Evaluatingclassifierperformance(seedescriptionabove)(30%).
• Task2-Featureselection(seedescriptionabove)(40%)
• Task3-SummaryofResults/Findings(15%)
• Report – readability, correct formatting, layout and proper referencing (15%)
You should aim to keep your answers concise, while conveying the important information.
3

4

Appendix 1. – Increasing the Java Heap Size in WEKA
1. For Windows & Linux: You can change the heap size parameter in the RunWeka.ini file (Windows only). In Linux or Windows, you can also do so by navigating (using the command line) to the folder where the weka.jar file is located on your machine and typing:
java -Xmx2048m -jar weka.jar
to start WEKA. The figure ‘2048’ represents the maximum size of the heap. This can be increased to make heap size larger if desired.
2. ForMacOS:Open:
System Preferences→Java Control Panel→Java→ click on View.
Edit the Runtime args box for the user by including the parameter e.g. -Xmx2048m. Click Apply.
Note that the figure ‘2048’ represents the maximum size of the heap. This can be increased to make heap size larger if desired.
5