程序代写代做 go decision tree Part A

Part A
COMP20008 2020 SM1 Workshop Week 10 Experimental design
1. What is the difference between supervised and unsupervised learning?
Supervised learning: We have class labels
Unsupervised learning: We don’t
2. What is the difference between training data and testing data?
Training data: Used to build a classification model
Testing data: Used to test it’s validity
Need to split test data & training data to ensure that we don’t overfit
our model and assess it’s performance as being unrealistically high.
3. What is the purpose of k-fold cross validation?
Splits using different parts of the dataset.
Then train & test a model for each split and report the averages.
Reduces likelihood of over fitting and selection bias
and provides more accurate assessments of performance.
4. Suppose Alice takes a dataset D with 100 instances, 4 features, plus a class label fea- ture. She computes the correlation of each of the 4 features with the class label using mutual information and discards the two features with lowest correlation. She now has a processed version D′ of the dataset (2 features, class label and 100 instances). She splits D′ into two – 80% training (80 instances) and 20% testing (20 instances). She learns a decision tree model on the training set and evaluates the model accuracy on the testing set. She reports the accuracy as being 90%. Why might this estimate of 90% accuracy be over-optimistic? Give reasons.
Over-optimistic because the feature selection has been done
using the test data as well as the training data.
It’s possible that feature selection done on the training data only
would have selected different features.
For a fair assessment, the test data
should not be involved in the model generation in any way,
and the test-train split should be the first thing you do.
1

Part B
The Aurin repository contains a large number of datasets that data scientists can use to help answer important questions in society. The following exercise illustrates a possible simple scenario and is designed to get you thinking about how you might use the Aurin urban data repository that is hosted by the University.
Suppose our question is Are we building enough green spaces in Victoria to ensure a healthy population?
• Question 1: Who would be interested in an answer to this question and why?
• Log in to the Aurin portal https://aurin.org.au
• Select Victoria as your region of interest.
• Browse through the available datasets and see what data is available.
• Add the dataset “2015 Local Government Area (LGA) Statistical Profiles”. You should select all the attributes to include. This dataset includes information about number of people reporting high blood pressure across different regions in the State. We will use this as a measure of people’s health.
• Add the dataset “LGA Visit to green space (once per week)”. You should select all the attributes to include. This dataset contains information about number of people who visit local green space each week, across different regions in the State.
• Download each of these datasets as a CSV file.
• Question 2: What feature would you use to join these datasets together?
• Question 3: Describe two different techniques you could use to help identify a rela- tionship between the visits to green space and reports of high blood pressure.
• Question 4: Describe how you could use the data to make a prediciton about people’s overall health based on the information available. Describe the steps you might use to evalulate your prediction.
• Question 5: What challenges do you think might arise in studying these questions?
• Question 6: What are the next steps you might take when studying these questions?
• Question 7: (If you have time) Implement one of these techniques and report the results.
Part B possible answers
The point of the exercise is to get students thinking about how to approach a practical problem. There aren’t any right answers in this part.
I suggest to go through this in groups then discuss the different approaches with the class. Some sample ideas:
2

Q1: Dept health, Dept urban planning personnel. Could be from senior to junior level. If a relationship between green spaces and health can be established, this might be used as evidence for having more of them. Obviously it’s a very complex question and political issues would perhaps trump evidenced based decision making!
Q2: Use the lga code (local government area code) to join the datasets together. Could create data frames for each relevant attributes of each dataset. Could hard code yourself or else use 2 pandas specific functions such as merge or join.
Q3: Some possible approaches:
• Scatter plots of green space vs blood pressure
• Correlation between blood pressure and number of visits to green space • Others as well – get students to discuss
Q4: This is more complex and less well defined.
One approach: First decide on how to define overall health. We could take many of the health related metrics and use them to create discrete groups like ’healthy’ and ’not healthy’. This could be data driven or make use of expert advice.
Could then select other relevant attributes and build a classification model (e.g. Decision tree, k-nn)
In order to evaluate, we need to ensure the experimental design is done appropriately. Consider the points in the lecture, i.e. perform test-training split before doing anything else and use k-fold cross validation.
Can consider overall accuracy or more complicated metrics to evaluate performance.
Q5: Challenges for Q3 include relatively small size dataset, only aggregate (suburb level) data, possible data quality issues, correlation vs causation and others.
Some challenges for Q4 are similar, also issues with defining classes for prediction model, evaluating model performance with limited data. Discuss the relationship between challenges.
Q6: Perhaps investigate what other data is available, e.g. data.melbourne.vic.gov.au, data.vic.gov.au. Consider how to obtain individual level data for better analysis? Consider how to im- plement a study to test the model.
3