COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 5, 6, Correlation & Clustering – Exercises
Contact: pauline.lin@unimelb.edu.au
© University of Melbourne 2021
Question 2ai) from 2016 exam
a) Richard is a data wrangler. He does a survey and constructs a dataset recording average time/day spent studying and average grade for a population of 1000 students:
Student Name
Average time per day spent studying
Average Grade
…
….
….
i) Richard computes the Pearson correlation coefficient between Average time per day studying and Average grade and obtains a value of 0.85. He concludes that more time spent studying causes a student’s grade to increase. Explain the limitations with this reasoning and suggest two alternative explanations for the 0.85 result.
2
Question 2aii) from 2016 exam
a) Richard is a data wrangler. He does a survey and constructs a dataset recording average time/day spent studying and average grade for a population of 1000 students:
Student Name
Average time per day spent studying
Average Grade
…
….
….
ii) Richard separately discretises the two features Average time per day spent studying and Average grade, each into 2 bins. He then computes the normalised mutual information between these two features and obtains a value of 0.1, which seems surprisingly low to him. Suggest two reasons that might explain the mismatch between the normalised mutual information value of 0.1 and the Pearson Correlation coefficient of 0.85. Explain any assumptions made.
3
For NMI, which binning is better?
A
B
4
Understanding the Algorithm
For which dataset does k-means require a fewer number of iterations? k= 2
Dataset 1 Dataset 2
© University of Melbourne 2021
5
Understanding the Algorithm
Which is the output of k-means where k = 3. A:
B:
C: depends on the initial seeds
6
Discussions
• How could I tell which clustering is better?
• How could I come up with a good clustering?
7
8
9