COMP20008 Semester One 2017 Exam
The University of Melbourne Semester One 2017 Exam
School: Computing and Information Systems Subject Number: COMP20008
Subject Title: Elements of Data Processing
Exam Duration: 2 hours Reading Time: 15 minutes This paper has 11 pages
Authorised Materials:
No calculators may be used.
Instructions to Invigilators:
Supply students with standard script books.
This exam paper can be taken away by the students after the exam.
This paper may be held by the Baillieu Library.
Instructions to Students:
Answer all 5 questions. The maximum number of marks for this exam is 50. Start each question (but not each part of a question) on a new page.
Page 1 of 11
COMP20008 Semester One 2017 Exam
Formulae Euclidean distance: d(x, y) = ni=1(xi − yi)2
√
ni=1 (xi −x ̄)2
Conditional entropy: H(Y |X) = x∈X p(x)H(Y |X = x) where X is the set of all possible categories for X
Mutual information:
MI(X, Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y )
Normalised mutual information: NMI(X, Y ) = MI(X,Y )
TP+TN+FP+FN TP is number of true positives
TN is number of true negatives FP is number of false positives FN is number of false negatives
Pearson’s correlation coefficient: r = wherex ̄=1n xandy ̄=1n y
√
ni=1 (yi −y ̄)2
n i=1 i n i=1 i Entropy: H(X) = − #categories p log p
n (x −x ̄)(y −y ̄) i=1i i
i=1 i2i
where pi is the proportion of points in the ith category.
min(H(X),H(Y )) Accuracy: A= TP+TN where
Page 2 of 11
COMP20008 Semester One 2017 Exam
1. a) Consider the following XML fragment
i) (1 marks) What is the namespace URI of the element name?
ii) (1 mark) What is the namespace URI of the element degree?
ii) (1 mark) What is the namespace URI of the element workshop?
b) (1 mark) Provide a JSON representation of the following XML fragment.
c) (3 marks) Explain two advantages of applying principal components anal- ysis to a dataset that has 500 features and 20000 instances. Explain one disadvantage.
d) (1 mark) Provide a real world example that illustrates the concept of “data missing not completely at random”.
e) (2 marks) Explain how outlier detection and removal fits in the data wran- gling pipeline.
Page 3 of 11
COMP20008 Semester One 2017 Exam
2. a) Below is pseudo code for the VAT algorithm described in lectures. It outputs a permutation P of the objects.
i) (3 marks) In line 6, explain why the pair of most similar objects oi and oj is chosen. Why not the least similar instead?
ii) (2 marks) Explain how a reordered dissimilarity matrix D∗ can be created using the permutation P.
b) (3 marks) Maxine is having a conversation about data wrangling. She says “The VAT algorithm is nice, but parallel co-ordinates is even more useful for understanding a dataset”. Explain one way in which use of parallel-coordinates could reveal similar information to use of VAT. Explain one way in which use of parallel co-ordinates could reveal information that would not be revealed by use of VAT.
c) (2 marks) Using an example, explain the difference between i) user based methods for collaborative filtering and ii) item based methods for collaborative filtering.
Page 4 of 11
COMP20008 Semester One 2017 Exam
3. Consider the following dataset, that records the predictions of a k-nearest neighbor classifier on a test set of 10 instances. There are two classes, X and Y. For each instance, the class that is predicted by the classifier is recorded and the actual (true) class of the instance is recorded.
Instance ID Predicted class Actual class
1XX 2XY 3YY 4XX 5XY 6YX 7XX 8YY 9YX 10 Y Y
a) (1 mark) Would Pearson correlation be suitable to compute the correlation between the Predicted class and Actual class? Why or why not?
b) (1 mark) Write a formula for the entropy of Predicted class.
c) (2 mark) Write a formula for the normalised mutual information between
Predicted class and Actual class.
d) (1 mark) What is the accuracy of the classifier for this test data?
e) (3 marks) Suppose we have some other classifier and use it to make predic- tions on the same test set of 10 instances. For this other classifier, suppose the normalised mutual information between Predicted Class and Actual class is reported to be 1 and its accuracy is reported to be 0. Is this possible? Explain your reasoning.
f) (2 marks) What is the purpose of separating a dataset into training and test sets, when evaluating the performance of a classifier?
Page 5 of 11
COMP20008 Semester One 2017 Exam
4. a) Consider the 2-D dataset: {(1,2),(2,2),(3,6),(4,6),(6,6),(12,12)}. We want to use single linkage hierarchical clustering with Euclidean distance. The initial proximity matrix is shown below.
Suppose the first two iterations yield the new proximity matrix below.
i) (1 mark) Draw the proximity matrix with the “?” cells filled in. Your filled in cell values may use square root symbols.
ii) (1 mark) Based on your answer to i), which clusters will be merged next? Explain your answer.
iii) (1 mark) Draw the (partial) dendrogram obtained after completing the merge in question ii).
Page 6 of 11
COMP20008 Semester One 2017 Exam
b) (2 marks) Consider the following two step procedure:
Step 1: We apply the k-means algorithm on a dataset D having n instances,
using k = 4. This outputs a clustering C.
Step 2: Suppose we then create a new dataset D′, by adding two copies of D together. i.e. D′ = D + D. D′ thus has 2n instances and each instance from D has 2 copies in D′. We then apply the k-means algorithm on D′, with k = 4 and using the same initial centroids as in Step 1. This outputs a clustering C′.
Compare the executions of the k-means algorithm in Step 1 and Step 2. Ex- plain how will they be different/similar in terms of final output, number of iterations and time complexity? State any assumptions made.
Page 7 of 11
COMP20008 Semester One 2017 Exam
c) i) (1 mark) One can compute a quality measure for a clustering, known as SSE (sum of squared errors). SSE is the sum of Euclidean distances of objects from their cluster centroids.
k
SSE = d(x,ci)2
i=1 x∈ci where ci is the centroid of cluster ci.
As the number of clusters increases, would you expect SSE to increase or decrease? Explain your answer.
ii) (2 marks) The Figure below shows the result of k-means clustering (k=2) with Euclidean distance on two datasets. The solid crosses are the centroids of clusters c1 and c2. The solid circle objects are in cluster c1 and the empty triangle objects are in cluster c2. Assume both datasets are drawn using the same distance scale.
Assuming sA = SSE(clusters for dataset A) and sB = SSE(clusters for dataset B), which of the following equations hold? Why? State any assumptions made.
• sa = sb • sa < sb • sa > sb
Dataset A
c1
c2
Dataset B
c2 c1
Page 8 of 11
COMP20008 Semester One 2017 Exam
d) (2 marks) Max is having a conversation about data integration. He says “Using blocking for record linkage between two datasets (dataset A and dataset B) is a bad idea. It is too time consuming to assign the records to blocks. It is much better instead to directly compare the records in A against the records in B without using any blocking step”. Argue why Max’s statement is incorrect.
Page 9 of 11
COMP20008 Semester One 2017 Exam
5. a) (1 mark) In the context of differential privacy, what is global sensitivity and how does it affect the amount of noise added to the output?
b) (2 marks) Assume we have a dataset with three features: Gender, Resident and Education. Gender is a categorical feature with two possible values (Male, Female) and Resident is a categorical feature with two possible values (Yes, No) and Education is a categorical feature with possible four values (Highschool, Diploma, Undergraduate, Postgraduate).
If we are interested in ‘count’ queries, what is the value of global sensitivity? Explain your answer.
c) (3 marks) Bloom filters are useful for privacy preserving data linkage with approximate matching. Suppose we are using k hash functions for a bloom filter with length n = 1000. Suppose each hash function is biased, such that the probability of mapping its input to an even number is twice the probability of mapping it to an odd number.
Explain how does the bias of the hash functions affect the probability of having a false positive for a bloom filter membership query? How does the bias of the hash functions affect the probability of having a false negative for a bloom filter membership query? State any assumptions made.
d) (2 marks) Suppose we are using a 3 party exact matching scheme for privacy preserving data linkage. Alice and Bob each have their own dataset and they are trying to link the datasets together on people’s names. They agree on a pre-processing strategy to standardise their data, which converts all upper case letters to lower case letters and converts all nick names to full names (e.g. “jim” to “james”, “bob” to “robert”, “liz” to “elizabeth”). Explain a possible disadvantage of their data pre-processing strategy with respect to the effectiveness of the privacy preserving data linkage.
Page 10 of 11
COMP20008 Semester One 2017 Exam
e) Consider the following dataset, which is 3-anonymous. The POI feature records a user’s last visited point of interest.
Name
Zip Code
Age
Nationality
POI
–
30**
≤ 25
*
Heart Clinic
–
30**
≤ 25
*
University
–
30**
≤ 25
*
University
–
30**
> 40
*
Heart Clinic
–
30**
> 40
*
Church
–
30**
> 40
*
Bar
–
305*
3*
*
Bar
–
305*
3*
*
Bar
–
305*
3*
*
Bar
i) (1 mark) Which features have been used as a quasi-identifier? Explain your answer.
ii) (1 mark) Describe a possible homogeneity privacy attack.
End of Exam
Page 11 of 11