COMP20008 Semester One 2019 Exam
The University of Melbourne Semester One 2019 Exam
School: Computing and Information Systems Subject Number: COMP20008
Subject Title: Elements of Data Processing
Copyright By PowCoder代写 加微信 powcoder
Exam Duration: 2 hours Reading Time: 15 minutes This paper has 8 pages
Authorised Materials:
No calculators may be used.
Instructions to Invigilators:
Supply students with standard script books.
Instructions to Students:
Answer all 14 questions. The maximum number of marks for this exam is 50.
Page 1 of 8
COMP20008 Semester One 2019 Exam
Formulae and Notation Euclidean distance: d(x, y) = ni=1(xi − yi)2
ni=1 (xi −x ̄)2
Pearson’s correlation coefficient: r = wherex ̄=n1 ni=1xi andy ̄=n1 ni=1yi
Entropy: H(X) = −|X| p log p i=1i 2i
ni=1 (yi −y ̄)2
where pi is the proportion of points in the ith category.
Conditional entropy: H(Y |X) = x∈X p(x)H(Y |X = x)
where X is the set of all possible categories for X and |X| is the number of categories.
Mutual information:
MI(X, Y ) = H(Y ) − H(Y |X) = H(X) − H(X|Y )
Normalised mutual information: NMI(X, Y ) = MI(X,Y )
n (x −x ̄)(y −y ̄) i=1i i
min(H(X),H(Y )) Accuracy: A= TP+TN where
TP+TN+FP+FN TP is number of true positives
TN is number of true negatives FP is number of false positives FN is number of false negatives
SetUnion: X∪Y istheunionofsetsX andY (elementsinX orinY orin both X and Y )
Set intersection: X ∩ Y is the intersection of sets X and Y (elements in both X and Y )
Page 2 of 8
COMP20008 Semester One 2019 Exam
1. This question concerns representations of data in XML/JSON.
(a) (0.5 mark) Consider the XML data below. Is it well-formed? Explain your answer.
COMP20008
(b) (0.5 mark) Explain the purpose of ‘uom’ and ‘open u’ in the XML data above.
(c) (1 mark) Provide an effective representation of the following data as a list of ‘person’ records in JSON format.
Name Age Interests
Tom 45 Jack 34 Susan 29
Gardening, Bird watching Movies, Video Games Cycling, Chess
2. Consider the task of processing readings from multiple temperature sensors.
(a) (1 mark) Explain the difference between readings which are noise and readings which are outliers.
(b) (2 marks) Using a simple example, explain two ways in which using a histogram for outlier detection in temperature sensor data might provide misleading results.
(c) Consider the following dataset of readings from a set of 10 temperature sensors collected at a particular time (one reading per sensor).
175, 179, 5, 185, 190, 191, 193, 193, 201, 33
i. (1 mark) Calculate the interquartile range (IQR) of this data.
ii. (1 mark) Explain why readings 5 and 33 would be categorised as outliers in a Tukey boxplot representation.
iii. (2 marks) There is in fact one sensor reading that has a missing value and was not shown above. Suggest an imputation method for this reading, in light of the data above, and justify your choice.
Page 3 of 8
COMP20008 Semester One 2019 Exam
3. Consider a recommender system for an online bookstore Books R Us which records ratings of books from users, and makes recommendations to users about books they might be interested to purchase. The system uses a popu- larity based strategy for its recommendations, randomly recommending a book from the top 50 books. The top 50 books are determined according to total number of purchases in the store during the last month.
(a) (2 marks) Explain one advantage and one disadvantage of this popularity based strategy, compared to using an item based collaborative filtering strategy for making recommendations. Explain any assumptions made.
(b) (2 marks) A data science consultant suggests to Books R Us that an alter- native strategy would be to train a decision tree classification model to predict whether a user will like a particular book. The features of the classification model would include information about the customer such as (age, gender, lo- cation) and also information about the book (cost, topic, year of publication). The class label to be predicted would be {like, not like}.
Explain one advantage and one disadvantage of this scheme. Explain any assumptions made.
4. Consider a dataset of four data instances {x1, x2, x3, x4} which has the follow- ing pairwise dissimilarity (distance) matrix.
x1 x2 x3 x4
x1 0 1 4 5
x2 1 0 2 6
x3 4 2 0 3
x4 5 6 3 0
(a) (3 marks) Apply single link agglomerative hierarchical clustering to group the data instances represented by this matrix. Draw the resulting dendrogram. Show all working.
(b) (2 marks) Explain one general advantage and one general disadvantage of single link agglomerative hierarchical clustering compared to k-means cluster- ing.
Page 4 of 8
COMP20008 Semester One 2019 Exam
5. Consider the task of detecting a strong negative (linear) correlation between two features, using visualisation.
(a) (2 marks) Explain, with the aid of a simple diagram, how a parallel co- ordinates plot could help a user detect a strong negative (linear) correlation between two features.
(b) (1 mark) Explain, with the aid of a simple diagram, how a scatter plot could help a user detect a strong negative (linear) correlation between two features.
(c) (1 mark) Which of the two plots do you believe would be more effective for this task? Explain and justify your answer.
6. (2 marks) Consider a dataset D1 with 4000 instances and 10 features. Consider two possible strategies to cluster this dataset, using k = 3.
• (S1) Apply PCA to D1 and retain the top two principal components. Let the resulting 2-dimensional dataset be D2. Apply k-means on D2.
• (S2) Apply k-means directly on D1.
In what circumstances would you expect the first strategy (S1) to be more effective than the second strategy (S2)? In what circumstances would you expect the second strategy (S2) to be more effective than the first strategy (S1)? Explain any assumptions made.
7. (4 marks) Given a dataset with four classes, A, B, C and D, suppose the root node of a decision tree has 125 instances of class A, 355 instances of class B, 200 instances of class C and 500 instances of class D. Consider a candidate split of this root node into four children, using a numeric feature F with values from 0 to 100.
• The first child contains instances with 0 ≤ F < 20 and has class propor- tions (50 class A and 75 class B),
• The second child contains instances with 20 ≤ F < 40 and has class proportions (50 class A and 225 class B),
• The third child contains instances with 40 ≤ F < 60 and has class proportions (25 class A, 30 class B, 50 class C, 100 class D )
• The fourth child contains instances with 60 ≤ F ≤ 100 and has class proportions (25 class B, 150 class C, 400 class D)
Using these numbers, write an expression for computing the utility of this candidate split using the information gain criterion. The expression may be complex and you do not need to simplify it to a single number.
Page 5 of 8
COMP20008 Semester One 2019 Exam
8. A data scientist collects a large number of data pairs (age, height) of people from birth to 80 years old. She computes a Pearson correlation coefficient between age and height.
(a) (1 mark) Would you expect the correlation to be positive or negative? Why?
(b) (1 mark) Would you expect the correlation to be similar in value if it was computed using a different set of people? Explain your answer and any assumptions made.
9. (3 marks) Suppose I have a dataset of 5000 customers (instances). Each cus- tomer is described by 10 features. There is a binary class label of {highvalue, lowvalue}.
Suppose we wish to empirically decide whether using k-NN will be a more accurate model than decision tree, for predicting the value of a new customer. Explain the steps one should follow in order to make this decision. Explain any assumptions made.
10. Business X and Business Y have decided to conduct a joint marketing cam- paign. For this marketing campaign, they first need to determine how many customers they have in common (how many people are in the customer list of both businesses). They involved a trusted 3rd party, Company Z, and imple- mented the following 3-party privacy preserving protocol, making use of the SHA-256 one way hashing function.
#In the following, the ’+’ symbol indicates string concatenation (joining two strings)
#Business X does the following SetX=empty
For each customer at Business X
SetX=SetX ∪ SHA-256(“First Name”+“Last Name”) Send SetX to Company Z
#Business Y does the following SetY=empty
For each customer at Business Y
SetY=SetY ∪ SHA-256(“First Name”+“Last Name”) Send SetY to Company Z
#Company Z does the following result=count(SetX ∩ SetY)
Share result with Business X and Business Y
Page 6 of 8
COMP20008 Semester One 2019 Exam
(a) (1 mark) Suppose Company Z is malicious and intends to compromise the privacy of the data. Explain two possible privacy attacks that Company Z can perform.
(b) (1 mark) Explain what changes to the protocol would be required for resisting these two possible attacks.
(c) (2 marks) According to the protocol, Company Z can only compute ex- act matches between pairs of records from Business X and Business Y. Propose changes to the protocol so that approximate similarity can be computed in a privacy preserving manner by Company Z.
11. In a data de-duplication scenario, blocking can be used to reduce the cost of finding matching entities.
(a) (1 mark) Using the records below, propose a simple blocking strategy that results in at most two records being in the same block, when applied to the four records below.
, 7 , Carlton , 7 , Brunswick Jan Cheng, 7 , Lee, 14 , Carlton
(b) (1 mark) Based on your proposed strategy, show the blocks having two records. (Do not show blocks with only one record.)
(c) (2 marks) Provide two reasons as to why your proposed blocking strategy might perform poorly when applied to a large real dataset.
Page 7 of 8
COMP20008 Semester One 2019 Exam
12. Consider the following dataset which has been anonymised. The quasi-identifiers are {Sex,Age,Postcode} and the sensitive attribute is Diagnosis.
Age Postcode 20-25 318* 20-25 318* 40-45 318* 30-35 318* 20-25 318* 30-35 318* 30-35 318* 40-45 318* 40-45 318* 20-25 318* 30-35 318* 40-45 318*
Diagnosis skin rash flu headache skin rash flu
flu headache headache headache skin rash headache headache
(a) (2 marks) In the context of k-anonymity: Is this data 1-anonymous? Is it 2-anonymous? Is it 3-anonymous? Is it 4-anonymous? Is it 5-anonymous? Is it 6-anonymous? Is it 7-anonymous? Is it 8-anonymous? Explain your answers.
(b) (2 marks) In the context of l-diversity: Is this data 1-diverse? Is it 2- diverse? Is it 3-diverse? Is it 4-diverse? Is it 5-diverse? Is it 5-diverse? Is it 6-diverse? Is it 7-diverse? Is it 8-diverse? Explain your answers.
13. (2 marks) Suppose Bob signs a document with his digital signature. Fred receives the document and changes its contents, but leaves the digital signature unchanged. How could a third party (Alice), know that the document has been modified from its original version, by someone other than Bob?
14. (3marks)ConsiderthefollowingquotefromWilliamMougayaraboutblockchain
“Online identity and reputation will be decentralized. We will own the data that belongs to us”.
Using the example of a blockchain for storing education related data, argue 3 distinct reasons why this quote could be true.
End of Exam
Page 8 of 8
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com