Section A: Short Answer Questions
Questions 1-10 – Total 24 marks
*** Answer ALL questions in this section ***
(Answers should be fewer than five (5) sentences)
Briefly define what data science is (one sentence), and name two aspects of
Question 1.
data science. [2 marks]
Question 2.
Explain briefly the difference between interval and ratio data. [2 marks]
Question 3.
Briefly explain what “data dimension reduction” means in data pre-processing. Why is it important? [2 marks]
Question 4.
Both Mean and Median are typical ways to explore the data’s statistical summary using “average”. List two ways to explore the “spread” of data. [2 marks]
Question 5.
Briefly explain the overfitting issue in classification. [2 marks]
Question 6.
There are several ways to handle missing values in data pre-processing. List three among them. [3 marks]
Question 7.
What are the roles of k in k-nn and k-means clustering? [2 marks]
Question 8.
Define the terms ‘support’ and ‘confidence’ as they relate to association rule mining. Is ‘support’ and ‘confidence’ for a rule X→Y the same as for the rule Y→X? Why or why not? [4 marks]
Page 1 of 4
Question 9.
Briefly describe the “bootstrap” data partition method in classification. [2 marks]
Question 10.
Describe briefly describe the three different types of clustering criteria that are used for hierarchical clustering analysis. [3 marks]
SECTION B BEGINS NEXT PAGE
Page 2 of 4
Section B: Application Questions
Questions 11-13. Total 30 marks
*** Answer ALL questions in this section ***
Question 11.
The performance of a classifier is as given in the confusion matrix below. The training dataset consisted of 1000 patient data based on some medical research to identify if the patient has the rare disease. [8 marks]
(a) Calculate the following statistics for the above confusion matrix:
1. Accuracy of the classification
2. Overall how often it was wrong?
3. When it predicts yes, how often is it correct?
4. When it’s actually no, how often does it predict yes?
5. When it’s actually no, how often does it predict no?
6. When it’s actually yes, how often does it predict yes?
(b) Looking at the accuracy, sensitivity and specificity what can you say about the classifier’s performance?
Question 12.
Draw the dendrogram based on the distance matrix given below by applying the hierarchical clustering algorithm (agglomerative) with single-link technique. The distance in the table below was calculated using the Manhattan distance function. Show all the steps to final stage. [8 marks]
Page 3 of 4
Question 13.
The table shows the Car Theft data set consisting of 16 instances. Each car has four attributes namely Colour, Type, Origin and Drive and a class attribute Stolen. These attributes of the car classifies whether the car is stolen. Use Naïve Bayesian classification to predict the class of the last data point. Perform any necessary adjustments to the data in your evaluation. [14 marks].
END OF EXAMINATION
Page 4 of 4