THIS PAPER IS FOR STUDENTS STUDYING AT: (tick where applicable)
RCaulfield oClayton oParkville oPeninsula o Monash Extension o Off Campus Learning o Malaysia o Sth Africa oOther (specify)
Office Use Only
EXAM CODES:
TITLE OF PAPER:
EXAM DURATION:
READING TIME:
2018/2019 Summer Semester Examination Period (February 2019)
Faculty of Business and Economics
ETX2250 / ETF5922
Data Visualisation and Analytics 2 hours writing time
10 minutes
During an exam, you must not have in your possession any item/material that has not been authorised for your exam. This includes books, notes, paper, electronic device/s, mobile phone, smart watch/device, calculator, pencil case, or writing on any part of your body. Any authorised items are listed below. Items/materials on your desk, chair, in your clothing or otherwise on your person will be deemed to be in your possession.
No examination materials are to be removed from the room. This includes retaining, copying, memorising or noting down content of exam material for personal use or to share with any other person by any means following your exam.
Failure to comply with the above instructions, or attempting to cheat or cheating in an exam is a discipline offence under Part 7 of the Monash University (Council) Regulations, or a breach of instructions under Part 3 of the Monash University (Academic Board) Regulations.
AUTHORISED MATERIALS
OPEN BOOK oYES RNO
CALCULATORS R YES o NO
(If YES, only a HP 10bII+ calculator is permitted, except at Malaysia and South Africa campuses where an
‘approved for use’ Faculty label is permitted)
SPECIFICALLY PERMITTED ITEMS RYES if yes, items permitted are: Ruler
oNO
Candidates must complete this section if required to write answers within this paper
STUDENT ID: __ __ __ __ __ __ __ __ DESK NUMBER: __ __ __ __ __
Page 1 of 8
Background for Question 1 and 2.
The following scenario will be used in question 1 and 2.
A researcher is interested in comparing economic and health indicators across countries in Africa, Asia and the Middle East based on data from the World Bank. The data used consists of a data frame countries.df:
• GDP:
• LaborRate:
• HealthExp:
• InfMortality:
• RegionName:
• Name:
Per capita Gross Domestic Product, in adjusted 2011 U.S. Dollars Labor force participation rate.
Health expenditures in U.S. Dollars. Infant mortality per 1000 live births. taking values Africa, Asia, Middle East Name of country
Question 1 [3 + 3 + 4 + 5 + 5 + 5 = 25 marks]
Use the Figure 1.1 through 1.3 as input to answer the following questions
a) What is the correlation between Infant mortality per 1000 live births and GDP in Asia?
b) What is approximately the median Infant mortality per 1000 live births and GDP in Asia?
c) Discuss a prominent outlier in the Africa data which is apparent in Figure 1.3. Explain what you can
determine about this outlier using information from any relevant graphs.
d) Using Figure 1.1, discuss the relationship between Health Expenditure and Infant Mortality.
e) Which graph would you use to highlight the difference in infant mortality between Africa and the
other two regions? Discuss your chosen graph in detail.
f) Write down the ggplot command for creating Figure 1.2 by using the variable names from the
background of this section.
Page 2 of 8
Page 3 of 8
Question2 [5+5+5 =15marks]
a) Table 2.1 represents a subset of the data for 3 countries in Africa. Suppose subtab.df is the data frame containing the columns of Table 2.1. Implement by hand the following command, thus rewriting this data in long form:
gather(data = subtab.df, key = ‘Measurement’, value = ‘Quantvalue’, -Name )
b) Suppose you have as input table 2.2. Write down the dplyr commands for calculating the average value for each measurement across the regions. The input data frame is called “subtab.df”:
c) Suppose you have as input table 2.2 in the database with the tablename public.subset. Write down the sql commands for calculating the average value for each measurement across the regions.
Table 2.1
Table 2.2
Page 4 of 8
Question3 [(5+2+3)+5 +5+5=25marks]
a) The data set europe.csv provides the values of economic indicators in Europe as shown in table 3.1 Answer the following questions about the code in Figure 3.1
i) Describe the results of implementing the code in Figure 3.1.
ii) Why is the scale command used? What does it do?
iii) Figure 3.2 is a plot of ss.df$ss against ss.df$k, what does this graph tell you? Which k would you choose and why?
Figure 3.1
Table 3.1
Figure 3.2
Page 5 of 8
b) In the context of hierarchical cluster analysis, explain what a linkage method measures. Explain the two linkage methods “single” and “complete”.
c) The Euclidean distances between points A, B, C, D, E and F are shown in Table 3.4. Draw a reasonably accurate dendrogram, including vertical scale, that corresponds to a hierarchical collection of clusters for these points, using single linkage.
Table 3.2
ABCDEF A
B C D E F
d) Suppose you are performing a cluster analysis using a data set consisting of ten binary variables V1 to V10. Two of the cases are shown in rows C1 and C2 of Table 3.2.
0
2
9
12
6
8.5
2
0
7
10
5
7
9
7
0
3
6
4
12
10
3
0
8
6
6
5
6
8
0
2
8.5
7
4
6
2
0
V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
C1
1
1
1
0
1
1
1
1
0
0
C2
0
1
0
0
1
0
1
0
1
0
Question4
a. Calculate the simple matching and Jaccard measures of similarity for these two cases.
b. Also explain how you would decide which of these two measures to use.
c. How do the similarity measures relate to measures of distance between data points?
[(2+2+2)+(3+3+4 +4)=20marks]
We consider data from the U.S. Bureau of Transportation Statistics, to predict if an accident will result in injuries based on initial 5 factors that are recorded in the emergency call. The goal of this is to optimize when to send an ambulance or only the fire brigade.
The variables from the codebook are:
• vehl_invl:
• alchl_i:
• mancol_i_r:
• rel_rwy_r:
• spd_lim:
Number of vehicles involved
Alcohol involved = 1, not involved = 2
0=no collision, 1=head-on, 2=other form of collision 1=accident on roadway, 0=not on roadway
Speed limit, miles per hour
Half the data was randomly selected as training data, and the tree in Figure 4.1 was constructed:
Page 6 of 8
Figure 4.1
(a) Considertheprocessofbuildingtheinitialtree.
i) At the root node, for example, the classification tree algorithm splits the data based on
whether the number of cars involved is below or above 3. How is this choice made?
ii) Explain how we choose the value of the target variable to assign to a leaf.
iii) What does this mean for any accident report with at least 5 cars involved?
(b) Nowconsiderthecompletedtrees.
i) How many terminal nodes in the tree?
ii) How would you reach the second last terminal node from the right? (TRUE .41 .59)
iii) What are the variables and their values to visit the decision node spd_lim >= 48 ?
iv) Write down the decision rules for this tree
v) Calculate the accuracy for the confusion matrix on the test set. Is it a good classifier? Explain why.
Predicted
False
True
False
414
84
True
347
155
Page 7 of 8
Actual
Question 5 [6 + 6 + 3 = 15 marks]
An online provider of statistics courses is interested in assessing alternative sequencing and combinations of courses, and therefore wishes to conduct association analysis on its data for past students. Table 5.1 shows a sample of their data, with each row representing an individual student and each column representing a statistics course that they offer as identified by the column headings.
Table 5.1
ID Intro Expt StatWrite Survey design
11000 20101 30111 41000 51000 60000 71000 80110 91000
10 0 0 0 0 11 1 0 0 0 12 0 0 0 0 13 0 0 0 0 14 0 0 0 1 15 0 0 0 1 16 1 1 1 1 17 1 0 0 0 18 1 0 0 0 19 1 0 0 0 20 0 0 0 0 21 0 0 1 1 22 0 0 0 0 23 1 0 1 1 24 1 0 0 0 25 1 0 1 0 26 0 1 1 0 27 0 1 1 0
28 1 0 0 0
Consider the association rule {Forecast, Regression}à{DataMining}.
(a) Based on the sample provided in Table 5.1, calculate for this association rule
• The support of the association rule itemset
• The confidence of the association rule
• The lift of the association rule.
(b) Interpret each of the numbers calculated in (a) in relation to the present application and explain any role
they may have in assessing the usefulness of the association rule.
(c) If you find out that a student has taken the Forecast and Regression courses, does this make it more likely
that they will take the DataMining course? If so, how much more likely?
DataMining
Cat Data
Regression
Forecast
1
0
0
0
0
1
0
0
1
1
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
0
0
0
1
0
1
0
0
0
0
0
1
1
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
1
1
0
1
1
0
1
0
1
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
1
0
0
1
0
1
1
0
1
0
0
0
0
1
1
0
0
0
0
1
0
1
1
1
1
1
0
1
0
1
1
*** END OF EXAMINATION ***
Page 8 of 8