Course code & title : Session : Time allowed :
IS6400 Business Data Analytics Semester B 2020/21
Two hours
CITY UNIVERSITY OF HONG KONG
This paper has 8 pages (including this cover page).
1. This exam consists of FIVE questions.
2. Please answer ALL FIVE questions.
3. Please provide all your answers on a new word document, save it under “your full name_your student ID”, submit it to Canvas under assignment FINAL EXAM.
This is an open-book examination.
Candidates are allowed to use the following materials/aids: Textbook, notes, lecture slides, tutorials
Materials/aids other than those stated above are not permitted. Candidates will be subject to disciplinary action if cheating or plagiarism by communicating with other peers are found.
If you encounter problems during the exam, you should contact:
IS General Office for general matters. Tel: +852 34428521 Instructors for exam question related matters
• Junming Liu Tel: +852 34429323 Skype ID: liujm0731 junmiliu@cityu.edu.hk
~1~
Question 1 (20 points)
A telecommunication company wants to investigate the customer churn phenomenon. Customer churn occurs when customers stop doing business with a company. The telecommunication company has collected some data. Please check the data file Q1.xlsx. The major variables are defined below:
― Name:customer’sname,
― Gender:acustomer’sgender,
― Age:acustomer’sagewhenthedatawascollected,
― PaymentMethod:themethodacustomerpaidthebill,
― Churn:churnstatus,
― LastTransaction:theamountacustomerpaidforhis/herlasttransaction.
The company wants to perform some descriptive analytics and needs your help. Please answer the following questions (keep two decimal places):
(1) In total, how many records (rows) have missing value in the variable “Churn”? Please provide two methods to process missing values. (4 points)
(2) Among all the customers, what is the interquartile range (IQR) of last transaction? (3 points)
(3) Overall, what is the Churn rate (exclude rows with missing values)? (3 points)
(4) Among the churned customers, what is the mean and standard deviation of last transaction? (3 points)
(5) Among the churned customers, how many of them are male, female, respectively? What is the percentage of customers who did not provide gender data? (3 points)
(6) What are the churn rates for the customers who used cash, cheque, and credit card to make payment, respectively? Which payment method has the highest churn rate? (4 points)
~2~
Question 2 (20 points)
A milk company has collected quarterly time series of milk sales, check out Q2milk.csv for the detailed data. The company would like to perform some analysis to predict Q2 2018 sales for planning. Please use the Q2milk.csv to conduct time series analysis and answer the following questions (keep all the answers to two decimal places). You can use ANY software you prefer. Pls submit the code and the screenshot of results for your analysis.
(1) Does the sales data demonstrate a trend? Is the trend upward or downward? Please use linear regression results to justify your answers. Please include a screenshot of your software outputs with timestamp (for MacOS, the time is in the upper right corner; for Windows, the time is in the lower right corner) (3 points).
(2) Build a graph to visualize the sales data. Based on your graph, does the sales data have seasonality? Describe the pattern of the seasonality, i.e. how the sales change over the seasons (3 points).
(3) Use Winter’s Multiplicative Model to forecast the sales of Quarter 2 in 2018. (5 points).
(4) Run the seasonal decomposition with the sales data, using Multiplicative model. What are the seasonal indexes of the Quarter 1 to Quarter 4? (4 points)
(5) Based on decomposition, please generate deseasonalized sales data, after that use Holt’s Model to forecast the deseasonlized sales of Quarter 2 in 2018. Based on the Q2 seasonal index you derived in question 4, please calculate the overall forecast for Q2 2018. (5 points)
~3~
Problem 3 Decision Tree (20 points).
Consider the decision tree shown in Figure 1below, and the corresponding training and validation sets in Tables 1 and 2 respectively.
(1) Estimate the error rate of the tree using both the training error and the pessimistic error considering model complexity. While computing the error with pessimistic approach, to account for model complexity, use a penalty value of 1 to each leaf node. (2 points)
(2) Compute the validation error of the tree using the validation set shown in Table 2. (2 points)
(3) Comment on the behavior of training and validation set errors with respect to model complexity by evaluating the pruned tree models. (8 points)
(4) Is the testing dataset imbalanced? If yes, what evaluation metrics can be used to evaluate the performances, please make comments. (8 points)
~4~
Figure 1 Decision Tree structure for Question 2 Table 1 Training Dataset
Table 2 Validation Dataset
~5~
Question 4 (20 points)
YYBeauty is a cosmetic store. It carries a wide variety of makeup products and tools. Let us assume that YYBeauty has hired you as a consultant to address some questions related to product promoting strategies. You have done an association rules mining, using a data set of 500 transactions. The following figure (Figure 4.1) gives you an idea of what the data set looks like. It only shows 8 transactions out of the 500 transactions.
Figure 4.1: Screenshot of transactional database (total transactions=500)
After conducting the association rules mining, you generate both the frequent item sets (see Figure 4.2) and the association rules (see Figure 4.3). The minimum support is 0.1 and the minimum confidence is 80%.
Size Support Item 1 Item 2 Item 3 Item 4
1
0.564
Concealer
1
0.424
Bronzer
1
0.412
Blush
1
0.348
Nail Polish
1
0.22
Brushes
2
0.28
Concealer
Bronzer
2
0.304
Concealer
Blush
2
0.218
Concealer
Nail Polish
2
0.154
Concealer
Brushes
2
0.218
Bronzer
Blush
2
0.206
Bronzer
Nail Polish
2
0.168
Bronzer
Brushes
2
0.164
Blush
Nail Polish
2
0.22
Nail Polish
Brushes
3
0.178
Concealer
Bronzer
Blush
3
0.152
Concealer
Bronzer
Nail Polish
3
0.124
Concealer
Bronzer
Brushes
3
0.136
Concealer
Blush
Nail Polish
3
0.154
Concealer
Nail Polish
Brushes
3
0.116
Bronzer
Blush
Nail Polish
3
0.168
Bronzer
Nail Polish
Brushes
4 0.124 Concealer Bronzer Nail Polish Brushes
Figure 4.2: Frequent Item Sets
~6~
# Antecedents Consequents Support Confidence Lift Ratio
1
Concealer, Nail Polish, Brushes
Bronzer
0.12
0.81
1.90
2
Concealer, Brushes
Bronzer
0.12
0.81
1.90
3
Concealer, Brushes
Bronzer, Nail Polish
0.12
0.81
3.91
4
Concealer, Brushes
Nail Polish
0.15
1.00
2.87
5
Concealer, Bronzer, Nail Polish
Brushes
0.12
0.82
3.71
6
Concealer, Bronzer, Brushes
Nail Polish
0.12
1.00
2.87
7
Brushes
Nail Polish
0.22
1.00
2.87
8
Bronzer, Nail Polish
Brushes
0.17
0.82
3.71
9
Bronzer, Brushes
Nail Polish
0.17
1.00
2.87
10
Bronzer, Blush
Concealer
0.18
0.82
1.45
11
Blush, Nail Polish
Concealer
0.14
0.83
1.47
Figure 4.3: Association Rules
Based on the above frequent item sets and association rules, answer the following questions:
(1) If a consumer purchased Concealer and Brushes already, what item will the customer
purchase with the highest probability? Justify your recommendations. (5 points)
(2) Please explain Support, Confidence, and Lift Ratio and use them to interpret the 4th Association Rule. (10 points)
(3) If you want to do a four-product bundle, which products should you put into the bundle and why? (5 points)
~7~
Question 5 (20 points)
Bitcoin is a cryptocurrency. Its price has increased dramatically in the past five years. The Q4 BTC-USD.csv contains Bitcoin close prices from 30/04/2013 to 30/04/2018 in column “Close.” Please use this data set to answer the following questions:
(1) If you are asked to use autoregressive models to generate foresting, which variable do you think will work better as dependent variable or variable processing. Why? (5 points)
(a) The original value of Bitcoin prices Closet, or
(b) The first order differenced value D_Closet =𝐶𝑙𝑜𝑠𝑒& − 𝐶𝑙𝑜𝑠𝑒&().
(2) Pls use an autoregressive method to forecast for Bitcoin Prices in the next five trading days (keep the answers to two decimal places) (5 points). Pls submit the code and the screenshot of results for your analysis.
Trading date Forecast
(3) Design a rolling-forecasting evaluation method to evaluate the “next-day” prediction performance. Please illustrate the validation dataset, evaluation metrics and calculation, and model comparison in your method. (This is a method design question; NO programming is needed.) (10 points)
May 1, 2018
May 2, 2018
May 3, 2018
May 4, 2018
May 5, 2018
– END –
~8~