161.777 Practical Data Mining
Semester 1, 2022
Assignment 2
Part A: Cluster Analysis [15 marks]
Copyright By PowCoder代写 加微信 powcoder
The DUNGAREE data set gives the number of pairs of four different types of dungarees sold at stores over a specific time period. Each row represents an individual store. There are six columns in the data set. One column is the store identification number, and the remaining columns contain the number of pairs of each type of jeans sold.
Model Role
Measurement Level
Description
Identification number of the store
Number of pairs of fashion jeans sold at the store
Number of pairs of leisure jeans sold at the store
Number of pairs of stretch jeans sold at the store
Number of pairs of original jeans sold at the store
Total number of pairs of jeans sold (the sum of FASHION, LEISURE, STRETCH, and ORIGINAL)
Create a diagram and enter the DUNGAREE data.
1) Assign the variable STOREID the model role ID and the variable SALESTOT the model role Rejected. Be sure that the remaining variables have the Input model role and the Interval measurement level. Why should the variable SALESTOT be rejected? [1 mark]
2) Examine the distributions of the variables. Are there any unusual data values? Are there missing values that should be replaced? Report any changes you made to the data, and justify your actions. [2 marks]
3) There are two ways to standardise in SAS EM: you can use the Standardize (or Range) function in the Transform Variables node, or the Standardization option within the Cluster node.
a) Examine the help files for the Transform Variables and Cluster nodes. Explain how they each
standardise the variables and explain any difference between them. [2 marks]
b) Do you think standardisation of the variables is wise in this case? What might happen if you did not standardise the inputs? [2 marks]
4) Run the cluster analyses using the inbuilt standardisation in the Cluster node and the automated method of choosing k.
a) Briefly explain how SAS EM automatically chooses k. [2 marks]
b) How many classes are returned in this case? Provide a figure that justifies this number.
5) Add a Segment Profile node and provide a set of histograms showing the distributions of the
variables within each segment. Pick one segment and explain how it differs from the overall dataset. [4 marks]
Part B: Market Basket Analysis [10 marks]
The BANK data set contains service information for nearly 8,000 customers. There are three
variables in the data set, as shown in the table below.
The BANK data set has over 32,000 rows. Each row of the data set represents a customer-service combination. Therefore, a single customer can have multiple rows in the data set, each row representing one of the products he or she owns. The median number of products per customer is three.
The 13 products are represented in the SERVICE variable use the following abbreviations: ATM automated teller machine debit card
AUTO automobile instalment loan
CCRD credit card
CD certificate of deposit
CKCRD check/debit card
CKING checking account
HMEQLC home equity line of credit
IRA individual retirement account MMDA money market deposit account MTG mortgage
PLOAN personal/consumer installment loan SVG saving account
TRUST personal trust account
Set up the data source node as above BUT set Visit role to rejected. Run the association node (using default settings) and use the results to answer the following questions: (You may also need to Explore the variables)
1) What is the most common type of bank service? [1 mark]
2) What two types of bank service occur most commonly together? Does this give a useful rule?
Name Model Role Measurement Level Description
ACCOUNT ID Nominal Account Number
SERVICE Target Nominal Type of Service
VISIT Sequence Ordinal Order of Product Purchase
Discuss the support, confidence levels and lift for this combination of bank service.
3) What is the probability of a customer with a checking account (CKING) also having a credit card? [1 mark]
4) What is the probability of a customer with a mortgage (MTG) also having a saving account?
5) To what extent do having a mortgage and a checking account increase the probability of having a saving account?
6) Produce a link graph showing the most useful rules. Explain how you decided what the most useful rules are and how they are shown on the link graph. [2 marks]
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com