Data Resource Management
Data Mining Homework
Your task is to produce a list of prospective
customers on whom your corporation will spend a big amount of financial
resources and use extended time of its limited human resources.
Using the three data mining models we have already built in our labs, build a predictive model that includes the independent variable, the dependent variables, and the values for those variables. Specifically:
Your independent variable is Bike Buyer.
The independent variables that should be included in your model are:
Age
CommuteDistance
EnglishOccupation
EnglishEducation
Gender
HouseOwner
MaritalStatus
NumberCarsOwned
NumberChildrenAtHome
TotalChildren
YearlyIncome
You have already built three data mining models using three different data mining algorithms. These models have already proposed to you some important factors/variables such as EnglishOccupation, NumberCarsOwned, NumberChildrenAtHome, and others.
The three mining models have already proposed values for those variables as well. For example, EnglishOccupation = Professional, NumberCarsOwned = 0, and NumberChildrenAtHome = 0 or 2.
Reflecting on the variables and values proposed for these variables from the three mining models, assign your own values to these variables, values that you think make sense and they will maximize the effectiveness of your model. For example, for EnglishOccupation, you might want to include prospective customers with Bachelors and Masters degrees. For NumberCarsOwned, you might want to include the values of 1 and 2.
After you have decided on the variables to include in your final model as well as their assigned values, create an Excel workbook. In one worksheet, enter your mode variables and indicate their associated variable values. In one column, list the variable, underneath list its values, and right next to it the explanation or rationale for the selection of the particular value. It should look like the following:
NumberCarsOwned
(1, 2) why?
Name this worksheet “ModelAndVariables”
Now, connect to SQL server and the AdventureWorks database and using SQL retrieve from the ProspectiveBuyer table the records that satisfy the model you have developed. That is, you will use the variable values as criteria for your query.
Once you have the result set, paste it in a second worksheet in the excel workbook you have already created. Name this worksheet “ProposedCustomerList“.
Name the workbook “YourName_DataMiningHomework“
Data Mining Lab 2
TASKS IN THIS LAB
Create the clustering mining model 2
Create the Naive Bayes mining model 3
Explore the Cluster Diagram.. 4
Explore Cluster Characteristics. 6
Explore Cluster Discrimination. 6
Explore the Dependency Network. 6
To explore the model in the Attribute Profiles tab. 7
Explore the Attribute Characteristics. 7
Explore the Attribute Discrimination tab. 7
Create a lift chart for the models. 8
Make a copy of the Decision Tree Model 10
Create a case filter on a mining model 11
Explore the filtered models. 12
Test the lift of the filtered models. 12
Create a calculated column in the prospective buyers table. 14
Creating an DMX Prediction Query. 14
Design the prediction query. 15
Run the query and view results. 16
PART I – ADDING A CLUSTERING AND NAÏVE BAYES MODEL
The Clustering model groups your customers by attributes that influence their bike buying behavior. The Naive Bayes model enables you to explore relationships among different attributes.
Create the Clustering Mining Model
- Click the Mining Models tab in Data Mining Designer in SQL Server Data Tools (SSDT).
Notice that the designer
displays two columns, one for the mining structure and one for the TrainingModelDecisionTrees mining model, which you have
already created:
- Right-click the Structure column and select New Mining
Model.
- In the New
Mining Model dialog box, in Model name, type TrainingModelClustering.
- In Algorithm
name, select Microsoft Clustering.
- Click OK.
The new model now appears in the Mining Models tab of Data Mining Designer. This model, built with the Microsoft Clustering algorithm, groups customers with similar characteristics into clusters and predicts bike buying behavior for each cluster.
Create the Naive Bayes mining model
- Right-click the Structure column, and select New Mining
Model.
- In the New
Mining Model dialog box, under Model name, type
TrainingModelNaiveBayes.
- In Algorithm name, select Microsoft Naive Bayes, then click OK.
A message appears stating that the Microsoft Naive Bayes algorithm does not support the Age and Yearly Income columns, which are continuous variables.
- Click Yes to acknowledge the message and continue.
Your environment should now look like this:
- Deploy your models to the
server. In the Mining Model menu, select Process
Mining Structure and All Models. You might need to enter your userid and
password for accessing the SQL server database.
- Click Run in the Processing Mining Structure – Targeted Mailing dialog box. The Process Progress dialog box opens to display the details of model processing. Model processing might take some time, depending on your computer.
PART II – EXPLORING THE CLUSTERING MODEL
The clustering algorithm groups cases into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, and creating predictions.
Explore the Cluster Diagram
- Click the tab Mining Model Viewer.
You might be asked to reprocess your models. Select “OK. By default, the designer
opens to the first model that was added to the structure — in this case, TrainingModelDecisionTrees.
- At the top left of the Mining Model Viewer tab, click on the Mining
Model list to select the TrainingModelClustering.
- In the adjacent Viewer drop down menu, select Microsoft
Cluster Viewer if it is not already selected.
- In the Shading
Variable box, select Bike Buyer.
- Select 1
in the State box to explore those cases where a bike
was purchased. The Density legend tells us that the
cluster with the darkest shading has the highest percentage of bike buyers.
- Move your mouse over the
cluster with the darkest shading. A tooltip displays the percentage of cases
that have the attribute, Bike Buyer = 1.
- Select the cluster that has the
highest density, right-click the cluster, select Rename
Cluster and type Bike
Buyers High for later identification. Click OK.
- Find the cluster that has the
lightest shading (and the lowest density). Right-click the cluster, select Rename Cluster and type Bike Buyers Low. Click OK.
- Click the Bike
Buyers High cluster and drag it to an area of the pane that will give
you a clear view of its connections to the other clusters. When you select a
cluster, the lines that connect this cluster to other clusters are highlighted,
so that you can easily see all the relationships for this cluster. When the
cluster is not selected, you can tell by the darkness of the lines how strong
the relationships are amongst all the clusters in the diagram. If the shading
is light or nonexistent, the clusters are not very similar.
- Use the slider to the left of the network, to filter out the weaker links and find the clusters with the closest relationships. The corporate marketing department might want to combine similar clusters together when determining the best method for marketing to these customers.
Explore Cluster Profiles
The Cluster Profiles tab provides an overall view of the TrainingModelClustering. The Cluster Profiles tab contains a column for each cluster in the model. The first column (Variables) lists the attributes that are associated with the clusters. The States column represents the value for each particular attribute. For example, for the HouseOwner flag, we have two states: 0, 1, and missing. 1 means the customer is a bike buyer while 0 means he or she is not.
The distribution of a discrete variable/attribute is shown as a colored bar with the maximum number of bars displayed in the Histogram bars list. Continuous variabes/attributes are displayed with a diamond chart, which represents the mean and standard deviation in each cluster.
- Set Histogram
bars to 5. In our model, 5 is the maximum number of
states for any one variable.
- If the Mining
Legend blocks the display of the Attribute profiles,
move it out of the way.
- Select the Bike Buyers High column and drag
it to the right of the Population column.
- Select the Bike Buyers Low column and drag
it to the right of the Bike Buyers High column.
- If you need to, adjust the
widths of the columns so that you can see the full name of the cluster at the
top.
- Click the Bike
Buyers High column. The Variables column is
sorted in order of importance for that cluster. Scroll through the column and
review characteristics of the Bike Buyer High cluster. For example, they are
more likely to have a short commute.
- Right-click the Age cell in the Bike Buyers High
column and select Show Legend. The Mining Legend
displays a more detailed view and you can see the age range of these customers
as well as the mean age.
- Right-click the Bike Buyers Low column and select Hide Column. As soon as you do that a “more clusters” drop down menu appears above the cluster profiles. You can use it to display the cluster again.
Explore Cluster Characteristics
With the Cluster Characteristics tab, you can examine in more detail the characteristics that make up a cluster. For example, if you select Bike Buyers High from the Cluster list, you can see the characteristics of the customers in this cluster.
Explore Cluster Discrimination
With the Cluster Discrimination tab, you can explore the characteristics that distinguish one cluster from another. After you select two clusters, one from the Cluster 1 list, and one from the Cluster 2 list, the viewer calculates the differences between the clusters and displays a list of the attributes that distinguish the clusters most.
- In the Cluster 1 box, select Bike Buyers High.
- In the Cluster 2 box, select Bike Buyers Low.
- Click Variables to sort alphabetically.
Some of the more substantial differences among the customers in the Bike Buyers Low and Bike Buyers High clusters include age, car ownership, number of children, and region.
PART III – EXPLORING THE NAIVE BAYES MODEL
The Microsoft Naive Bayes algorithm provides several methods for displaying the interaction between bike buying and the input attributes.
Explore the Dependency Network
- At the top left of the Mining Model Viewer tab, click on the Mining
Model list to select the TrainingModelNaiveBayes.
- Use the Viewer
list to switch to Microsoft Naive Bayes Viewer.
- Click the Bike
Buyer node to identify its dependencies. The brownish shading indicates
that all of the attributes have an effect on bike buying.
- Adjust the slider to identify the most influential attribute. As you lower the slider, only the attributes that have the greatest effect on the [Bike Buyer] attribute remain.
To explore the model in the Attribute Profiles tab
- In the Predictable
box, verify that Bike Buyer is selected. If the Mining Legend
is blocking display of the Attribute profiles, move it
out of the way.
- In the Histogram
bars box, select 5. In our model, 5 is the maximum
number of states for any one variable. The attributes that affect the state of
this predictable attribute are listed together with the values of each state of
the input attributes and their distributions in each state of the predictable
attribute.
- In the Attributes
column, find Number Cars Owned. Notice the differences
in the histograms for bike buyers (column labeled 1) and non-buyers (column
labeled 0). A person with zero or one car is much more likely to buy a bike.
- Right-click the Number Cars Owned cell in the bike buyer (column labeled 1) column and select Show Legend. The Mining Legend displays a more detailed view.
Explore the Attribute Characteristics
With the Attribute Characteristics tab, you can select an attribute and value from the two drop down boxes to visualize the top factors.
- In the Attribute
list, verify that Bike Buyer is selected.
- Set the Value
to 1. In the viewer, you will see that customers who
have no children at home, short commutes, and live in the North America region
are more likely to buy a bike.
Explore the Attribute Discrimination tab
With the Attribute Discrimination
tab, you can investigate the relationship between two discrete values of bike
buying and other attribute values. Because the TrainingModelNaiveBayes
model has only two states, 1 and 0, you do not have to make any changes to the
viewer. In the viewer, you can see that
people who do not own cars tend to buy bicycles, and people who own two cars
tend not to buy bicycles.
PART IV – TESTING ACCURACY WITH LIFT CHARTS
Now it is time to test your models for accuracy. Because the data in the testing set already contains known values for bike buying, it is easy to determine whether the model’s predictions are correct. The model that performs the best will be used by the company marketing department to identify the customers for their mailing campaign.
Validation is an important step in the data mining process. Knowing how well your targeted mailing mining models perform against real data is important before you deploy the models into a production environment. Analysis Services provides a variety of methods to determine the accuracy of mining models. In this case we will take a look at a lift chart. Typically, the predictive accuracy of a mining model is measured by either lift or classification accuracy. For this tutorial we will use the lift chart method.
- In Data Mining Designer switch
to the Mining Accuracy Chart tab. Click on the Input Selection tab if it is not already selected.
- In the Select data set to be used for Accuracy Chart (at the bottom of the screen), select Use mining structure test cases to use the historical data in the mining structure.
The next step is to select the models (decision trees, clustering, bayes) that you want to include in the lift chart, the predictable column (BikeBuyer), and the value to predict (0 or 1).
Create a lift chart for the models
- On the Input
Selection tab of Data Mining Designer, under Select
predictable mining model columns to show in the lift chart, verify that Synchronize Prediction Columns and Values is selected.
- In the Predictable
Column Name column, verify that Bike Buyer is
selected for each model.
- In the Show
column verify that all models are selected.
- In the Predict
Value column, select 1. The same value is
automatically filled in for each model that has the same predictable column.
Your screen should now look like:
- Select the Lift Chart tab to display the lift chart.
The lift chart plots the three models you created plus a Random Guess Model as well as an Ideal Model. The mining models you created will fall between these two extremes; between a random guess and a perfect prediction. Any improvement from the random guess is considered to be lift.
- Use the legend to locate the colored lines representing the Ideal Model and the Random Guess Model.
You’ll notice that the TrainingModelDecisionTrees provides the greatest lift, outperforming both the Clustering and Naive Bayes models. Your lift chart should look like the following: For more details read the article https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/lift-chart-analysis-services-data-mining?view=sql-server-2017.
For
Our best model at 50% of the population will predict only 74.65% of target states. |
The ideal model needs only 50% of the population to predict 100% of the population characteristics. |
We are at 50% of the population here |
The threshold we will use in the query to select customers. In this model, we will select customers with predict probability above 45%! |
PART V: WORKING WITH PREDICTIONS
You have developed, trained, tested, and explored the data mining models you created. Now you are ready to use the models to identify future prospective customers for the company.
The role of DMX
After you have tested the accuracy of your mining models and decided that you are satisfied with them, you can then create prediction queries by using the Prediction Query Builder on the Mining Model Prediction tab in the Data Mining Designer. This interface helps you build queries in DMX, or the Data Mining Extension (DMX) language. DMX has syntax like that of SQL but is used for queries against data mining objects.
The Prediction Query Builder has three views. With the Design and Query views, you can build and examine your query. You can then run the query and view the results in the Result view.
Create an Age calculated column in the prospective buyers table
- In Solution
Explorer, double-click the TargetedMailing data
source.
- Right-click the table,
ProspectiveBuyer, and select New Named Calculation.
- In the Create
Named Calculation dialog box, for Column name,
type calcAge.
- For Description,
type Calculate age based on birthdate.
- In the Expression box, type the SQL function DATEDIFF(YYYY,[BirthDate],getdate()) and then click OK.
Because the table ProspectiveBuyer (the data we purchased to run our models) does not have an Age column corresponding to the one we have in our historical data (vTargetMail), we calculate customer age from the BirthDate column in the ProspectiveBuyer table. Since Age was identified as the most influential column for predicting bike buying, it must exist in both the model and in the input table.
Creating an DMX Prediction Query
- In Solution Explorer double click the Targeted mining structure to open it.
- On the Mining
Model Prediction tab in the Mining Model box,
click Select Model. In the Select
Mining Model dialog box, navigate through the tree to the Targeted structure, expand the structure, select TrainingModelDecisionTrees, and then click OK. This is the model
we will use to make predictions since it is the strongest we have.
- In the Select
Input Table(s) box, click Select Case Table. This is
the table that contains the historical data with which we trained our data
mining models.
- In the Select
Table dialog box, in the Data Source list,
select TargetedMailing.
- In Table/View Name, select the ProspectiveBuyers (dbo) table, and then click OK. This is the table that contains the new data from marketing on which we want to use the data mining models to make predictions.
The ProspectiveBuyer table most closely resembles the V_TargetMail case table.
Mapping the Columns
After you select the table to be used with our mining models (input table), the Prediction Query Builder creates a default mapping between the mining model and the input table, based on the names of the columns. At least one column from the structure must match a column in the external data.
- Right-click a line connecting
the Mining Model window to the Select
Input Table window, and select Modify Connections.
- Under Table
Column, click the Bike Buyer cell and select ProspectiveBuyer.Unknown from the
dropdown. This is the attribute that we
will predict in the new ProspectiveBuyer table.
Click OK.
- Under Table
Column, click the Age cell and select
ProspectiveBuyers.calcAge from the dropdown.
We need to manually modify the mapping since we have created a
calculated field for the age.
- Click OK.
Design the prediction query
- The first button on the toolbar
of the Mining Model Prediction tab is Design, Query, Result button. Click the down arrow on this
button, and select Design. Your query might already be in design view
like in the following image:
- Click the cell in the first
empty row in the Source column, and then select Prediction Function.
- In the Prediction
Function row, in the Field column, select PredictProbability. In
the Alias column of the same row, type Probability of result.
- From the Mining
Model window, select and drag [Bike Buyer] into the Criteria/Argument
cell. When you let go, [TrainingModelDecisionTrees].[Bike
Buyer] appears in the Criteria/Argument cell.
- Click the next empty row in the
Source column, and then select TrainingModelDecisionTrees
mining model.
- In the TrainingModelDecisionTrees
row, in the Field column, select Bike
Buyer.
- In the TrainingModelDecisionTrees
row, in the Criteria/Argument column, type “=1” (without the double quotes).
- Click the next empty row in the
Source column, and then select ProspectiveBuyer
table.
- In the ProspectiveBuyer
row, in the Field column, select ProspectiveBuyerKey. This adds the unique identifier to the
prediction query so that you can identify who is and who is not likely to buy a
bicycle
- Add five more rows to the grid. For each row, select the ProspectiveBuyer table as the Source
and then add the following columns in the Field cells:
- calcAge
- LastName
- FirstName
- AddressLine1
- AddressLine2
Your query design should look like the following:
Run the query and view results
- In the Mining
Model Prediction tab, on the upper left corner of the query window, select
the Result button.
- After the query runs and the
results are displayed, you can review the results. The Mining
Model Prediction tab displays contact information for potential
customers who are likely to be bike buyers. The Probability
of result column indicates the probability of the prediction being
correct. You can use these results to determine which potential customers to
target for the mailing.
- At this point, you can save the results. The most convenient way is to save the SQL code of the Query to notepad for future reference. To do that simply switch the query to SQL code using the icons at the upper left of its interface and then just copy and paste the code to notepad. This way we can reuse it any number of times we would like to.
Here is the SQL of the Query
SELECT
(PredictProbability([TrainingModelDecisionTrees].[Bike Buyer])) as [Probability of result],
[TrainingModelDecisionTrees].[Bike Buyer],
t.[ProspectiveBuyerKey],
t.[calcAge],
t.[LastName],
t.[FirstName],
t.[AddressLine1],
t.[AddressLine2]
From
[TrainingModelDecisionTrees]
PREDICTION JOIN
OPENQUERY([Adventure Works DW],
‘SELECT
[ProspectiveBuyerKey],
(DATEDIFF(YYYY,[BirthDate],getdate())) AS [calcAge],
[LastName],
[FirstName],
[AddressLine1],
[AddressLine2],
[MaritalStatus],
[Gender],
[YearlyIncome],
[TotalChildren],
[NumberChildrenAtHome],
[HouseOwnerFlag],
[NumberCarsOwned],
[Unknown]
FROM
[dbo].[ProspectiveBuyer]
‘) AS t
ON
[TrainingModelDecisionTrees].[Marital Status] = t.[MaritalStatus] AND
[TrainingModelDecisionTrees].[Gender] = t.[Gender] AND
[TrainingModelDecisionTrees].[Yearly Income] = t.[YearlyIncome] AND
[TrainingModelDecisionTrees].[Total Children] = t.[TotalChildren] AND
[TrainingModelDecisionTrees].[Number Children At Home] = t.[NumberChildrenAtHome] AND
[TrainingModelDecisionTrees].[House Owner Flag] = t.[HouseOwnerFlag] AND
[TrainingModelDecisionTrees].[Number Cars Owned] = t.[NumberCarsOwned] AND
[TrainingModelDecisionTrees].[Bike Buyer] = t.[Unknown] AND
[TrainingModelDecisionTrees].[Age] = t.[calcAge]
WHERE
[TrainingModelDecisionTrees].[Bike Buyer] =1
.