数据库代写

Data Resource Management

Data Mining Homework  

Your task is to produce a list of prospective customers on whom your corporation will spend a big amount of financial resources and use extended time of its limited human resources. 

Using the three data mining models we have already built in our labs, build a predictive model that includes the independent variable, the dependent variables, and the values for those variables.  Specifically:

Your independent variable is Bike Buyer.

The independent variables that should be included in your model are:

Age

CommuteDistance

EnglishOccupation

EnglishEducation

Gender

HouseOwner

MaritalStatus

NumberCarsOwned

NumberChildrenAtHome

TotalChildren

YearlyIncome   

You have already built three data mining models using three different data mining algorithms.  These models have already proposed to you some important factors/variables such as EnglishOccupation, NumberCarsOwned, NumberChildrenAtHome, and others. 

The three mining models have already proposed values for those variables as well.  For example, EnglishOccupation = Professional, NumberCarsOwned = 0, and NumberChildrenAtHome = 0 or 2.

Reflecting on the variables and values proposed for these variables from the three mining models, assign your own values to these variables, values that you think make sense and they will maximize the effectiveness of your model.  For example, for EnglishOccupation, you might want to include prospective customers with Bachelors and Masters degrees.  For NumberCarsOwned, you might want to include the values of 1 and 2.

After you have decided on the variables to include in your final model as well as their assigned values, create an Excel workbook.  In one worksheet, enter your mode variables and indicate their associated variable values.  In one column, list the variable, underneath list its values, and right next to it the explanation or rationale for the selection of the particular value.  It should look like the following:

NumberCarsOwned

(1, 2)                    why?

Name this worksheet “ModelAndVariables” 

Now, connect to SQL server and the AdventureWorks database and using SQL retrieve from the ProspectiveBuyer table the records that satisfy the model you have developed.  That is, you will use the variable values as criteria for your query.

Once you have the result set, paste it in a second worksheet in the excel workbook you have already created.  Name this worksheet “ProposedCustomerList“.

Name the workbook “YourName_DataMiningHomework

Data Mining Lab 2

TASKS IN THIS LAB

Create the clustering mining model 2

Create the Naive Bayes mining model 3

Explore the Cluster Diagram.. 4

Explore Cluster Profiles. 5

Explore Cluster Characteristics. 6

Explore Cluster Discrimination. 6

Explore the Dependency Network. 6

To explore the model in the Attribute Profiles tab. 7

Explore the Attribute Characteristics. 7

Explore the Attribute Discrimination tab. 7

Create a lift chart for the models. 8

Using Filters. 10

Make a copy of the Decision Tree Model 10

Create a case filter on a mining model 11

Process the filtered model 11

Explore the filtered models. 12

Test the lift of the filtered models. 12

The role of DMX.. 14

Create a calculated column in the prospective buyers table. 14

Creating an DMX Prediction Query. 14

Mapping the Columns. 15

Design the prediction query. 15

Run the query and view results. 16

PART I – ADDING A CLUSTERING AND NAÏVE BAYES MODEL

The Clustering model groups your customers by attributes that influence their bike buying behavior. The Naive Bayes model enables you to explore relationships among different attributes.

Create the Clustering Mining Model

  1. Click the Mining Models tab in Data Mining Designer in SQL Server Data Tools (SSDT).

Notice that the designer displays two columns, one for the mining structure and one for the TrainingModelDecisionTrees mining model, which you have already created:

  • Right-click the Structure column and select New Mining Model.

  • In the New Mining Model dialog box, in Model name, type TrainingModelClustering.

  • In Algorithm name, select Microsoft Clustering.

  • Click OK.

The new model now appears in the Mining Models tab of Data Mining Designer. This model, built with the Microsoft Clustering algorithm, groups customers with similar characteristics into clusters and predicts bike buying behavior for each cluster.

Create the Naive Bayes mining model

  1. Right-click the Structure column, and select New Mining Model.
     
  2. In the New Mining Model dialog box, under Model name, type TrainingModelNaiveBayes.

  3. In Algorithm name, select Microsoft Naive Bayes, then click OK.

A message appears stating that the Microsoft Naive Bayes algorithm does not support the Age and Yearly Income columns, which are continuous variables.

  • Click Yes to acknowledge the message and continue.

Your environment should now look like this:

  • Deploy your models to the server.  In the Mining Model menu, select Process Mining Structure and All Models. You might need to enter your userid and password for accessing the SQL server database.

  1. Click Run in the Processing Mining Structure – Targeted Mailing dialog box. The Process Progress dialog box opens to display the details of model processing. Model processing might take some time, depending on your computer.

PART II – EXPLORING THE CLUSTERING MODEL

The clustering algorithm groups cases into clusters that contain similar characteristics. These groupings are useful for exploring data, identifying anomalies in the data, and creating predictions.

Explore the Cluster Diagram

  1. Click the tab Mining Model Viewer.  You might be asked to reprocess your models.  Select “OK. By default, the designer opens to the first model that was added to the structure — in this case, TrainingModelDecisionTrees.

  2. At the top left of the Mining Model Viewer tab, click on the Mining Model list to select the TrainingModelClustering.

  3. In the adjacent Viewer drop down menu, select Microsoft Cluster Viewer if it is not already selected.

  4. In the Shading Variable box, select Bike Buyer.

  5. Select 1 in the State box to explore those cases where a bike was purchased. The Density legend tells us that the cluster with the darkest shading has the highest percentage of bike buyers.

  6. Move your mouse over the cluster with the darkest shading. A tooltip displays the percentage of cases that have the attribute, Bike Buyer = 1.

  7. Select the cluster that has the highest density, right-click the cluster, select Rename Cluster and type Bike Buyers High for later identification. Click OK.

  8. Find the cluster that has the lightest shading (and the lowest density). Right-click the cluster, select Rename Cluster and type Bike Buyers Low. Click OK.

  9. Click the Bike Buyers High cluster and drag it to an area of the pane that will give you a clear view of its connections to the other clusters. When you select a cluster, the lines that connect this cluster to other clusters are highlighted, so that you can easily see all the relationships for this cluster. When the cluster is not selected, you can tell by the darkness of the lines how strong the relationships are amongst all the clusters in the diagram. If the shading is light or nonexistent, the clusters are not very similar.

  10. Use the slider to the left of the network, to filter out the weaker links and find the clusters with the closest relationships. The corporate marketing department might want to combine similar clusters together when determining the best method for marketing to these customers.

Explore Cluster Profiles

The Cluster Profiles tab provides an overall view of the TrainingModelClustering. The Cluster Profiles tab contains a column for each cluster in the model. The first column (Variables) lists the attributes that are associated with the clusters. The States column represents the value for each particular attribute. For example, for the HouseOwner flag, we have two states: 0, 1, and missing.  1 means the customer is a bike buyer while 0 means he or she is not.

The distribution of a discrete variable/attribute is shown as a colored bar with the maximum number of bars displayed in the Histogram bars list. Continuous variabes/attributes are displayed with a diamond chart, which represents the mean and standard deviation in each cluster.

  1. Set Histogram bars to 5. In our model, 5 is the maximum number of states for any one variable.

  2. If the Mining Legend blocks the display of the Attribute profiles, move it out of the way.

  3. Select the Bike Buyers High column and drag it to the right of the Population column.

  4. Select the Bike Buyers Low column and drag it to the right of the Bike Buyers High column.
     
  5. If you need to, adjust the widths of the columns so that you can see the full name of the cluster at the top.

  6. Click the Bike Buyers High column. The Variables column is sorted in order of importance for that cluster. Scroll through the column and review characteristics of the Bike Buyer High cluster. For example, they are more likely to have a short commute.

  7. Right-click the Age cell in the Bike Buyers High column and select Show Legend.  The Mining Legend displays a more detailed view and you can see the age range of these customers as well as the mean age.

  8. Right-click the Bike Buyers Low column and select Hide Column. As soon as you do that a “more clusters” drop down menu appears above the cluster profiles.  You can use it to display the cluster again.

Explore Cluster Characteristics

With the Cluster Characteristics tab, you can examine in more detail the characteristics that make up a cluster. For example, if you select Bike Buyers High from the Cluster list, you can see the characteristics of the customers in this cluster.

Explore Cluster Discrimination

With the Cluster Discrimination tab, you can explore the characteristics that distinguish one cluster from another. After you select two clusters, one from the Cluster 1 list, and one from the Cluster 2 list, the viewer calculates the differences between the clusters and displays a list of the attributes that distinguish the clusters most.

  1. In the Cluster 1 box, select Bike Buyers High.
  2. In the Cluster 2 box, select Bike Buyers Low.
  3. Click Variables to sort alphabetically.

Some of the more substantial differences among the customers in the Bike Buyers Low and Bike Buyers High clusters include age, car ownership, number of children, and region.

PART III – EXPLORING THE NAIVE BAYES MODEL

The Microsoft Naive Bayes algorithm provides several methods for displaying the interaction between bike buying and the input attributes.

Explore the Dependency Network

  1. At the top left of the Mining Model Viewer tab, click on the Mining Model list to select the TrainingModelNaiveBayes.

  2. Use the Viewer list to switch to Microsoft Naive Bayes Viewer.

  3. Click the Bike Buyer node to identify its dependencies. The brownish shading indicates that all of the attributes have an effect on bike buying.

  4. Adjust the slider to identify the most influential attribute. As you lower the slider, only the attributes that have the greatest effect on the [Bike Buyer] attribute remain.

To explore the model in the Attribute Profiles tab

  1. In the Predictable box, verify that Bike Buyer is selected.  If the Mining Legend is blocking display of the Attribute profiles, move it out of the way.

  2. In the Histogram bars box, select 5. In our model, 5 is the maximum number of states for any one variable. The attributes that affect the state of this predictable attribute are listed together with the values of each state of the input attributes and their distributions in each state of the predictable attribute.

  3. In the Attributes column, find Number Cars Owned. Notice the differences in the histograms for bike buyers (column labeled 1) and non-buyers (column labeled 0). A person with zero or one car is much more likely to buy a bike.
     
  4. Right-click the Number Cars Owned cell in the bike buyer (column labeled 1) column and select Show Legend. The Mining Legend displays a more detailed view.

Explore the Attribute Characteristics

With the Attribute Characteristics tab, you can select an attribute and value from the two drop down boxes to visualize the top factors.

  1. In the Attribute list, verify that Bike Buyer is selected.

  2. Set the Value to 1. In the viewer, you will see that customers who have no children at home, short commutes, and live in the North America region are more likely to buy a bike.

Explore the Attribute Discrimination tab

With the Attribute Discrimination tab, you can investigate the relationship between two discrete values of bike buying and other attribute values. Because the TrainingModelNaiveBayes model has only two states, 1 and 0, you do not have to make any changes to the viewer.  In the viewer, you can see that people who do not own cars tend to buy bicycles, and people who own two cars tend not to buy bicycles.

PART IV – TESTING ACCURACY WITH LIFT CHARTS

Now it is time to test your models for accuracy. Because the data in the testing set already contains known values for bike buying, it is easy to determine whether the model’s predictions are correct. The model that performs the best will be used by the company marketing department to identify the customers for their mailing campaign.

Validation is an important step in the data mining process. Knowing how well your targeted mailing mining models perform against real data is important before you deploy the models into a production environment. Analysis Services provides a variety of methods to determine the accuracy of mining models. In this case we will take a look at a lift chart. Typically, the predictive accuracy of a mining model is measured by either lift or classification accuracy. For this tutorial we will use the lift chart method.

  1. In Data Mining Designer switch to the Mining Accuracy Chart tab. Click on the Input Selection tab if it is not already selected.

  2. In the Select data set to be used for Accuracy Chart (at the bottom of the screen), select Use mining structure test cases to use the historical data in the mining structure.

The next step is to select the models (decision trees, clustering, bayes) that you want to include in the lift chart, the predictable column (BikeBuyer), and the value to predict (0 or 1).

Create a lift chart for the models

  1. On the Input Selection tab of Data Mining Designer, under Select predictable mining model columns to show in the lift chart, verify that Synchronize Prediction Columns and Values is selected.

  2. In the Predictable Column Name column, verify that Bike Buyer is selected for each model.

  3. In the Show column verify that all models are selected.

  4. In the Predict Value column, select 1. The same value is automatically filled in for each model that has the same predictable column.

    Your screen should now look like:


  5. Select the Lift Chart tab to display the lift chart.

The lift chart plots the three models you created plus a Random Guess Model as well as an Ideal Model. The mining models you created will fall between these two extremes; between a random guess and a perfect prediction. Any improvement from the random guess is considered to be lift.

  • Use the legend to locate the colored lines representing the Ideal Model and the Random Guess Model.

You’ll notice that the TrainingModelDecisionTrees provides the greatest lift, outperforming both the Clustering and Naive Bayes models.  Your lift chart should look like the following:  For more details read the article https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/lift-chart-analysis-services-data-mining?view=sql-server-2017.

For

Our best model at 50% of the population will predict only 74.65% of target states.
The ideal model needs only 50% of the population to predict 100% of the population characteristics.
We are at 50% of the population here
The threshold we will use in the query to select customers.  In this model, we will select customers with predict probability above 45%!

PART V: WORKING WITH PREDICTIONS

You have developed, trained, tested, and explored the data mining models you created. Now you are ready to use the models to identify future prospective customers for the company.

The role of DMX

After you have tested the accuracy of your mining models and decided that you are satisfied with them, you can then create prediction queries by using the Prediction Query Builder on the Mining Model Prediction tab in the Data Mining Designer. This interface helps you build queries in DMX, or the Data Mining Extension (DMX) language. DMX has syntax like that of SQL but is used for queries against data mining objects.

The Prediction Query Builder has three views. With the Design and Query views, you can build and examine your query. You can then run the query and view the results in the Result view.

Create an Age calculated column in the prospective buyers table

  1. In Solution Explorer, double-click the TargetedMailing data source.

  2. Right-click the table, ProspectiveBuyer, and select New Named Calculation.

  3. In the Create Named Calculation dialog box, for Column name, type calcAge.

  4. For Description, type Calculate age based on birthdate.

  5. In the Expression box, type the SQL function DATEDIFF(YYYY,[BirthDate],getdate()) and then click OK.

Because the table ProspectiveBuyer (the data we purchased to run our models) does not have an Age column corresponding to the one we have in our historical data (vTargetMail), we calculate customer age from the BirthDate column in the ProspectiveBuyer table. Since Age was identified as the most influential column for predicting bike buying, it must exist in both the model and in the input table.

Creating an DMX Prediction Query

  1. In Solution Explorer double click the Targeted mining structure to open it.

  2. On the Mining Model Prediction tab in the Mining Model box, click Select Model. In the Select Mining Model dialog box, navigate through the tree to the Targeted structure, expand the structure, select TrainingModelDecisionTrees, and then click OK.  This is the model we will use to make predictions since it is the strongest we have.

  3. In the Select Input Table(s) box, click Select Case Table. This is the table that contains the historical data with which we trained our data mining models.   

  4. In the Select Table dialog box, in the Data Source list, select TargetedMailing.

  5. In Table/View Name, select the ProspectiveBuyers (dbo) table, and then click OK.  This is the table that contains the new data from marketing on which we want to use the data mining models to make predictions.

The ProspectiveBuyer table most closely resembles the V_TargetMail case table.

Mapping the Columns

After you select the table to be used with our mining models (input table), the Prediction Query Builder creates a default mapping between the mining model and the input table, based on the names of the columns. At least one column from the structure must match a column in the external data.

  1. Right-click a line connecting the Mining Model window to the Select Input Table window, and select Modify Connections.

  2. Under Table Column, click the Bike Buyer cell and select ProspectiveBuyer.Unknown from the dropdown.  This is the attribute that we will predict in the new ProspectiveBuyer table.  Click OK.

  3. Under Table Column, click the Age cell and select ProspectiveBuyers.calcAge from the dropdown.  We need to manually modify the mapping since we have created a calculated field for the age.
     
  4. Click OK.

Design the prediction query

  1. The first button on the toolbar of the Mining Model Prediction tab is Design, Query, Result button. Click the down arrow on this button, and select Design.  Your query might already be in design view like in the following image:


     
  2. Click the cell in the first empty row in the Source column, and then select Prediction Function.

  3. In the Prediction Function row, in the Field column, select PredictProbability.  In the Alias column of the same row, type Probability of result.

  4. From the Mining Model window, select and drag [Bike Buyer] into the Criteria/Argument cell.  When you let go, [TrainingModelDecisionTrees].[Bike Buyer] appears in the Criteria/Argument cell.

  5. Click the next empty row in the Source column, and then select TrainingModelDecisionTrees mining model.

  6. In the TrainingModelDecisionTrees row, in the Field column, select Bike Buyer.

  7. In the TrainingModelDecisionTrees row, in the Criteria/Argument column, type “=1” (without the double quotes).

  8. Click the next empty row in the Source column, and then select ProspectiveBuyer table.

  9. In the ProspectiveBuyer row, in the Field column, select ProspectiveBuyerKey.  This adds the unique identifier to the prediction query so that you can identify who is and who is not likely to buy a bicycle

  10. Add five more rows to the grid. For each row, select the ProspectiveBuyer table as the Source and then add the following columns in the Field cells:
    1. calcAge
    1. LastName
    1. FirstName
    1. AddressLine1
    1. AddressLine2

Your query design should look like the following:

Run the query and view results

  1. In the Mining Model Prediction tab, on the upper left corner of the query window, select the Result button.

  2. After the query runs and the results are displayed, you can review the results. The Mining Model Prediction tab displays contact information for potential customers who are likely to be bike buyers. The Probability of result column indicates the probability of the prediction being correct. You can use these results to determine which potential customers to target for the mailing.

  3. At this point, you can save the results. The most convenient way is to save the SQL code of the Query to notepad for future reference.  To do that simply switch the query to SQL code using the icons at the upper left of its interface and then just copy and paste the code to notepad.  This way we can reuse it any number of times we would like to.

Here is the SQL of the Query

SELECT

  (PredictProbability([TrainingModelDecisionTrees].[Bike Buyer])) as [Probability of result],

  [TrainingModelDecisionTrees].[Bike Buyer],

  t.[ProspectiveBuyerKey],

  t.[calcAge],

  t.[LastName],

  t.[FirstName],

  t.[AddressLine1],

  t.[AddressLine2]

From

  [TrainingModelDecisionTrees]

PREDICTION JOIN

  OPENQUERY([Adventure Works DW],

    ‘SELECT

      [ProspectiveBuyerKey],

      (DATEDIFF(YYYY,[BirthDate],getdate())) AS [calcAge],

      [LastName],

      [FirstName],

      [AddressLine1],

      [AddressLine2],

      [MaritalStatus],

      [Gender],

      [YearlyIncome],

      [TotalChildren],

      [NumberChildrenAtHome],

      [HouseOwnerFlag],

      [NumberCarsOwned],

      [Unknown]

    FROM

      [dbo].[ProspectiveBuyer]

    ‘) AS t

ON

  [TrainingModelDecisionTrees].[Marital Status] = t.[MaritalStatus] AND

  [TrainingModelDecisionTrees].[Gender] = t.[Gender] AND

  [TrainingModelDecisionTrees].[Yearly Income] = t.[YearlyIncome] AND

  [TrainingModelDecisionTrees].[Total Children] = t.[TotalChildren] AND

  [TrainingModelDecisionTrees].[Number Children At Home] = t.[NumberChildrenAtHome] AND

  [TrainingModelDecisionTrees].[House Owner Flag] = t.[HouseOwnerFlag] AND

  [TrainingModelDecisionTrees].[Number Cars Owned] = t.[NumberCarsOwned] AND

  [TrainingModelDecisionTrees].[Bike Buyer] = t.[Unknown] AND

  [TrainingModelDecisionTrees].[Age] = t.[calcAge]

WHERE

  [TrainingModelDecisionTrees].[Bike Buyer] =1

.