Chapter Two: Data Understanding and Data Preparation
1
Chapter Two:
Data Understanding and
Data Preparation
1
2
Goals of the Chapter
To understand the major activities involved in data understanding and data preparation phases
Introduction of SAS Enterprise Miner 15.1
2
3
Data Preprocessing
Data Understanding phase involves:
Data collection
Familiarization with the data, discover initial insight
Evaluate the quality of the data
Data Preparation phase involves:
Select the cases and variables
Clean the data
Integrate the data sets if needed
Perform transformations on variables if needed
These phases may account for about 60-70% of a typical data mining project time.
The two phases are not mutually exclusive.
Shall we explore the data before cleaning them, or shall we clean the data before exploring them?
Data Understanding
Values, variables, and objects
A single unit of information is a value of a variable, where each variable can take a number of different values.
Two types of values: numerical and categorical
Numerical values are expressed by numbers.
Categorical values usually describe qualitative concepts such as Yes/No, Accept/Reject, Fair/Good/Very Good/Excellent etc.
Numbers are also frequently used for representing qualitative concept. For example, use 1 for Yes, and 0 for No.
Categorical variable can be:
binary – there are only two values. E.g. Gender.
nominal – there are more than two values. E.g. District.
ordinal – like nominal but some ordering exists. E.g. Income group.
4
Data Understanding
Values, variables, and objects
Objects are also known as observations, records, examples, units, cases, or individuals.
Objects represent entities described by one or more variables.
Consider an example concerning customers of a branch of a bank.
A customer is an object that can be described by a number of variables, such as identity, sex, age, current account balance, credit risk level.
5
Data Understanding
Data Sets
Objects described by the same variables are grouped to form data sets.
An object may have only one record in a data set (such as demographic related variables); An object may also have multiple records in a data set (such as purchases records).
Many data sets are organized as flat files.
A flat file is a formatted table composed of rows and columns, like a matrix.
Conventionally, the rows represent objects, and the columns represent variables.
Flat files are often generated from data stored in other more complex formats such as databases.
6
Data Understanding
Data Sets
Example: A data set of loan applicants (Cont’d)
The first row of a data set is often the name of each variable.
Each column can be separated by either a blank space, comma, or tab.
It is possible that several objects are related to the same applicant.
Some values are blank. This indicates the value of the variable is unknown for the object.
A value may be incorrectly outside the acceptable range of the variable.
7
Data Understanding
Data Sets
Example: A data set of loan applicants (Cont’d)
Each applicant (object) is described by the following set of variables:
IDENTITY (Identity of the applicant)
BAD (1 = applicant defaulted on loan, 0 = loan repaid)
LOAN (Amount of the loan request)
MORTDUE (Amount due on existing mortgage)
VALUE (Value of current property)
REASON (Debtcon = debt consolidation, Homeimp = home improvement)
JOB (Six occupational categories)
YOJ (Years at present job)
NINQ (Number of recent credit inquires)
DEBTINC (Debt-to-income ratio)
8
Data Understanding (Self-Study)
Data Storage
In data mining, we often deal with data sets that are kept in special data storage and management systems because:
The corresponding data set may not fit into the memory of a computer used for data mining and thus a data management system is used to fetch the data.
The data mining methods may need to work on many different subsets of data, and a data management system is required to efficiently retrieve the required pieces of data.
The data may need to be dynamically added and updated, sometimes by different people in different locations. A data management system is required to handle updating.
The flat file may include significant portions of redundant information, which can be avoided if a single data set is stored in multiple tables.
9
Data Understanding (Self-Study)
Data Storage
Databases Management System
It consists of a database that stores the data, and a set of programs for management and fast access to the database.
The most common database type is a relational database, which consists of a set of tables. Each table is rectangular and can be perceived as being analogous to a single flat file. Each object in a table is assigned a key variable that defines its unique identifiers. Through this key, objects in different table can be linked together.
A specialized language, called Structured Query Language (SQL), allows the user to specify queries that contain a list of relevant variables and constraints on those variables.
SQL also provides the ability to aggregate data by computing functions such as summations, average, count, maximum, and minimum.
10
Data Understanding (Self-Study)
Data Storage
Databases Management System
Example: A relational database of customer orders
11
Data Understanding (Self-Study)
Data Storage
Data Warehouses
Consider a bank operates a number of branches in different cities, each having its own database. Using a database, we are able to analyze data in individual branches, but it can be very difficult to perform analysis across all branches. In this case, a data warehouse is used.
A data warehouse is a repository of data collected in different locations (relational databases) and stored using a unified schema, Data warehouses are usually created by applying a set of processing steps to data coming from multiple databases. The steps usually include data cleaning, data transformation, data integration, data loading, and periodical data update.
A data warehouse usually uses a multidimensional database structure, where each dimension corresponds to a variable or a set of variables selected by the user to be included in the schema.
Data can be merged to provide the user with a higher-level summarization, or broken down into subranges to present the user with more detailed information.
12
Data Understanding (Self-Study)
Data Storage
Advanced Data Storage
New breed of databases are needed in order to satisfy the needs of more specialized users who must handle more than just numerical and nominal data.
There are new databases handle spatial data such as maps; hypertext such as HTML and XML; multimedia data such as combination of text, image, video and audio
13
Data Understanding
The Amount of Data
The size of a data set being analyzed has direct impact on not only the data storage and retrieval, but also on the amount of time required for processing such data by data mining algorithms.
While a large amount of data can be stored cheaply due to the greatly reduction in storage (hard disk or RAM) cost, and a large amount of data can be retrieved quickly due to the advancement of database techniques and faster networks, all data mining algorithms are quite sensitive to the size of a data set.
Three dimensions of data size are usually considered:
The number of objects
The number of variables
The number of values a variable assumes
14
Data Understanding
The Amount of Data
How much objects?
Consider the density: refers to the prevalence of the outcome of interest.
When estimating a real number (such as amount of spending), the full range of the target should be available in the data set for model building.
When classifying a categorical target, each level of the target must be represented by sufficient number of records.
Consider the available time and resource: the larger the data set, the more computer power (disk space, RAM, and CPU speed) and time are needed.
15
Data Understanding
The Amount of Data
How many variables?
A final model is usually based on just a few variables. But these few variables are often derived by combining several other variables.
Avoid throwing away variables that seem unlikely to be interesting at the early stage of project.
Domain experts provide useful insight of what variables to be included but it is important not to be constrained by their expertise.
How much history is required?
Should have enough historical data to capture calendar related events, such as seasonality.
Data from too far in the past may not be as useful of mining because of changing market conditions.
For many customer-focused applications, two to three years of history is appropriate, plus data about the beginning of the customer relationship.
16
Data Understanding
Preliminary Data Analysis
Unless a data mining analyst is already very familiar with the data set in hand, which is very unlikely in practice, the analyst often performs some preliminary data analyses on the data set.
The purposes of such preliminary data analyses include :
To understand the distribution of the variables, such as range of values, average, skewness, categorical frequency, etc.
To identify trivial errors in the data set, such as incorrect values, missing values, possible outliers, unary expression, rare event, etc.
To verify that only intentionally selected (or interested) objects and variables are included.
17
Data Understanding
Preliminary Data Analysis
Data Description
To explore the selected variable through appropriate descriptive statistics (univariate and multivariate).
Univariate statistics:
For interval variable:
Statistics: number of records with missing value, number of records without missing value, mean, standard deviation, minimum, maximum, and skewness.
Charts: Histogram.
For categorical variable:
Statistics: Relative frequency table.
Charts: Bar or pie.
18
Data Understanding
Preliminary Data Analysis
Data Description
Multivariate statistics
For interval variables:
Statistics: Correlation coefficient.
Chart: Scatter plot, 3-D histogram
For categorical variables:
Statistics: Cross-table, Chi-Square statistics.
Chart: 3-D bar chart.
For interval variable and categorical variable:
Statistics: compute descriptive statistics for the interval variable by each level of the categorical variable.
Chart: Histogram of the interval variable by each level of the categorical variable.
19
Data Preparation
Much of the raw data contained in databases (even data warehouse) is un-preprocessed, incomplete, and noisy.
For example, the databases may contain
Fields that are obsolete or redundant,
Missing values,
Outliers,
Data in a form not suitable for the data mining models,
Values not consistent with policy or common sense.
In order to be useful for data mining purposes, the databases need to undergo preprocessing, in the form of data cleaning and data transformation.
20
Data Preparation
Basic issues that must be resolved in data preparation:
How do I handle time-series data? Data rearrangement
How do I express data values? – Data transformation
How do I clean up the data? – Data cleansing
How do I handle missing values? – Data imputation
Do I need to create new variables? – Data derivation
Do I need to reduce the amount of data? – Data reduction
The above issues are not necessary to be tackled in any designated order, and each issues may be needed to be tackled more than once.
21
Data Preparation
Data rearrangement
Most business records are time series records, e.g. the monthly bills of each customer. They cannot be analyzed by most data mining models without further modifications.
Most times series records need to be rearranged (flattened out) by performing a reverse pivot on them. This is achieved by copying portions of data in a column to separate rows.
22
Data Preparation
Data cleansing
Includes operations that correct bad values, filter some bad records out of the data set.
Some common problems and solutions:
A record has missing value in many columns to be included in the model.
If the values cannot be found, delete the record.
Values in the identity column of combined data set are not unique.
If the values of the other columns are identical, keep only one of the records.
If the values of the other columns are different, it is necessary to find out the reasons that cause the problem. It is possible that the entire data set could be generated incorrectly.
23
Data Preparation
Data cleansing
Some common problems and solutions:
Data dimension is inconsistent.
The same variable may be encoded differently in different source data tables.
E.g. Gender in one table may be coded as 0 and 1, but in another table it is coded as M and F.
Standardized the values of the same variable across the tables.
Some codes for categorical variable are not valid.
Replace the codes with correct codes if possible. Otherwise delete the values and leave them to be imputed later.
Value is out of the possible numeric range.
E.g. Age column contains value 999, which was used to represent missing value.
Delete the values and leave them to be imputed later.
24
Data Preparation
Data cleansing
Some common problems and solutions:
Unary column or close to unary column.
For unsupervised data mining tasks, delete the column as it will not provide useful information.
For supervised data mining tasks, check the association between the column and the target before deleting it.
Very large proportion of the records are blank in a column.
If the blanks are not meaningful, the column can be deleted.
If one consider that there may be a pattern in the missingness, a binary variable can be created to indicate the present of missing value for the respective column.
25
Data Preparation
Data imputation
To fills the blanks (missing values or nulls) in a column with guessed values.
Many DM algorithms will ignore records that contain blanks.
Regression, neural network, and clustering models ignore records with blanks.
Decision tree algorithms accept records with blanks.
Two types of missing value:
Real missing value: A value exists, but it is unknown. E.g. the birth date of a customer.
No stored information: The variable does not apply. E.g. total spending in previous six months is not defined for customers who have less than 6 months of tenure, or the customers did not make any purchase in the six months.
Replacing the unknown age of a customer (real missing value) make sense but imputing spending (not stored information) of a non-spending customer does not.
26
Data Preparation
Data imputation
It is important to investigate missing values to evaluate whether the observations with missing values are somehow related to the target value if a target variable is available
Things not to do
Don’t throw records away without further consideration. For example, perhaps due to certain acquisition channels, information about certain age groups may not be available.
Don’t replace with a special numeric value for numeric variable. For example, do not replace missing value of age by 999.
Consider do nothing
Some models (e.g. decision tree) are able to make good use of records containing missing values by treating missing as a category.
Missing value may turn out to be an important predictor.
27
Data Preparation
Data imputation
Imputation methods for not stored information
In most occasions, they can be replaced by zero, or another value that represents the business knowledge behind it.
Example:
28
Data Preparation
Data imputation
Imputation methods for real missing value
Impute with a constant.
For numeric variable, commonly used constants are mean or median of the column, or other reasonable numeric constant.
This may make an observation that was unusual with respect to a given variable and make it look typical.
If the proportion of missing values is relatively high, this may create a spike at the mean value of the distribution.
For categorical variable, the missing value can be naturally handled by treating the missing value as an additional class of the variable.
When the proportion of missing observations is extremely small, it would be more appropriate to use the mode so that there is no need to bring in a new parameter to the model for a trivial number of observations.
29
Data Preparation
Data imputation
Imputation methods for real missing value
Use other variables to predict the most likely values for the blanks.
For example, to impute the income for a customer, one could look up the salaries of all known customers in the same occupation and age range of the customer.
Very often, a decision tree is built for a variable to predict the missing value. This approach can be used for both numeric and categorical variables.
It may also involve other variables that are of little importance to the model, thus make the required variables for the model unnecessary large.
The tree imputation method should only be used for selected variables, and to be performed after variables selection process.
30
Data Preparation
Data transformation for numeric variables
The numeric variables in a data set may have different scales. Some data mining techniques are strongly influenced by the scale of the variables. E.g: Clustering.
Possible solutions:
[0,1] Range standardization (Min-Max normalization) : To transform interval variables from the original range to a range between 0 and 1.
Central tendency standardization (z-score standardization): To transform interval variables from the original distribution to a distribution with 0 mean and 1 standard deviation.
31
Data Preparation
Data transformation for numeric variables
A numeric variable can be extremely skewed. This means that the amount of available observations varies greatly across the range of the values covered by the concerned variable. The few observations at the tail of its distribution may make the variable appear to be more (or less) important than it actually is.
Possible solutions:
Non-linear transformations, such as natural logarithm or power transformation, are often used to generates more consistent support across the range of values.
Apply an appropriate binning transformation that creates approximately equal number of observations in each created bins.
32
Data Preparation
Data transformation for numeric variables
A numeric variable may equal to one value quite frequently but follows a certain distribution elsewhere. Ignoring the duality of this variable can lead to understating or overstating the importance of the variable.
Possible solutions:
Replace the original variable by two created variables. An indicator variable (1,0) indicates whether the value is in the spike; the other variable equals the original value of the variable if it is not in the spike, otherwise its value is missing and to be imputed later.
It is often better to fit a model to each group (in spike and not in spike). It is not practical to take this approach for too many variables.
33
Data Preparation
Data transformation for numeric variables
A numeric variable may have a spike at a value for most of the observations and the distribution is virtually flat for the remaining relatively small portion of the data.
Possible solution:
Create a binary indicator (1, 0) that indicates the values is in the spike or not.
34
Data Preparation
Data transformation for numeric variables
Binning: Group values of an interval variable into a number of bins so that each bins have sufficient number of observations.
Grouped interval variable becomes a categorical column with options of imposing order on them.
Allow data mining classification models to discover the non-linear relationship between the binned groups and the target.
35
Equal width binning
Each bin covers equal width of the column values.
Not appropriate for sparsely distributed column or column with outlier.
E.g. The weight of cars are binned by an equal width at 1000lbs each, namely, 0 – < 1000, 1000 - < 2000, 2000 - < 3000, 3000 - < 4000, and so on.
Data Preparation
Data transformation for numeric variables
Binning
Equal frequency binning (Quantiles binning)
Each bin contains approximately equal amount of records.
Start with 10 bins and re-adjust if necessary.
Equal data values should belong to the same bin. Frequency in each bin may be different due to the repeated values.
Difficult to determine the best number of bins.
For example, the weight of cars are binned into 5 bins with approximately equal frequency.
36
Data Preparation
Data transformation for numeric variables
Binning
Equal frequency binning
Steps of forming q equal frequency bins:
Index each observation 1, …, N from the lowest to the highest.
Compute Qk = k/q, k = 1, 2, …, q-1
Determine the index of the k-th of q-quantiles of the data set: Ik = N Qk . If Ik is not an integer, then round up to the next integer. The corresponding data value is the k-th of q-quantiles. If Ik is an integer, take the average of the data value at that index and the data value of the next as the k-th q-quantiles.
37
Data Preparation
Data transformation for numeric variables
Binning
Equal frequency binning
Example: A 4-quantiles for the following data set:
Q1 = 0.25. The index of the 1-st of 4-quantiles: 18 0.25 = 4.5, which rounds up to 5. The 1-st of 4-quantiles is 5.
Q2 = 0.5. The index of the 2-nd of 4-quantiles: 18 0.5 = 9, which is an integer, the average of the ninth and tenth values is 9.5. The 2-nd of 4-quantiles is 9.5.
Q3 = 0.75The index of the 3-rd of 4-quantiles: 18 0.75 = 13.5, which rounds up to 14. The 3-rd of 4-quantile is 14.
The range of each bin is: < 5; >= 5 – < 9.5; >= 9.5 – < 14; >= 14.
See the document ‘Creating Quantiles.docx’ for more examples.
38
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18
Data Preparation
Data transformation for numeric variables
Binning
Optimal binning
Form bins for an interval so that the bins have the highest predictive power with respect to the target.
No need to pre-determine the exact number of bins.
Optimal bins are found by using a decision tree algorithm with the variable to be transformed as the only input variable.
Records with missing values may be binned separately or merged with one of the bins.
39
Data Preparation
Data transformation for categorical variables
Binary dummy variable: A binary dummy variable is a numerical variable coded with 0 or 1 to reflect the presence or absence of a particular categorical code in a given column.
Not all data mining models require such transformation (e.g. Decision tree).
Most data mining software will perform the transformation as needed automatically.
If it is necessary to apply the transformation manually:
Use 1 binary dummy variable for a binary categorical column.
For a nominal categorical column with m levels:
Regression model: Use m-1 binary dummies.
Neural network model: Use m-1 or m binary dummies.
Cluster analysis: Use m binary dummies.
40
Data Preparation
Data transformation for categorical variables
Categorical variable with too many levels
In general, a categorical variable with k levels requires a minimum of k-1 parameters in the model.
Excessive numbers of parameters can slow down processing and generate very little performance improvement.
Possible solutions:
Seek a higher level hierarchy that make sense when modeling such variable.
E.g. Street name can be replaced by district name, product can be replaced by type or group of product.
41
Data Preparation
Data transformation for categorical variables
Categorical variable with levels that rarely occur
Some categorical variables have a few dominant levels that account for the majority of the data as well as a large number of levels that are extremely small in comparison.
Possible solutions:
Group the infrequently occurring categories into an “other” category.
It might also possible to group the infrequently occurring levels with a more frequently occurring level that seems to make the most sense.
42
Data Preparation
Data transformation for categorical variables
It there is a target, one may employ optimal grouping method.
Optimal grouping:
Combine some existing levels of a categorical variable into new levels so that the new levels have the highest predictive power with respect to the target column.
Method is identical to that for interval varaibles.
Categorical ordinal levels can only be grouped with neighbourhood levels.
43
Data Preparation
Data derivation
This is about defining new variables that express the information inherent in the data in ways that make the information more useful or more readily available to data mining techniques.
Turning counts into rates
Tallies, such as number of purchases, for events that occur over time can be converted to rates by dividing by some fixed time unit to get number of events (e.g. number of purchases) per time unit (e.g. per month, or per year). This allows records with different tenures to be compared.
For example, a customer with ten events in six months of tenure is experiencing events at the same rate as someone who experiences twenty per year.
44
Data Preparation
Data derivation
Relative measures
Variables are often stored as counts of members of particular categories. For example, number of purchases in each of the 5 product categories. The problem with these type of counts is that they tend to rise and fall together (correlated) with the total count of all categories.
The counts of each category can be converted into percentage (or proportion) of all categories. For example, number of purchases in category A / total number of purchases in all categories.
If there are N categories, knowing the percentage for N-1of them is sufficient to determine the value of the remaining category.
Be aware that the total count for some records could be 0. The transformation for these records generate missing values.
45
Data Preparation
Data derivation
Number of days since
Records the number of days since the last known activity.
Example: Suppose today’s date is 5-Dec-2014
46
0
Data Preparation
Data derivation
Other frequently derived columns:
Total column
E.g. Total sales column is derived by summing the individual monthly sales column.
Ratio column
E.g. sales ratio of two consecutive time units; ratio of amount paid and amount billed.
Difference column
When two things are usually equal, the few cases where they are not may indicate something of interest. For example, in catalog or online retail sales data, the columns original order value and net order value are usually the same but the difference of the two columns may reveal certain unprofitable behaviour of a particular group of customers, the renters.
47
Data Preparation
Data reduction
Data set with many variables
It is a blessing, as more descriptive information is available.
It is a curse, as many variables are likely to be correlated. More columns may contain missing values. It requires more computing resource etc..
A categorical variables with high number of levels will also generate large number of variables.
Risk of overfitting
More variables provide more opportunities for overfitting in supervised modelling because there will be less observations in each value of each variable.
With more variables available, the model tend to memorize specific cases in the model data set, rather than generalizing from them in a way that creates stable models.
48
Data Preparation
Data reduction
Risk of correlation among variables
Correlated variables might describe the same phenomenon, but some of the variables may be more descriptive than others.
For example, to describe declining usage, the ratio of this year’s usage to last year’s being less than one clearly describes “declining usage”. Recent usage being low would be correlated with declining usage but has less descriptive power than the ratio.
For unsupervised data mining models or other models that rely on the distance calculation, correlated variables overweight certain characteristics.
For example, if average monthly spending and yearly spending are two variables, distance among “high spending” customers are going to be close along both dimensions. Hence in fact, the model is really focus on just one aspect of the data, such as total spending.
49
Data Preparation
Data reduction
Variable reduction (selection) techniques
Using the target variable and input variables
For categorical targets, drop input variables with low value.
For continuous targets, drop input variables with low R2 value.
Build a large decision tree classification model. Select the set of variables included in the tree.
Be aware that decision trees can select variables that are highly correlated with each other.
Build a regression model with stepwise model (or forward, or backward) selection procedure. Select the set of variables included in the model.
This procedure could be time consuming for large number of input variables.
Can create a combined set of candidate variables that are chosen by any of the methods.
50
Data Preparation
Data reduction
Variable reduction (selection) techniques
Using input variables only
Method 1: Derive principal components for all variables or subset of variables, and to select some number of these for modelling purposes.
The principal components do not necessarily lend themselves to explanation.
Method 2: Generate variable clusters by using a clustering algorithm. Then select the most descriptive variable in each cluster.
Method 3: Each variable is assigned to either the first principal component or the second principal component based on its contribution. This variable assignment becomes the highest split in the variable hierarchy. The process repeats itself for each variable cluster. Stop the iterative reassignment when some criteria (such as proportion of variation explained) are met.
51
Method 2: Each variable is assigned to either the first principal component or the second principal component based on its contribution. This variable assignment becomes the highest split in the variable hierarchy. The process repeats itself for each variable cluster. Stop the iterative reassignment when some criteria (such as proportion of variation explained) are met.
The method should be applied to interval variables only as dummy variables from the same categorical variable could be cluserted into different variable clusters.
Data Preparation
Data reduction
Variable reduction techniques
Using input variables only
Method 4:
Stage 1: Use Method 2 to generate a set of global clusters. The number of clusters obtain at this stage is not more than the integer value of [(number of variable)/100 +2].
Stage 2: Apply Method 3 to each global cluster identified in Stage 1.
Methods 2 to 4 should be applied to interval variables only. Dummy variables from one categorical variable could be clustered into different categories.
52
Data reduction
Variable reduction techniques
53
Verifying the Resultant Data Set (Self-study)
It is not uncommon to have performed hundreds of transformation actions on the operational data sets.
The final combined data sets often contain errors.
Verifying the combined data sets can save a considerable amount of rework.
It may be necessary to perform this step more than once as new data or new transformation were applied from time to time.
54
Verifying the Resultant Data Set (Self-study)
Methods of verification
For discrete column, compare the distributions from the data source tables and combined / resultant data set.
For interval column, compare the statistics (min, max, average, and standard deviation) between the data source tables and combined / resultant data set.
Examine the data type of each column in the combined / resultant data set.
Compare the number of object of interest (such as ID) in the original data to that in the combined / resultant data set.
If a data set was filtered, the filtered records should not appear in the new data set.
If a data set was sampled, the distributions or statistics for each column in the new data set should be similar to that in the original data set.
55
Verifying the Resultant Data Set (Self-study)
Methods of verification
If a column value was recoded or grouped, the number of records at each value in the transformed column should be identical to the total number of records with the corresponding set of values in the original column.
If missing values were filled in, the filled column should contain no missing values.
For a derived column, examine a few randomly selected records to confirm that the created values agree to the rules for creating the column.
For a derived column, examine its distribution.
Documenting the Data Preparation Steps
(Self-Study)
Documenting the data understanding and preparation steps is a tedious process, but it:
Ensures the integrity and accuracy of data.
Maintain the production of the combined / resultant data sets.
Provides useful documents for verifying purpose.
Information required for each data set:
Name of the table.
Description of the table.
Experimental unit.
Sources.
Column name (also record the original column name if available).
Data type of the column.
Applied transformations and imputation methods.
Value range for interval column.
List of levels for categorical column.
56
57
SAS Enterprise Miner 15.1
Example 2.1: Creating a new SAS EM project
Select New Project in the Welcome window.
Create a new project:
Step 1: Click Next button.
Step 2: Name the project as Project1. Click Next button.
For locally installed EM: You may specify directory C:\ (or other directory) as the SAS Server Directory. Do not specify a directory which has special characters (space, Chinese etc.) in its name.
Step 3: Click on Next to confirm SAS Folder Location.
A folder named Project1 will be created under your user folder on cloud.
Step 4: Click on Next and Finish buttons.
A EM project created in SAS OnDemand cannot be downloaded.
58
SAS Enterprise Miner 15.1
Example 2.1: Create a new SAS EM project
Diagram workspace
Tool Buttons
Project Panel
Properties Panel
Help Panel
Tools
59
SAS Enterprise Miner 15.1
Example 2.1: Define source data
EM can only work with SAS data set through a pre-defined SAS library. For convenience, create or upload the required SAS data sets in Studio or SAS EG.
Since you have registered to the course MS6711_2020, you can share my SAS data sets in SAS OnDemand. The name of an uploaded SAS data set must be in small letters.
Click Project1 in SAS EM project panel and then select the … button on the right of the Start-Up Code property in the properties panel to open Start-Up code window.
Type the below LIBNAME statement in the Start-Up code window, where u45408717 is my identity in SAS OnDemand :
libname mydata ‘my_shared_file_links/u45408717/SAS Data’;
MYDATA is the created SAS library name and the path is the definition of this library.
Click Run Now button and then OK button to save the statements and close the code window.
60
SAS Enterprise Miner 15.1
Example 2.1: Define source data
61
SAS Enterprise Miner 15.1
Example 2.1: Defining source data
Right click Data Sources under Project1 in the project panel, then select Create Data Source.
Data Source Wizard
Step 1: Source: SAS table, click Next button.
Step 2: Browse, expand MYDATA library and select Customer_join data set, click OK and then Next button.
If MYDATA library is not in the available library list, click Refresh button.
Step 3 – 4: Click Next button.
Step 5: Review Column Metadata. By default, each variable is assigned with a Role value and Level value.
SAS Enterprise Miner 15.1
Example 2.1: Defining source data
Data Source Wizard (Optional)
If the Advanced option in Step 4 is selected, one can assign the Role value to Reject according to the specified level thresholds.
Under the default Advanced setting, the Role values of variables City, Postcode, and State are assigned to Rejected.
To change the default threshold levels, in Step 4:
Check Advanced option and click Customize button to change the threshold levels.
For example, change the Reject Levels Count Threshold value to 50 (or other value as applicable). Then click OK button.
One can use the Advanced option to quickly identify variables with high number of levels.
For this example, we use Basic option.
62
SAS Enterprise Miner 15.1
Example 2.1: Defining source data
Data Source Wizard
Step 5: (Cont’d)
To look at the field values for some of the records, select all variables, then click the Explore button.
In Sample Properties of Explore Window, set the value of Sample Method to Random, and the value of Fetch Size to Max. Click Apply button.
63
SAS Enterprise Miner 15.1
Example 2.1: Defining source data
Data Source Wizard
Step 5: (Cont’d)
Enlarge the Data window to view the sampled records.
Enlarge the Sample Statistics to view the statistics based on the sample.
64
SAS Enterprise Miner 15.1
Example 2.1: Defining source data
Data Source Wizard
Step 5: (Cont’d)
Close the Explore window and back to the Variables window.
It is often necessary to adjust the Role value or Level value manually. Otherwise, data may be encoded or treated inappropriately at later stages.
Right click the Role cell of the Activated_yearmth row and select Input.
Change the Role of Customerid to ID.
Click Next button to save the changes and to go to next step.
Step 6-7: Click Next button.
Step 8: Click Finish button.
65
SAS Enterprise Miner 15.1
Example 2.1: Defining source data
Variables Window
It is also possible to adjust the Role, and Levels of variables after Data Source Wizard is closed.
Select Customer_join in Data Sources to review the properties of the data set in the Properties Panel.
Select the … button to the right of the Variables property to open the variables window as before.
66
67
SAS Enterprise Miner 15.1
Example 2.1: Creating a New Project Diagram
Right click Project1 (or Diagrams), then select Create Diagram.
Enter name Customer Diagram (or other name you prefer).
68
SAS Enterprise Miner 15.1
Example 2.1: Input Data Node
In each diagram, there must be at least one Input Data Node in the workspace.
Drag Customer_join from the Data Sources into the workspace.
69
SAS Enterprise Miner 15.1
Example 2.1: Input Data Node
Click the Input Data node in the workspace to show the property of the node in the Properties Panel.
Select the … button to the right of the Variables property to activate Variable Window.
Like the Variable Window in Data Sources, you can change the role, measurement level of each variable.
Changes that are made here only affect subsequent analyses that are based on this node.
You can view the distributions of selected variables as before.
Right click the node and select Run to run the node.
70
SAS Enterprise Miner 15.1
Example 2.1: StatExplore Node
Used for generating summary statistics for the variables in the data set.
Drag StatExplore node from Explore Tool into the diagram.
Connect the Input Data node to the StatExplore node.
71
SAS Enterprise Miner 15.1
Example 2.1: StatExplore Node
Select the StatExplore node in the diagram and select the … button to the right of Variables to activate the Variables Window.
Select a variable or some variables (hold control key and select the variables simultaneously) and click Explore as before.
If some variables are not to be used for the summary report, set their Use values to No.
Change Number of Observations of Data property to All for using all observations in the report.
To view summary statistics of the selected variables:
Right click the StatExplore node, select Run, and view the Results.
In the Output window, variables City and Postcode are both reported with 513 levels. In fact, the actual number of levels of both variables are higher than 513.
SAS Enterprise Miner 15.1
Example 2.1: MultiPlot Node
Used for visualizing the distribution of selected variables.
Drag MultiPlot node from Explore Tool into the diagram.
Connect the StatExplore node to the MultiPlot node.
To plot the distributions of the variables, run the node.
If some variables are not to be used for the plotting, set their Use values in Variables window to No.
72
SAS Enterprise Miner 15.1
Example 2.1: MultiPlot Node
Error occured because the variables City and Zip contain too many levels.
Set the Use values of City and Postcode in Variables window to No. Run the node again.
73
SAS Enterprise Miner 15.1
Example 2.1: Graph Explore Node
Used for plotting charts (such as distribution, box plots, scatter plot etc.) for single variables and multiple variables
The resulting plot is fully interactive.
Drag Graph Explore node from Explore Tool into the diagram.
Connect the MultiPlot node to Graph Explore.
Set the following properties:
Method in Sample Properties to Random.
Size in Sample Properties to Max.
74
SAS Enterprise Miner 15.1
Example 2.1: Graph Explore Node
Run the node. Select the Results button to open the Graph Explore Results window.
Select the Sample Table and click View | Plot from the result menu to select the type of chart to be drawn.
Draw the chart for each selected variable.
Close the Results window. Any open graphs are persisted. When the Results window is open again, the graphs will be recreated.
75
SAS Enterprise Miner 15.1
Example 2.1: Drop Node
Used for removing selected variables from data set so that nodes connected from the Drop nodes will not have access to these variables.
Example: For illustration purpose, the variables City and Postcode are dropped in this example.
Drag Drop node from Modify tools into the diagram.
Connect the Input Data node to the Drop node.
Select the Drop node in the diagram to review its properties in the property panel.
76
SAS Enterprise Miner 15.1
Example 2.1: Drop Node
Example (Continued)
Select the Drop node in the diagram and select the … button to the right of Variables to activate the Variables Window.
Click the Drop cell of the City row and select Yes to drop the variable.
Repeat the same for variable Postcode. Click OK button
In Property Panel of the Drop node, set the value of Drop from Table to Yes (that is to remove the variables completely from the output data set of this node) and all the others options to No in Drop Selection Options.
Run the node.
77
SAS Enterprise Miner 15.1
Example 2.1: Filter Node
Used for removing certain observations from the data set.
E.g. Remove all observations with CurrentBalance > $400.
Drag Filter Node from Sample tools into the Diagram.
Connect Drop Node to Filter Node.
Select Filter Node to review its properties in the Property Panel.
Set Default Filtering Method to None for both Class and Interval variables respectively.
User now needs to specify which observations are to be filtered out.
78
SAS Enterprise Miner 15.1
Example 2.1: Filter Node
E.g. Continued
Select the … button to the right of Interval Variables of Interval Variables property to activate the Interactive Interval Filter window.
Select the CurrentBalance and click the Filtering Method cell at CurrentBalance row.
Select User Specified filtering method and type 400 at the Filter Upper Limit cell to filter out records with CurrentBalance > 400.
User may click Generate Summary button to view the distribution of the selected variable.
Click OK button and Run the node.
79
SAS Enterprise Miner 15.1
Example 2.1: Replacement Node
Used for replacing values of selected variables.
Example: Replace all levels except ACTIVATION of Churn_reason by CHURNED.
Drag the Replacement node from Modify tools into the diagram.
Connect the Filter node to Replacement node.
Select the Replacement node to review it’s properties in the Property Panel.
Set Default Limits Method of Interval Variables to None.
Click the … button to the right of Replacement Editor of Class variables.
80
SAS Enterprise Miner 15.1
Example 2.1: Replacement Node
Example: Continued
In the Replacement Editor, use the Replacement Value column to replace the value of class variable levels.
Replace all levels except ACTIVATION of Churn_reason by CHURNED.
Click OK button. Then run the node.
81
SAS Enterprise Miner 15.1
Example 2.1: Transform Variables node
Used for creating new variables that are the transformation of existing variables in the data set.
Available transformation methods include:
For interval variables: simple transformation (such as log, square root etc), binning transformation, and power transformaton.
For class variables: grouping rare levels transformation, and dummy indicator transformation.
Customized transformation.
Examples:
Transform the variable tot_paid_amt into 5 bins of equal size.
Group rare event of the variable Region into Others.
Create a new variable average_amt_perinvoice which equals to tot_invoice_amt / invoice_count.
82
SAS Enterprise Miner 15.1
Example 2.1: Transform Variables node
Create equal frequency bins
Drag Transform Variables node from the Modify tools into the diagram.
Connect the Replacement node to the Transform node.
Select the Transform Variables node in the diagram to review its properties in Property Panel.
Click the … button to activate the Variable table to set the transformation method for each variable.
Click the Method cell at the tot_paid_amt row and select Quantile (equal binning) from the drop list.
Type value 5 in the respective Number of Bins cell. Click OK and run the node.
83
SAS Enterprise Miner 15.1
Example 2.1: Transform Variables node
Group rare levels
Change the default cutoff value of Grouping Method in the Property Panel from the default 0.1 (10%) to 0.01 (1 %).
Click … button to the right of Variable to activate the Variable table again.
Click the Method cell at the Region and select Group Rare Levels from the drop list, Click OK. Then run the node.
84
SAS Enterprise Miner 15.1
Example 2.1: Transform Variables node
Create a new variable
Set the followings for Sample Properties:
Method to Random.
Size to Max.
Select … button to the right of Formulas in the Property Panel.
Click the Create icon in the Formula Builder window to activate the Add Transformation window.
Type in the new variable name and the formula. Click OK.
Run the Transform Variables node.
85
SAS Enterprise Miner 15.1
Example 2.1: Sample node
Used for creating a sample of records by random, cluster, or stratified sampling method.
To create a balanced sample of churned and activation records:
Drag the Sample node from Sample tools into the diagram.
Connect the Transform node to the Sample node.
Click … of Variables property of the Sample node to activate the Variables window.
Change the Sample Role of Rep_churn_reason to Stratification. Click OK to save the change.
Change the Sample Method of Train property to Stratify.
Change the Type of Size property to Number of Observations.
Change the Observations of Size property to 20,000.
Change the Criterion of Stratified property to Equal.
Run the node.
86
SAS Enterprise Miner 15.1
Example 2.1: Closing and Re-opening a EM project
Click Close This Project in File, and then click Exit to quit EM. Project is saved automatically.
To open the project again at the same machine:
Start SAS EM as usual and select Open Project in the Welcome window.
In the Open Project window, select Project1 (or another project). Then click on OK.
87
SAS Enterprise Miner 15.1 (Self-Study)
Example 2.1: Create Model Package
A package that contains part of the results from the nodes and the contents of the package can be viewed at another PC.
The created package is stored under the Report folder of the project by default.
User can create as many package as needed for a project.
A package contains:
A snapshot of the flow associated with the reporting node.
The settings and results for the entire flow can be viewed.
To create a package:
Right click the last node of the path and select Create Model Package from the list.
You are then prompted to enter a name for the model package.
88
SAS Enterprise Miner 15.1 (Self-Study)
Example 2.1: Create Model Package
Example:
Right click the Graph Explore node and select Create Model Package.
Enter Project1_Package as the name of the Model Package.
From SAS Studio, you can see that a folder named Project1_Package_XXXXX is created under the folder Reports of the project. Do not change the files contained in this folder.
This folder is self-contained. It does not need to be contained in a project.
To view a Model Package, either open it within the same Project, or download miningResult.spk (or Project1_Package in EM) to your local hardd isk. Then,
Create a new EM project or open an existing EM project.
In the new EM project, select File | Open Model Package, locate and select the file miningResult.spk inside the Model Package folder. Click open
89
SAS Enterprise Miner 15.1
Example 2.2
A charity wants to develop a donor selection model for selecting potential donors from the records stored in the database.
DONOR_RAW_DATA.SAS7BDAT represents a randomly selected sample from the data base for model development purpose.
Create a new EM projected named as Project2. Import the data set into the project. Keep this project save for later use.
Define the data sources with donor_raw_data.
19372 rows of observations.
50 columns (see Donor_raw_description.pdf for the description of the variables).
Variable Donation indicates whether the respondent donated in the last campaign (1=Donated, 0 = Not donated).
Variable Donation_Amt indicates the amount donated by the donor in the last campaign. No value for those not donated.
90
SAS Enterprise Miner 15.1
Example 2.2
Define role and level for the variables
In Variables window (or Data Source Wizard) of Data Sources, set the Role and Level for the following variables: (Click Compute Summary button to display statistics)
Control_Number: ID role
Donation: Target role; Binary level; Descending order
Donation_Amt: Rejected role
File_Avg_Gift: Rejected role (identical to LifeTime_Avg_Gift_AMT)
Frequency_Status_97NK: Ordinal level
Home_Owner: Binary level
Income_Group: Nominal level
In_House: Binary level
Pep_Star: Binary level
Published_Phone: Binary level
Recent_Star_Status: Interval level
Wealth_Rating: Nominal level
Click OK to save the changes.
Create a new diagram and then drag the Donor_Raw_Data node from the Data Sources to the diagram.
91
SAS Enterprise Miner 15.1
Example 2.2
Data Exploration
Connect the Data Input node to a StatExplore (or MultiPlot).
From the results of StatExplore node, we note the followings:
Donation levels: 1 (25%), 0 (75%).
Cluster_Code contains high number of levels.
Large proportion of Income_Group, Wealth_Rating and Donor_Age are missing. Months_Since_Last_Prom_Resp contains small amount of missing values.
Missing values of variables SES (2.34%) and Urbancity (2.34%) are represented by a ‘?’ symbol.
Donor_Gender has 4 different values.
The minimum value of Months_Since_Last_Prom_Resp is negative.
92
SAS Enterprise Miner 15.1
Example 2.2
Data Exploration
Connect the Data Input node to a StatExplore (or MultiPlot).
From the results of StatExplore node, we also note the followings:
Distributions of Lifetime_Avg_Gift_Amt, Lifetime_Gift_Range, Lifetime_Max_Gift_Amt, Lifetime_Min_Gift_Amt, Mor_Hit_Tate, Median_Home_Value, and Pct_Attribute1 are fairly skewed (non-symmetical) and have high kurtosis (very spiky).
Many variables have high chi-square statistics with the Target_B.
93
SAS Enterprise Miner 15.1
Example 2.2
Replace values of variables
Connect Input Data node to a Replacement node.
Set Default Limits Method of Interval Variables to None so that no value replacement will be applied to interval variables by default.
In Replacement Editor of Interval variables:
Replace the negative values (< 0) of Months_Since_Last_Prom_Resp by missing values.
94
SAS Enterprise Miner 15.1
Example 2.2
Replace values of variables
In the same Replacement node (or a new one):
In the Replacement Editor of Class Variables:
Replace value A of Donor_Gender by U.
Replace value . of Income_Group by 99.
Replace value . of Wealth_Rating by 99.
Click OK to continue.
Run the Replacement node. Always check the results.
This set of replacements involve only the values of the respective observation of each variable, it can therefore be applied before the data partition.
95
SAS Enterprise Miner 15.1
Example 2.2
Impute missing values of Rep_Months_Since_Last_Prom_Resp
The actual value of Months_Since_Last_Prom_Resp is larger or equal to the value of Months_Since_Last_Gift, set the missing value of Months_Since_Last_Prom_Resp equal to the value of Months_Since_Last_Gift of the same observation.
Connect the Replacement node to a SAS Code node.
In the Code Editor of the SAS Code node, type the following code into the Training Code window. Then run the node.
96
data &em_export_train;
set &em_import_data;
If rep_months_since_last_prom_resp = . then do;
rep_months_since_last_prom_resp=months_since_last_gift;
end;
run;
SAS Enterprise Miner 15.1
Example 2.2
Data Partition
Connect the SAS Code node to a Partition node.
Set the following properties for the Partition node:
In Variables window, set the Partition Role of Donation to Stratification.
Partition method: Stratified.
Data Set Percentages: training (60%), validation (20%), and test (20%).
Run the node.
All sequences data modifications and model development will be based on the training set and validation set. The same kind of subsequent data modifications will be applied to validation and test sets automatically.
97
SAS Enterprise Miner 15.1
Example 2.2
Create optimal groups for selected categorical variable
Connect the Partition node to a Variable Selection node.
In the Variables window of the Variable Selection node:
Set the Use status of Donation, and Cluster_Code, to Yes; all the other variables are set to No.
Click OK to continue.
Set Rejects Unused Input in Train options to No for passing all variables to successor nodes.
Set Use Group Variables in R-Square Options to Yes for grouping levels of a class variable based on the relationship of the variable to the target variable.
Run the node.
In Results, select View | Variable Tables | Group Variables to view the grouping.
98
SAS Enterprise Miner 15.1
Example 2.2
Create optimal bins for interval variables and keep missing values, if any, as a separate group.
Connect the Variable Selection to a Transform node.
Set Interval Inputs of Default Methods to Optimal Binning.
Set Treat Missing as Level of Default Methods to Yes for keep the missing values, if any, as a valid bin by itself.
Set Number of Bins in Optimal Binning to 10 so that at most 10 bins will be created. The minimum acceptable value is 4.
Set Hide and Reject in Score to No so that the original variables are visible and with Input role in the output.
Run the node.
99
SAS Enterprise Miner 15.1
Example 2.2
Variables selection
Connect the Transform node to another Variable Selection node.
Set the Target Model in Train to R and Chi-square so that variables are kept only if they are selected by both methods.
Set the Number of bins in Chi-Square Options to 20 for speeding up the process.
Set the Use AOV16 Variables and Use Group Variables in R-Square Options to No as the interval variables are already binned, and the categorical variables considered only have small number of levels.
Run the node.
In Results, Variable Tables | Variable Selection shows the set of selected variables that are selected by both methods.
Keep this project save for later use.
100
Customer IDQuarterDue Amount
A1$41.25
A2$91.63
A3$30.12
A4$77.70
B1$57.53
B2$10.93
B3$68.18
B4$92.96
Customer IDDueAmount_1DueAmount_2DueAmount_3DueAmount_4
A$41.25$91.63$30.12$77.70
B$57.53$10.93$68.18$92.96
/docProps/thumbnail.jpeg