CHAPTER 1 DATA MINING: Overview
1
Chapter 1:
Data Mining – An Overview
1
2
Goals of the Chapter
Definition of data mining.
Data mining process.
A data mining case.
2
What Is Data Mining?
Knowledge discovery (KD) or Knowledge discovery in database (KDD) has been defined as the ‘non-trivial extraction of implicit, previously unknown and potentially useful information from data’.
It is a process of which data mining forms just one part.
An idealized version of KD process:
3
What Is Data Mining?
Over the years, KD has gone by many different names, such as machine learning, business intelligence, predictive modelling, predictive analytics, big data analysis, and so on.
Many of these are just buzzwords. They mean more or less the same kind of process.
Although the terms KD and data mining are distinct, many researchers and practitioners use data mining as a synonym for KD process.
We adopt similar view in this course.
4
What Is Data Mining?
Brief history of data mining
Collecting and storing data on computers, tapes and disks started in 1960s.
Next evolutionary step in data mining happened during 1980s with the introduction of relational databases and structured query languages.
Data mining term was introduced in 1989 by the researcher Gregory Piatesky-Shapiro. By then, only hard to use algorithms were available.
Next was the introduction of data warehousing happened during 1990s. Online analytic process and multidimensional databases contributed to the growth of data warehousing.
In the late 1990s, data mining suites were developed by vendors such as SPSS, IBM, and SAS. These suites allowed the user to perform several discovery tasks and also supported data transformation and visualization.
5
What Is Data Mining?
Four areas contributed to the growth of data mining in its current form. They are:
Statistical concepts deal with data and relationships among them. These concepts are the building blocks of advanced data mining techniques;
Machine learning (Artificial intelligence) is the area which tried to simulate human thought process or human intelligence in statistical problems;
The increasing availability of automatic data capture mechanisms allows not only more events being recorded, but more information per event is captured;
Computer power (speed and storage space) allows the analysis of large volume of multidimensional data at an extremely high speed.
6
What Is Data Mining?
Berry and Linoff in their book “Data mining Techniques, 3rd edition” gave the following definition for data mining:
“Data mining is a business process for exploring large amounts of data to discover meaningful patterns and rules.”
This definition has several parts, all of which are important:
Data mining is a business process
Large amount of data
Meaningful patterns and rules
7
What Is Data Mining?
Data mining is a business process
A business process is a collection of related, structured activities or tasks that produce a specific service or product for a particular customer or customers.
It can often be visualized with a flowchart as a sequence of activities with interleaving decision points.
A business process begins with a mission objective and ends with achievement of the business objective.
8
What Is Data Mining?
Large amount of data
How much is a lot of data?
Desktop version of Excel allows more than 1 million rows of data on a page of spreadsheet.
Excel contains powerful tools for visualizing and understanding data, provided the data can be fitted on a page of spreadsheet.
What if the data cannot be fitted in a page of spreadsheet?
For example, consider the transaction data generated from a large supermarket over a week.
Even if the data can be fitted in a page of spreadsheet, can you find groups of transactions by similarity?
Excel does not have such kind of built-in tool, though some add-in tools are available.
9
What Is Data Mining?
Meaningful patterns and rules
Finding trivial patterns in data is not tremendously difficult. For example, more female customers purchased product X than male customers, or more mid-age customers purchased product Y than other age groups.
Data mining is about finding non-trivial patterns.
For example: To find the pattern of three groups of customers: valuable customer with low risk, valuable customer with some risk, and customer with high risk.
For example: To find the pattern of customers who are most likely to leave.
10
What Is Data Mining?
Data mining techniques can be broadly divided into two categories: supervised, and unsupervised methods.
Supervised methods
Include algorithms that are used for classification and estimation.
In estimation, there is a numerical target (dependent) variable (such as age, income, amount of spending.)
In classification, there is a categorical target variable which has two or more levels (such as Yes / No; Default / Not Default; Renew / Not Renew; High Risk / Medium Risk / Low Risk.)
11
What Is Data Mining?
Supervised methods (Cont’d)
A supervised algorithm examines a large set of records, each record containing information on the target variable as well as a set of input (independent or predictor) variables.
It “learns about” which combinations of variables are associated with each value of the target variable.
The algorithm then would look at new records, for which no information about the target variable is available. Based on what has been learnt, the algorithm will decide the appropriate value (numeric or categorical) for the target variable of the new record.
Examples of supervised methods: regression, logistic regression, decision tree, neural network, etc.
12
What Is Data Mining?
Unsupervised methods
Algorithms are used for finding hidden structure in data set without a target variable.
There is no learning from cases where such an target variable is known.
Examples of unsupervised methods: association analysis, and clustering.
Association analysis has been modified and becomes the rule-based classification (a kind of supervised methods).
13
14
What Is Data Mining?
Other analytic techniques
Online analytical processing (OLAP)
Allows users to explore a data cube and drill down to understand interesting features that they encountered.
User needs to come up his own hypotheses about the possible relations between the variables and looks for confirmation by observing the data.
OLAP answers questions like “What is the proportion of customers with low income and lots of debts defaulted a loan?”
DM answers questions like “What are the characteristics of the customers who are likely to default a loan?”
15
What Is Data Mining?
Other analytic techniques
Statistical methods
Statisticians and data miners use many of the same techniques, such as regression, but based on different assumptions.
Most statistical methods assume an underlying distribution of a given data set, such as normal distribution. Based on the distribution assumption, explicit inference, such as hypothesis testing, can be made for the parameters in the model.
Data miners do not make assumption on the distribution of a given data set. They believe if a pattern is found from a given data set, the same pattern should also be identified in another data set with similar variable values.
What Is Data Mining?
Other business analytic techniques
Big data analytic
Big data is a general term used to describe the voluminous amount of unstructured and semi-structured data a company creates (e.g. text messages or videos generated in Twitter or Facebook) . These data would take too much time and cost too much money to load into a relational database for analysis.
Due to its unstructured nature and size, a new class of big data technology has emerged. These include NoSQL databases (a type of database design), Hadoop and MapReduce (both are programming framework that supports distributed computing environment) etc.
The most efforts involved in big data analytic is to organize and store the data in an easy to retrieve form.
Most data mining algorithms are still deployable under distributed computing environment once the data are organized.
16
17
What Is Data Mining?
Major business applications
Customer loyalty/attrition
Market basket analysis
Direct marketing
Customer segmentation
Fraud Detection
Credit risk
Insurance premium determination
Data Mining Process
A misconception: Data mining simply consists of picking and applying computer-based tool to match the presented problem and automatically obtaining solution.
Data mining is not a random application of statistics and machine-learning methods and tools. It is a carefully planned and considered process of deciding what will be most useful, promising and revealing.
Simply knowing many algorithms used for data analysis is not sufficient for a successful data mining project. It is important to understand the overall approach.
18
Data Mining Process
Strictly speaking, there is no data mining process exist as data mining is supposed to be one step of knowledge discovery process (KDP).
Since data mining becomes a synonym of knowledge discovery, the KDP models proposed in 1990s have been adopted as data mining process models.
Popular KDP models include:
Academic Research Model, proposed by Fayyad et al in 1996.
Cross-Industry Standard Process for Data Mining, proposed by Daimler-Chrysler, SPSS, and NCR in 2000.
Hybrid model, proposed by Cios et al in 2005.
19
Data Mining Process
Cross Standard Process for Data Mining (CRISP-DM)
Business understanding
Clearly understand the project objectives and requirements from a business perspective.
Translate these goals and restrictions into a DM problem definition.
Prepare a preliminary strategy for achieving these objectives.
Data Understanding
Collect the data.
Familiarization with the data and discover initial insights.
Evaluate the quality of the data.
Data Preparation
Select the cases and variables that are appropriate for analysis.
Clean the data so that it is ready for modeling tools.
Integrate the data sets if needed.
Perform transformations on variables if needed.
20
Data Mining Process
CRISP-DM (Cont’d)
Modeling
Select and apply appropriate modeling techniques.
Calibrate model settings to optimize results.
Often, several different techniques may be applied for the same data mining problem.
May require looping back to data preparation step, in order to bring the form of the data into line with the specific requirements of a particular data mining technique.
Evaluation
Models must be evaluated for quality and effectiveness.
Determine whether the model achieves the objectives set for it in Step 1.
Establish whether some important facet of the business or research problem has not been sufficiently accounted for.
Make decision regarding the use of the data mining results.
21
Data Mining Process
CRISP-DM (Cont’d)
Deployment
The discovered knowledge must be organized and presented in a way the end user can use.
Depending on the requirements, this can be as simple as a report or as complex as implementing a repeatable KDP.
CRISP-DM model has been used in domains such as medicine, engineering, marketing, and sales.
22
Data Mining Process
The sequence of the steps in CRISP-DM is not strictly one way, moving back and forth between different steps is always required.
The outer circle symbolizes the cyclic nature of data mining itself.
The lessons learned during the process can trigger new, often more focused business questions.
23
A Case Study in Business Data Mining
Refer to “A Data Mining Approach for Retailing Bank Customer Attrition Analysis”
By XiaoHua Hu, Applied Intelligence 22, 47-60, 2005
Business understanding
A bank provides a loan service for over 750,000 customers with $1.5 billion outstanding.
Every month, the bank receives some 5,700 calls from customers wishing to close their account.
In addition, many customers will use the product as long as the introductory rate is in effect and then lapse thereafter.
The only retention effort implemented so far is to reduce the rate for some approached customers.
24
A Case Study in Business Data Mining
Business problem
The management of the bank wants to explore the possibility of setting a knowledge based retention effort through a combination of effective segmentation, customer profiling, data mining, and credit scoring that can retain more customers, while maximizing revenue.
There are different types of attriters in the product line:
Slow attriters
Fast attriters
Cross selling
High risk
Pirating
25
A Case Study in Business Data Mining
Business problem
A customer can display a subset of these behaviours over the life of the loan.
26
A Case Study in Business Data Mining
Business Problem
Focused on two attrition problems:
Utilizing data on accounts that remained continuously open in the last 4 months, predict, with 60 days advance notice, the likelihood that a particular customer will opt to voluntarily close his/her account either by phone or write-in.
Utilizing data on accounts that remained continuously open in the last 4 months, predict, with 60 days advance notice, the likelihood that a particular customer will have his account transferred to a competing institution. The account may or may not remain open.
27
A Case Study in Business Data Mining
Business Problem
Data mining problem
Develop models that predict the customers who are likely to attrite within 30 to 60 days on an ongoing basis.
Identify the characteristics of the most profitable/desirable customer segments in order to develop policies to ensure their continue support, to grow the group, and to acquire more customers with similar characteristics.
Identify customer groups whose characteristics lend them to migrating from unprofitable/ dormant to profitable. Once identified, the characteristics can enable the development of risk, maintenance and opportunity policies tailored to a successful migration.
28
A Case Study in Business Data Mining
Data Selection
The following data sources were utilitized:
Credit Card Data warehouse
Account related demographic and credit bureau information
Account related segmentation values based on the banks’ segmentation scheme.
Database that stores all cheques processed
Data preparation
Monthly data set were combined into one data file. Time sensitive variables were ”unrolled”. The data set contained around 870 variables.
29
A Case Study in Business Data Mining
Create target variable
For each account, assigned value 1 for attriter to the variable and value 0 for non-attriter.
Preliminary data analysis
Removed variables with unary value or near unary value.
Removed variables with low correlation to the target variable.
Detected and removed data ”leakers”.
Number of variables in the data set were reduced to 242.
30
A Case Study in Business Data Mining
Randomly split the data set into subsets for training and verification
There were around 2.2% of accounts were attriters.
A balanced data set with 50-50 percentage of each category (attriters and non-attriters) was created for model building purpose.
Another unbalanced data set with 97.8% non-attriters was created for model verification purpose.
Applied data mining algorithms to the balanced data set
Bayesian, decision tree, neural network, and ensembled.
31
A Case Study in Business Data Mining
Evaluation Criterion
Prediction accuracy is not a suitable evaluation criterion for the data mining application such as attrition analysis.
The goal of the attrition analysis is not to predict the behaviour of every customer, but to find a good subset of customers where the percentage of attritors is high.
Lift instead of predictive accuracy was used as an evaluation criterion.
32
A Case Study in Business Data Mining
Data mining results
For the top 4% attrition scores, the lift values of the deployed models were in the range of 3.9 – 6.6.
Several variables were found useful…
Most recent current balance,
Current balance with constant values
Segment
Annual charge date
Number of payments
Incentive interest
33
A Case Study in Business Data Mining
Field test
For two purposes:
The top percentage of the customer attrition list does contain concentrated attriters.
The data mining based marketing approach is effective for retention purpose.
Based on the attrition scores on the current customers, the top 4% of the customers were selected. Half of them were contacted and offered some incentive packages to encourage the customers to stay with the company. The other half were not contacted at all.
34
A Case Study in Business Data Mining
Field test
After two months later, the attrition rate of the contacted group was 0.12%, where the attrition rate for the non-contacted group was 5.6%. The historical attrition rate for a two-month period was 1.1%.
The lower attrition rate among the contacted group indicated the proactive action did have an impact on the customers’ behaviour.
The high attrition rate among the non-contacted group demonstrated that the data mining model was accurate and the top 4% captured a high concentrated proportion of attriters.
35
/docProps/thumbnail.jpeg