ISOM3360 Data Mining for Business Analytics, Session 2
Data Mining Basics
Instructor: Department of ISOM Spring 2022
Copyright By PowCoder代写 加微信 powcoder
What is Data Mining?
Data mining (knowledge discovery from data)
Automatic extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from large amount of data
Involves methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.
Non-Trivial Data Mining Results
Beers and diapers were often bought together by customers.
Phoenix is not a good place for selling golf clubs, despite the many golf courses nearby.
People who buy small pad that adhere to the bottom of chair legs (to protect the floor) are more likely to be good credit risk.
Vegetarians tend to miss fewer flights.
Common Data Mining Tasks
Classification and class probability estimation Determine which class an individual belongs to
Regression
Estimate the numerical value of some variable for an individual
Similarity matching
Identify similar individuals based on known attributes
Clustering
Group individuals together by their similarity
Co-occurrence grouping (frequent itemset mining)
Find associations between entities based on transactions
involving them.
Commonly Used Induction Algorithms
Revisit: Customer Retention
Which customers should they target with a special offer, prior to contract expiration?
Example Data Mining Solutions
Decision tree technique
If Education = ‘high’ and Gender = ‘male’, then customer is likely to churn.
Logistic regression technique
Calculate the probability of churning given the features of a customer.
Nearest neighbor technique
Calculate how similar a customer is to existing churning customers.
A Process View to Data Mining
Data Mining Basic Terminologies
Data, target variable, model Supervised vs. unsupervised learning Classification vs. regression
Training vs. testing
Mining phase vs. using phase
Example (Instance)
A fact or a data point; described by a set of attributes (fields, columns, variables, features).
A data set:
A set of examples
A sample/subset of the universe
One example/instance
Can you name a few attributes of the following? A stock
An apartment
Target Variable
A special variable that is the interest/target of the task.
Equivalent statistics terminology:
Attributes: variables
Target variable: dependent variable
Target variable
Types of Attributes/Variables (I)
Numerical variable (quantitative data)
Discrete variable: has only a finite or countably infinite set of values (often integer variables)
Example: the number of items bought by a customer (e.g., 12)
Continuous variable: has real numbers as attribute values
Example: the time that the customer spends (e.g., 16.49 min)
Types of Attributes/Variables (II)
Categorical variable (qualitative data)
Ordinal variable: has categories that can be meaningfully ordered
Example: course grade (A, B, C, D, …)
Nominal variable: the categories have no meaningful
Example: location region ( , , , etc.)
In-Class Exercise
The size of a company (#employees) is a ____________ variable.
The average height of students taking this class is a ____________ variable.
The country that an individual lives in is a ____________ variable.
Education level (less than high school, high school, bachelor, master, doctoral) is a ____________ variable.
Unstructured Data
cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form (text, image, social networks, etc).
Product and hotel reviews
Blogs, forums and other social media Voice of the customer data
Machine logs
Web logs
Unstructured data: Text (0/1 Representation)
— Each entry in the table represents a document.
— Attribute describes whether or not a word appears in the document.
Unstructured data: Image
What are the features of an image?
One tongue, two eyes, one nose?
Computer represents an image using RGB pixels. An 640*480 image consists of 307,200 pixels. Each pixel is a RGB tuple value between 0~255. e.g. (255,0,0) is red.
Unstructured data: Network
Your network properties, such as your neighbors, your neighbors’ neighbor, your “centrality”
Data Mining Process Revisit
A model is:
A pattern.
A summarization of relationships in the data.
A simplified representation of reality created to serve
specific purpose.
Some examples
IF Balance >= 50K AND Age > 45
THEN Default = ‘no’ ELSE Default = ‘yes’
A learner or inducer or algorithm
A method or algorithm used to generalize a model from a
set of examples.
Learner: induces a pattern from examples
In practice, people use model and learner interchangeably. But they are different.
IF Balance >= 50K AND Age > 45 THEN Default = ‘no’
ELSE Default = ‘yes’
Supervised vs. Unsupervised Learning
Supervised learning (prediction): learns a model that predicts target outcome based on a set of other attributes (i.e., training data where target value is known).
Stock price prediction (numerical target variable) Credit card default (binary target variable)
Unsupervised learning (relationship mining): finds relationships in the data without reference to target variable.
Beer and diaper
Key: is there a target that we are trying to predict?
Classification vs. Regression
The difference is the type of target variable: Classification: categorical target variable
Is this customer “loyal” or “likely to terminate contract”?
Is a credit card use “legitimate” or “fraudulent”? Regression: numerical target
How much a customer is going to spend? What is the credit score of a customer?
Both are supervised learning!
Predictive DM/Modeling: the Philosophy
Data you already have
Data you will have
Build/Evaluate model
Model Apply to new data
Model Evaluation
Supervised Learning
Ground truth: Yes
Evaluation: predictive performance
Unsupervised Learning Ground truth: No
Evaluation: intelligibility
Model Training vs. Model Testing
After learning a model, can we have an estimate on how well the model would perform on new data?
Solution: split data into two parts
Training data to learn the model.
Testing data to evaluate performance of learned model on “new” data.
Never ever use testing data to learn your model! Why do we want to split data into two parts?
Data Splitting for Training and Testing
Training data
Testing data (Hold-out data)
Model Training (on Training Data)
Learner: induces a pattern from examples
IF Balance >= 50K AND Age > 45 THEN Default = ‘no’
ELSE Default = ‘yes’
Model Testing (on Testing/Hold-Out Data)
IF Balance >= 50K AND Age > 45 THEN Default = ‘no’
ELSE Default = ‘yes’
Process for Supervised/Unsupervised Learning
Supervised Learning
Unsupervised Learning
Evaluation
Evaluation
Data Mining Phase vs. Use Phase (Supervised)
Training and testing data have known value of target attribute
Data mining
Data mining
Prediction
Model New data has unknown value of target attribute
In-Class Exercise
TelCo, a major telecommunications firm, wants to investigate its problem with customer attrition, or “churn”
This is a saturated market; a large proportion of cell-phone customers leave when their contracts expire.
Q: Which customers should they target with a special offer, prior to contract expiration?
Try to come up with a data-driven solution to the problem. Use the concepts you learned today.
Lay out a step-by-step plan (high level).
Step-by-Step Plan
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com