代写代考 ISOM3360 Data Mining for Business Analytics, Session 2

ISOM3360 Data Mining for Business Analytics, Session 2
Data Mining Basics
Instructor: Department of ISOM Spring 2022

What is Data Mining?
Data mining (knowledge discovery from data)
􏰁 Automatic extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from large amount of data
􏰁 Involves methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.

Non-Trivial Data Mining Results
Beers and diapers were often bought together by customers.
Phoenix is not a good place for selling golf clubs, despite the many golf courses nearby.
People who buy small pad that adhere to the bottom of chair legs (to protect the floor) are more likely to be good credit risk.
Vegetarians tend to miss fewer flights.

Common Data Mining Tasks
Classification and class probability estimation 􏰁 Determine which class an individual belongs to
Regression
􏰁 Estimate the numerical value of some variable for an individual
Similarity matching
􏰁 Identify similar individuals based on known attributes
Clustering
􏰁 Group individuals together by their similarity
Co-occurrence grouping (frequent itemset mining)
􏰁 Find associations between entities based on transactions
involving them.

Commonly Used Induction Algorithms

Revisit: Customer Retention
Which customers should they target with a special offer, prior to contract expiration?

Example Data Mining Solutions
Decision tree technique
􏰁 If Education = ‘high’ and Gender = ‘male’, then customer is likely to churn.
Logistic regression technique
􏰁 Calculate the probability of churning given the features of a customer.
Nearest neighbor technique
􏰁 Calculate how similar a customer is to existing churning customers.

A Process View to Data Mining

Data Mining Basic Terminologies
Data, target variable, model Supervised vs. unsupervised learning Classification vs. regression
Training vs. testing
Mining phase vs. using phase

Example (Instance)
􏰁 A fact or a data point; described by a set of attributes (fields, columns, variables, features).
A data set:
􏰁 A set of examples
􏰁 A sample/subset of the universe
One example/instance

Can you name a few attributes of the following? 􏰁 A stock
􏰁 An apartment

Target Variable
A special variable that is the interest/target of the task.
Equivalent statistics terminology:
􏰁 Attributes: variables
􏰁 Target variable: dependent variable
Target variable

Types of Attributes/Variables (I)
Numerical variable (quantitative data)
􏰁 Discrete variable: has only a finite or countably infinite set of values (often integer variables)
􏰀 Example: the number of items bought by a customer (e.g., 12)
􏰁 Continuous variable: has real numbers as attribute values
􏰀 Example: the time that the customer spends (e.g., 16.49 min)

Types of Attributes/Variables (II)
Categorical variable (qualitative data)
􏰁 Ordinal variable: has categories that can be meaningfully ordered
􏰀 Example: course grade (A, B, C, D, …)
􏰁 Nominal variable: the categories have no meaningful
􏰀 Example: location region ( , , , etc.)

In-Class Exercise
The size of a company (#employees) is a ____________ variable.
The average height of students taking this class is a ____________ variable.
The country that an individual lives in is a ____________ variable.
Education level (less than high school, high school, bachelor, master, doctoral) is a ____________ variable.

Unstructured Data
cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form (text, image, social networks, etc).
􏰁 Product and hotel reviews
􏰁 Blogs, forums and other social media 􏰁 Voice of the customer data
􏰁 Machine logs
􏰁 Web logs

Unstructured data: Text (0/1 Representation)
— Each entry in the table represents a document.
— Attribute describes whether or not a word appears in the document.

Unstructured data: Image
What are the features of an image?
One tongue, two eyes, one nose?
Computer represents an image using RGB pixels. An 640*480 image consists of 307,200 pixels. Each pixel is a RGB tuple value between 0~255. e.g. (255,0,0) is red.

Unstructured data: Network
Your network properties, such as your neighbors, your neighbors’ neighbor, your “centrality”

Data Mining Process Revisit

A model is:
􏰁 A pattern.
􏰁 A summarization of relationships in the data.
􏰁 A simplified representation of reality created to serve
specific purpose.
Some examples
􏰁 IF Balance >= 50K AND Age > 45
THEN Default = ‘no’ ELSE Default = ‘yes’

A learner or inducer or algorithm
􏰁 A method or algorithm used to generalize a model from a
set of examples.
Learner: induces a pattern from examples
In practice, people use model and learner interchangeably. But they are different.
IF Balance >= 50K AND Age > 45 THEN Default = ‘no’
ELSE Default = ‘yes’

Supervised vs. Unsupervised Learning
Supervised learning (prediction): learns a model that predicts target outcome based on a set of other attributes (i.e., training data where target value is known).
􏰁 Stock price prediction (numerical target variable) 􏰁 Credit card default (binary target variable)
Unsupervised learning (relationship mining): finds relationships in the data without reference to target variable.
􏰁 Beer and diaper
Key: is there a target that we are trying to predict?

Classification vs. Regression
The difference is the type of target variable: 􏰁 Classification: categorical target variable
􏰀 Is this customer “loyal” or “likely to terminate contract”?
􏰀 Is a credit card use “legitimate” or “fraudulent”? 􏰁 Regression: numerical target
􏰀 How much a customer is going to spend? 􏰀 What is the credit score of a customer?
Both are supervised learning!

Predictive DM/Modeling: the Philosophy
Data you already have
Data you will have
Build/Evaluate model
Model Apply to new data

Model Evaluation
Supervised Learning
􏰁 Ground truth: Yes
􏰁 Evaluation: predictive performance
Unsupervised Learning 􏰁 Ground truth: No
􏰁 Evaluation: intelligibility

Model Training vs. Model Testing
􏰁 After learning a model, can we have an estimate on how well the model would perform on new data?
Solution: split data into two parts
􏰁 Training data to learn the model.
􏰁 Testing data to evaluate performance of learned model on “new” data.
􏰁 Never ever use testing data to learn your model! Why do we want to split data into two parts?

Data Splitting for Training and Testing
Training data
Testing data (Hold-out data)

Model Training (on Training Data)
Learner: induces a pattern from examples
IF Balance >= 50K AND Age > 45 THEN Default = ‘no’
ELSE Default = ‘yes’

Model Testing (on Testing/Hold-Out Data)
IF Balance >= 50K AND Age > 45 THEN Default = ‘no’
ELSE Default = ‘yes’

Process for Supervised/Unsupervised Learning
Supervised Learning
Unsupervised Learning
Evaluation
Evaluation

Data Mining Phase vs. Use Phase (Supervised)
Training and testing data have known value of target attribute
Data mining
Data mining
Prediction
Model New data has unknown value of target attribute

In-Class Exercise
TelCo, a major telecommunications firm, wants to investigate its problem with customer attrition, or “churn”
This is a saturated market; a large proportion of cell-phone customers leave when their contracts expire.
Q: Which customers should they target with a special offer, prior to contract expiration?
􏰁 Try to come up with a data-driven solution to the problem. 􏰁 Use the concepts you learned today.
􏰁 Lay out a step-by-step plan (high level).

Step-by-Step Plan

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts