编程辅导 ISOM3360 Data Mining for Business Analytics, Session 3

ISOM3360 Data Mining for Business Analytics, Session 3
Data Understanding and Preparation
Instructor: Department of ISOM Spring 2022

Copyright By PowCoder代写 加微信 powcoder

A Few Questions to Ask Yourself
What is data mining?
What are variables/target variable?
What is supervised vs. unsupervised learning? What is classification vs. regression?
What is mining phase vs. using phase?
What is the general data mining process?

A Process View to Data Mining

In-Class Exercise
TelCo, a major telecommunications firm, wants to investigate its problem with customer attrition, or “churn”
This is a saturated market; a large proportion of cell-phone customers leave when their contracts expire.
Q: Which customers should they target with a special offer, prior to contract expiration?
􏰁 Try to come up with a data-driven solution to the problem. 􏰁 Use the concepts you learned today.
􏰁 Lay out a step-by-step plan (high level).

Step-by-Step Plan
target variable
Yupervised learning
classification
Customer variables 3 Data
5 prediction
ttestlǜiii
evaluation

A Process View to Data Mining

Where are the data from?
Company sales database Company customer database Survey data
Government public data Third-party data provider Social media data
https://data.gov.hk/en/

Example: is the most popular online real estate
information site in the US.
https://www.zillow.com/
Zestimates: a tool for estimating home value.

To Describe the Dataset
What do your records represent?
What does each attribute mean?
What type of attributes?
􏰁 Categorical (e.g., nominal, ordinal)
􏰁 Numerical (e.g., discrete, continuous) 􏰁 Text

Boston Housing Dataset
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

Data Understanding: Explore the Dataset
Preliminary investigation of the data to better understand its specific characteristics; Help in selecting appropriate data mining algorithms
Things to look at
􏰁 Class imbalance
􏰁 Dispersion of data attribute values 􏰁 Skewness, outliers, missing values 􏰁 Correlation analysis
Visualization tools are important 􏰁 Histograms, box plots
􏰁 Scatter plots

Class Balance
Many datasets have a discrete (binary) attribute class 􏰁 What is the frequency of each class?
􏰁 Is there a considerable less frequent class?
Sometimes, classes have very unequal frequency 􏰁 Medical diagnosis: 90% healthy, 10% disease
􏰁 Online purchase: 99% don’t buy, 1% buy
􏰁 Fraud detection: 99.9% transactions are not fraudulent
Data mining algorithms may give poor results due to class imbalance problem
􏰁 Identify the problem in an initial phase

Bar Charts
A bar chart presents categorical data with rectangular bars with heights proportional to the values that they represent.
Didn’t survive Survived

Useful Statistics
Discrete attributes
􏰁 Frequency of each value (bar charts) 􏰁 Mode = value with highest frequency
Continuous attributes
􏰁 Range of values, i.e., min and max
􏰁 Mean (average)
􏰁 Skewed distribution

Five-Number Summary
A set of descriptive statistics that provide information about a dataset: (min, Q1, Q2, Q3, max)
􏰁 Minimum (min), lower quartile (Q1), median (Q2), upper
quartile (Q3), maximum (max)
Attribute values: 6 47 49 15 42 41 7 39 43 40 36 Sorted: 6 7 15 36 39 40 41 42 43 47 49
What is the five-number summary of this dataset?
mini6 Q i15 Qz 40Qi43 Max 49

A box plot can provide information useful information about an attribute
􏰁 Sample’s range
􏰁 Normality of the distribution
􏰁 Skew (asymmetry) of the distribution 􏰁 Plot extreme cases within the sample

Histograms
A histogram is an estimate of the probability distribution of a continuous variable.
􏰁 “Bin” the range of values (i.e., divide the entire range of values into a series of intervals) and count how many values fall into each interval
􏰁 The bins (intervals) must be adjacent, and are often of equal size.

Outliers are values that lie far away from the bulk of data.
􏰁 E.g., anything over 3 standard deviations away from the mean
􏰁 Outliers can be legitimate instances or values
Some algorithms may produce poor results in the presence of outliers (need to identify and remove them)

Correlation Analysis (Numerical Data)
Correlation is a measure that captures the statistical relationship between two variables.
Pearson’s correlation coefficient

Scatter Plot
A scatter plot is a two-dimensional data visualization that shows the correlation between two variables.
Which one shows positive correlation, negative correlation, and no correlation?

Data Preprocessing
Data is often collected for unspecified applications
􏰁 Data may have quality problems that need to be addressed before applying a data mining technique
􏰀 Noise and outliers 􏰀 Missing values
􏰁 Preprocessing may be needed to make data more suitable for data mining.
“If you want to find gold dust, move the rocks out of the way first!”

Data Preprocessing
Data transformation might be needed
􏰁 Handling missing values
􏰁 Handling categorical variables
􏰁 Feature transformation
􏰀 e.g., log transformation
􏰀 Normalization (back to this when clustering is discussed)
􏰁 Feature discretization
􏰁 Feature selection

Missing Values
Data is not always available
Missing data may be due to various reasons
􏰁 Data not entered due to misunderstanding
􏰁 Certain data may not be considered important at the time of entry
􏰁 Deleted accidentally
Missing data may carry some information content
􏰁 A credit application may carry information by noting which field the applicant did not complete

How to Handle Missing Values: Remove
Remove data instances that have missing value (may affect a lot records)
􏰁 Okay if not more than 5% of the records try
Remove attributes with missing values (may leave out important features)

How to Handle Missing Values: Infer
Use a global constant to fill in the missing value 􏰁 e.g., “unknown”. (May create a new class!)
Use the average (or most frequent) value to fill in the missing value
Use the attribute mean (or most frequent value) for all samples belonging to the same class to fill in the missing value
Other more sophisticated (data mining) methods
􏰁 e.g., finding the k neighbors nearest to the point and fill in the average (or most frequent) value

Missing Values
No matter what techniques you use to conquer the problem, it comes at a price. The more guessing you have to do, the further away from the real data the database becomes. Thus, in turn, it can affect the accuracy and validation of the mining results.

Handling Categorical Variables
Categorical variable
􏰁 Size: small, medium, large
􏰁 Industry: Finance, IT, Marketing, etc…
Some data mining algorithms can support categorical values without further manipulation but there are many more algorithms that do not.

Handling Categorical Variables
Automobile dataset [link]
Some variables in the dataset are categorical

One-Hot-Encoder
Convert each category value into a new (dummy) column and assigns a 1 or 0 value to the column.
􏰁 e.g., drive_wheels: rwd, fwd, 4wd
Any disadvantage of using this strategy?

Feature Transformation: Taking Log
A function is applied to each value of an attribute
􏰁 E.g., Use to transform data that has a highly skewed distribution into data that are less skewed

Feature Normalization
Normalization (standardization) helps to prevent that attributes with large ranges outweigh attributes with small ranges. It brings all variables to the same scale.
Difference in one coordinate (in this example, weight) is insignificant compared to a change in the other coordinate (height).

Feature Normalization: min-max
Onlyapplynon targetvariables Rescaling (min-max normalization): targetvan
Suppose that the minimum and maximum values for the income variable are $12,000 and $98,000, respectively. By min-max normalization, a value of $73,600 for income is transformed to?

Feature Normalization: z-score
Standardization
􏰁 where is the mean of feature values, and is the standard deviation of feature values.
Suppose that the mean and standard deviation of the values for the income variable are $54,000 and $16,000, respectively. With z-score normalization, a value of $73,600 for income is transformed to?

Do you apply normalization on training or testing set?

Feature Discretization
To transform a continuous attribute into a discrete attribute
􏰁 Some data mining algorithms only work with discrete attributes (without extension)
􏰀 e.g., decision trees, naïve Bayes
􏰁 Better results may be obtained with discretized attributes

Feature Discretization: Binning
Unsupervised discretization
􏰁 Equal-interval binning
􏰀 Divide attribute values into N intervals of equal size
􏰀 The width of intervals: (max-min)/N
􏰀 The most straightforward way; but outliers may dominate presentation; cannot handle skewed data well.
Attribute values: {0, 4, 12, 16, 18, 24, 26, 28}. Can you
put these values into 3 bins using equal-interval

Feature Selection
􏰁 Many data mining algorithms work better if the number of attributes is lower
􏰁 Simplified model is easier to interpret.
􏰁 Reduce storage requirement and training time.
􏰁 Reduce overfitting and achieve better generalization performance.
Techniques:
􏰁 To be discussed in Feature Selection (Week 8).

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com