ISOM3360 Data Mining for Business Analytics, Session 3
Data Understanding and Preparation
Instructor: Department of ISOM Spring 2022
Copyright By PowCoder代写 加微信 powcoder
A Few Questions to Ask Yourself
What is data mining?
What are variables/target variable?
What is supervised vs. unsupervised learning? What is classification vs. regression?
What is mining phase vs. using phase?
What is the general data mining process?
A Process View to Data Mining
In-Class Exercise
TelCo, a major telecommunications firm, wants to investigate its problem with customer attrition, or “churn”
This is a saturated market; a large proportion of cell-phone customers leave when their contracts expire.
Q: Which customers should they target with a special offer, prior to contract expiration?
Try to come up with a data-driven solution to the problem. Use the concepts you learned today.
Lay out a step-by-step plan (high level).
Step-by-Step Plan
target variable
Yupervised learning
classification
Customer variables 3 Data
5 prediction
ttestlǜiii
evaluation
A Process View to Data Mining
Where are the data from?
Company sales database Company customer database Survey data
Government public data Third-party data provider Social media data
https://data.gov.hk/en/
Example: is the most popular online real estate
information site in the US.
https://www.zillow.com/
Zestimates: a tool for estimating home value.
To Describe the Dataset
What do your records represent?
What does each attribute mean?
What type of attributes?
Categorical (e.g., nominal, ordinal)
Numerical (e.g., discrete, continuous) Text
Boston Housing Dataset
https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html
Data Understanding: Explore the Dataset
Preliminary investigation of the data to better understand its specific characteristics; Help in selecting appropriate data mining algorithms
Things to look at
Class imbalance
Dispersion of data attribute values Skewness, outliers, missing values Correlation analysis
Visualization tools are important Histograms, box plots
Scatter plots
Class Balance
Many datasets have a discrete (binary) attribute class What is the frequency of each class?
Is there a considerable less frequent class?
Sometimes, classes have very unequal frequency Medical diagnosis: 90% healthy, 10% disease
Online purchase: 99% don’t buy, 1% buy
Fraud detection: 99.9% transactions are not fraudulent
Data mining algorithms may give poor results due to class imbalance problem
Identify the problem in an initial phase
Bar Charts
A bar chart presents categorical data with rectangular bars with heights proportional to the values that they represent.
Didn’t survive Survived
Useful Statistics
Discrete attributes
Frequency of each value (bar charts) Mode = value with highest frequency
Continuous attributes
Range of values, i.e., min and max
Mean (average)
Skewed distribution
Five-Number Summary
A set of descriptive statistics that provide information about a dataset: (min, Q1, Q2, Q3, max)
Minimum (min), lower quartile (Q1), median (Q2), upper
quartile (Q3), maximum (max)
Attribute values: 6 47 49 15 42 41 7 39 43 40 36 Sorted: 6 7 15 36 39 40 41 42 43 47 49
What is the five-number summary of this dataset?
mini6 Q i15 Qz 40Qi43 Max 49
A box plot can provide information useful information about an attribute
Sample’s range
Normality of the distribution
Skew (asymmetry) of the distribution Plot extreme cases within the sample
Histograms
A histogram is an estimate of the probability distribution of a continuous variable.
“Bin” the range of values (i.e., divide the entire range of values into a series of intervals) and count how many values fall into each interval
The bins (intervals) must be adjacent, and are often of equal size.
Outliers are values that lie far away from the bulk of data.
E.g., anything over 3 standard deviations away from the mean
Outliers can be legitimate instances or values
Some algorithms may produce poor results in the presence of outliers (need to identify and remove them)
Correlation Analysis (Numerical Data)
Correlation is a measure that captures the statistical relationship between two variables.
Pearson’s correlation coefficient
Scatter Plot
A scatter plot is a two-dimensional data visualization that shows the correlation between two variables.
Which one shows positive correlation, negative correlation, and no correlation?
Data Preprocessing
Data is often collected for unspecified applications
Data may have quality problems that need to be addressed before applying a data mining technique
Noise and outliers Missing values
Preprocessing may be needed to make data more suitable for data mining.
“If you want to find gold dust, move the rocks out of the way first!”
Data Preprocessing
Data transformation might be needed
Handling missing values
Handling categorical variables
Feature transformation
e.g., log transformation
Normalization (back to this when clustering is discussed)
Feature discretization
Feature selection
Missing Values
Data is not always available
Missing data may be due to various reasons
Data not entered due to misunderstanding
Certain data may not be considered important at the time of entry
Deleted accidentally
Missing data may carry some information content
A credit application may carry information by noting which field the applicant did not complete
How to Handle Missing Values: Remove
Remove data instances that have missing value (may affect a lot records)
Okay if not more than 5% of the records try
Remove attributes with missing values (may leave out important features)
How to Handle Missing Values: Infer
Use a global constant to fill in the missing value e.g., “unknown”. (May create a new class!)
Use the average (or most frequent) value to fill in the missing value
Use the attribute mean (or most frequent value) for all samples belonging to the same class to fill in the missing value
Other more sophisticated (data mining) methods
e.g., finding the k neighbors nearest to the point and fill in the average (or most frequent) value
Missing Values
No matter what techniques you use to conquer the problem, it comes at a price. The more guessing you have to do, the further away from the real data the database becomes. Thus, in turn, it can affect the accuracy and validation of the mining results.
Handling Categorical Variables
Categorical variable
Size: small, medium, large
Industry: Finance, IT, Marketing, etc…
Some data mining algorithms can support categorical values without further manipulation but there are many more algorithms that do not.
Handling Categorical Variables
Automobile dataset [link]
Some variables in the dataset are categorical
One-Hot-Encoder
Convert each category value into a new (dummy) column and assigns a 1 or 0 value to the column.
e.g., drive_wheels: rwd, fwd, 4wd
Any disadvantage of using this strategy?
Feature Transformation: Taking Log
A function is applied to each value of an attribute
E.g., Use to transform data that has a highly skewed distribution into data that are less skewed
Feature Normalization
Normalization (standardization) helps to prevent that attributes with large ranges outweigh attributes with small ranges. It brings all variables to the same scale.
Difference in one coordinate (in this example, weight) is insignificant compared to a change in the other coordinate (height).
Feature Normalization: min-max
Onlyapplynon targetvariables Rescaling (min-max normalization): targetvan
Suppose that the minimum and maximum values for the income variable are $12,000 and $98,000, respectively. By min-max normalization, a value of $73,600 for income is transformed to?
Feature Normalization: z-score
Standardization
where is the mean of feature values, and is the standard deviation of feature values.
Suppose that the mean and standard deviation of the values for the income variable are $54,000 and $16,000, respectively. With z-score normalization, a value of $73,600 for income is transformed to?
Do you apply normalization on training or testing set?
Feature Discretization
To transform a continuous attribute into a discrete attribute
Some data mining algorithms only work with discrete attributes (without extension)
e.g., decision trees, naïve Bayes
Better results may be obtained with discretized attributes
Feature Discretization: Binning
Unsupervised discretization
Equal-interval binning
Divide attribute values into N intervals of equal size
The width of intervals: (max-min)/N
The most straightforward way; but outliers may dominate presentation; cannot handle skewed data well.
Attribute values: {0, 4, 12, 16, 18, 24, 26, 28}. Can you
put these values into 3 bins using equal-interval
Feature Selection
Many data mining algorithms work better if the number of attributes is lower
Simplified model is easier to interpret.
Reduce storage requirement and training time.
Reduce overfitting and achieve better generalization performance.
Techniques:
To be discussed in Feature Selection (Week 8).
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com