Central tendency
Data Quality Report –
Case Study
Fundamentals of Machine Learning for Predictive Data Analytics,
and Namee and Aoife D’Arcy
Chapter 3
Case Study: Analytical Base Table
Case study: Data Quality Reports
Analysis – Missing values
NUM SOFT TISSUE – 2%
Possible solution: leave it as it is, document it in the data quality plan
MARITAL STATUS – 61.2%
Possible solution – remove the feature
Decision → remove both MARITAL STATUS and INCOME from ABT
Examining the raw data we notice that the fields where MARITAL STATUS is missing we have income value 0
Unusual pattern for INCOME from the histogram large number of 0s
Discussion with the business revealed that MARITAL STATUS and INCOME were collected together, and INCOME 0 represents a missing value
Analysis – Irregular cardinality
INSURANCE TYPE – cardinality 1
Nothing wrong with the data, refers to the type of claim (CI = Car Insurance). However, not useful for predictions
Decision → feature is removed from the ABT
Analysis – Irregular cardinality
Continuous features with low cardinality:
NUM CLAIMANTS, NUM CLAIMS, SOFT TISSUE, SOFT TISSUE %, FRAUD FLAG – all have cardinality <10 (dataset of 500 instances)
Decision → no issues
Decision → change FRAUD FLAG to categorical feature (note – this is also our target feature, so the type will have large impact on the ML approach we take)
Discussing with the business indicates that NUM CLAIMANTS, NUM CLAIMS, SOFT TISSUE, SOFT TISSUE% naturally take low number of values
FRAUD FLAG has cardinality 2, indicates categorical feature labelled as continuous with values 0 and 1.
Analysis - Outliers
CLAIM AMOUNT – min value -99,999
Decision → invalid outlier, remove it and replace it with missing value
Examining the raw data we notice this value comes from the d3 in our table.
Examining the histogram for CLAIM AMOUNT we don’t see a large bar at that value, so this is isolated instance
Value looks like input error, which was confirmed with the business
Analysis - Outliers
CLAIM AMOUT, TOTAL CLAIMED, NUM CLAIMS, AMOUNT RECIEVED – all have unusually high maximum values, especially compared to median and 3rd quartile
TOTAL CLAIMED, NUM CLAIMS – both high values are from d460 in ABT
Decision → d460 was removed from the ABT
This policy seems to have made many more claims compared to the others
Investigation with the business shows this is a valid outlier, however this was a company policy and was included in the ABT by mistake.
Analysis - Outliers
CLAIM AMOUT and AMOUNT RECEIVED – both from d302 in ABT
Decision → document the outliers and possibly handle them later
Examining the raw data we note this is a large claim for a serious injury
Analysis of the histograms show that the large values are no unique (several small bars in the right hand side of the histograms)
Handling missing values
Approach 1: Drop any features that have missing value
Rule of thumb – only use if 60% or more of values missing, otherwise look for different approach to handle
Approach 2: Apply complete case analysis.
Delete any instances where one or more features are missing values
Can result in significant data loss, can introduce bias (if distribution of missing values is not random)
Recommendation: Remove only instances with missing value for target feature
Approach 3: Derive a missing indicator feature from features with missing value.
Binary feature that flags whether the value was present or missing in the original data. May be useful if the reason the features are missing is related to the target feature
Handling missing values
Imputation replaces missing feature values with a plausible estimated value based on the feature values that are present.
The most common approach to imputation is to replace missing values for a feature with a measure of the central tendency of that feature.
Continuous features – usually mean and median
Categorical features – usually the mode
We would be reluctant to use imputation on features missing in excess of 30% of their values and would strongly recommend against the use of imputation on features missing in excess of 50% of their values.
Handling outliers – clamp transformation
The easiest way to handle outliers is to use a clamp transformation that clamps all values above an upper threshold and below a lower threshold to these threshold values, thus removing the offending outliers
where ai is a specific value of feature a, and lower and upper are the lower and upper thresholds.
Handling outliers - clamp transformation
Upper and lower limits can be set manually based on domain knowledge, or can be calculated from the data
Method 1:
lower = 1st quartile value – 1.5*inter-quartile range
higher = 3rd quartile value + 1.5*inter-quartile range
Method 2:
Use the mean value of a feature ± 2 times standard deviation
Handling outliers – clamp transformation
Clamp transformation – for and against
Performing clamp transformation may remove the most interesting (and predictive) data from the dataset
However, some machine learning techniques perform poorly in presence of outliers
Clamp transformations are only recommended when you suspect the outliers will affect the performance of the model
Case study: Data Quality Plan
Case study: Data Quality Plan
NUM SOFT TISSUE
Missing only 2% of the values, we’ll use imputation.
We’ll use the mean or the median (value 0.2 doesn’t naturally occur in the data set, so we’ll use 0)
Case study: Data Quality Plan
CLAIM AMOUNT, AMOUNT RECEIVED
we’ll use clamp transformation
exponential distribution, so methods we discussed
won’t work too well
following a discussion with the business we were
advised that lower limit of 0 and upper limit of 80,000
make sense for the clamp transformation
Case study: Data Quality Plan
Visualising relationship between features
In preparation to create predictive models it’s always a good idea to investigate the relationship between the variables
This can identify pairs of closely-related features (which in turn can be used to reduce the size of the ABT)
Scatter plot (continuous features)
SPLOM (Scatter plot matrix)
Bar plot (categorical features)
“Small multiples approach”
If the two features being visualised have a strong relationship, then the bar plots for each level of the second feature will look noticeably different to one another, and to the overall bar plot.
Small multiple histograms
“Small multiple” histograms can be used to compare a categorical with a continuous feature
Small multiple histograms (cont.)
Covariance and correlation
Covariance and correlation provide us with a formal measures of the relationship of two continuous variables
Covariance has a possible range of [-∞, ∞] where negative values indicate negative relationship, positive values indicate positive relationship, and values around 0 indicate little or no relationship
Correlation
Correlation is a normalized form of covariance
Correlation has a range of [-1, 1]
Correlation matrix
Summary (Data Quality Reports)
The key outcomes of the data exploration process are that the practitioner should
Have gotten to know the features within the ABT, especially their central tendencies, variations, and distributions.
Have identified any data quality issues within the ABT, in particular missing values, irregular cardinality, and outliers.
Have corrected any data quality issues due to invalid data.
Have recorded any data quality issues due to valid data in a data quality plan along with potential handling strategies.
Be confident that enough good quality data exists to continue with a project.
/docProps/thumbnail.jpeg