CS代考 The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
Fundamentals of Machine Learning for
Predictive Data Analytics
Chapter 3: Data Exploration Sections 3.1, 3.2, 3.3, 3.4
and Namee and Aoife D’Arcy

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
1
2
3
4
The Data Quality Report
Case Study: Motor Insurance Fraud
5
Summary
Getting To Know The Data
Case Study: Motor Insurance Fraud
Identifying Data Quality Issues
Case Study: Motor Insurance Fraud
Handling Data Quality Issues
Handling Missing Values
Handling Outliers
Case Study: Motor Insurance Fraud

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
The Data Quality Report

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
A data quality report includes tabular reports that describe the characteristics of each feature in an ABT using standard statistical measures of central tendency and variation.
The tabular reports are accompanied by data visualizations:
A histogram for each continuous feature in an ABT. A bar plot for each categorical feature in an ABT.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
Table: The structures of the tables included in a data quality report to describe (a) continuous features and (b) categorical features.
Feature —— —— ——
(a) Continuous Features
% 1st 3rd Std. Count Miss. Card. Min. Qrt. Mean Median Qrt. Max. Dev. —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— ——
(b) Categorical Features
2nd 2nd
% Mode Mode 2nd Mode Mode Count Miss. Card. Mode Freq. % Mode Freq. %
—— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— —— ——
Feature —— —— ——

Case Study: Motor Insurance Fraud
The following slides show a portion of the ABT that has been developed for the motor insurance claims fraud detection.
A portion of the ABT developed for this solution is shown first.

Table: Portions of the ABT for the motor insurance claims fraud detection problem.
ID TYPE INC.
1 CI 0
2 CI 0
3 CI 54,613
4 CI 0
5 CI 0
6 CI 0
7 CI 52,567
8 CI 0
9 CI 0
10 CI 42,300
300 CI 0
301 CI 0
302 CI 0
303 CI 0
304 CI 46,365
458 CI 48,176
459 CI 0
460 CI 0
461 CI 47,371
462 CI 0
491 CI 40,204
492 CI 0
493 CI 0
494 CI 31,951
495 CI 0
496 CI 0
497 CI 29,280
498 CI 0
499 CI 46,683
500 CI 0
FRAUD CLAIMED CLAIMS TISS. TISS. RCVD. FLAG
MARITAL NUM INJURY HOSPITAL CLAIM STATUS CLMNTS. TYPE STAY AMNT.
NUM % CLAIM TOTAL NUM SOFT SOFT AMT
Married
Single
Married
2 Soft Tissue No 2 Back Yes 1 Broken Limb No 4 Broken Limb Yes 4 Soft Tissue No 1 Broken Limb Yes 3 Broken Limb No 2 Back Yes 1 Soft Tissue No 4 Back No
1,625 15,028 -99,999 5,097 8869 17,480 3,017 7463 2,067 2,260
3250 2 2 60,112 1
0 0 0 11,661 1 1 0 0 0 0 0 0 18,102 2 1
1.0 0 1 0 15,028 0 0 572 0 1.0 7,864 0 00 1 0 17,480 0 0.5 0 1
… … …
Married
1,627 0 270,200 0 7,668 0 0 1,653 0
Married
Divorced
3 Soft Tissue Yes 1 Soft Tissue Yes 3 Back No 1 Broken Limb Yes 1 Soft Tissue No
4,653 881 8,688 3,194 6,821
8,203 1 51,245 3 729,792 56 11,668 1 0 0
0 0
0 00 1 5 0.08 8,688 0 0 0 3,194 0 0 00 1
2 Broken Limb No 1 Broken Limb No 3 Serious Yes 1 Soft Tissue No 1 Back No
2,244
1,627 270,200 7,668 3,217
0 0 92,283 3 0 0 92,806 3 0 0
0 0 0 0 0 0 0 0
… … …
… … …
Single Married
Married Married
1 Back No 1 Broken Limb No 1 Soft Tissue Yes 1 Broken Limb No 2 Back No 1 Soft Tissue No 4 Broken Limb Yes 1 Broken Limb Yes 1 Broken Limb No 1 Broken Limb No
75,748 11,116 1 6,172 6,041 1 2,569 20,055 1 5,227 22,095 1 3,813 9,882 3 2,118 0 0 3,199 0 0
32,469 0 0 179,448 0 0 8,259 0 0
0 0 0 1 0 6,172 0 0 0 2,569 0 0 0 5,227 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 16,763 0 0 179,448 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0
7,463 0 2,067 0 2,260 0
2,244 0
4,653 0

Table: A data quality report for the motor insurance claims fraud detection ABT
Feature
INCOME
NUM CLAIMANTS CLAIM AMOUNT TOTAL CLAIMED NUM CLAIMS
NUM SOFT TISSUE % SOFT TISSUE AMOUNT RECEIVED FRAUD FLAG
% Count Miss. 500 0.0 500 0.0 500 0.0 500 0.0 500 0.0 500 2.0 500 0.0 500 0.0 500 0.0
(a) Continuous Features
1st 3rd Std. Card. . Mean Median Qrt. . 171 0.0 0.0 13,740.0 0.0 33,918.5 71,284.0 20,081.5 4 1.0 1.0 1.9 2 3.0 4.0 1.0 493 -99,999 3,322.3 16,373.2 5,663.0 12,245.5 270,200.0 29,426.3 235 0.0 0.0 9,597.2 0.0 11,282.8 729,792.0 35,655.7 7 0.0 0.0 0.8 0.0 1.0 56.0 2.7 6 0.0 0.0 0.2 0.0 0.0 5.0 0.6 9 0.0 0.0 0.2 0.0 0.0 2.0 0.4 329 0.0 0.0 13,051.9 3,253.5 8,191.8 295,303.0 30,547.2 2 0.0 0.0 0.3 0.0 1.0 1.0 0.5

Table: A data quality report for the motor insurance claims fraud detection ABT.
Feature INSURANCE TYPE MARITAL STATUS INJURY TYPE HOSPITAL STAY
% Count Miss.
(a) Categorical Features
Mode Mode Card. Mode Freq. % 1 CI 500 1.0 4 Married 99 51.0 4 Broken Limb 177 35.4 2 No 354 70.8
2nd 2nd 2nd Mode Mode
Mode Freq. % –– –
500 500 500 500
0.0 61.2 0.0 0.0
Single Soft Tissue Yes
48 24.7 172 34.4 146 29.2

0 10000 30000 50000
Income
(a) INCOME
70000
1 2 3 4 Num. Claimants
(b) NUM CLAIMANTS
−1e+05 0e+00 1e+05 2e+05
Claim Amount
0e+00 2e+05 4e+05 6e+05
Total Claimed
(d) TOTAL CLAIMED
(c) CLAIM AMOUNT
Figure: Visualizations of the continuous and categorical features in the motor insurance claims fraud detection ABT in Table 2 [7].
0e+00 2e−05
4e−05
6e−05 0.00000
0.00010 0.00020
0e+00 1e−05 2e−05
3e−05
0.0
0.1
Density
Density
Density
Density
0.2 0.3 0.4

0 1 2 3 4 5 56 Num. Claims
(a) NUM CLAIMS
0 1 2 3 5 Num. Soft Tissue
(b) NUM SOFT TISSUE
0.0 0.5 1.0 1.5 2.0
% Soft Tissue
(c) % SOFT TISSUE
0 50000 150000 250000
Amount Received
(d) AMOUNT RECEIVED
Figure: Visualizations of the continuous and categorical features in the motor insurance claims fraud detection ABT in Table 2 [7].
0 1 2 3
0e+00 2e−05 4e−05 6e−05
8e−05
Density
Density
0.0 0.1 0.2 0.3 0.4 0.5
Density
Density
0.0 0.2 0.4 0.6 0.8

Missing Married Single Divorced Marital Status
(a) MARITAL STATUS
Broken Limb Soft Tissue Back Serious Injury Type
(b) INJURY TYPE
No Yes Hospital Stay
(c) HOSPITAL STAY
Figure: Visualizations of the continuous and categorical features in the motor insurance claims fraud detection ABT in Table 2 [7].
0.0 0.1
Density
0.2 0.3 0.4 0.5 0.6
0.00
Density 0.10 0.20
0.30
Density
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

01 CI Fraud Flag Insurance Type
(a) FRAUD FLAG (b) INSURANCE TYPE
Figure: Visualizations of the continuous and categorical features in the motor insurance claims fraud detection ABT in Table 2 [7].
Density
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Density
0.0 0.2 0.4 0.6 0.8 1.0

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
Getting To Know The Data

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
For categorical features, we should:
Examine the mode, 2nd mode, mode %, and 2nd mode %
as these tell us the most common levels within these features and will identify if any levels dominate the dataset.
For continuous features we should:
Examine the mean and standard deviation of each feature
to get a sense of the central tendency and variation of the values within the dataset for the feature.
Examine the minimum and maximum values to understand the range that is possible for each feature.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
When we generate histograms of features there are a number of common, well understood shapes that we should look out for.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
(a) Uniform (b) Normal (Unimodal) (c) Unimodal (skewed right)
Figure: Histograms for different sets of data each of which exhibit well-known, common characteristics.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
(a) Unimodal (skewed left) (b) Exponential (c) Multimodal
Figure: Histograms for different sets of data each of which exhibit well-known, common characteristics.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
A uniform distribution indicates that a feature is equally likely to take a value in any of the ranges present.
Uniform

The Data Quality Report
Getting To Know The Data
Identifying Data Quality Issues Handling Data Quality Issues Summary
Features following a normal distribution are characterized by a strong tendency towards a central value and symmetrical variation to either side of this.
Normal (Unimodal)

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
Unimodal (skewed left)
Skew is simply a tendency towards very high (right skew) or very low (left skew) values.
Unimodal (skewed right)

The Data Quality Report
Getting To Know The Data
Identifying Data Quality Issues Handling Data Quality Issues Summary
In a feature following an exponential distribution the likelihood of occurrence of a small number of low values is very high, but sharply diminishes as values increase.
Exponential

The Data Quality Report
Getting To Know The Data
Identifying Data Quality Issues Handling Data Quality Issues Summary
A feature characterized by a multimodal distribution has two or more very commonly occurring ranges of values that are clearly separated.
Multimodal

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
The probability density function for the normal distribution (or Gaussian distribution) is
1 −(x − μ)2
N(x,μ,σ) = √ e 2σ2 (1)
σ 2π
where x is any value, and μ and σ are parameters that
define the shape of the distribution: the population mean and population standard deviation.

μ=0, σ=1 μ=−2, σ=1 μ=+2, σ=1
Density
0.0 0.1 0.2 0.3 0.4
−6 −4 −2 0 2 4 6
Value
Figure: Three normal distributions with different means but identical standard deviations.

μ=0, σ=1 μ=0, σ=2 μ=0, σ=6
Density
0.0 0.1 0.2 0.3 0.4
−6 −4 −2 0 2 4 6
Value
Figure: Three normal distributions with identical means but different standard deviations.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
The 68 − 95 − 99.7 rule is a useful characteristic of the normal distribution.
The rule states that approximately:
68% of the observations will be within one σ of μ 95% of observations will be within two σ of μ 99.7% of observations will be within three σ of μ.

μ−3σ μ−2σ μ−σ μ μ+σ μ+2σ μ+3σ
Figure: An illustration of the 68 − 95 − 99.7 percentage rule that a normal distribution defines as the expected distribution of observations. The grey region defines the area where 95% of observations are expected.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
Examine the data quality report for the motor insurance fraud prediction scenario and comment on the central tendency and variation of each feature.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
Identifying Data Quality Issues

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
A data quality issue is loosely defined as anything unusual about the data in an ABT.
The most common data quality issues are:
missing values irregular cardinality outliers

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
The data quality issues we identify from a data quality report will be of two types:
Data quality issues due to invalid data. Data quality issues due to valid data.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
Table: The structure of a data quality plan.
Feature Data Quality Issue ———— ———— ———— ————
Potential Handling Strategies ——
——
——
——

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary Case Study: Motor Insurance Fraud
Table: The data quality plan for the motor insurance fraud prediction ABT.
Feature
NUM SOFT TISSUE CLAIM AMOUNT AMOUNT RECEIVED
Data Quality Issue Potential Handling Strategies Missing values (2%)
Outliers (high)
Outliers (high)

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
Handling Data Quality Issues

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary Handling Missing Values
Approach 1: Drop any features that have missing value. Approach 2: Apply complete case analysis.
Approach 3: Derive a missing indicator feature from features with missing value.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary Handling Missing Values
Imputation replaces missing feature values with a plausible estimated value based on the feature values that are present.
The most common approach to imputation is to replace missing values for a feature with a measure of the central tendency of that feature.
We would be reluctant to use imputation on features missing in excess of 30% of their values and would strongly recommend against the use of imputation on features missing in excess of 50% of their values.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary Handling Outliers
The easiest way to handle outliers is to use a clamp transformation that clamps all values above an upper threshold and below a lower threshold to these threshold values, thus removing the offending outliers
lower 
ai = upper ai
if ai < lower if ai > upper (2) otherwise
where ai is a specific value of feature a, and lower and upper are the lower and upper thresholds.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
What handling strategies would you recommend for the data quality issues found in the motor Insurance fraud ABT?

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary Case Study: Motor Insurance Fraud
Case Study: Motor Insurance Fraud
Table: The data quality plan for the motor insurance fraud prediction ABT.
Feature
NUM SOFT TISSUE
CLAIM AMOUNT AMOUNT RECEIVED
Data Quality Issue Missing values (2%)
Outliers (high) Outliers (high)
Potential Handling Strategies Imputation
(median: 0.0)
Clamp transformation (manual: 0, 80 000)
Clamp transformation (manual: 0, 80 000)

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
Summary

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
The key outcomes of the data exploration process are that the practitioner should
1 Have gotten to know the features within the ABT, especially their central tendencies, variations, and distributions.
2 Have identified any data quality issues within the ABT, in particular missing values, irregular cardinality, and outliers.
3 Have corrected any data quality issues due to invalid data.
4 Have recorded any data quality issues due to valid data in
a data quality plan along with potential handling strategies.
5 Be confident that enough good quality data exists to
continue with a project.

The Data Quality Report Getting To Know The Data Identifying Data Quality Issues Handling Data Quality Issues Summary
1
2
3
4
The Data Quality Report
Case Study: Motor Insurance Fraud
5
Summary
Getting To Know The Data
Case Study: Motor Insurance Fraud
Identifying Data Quality Issues
Case Study: Motor Insurance Fraud
Handling Data Quality Issues
Handling Missing Values
Handling Outliers
Case Study: Motor Insurance Fraud