CS代考 Central tendency

Central tendency

Data Quality Report –
Case Study
Fundamentals of Machine Learning for Predictive Data Analytics,
and Namee and Aoife D’Arcy
Chapter 3

Case Study: Analytical Base Table

Case study: Data Quality Reports

Analysis – Missing values
NUM SOFT TISSUE – 2%
Possible solution: leave it as it is, document it in the data quality plan
MARITAL STATUS – 61.2%
Possible solution – remove the feature

Decision → remove both MARITAL STATUS and INCOME from ABT

Examining the raw data we notice that the fields where MARITAL STATUS is missing we have income value 0
Unusual pattern for INCOME from the histogram large number of 0s
Discussion with the business revealed that MARITAL STATUS and INCOME were collected together, and INCOME 0 represents a missing value

Analysis – Irregular cardinality

INSURANCE TYPE – cardinality 1
Nothing wrong with the data, refers to the type of claim (CI = Car Insurance). However, not useful for predictions

Decision → feature is removed from the ABT

Analysis – Irregular cardinality
Continuous features with low cardinality:
NUM CLAIMANTS, NUM CLAIMS, SOFT TISSUE, SOFT TISSUE %, FRAUD FLAG – all have cardinality <10 (dataset of 500 instances) Decision → no issues Decision → change FRAUD FLAG to categorical feature (note – this is also our target feature, so the type will have large impact on the ML approach we take) Discussing with the business indicates that NUM CLAIMANTS, NUM CLAIMS, SOFT TISSUE, SOFT TISSUE% naturally take low number of values FRAUD FLAG has cardinality 2, indicates categorical feature labelled as continuous with values 0 and 1. Analysis - Outliers CLAIM AMOUNT – min value -99,999 Decision → invalid outlier, remove it and replace it with missing value Examining the raw data we notice this value comes from the d3 in our table. Examining the histogram for CLAIM AMOUNT we don’t see a large bar at that value, so this is isolated instance Value looks like input error, which was confirmed with the business Analysis - Outliers CLAIM AMOUT, TOTAL CLAIMED, NUM CLAIMS, AMOUNT RECIEVED – all have unusually high maximum values, especially compared to median and 3rd quartile TOTAL CLAIMED, NUM CLAIMS – both high values are from d460 in ABT Decision → d460 was removed from the ABT This policy seems to have made many more claims compared to the others Investigation with the business shows this is a valid outlier, however this was a company policy and was included in the ABT by mistake. Analysis - Outliers CLAIM AMOUT and AMOUNT RECEIVED – both from d302 in ABT Decision → document the outliers and possibly handle them later Examining the raw data we note this is a large claim for a serious injury Analysis of the histograms show that the large values are no unique (several small bars in the right hand side of the histograms) Handling missing values Approach 1: Drop any features that have missing value Rule of thumb – only use if 60% or more of values missing, otherwise look for different approach to handle Approach 2: Apply complete case analysis. Delete any instances where one or more features are missing values Can result in significant data loss, can introduce bias (if distribution of missing values is not random) Recommendation: Remove only instances with missing value for target feature Approach 3: Derive a missing indicator feature from features with missing value. Binary feature that flags whether the value was present or missing in the original data. May be useful if the reason the features are missing is related to the target feature Handling missing values Imputation replaces missing feature values with a plausible estimated value based on the feature values that are present. The most common approach to imputation is to replace missing values for a feature with a measure of the central tendency of that feature. Continuous features – usually mean and median Categorical features – usually the mode We would be reluctant to use imputation on features missing in excess of 30% of their values and would strongly recommend against the use of imputation on features missing in excess of 50% of their values. Handling outliers – clamp transformation The easiest way to handle outliers is to use a clamp transformation that clamps all values above an upper threshold and below a lower threshold to these threshold values, thus removing the offending outliers where ai is a specific value of feature a, and lower and upper are the lower and upper thresholds. Handling outliers - clamp transformation Upper and lower limits can be set manually based on domain knowledge, or can be calculated from the data Method 1: lower = 1st quartile value – 1.5*inter-quartile range higher = 3rd quartile value + 1.5*inter-quartile range Method 2: Use the mean value of a feature ± 2 times standard deviation Handling outliers – clamp transformation Clamp transformation – for and against Performing clamp transformation may remove the most interesting (and predictive) data from the dataset However, some machine learning techniques perform poorly in presence of outliers Clamp transformations are only recommended when you suspect the outliers will affect the performance of the model Case study: Data Quality Plan Case study: Data Quality Plan NUM SOFT TISSUE Missing only 2% of the values, we’ll use imputation. We’ll use the mean or the median (value 0.2 doesn’t naturally occur in the data set, so we’ll use 0) Case study: Data Quality Plan CLAIM AMOUNT, AMOUNT RECEIVED we’ll use clamp transformation exponential distribution, so methods we discussed won’t work too well following a discussion with the business we were advised that lower limit of 0 and upper limit of 80,000 make sense for the clamp transformation Case study: Data Quality Plan Visualising relationship between features In preparation to create predictive models it’s always a good idea to investigate the relationship between the variables This can identify pairs of closely-related features (which in turn can be used to reduce the size of the ABT) Scatter plot (continuous features) SPLOM (Scatter plot matrix) Bar plot (categorical features) “Small multiples approach” If the two features being visualised have a strong relationship, then the bar plots for each level of the second feature will look noticeably different to one another, and to the overall bar plot. Small multiple histograms “Small multiple” histograms can be used to compare a categorical with a continuous feature Small multiple histograms (cont.) Covariance and correlation Covariance and correlation provide us with a formal measures of the relationship of two continuous variables Covariance has a possible range of [-∞, ∞] where negative values indicate negative relationship, positive values indicate positive relationship, and values around 0 indicate little or no relationship Correlation Correlation is a normalized form of covariance Correlation has a range of [-1, 1] Correlation matrix Summary (Data Quality Reports) The key outcomes of the data exploration process are that the practitioner should Have gotten to know the features within the ABT, especially their central tendencies, variations, and distributions. Have identified any data quality issues within the ABT, in particular missing values, irregular cardinality, and outliers. Have corrected any data quality issues due to invalid data. Have recorded any data quality issues due to valid data in a data quality plan along with potential handling strategies. Be confident that enough good quality data exists to continue with a project. /docProps/thumbnail.jpeg