Data Cleansing — 2
Data Cleansing — 2
Faculty of Information Technology, Monash University, Australia
FIT5196 week 7
(Monash) FIT5196 1 / 24
Missing Value
Missing values in the Switzerland heart disease data set are indicated by “?”.
(Monash) FIT5196 2 / 24
Reasons for Missing Values
Equipment errors
Absence of survey participants
Unavailability in GPS signals in rural area.
Change of circumstances: Such as death, graduation, etc.
Filter question when a set of questions in a survey that is only asked to
participants who indicate they are married.
(Monash) FIT5196 3 / 24
Consequences of Missing Values
Why is missing data a problem in data analysis?
É All standard statistical methods presume complete information for all the
variables included in analysis.
Consequences: Ignoring or inappropriately handling missing data may lead
to
É biased estimation: over/under estimated sample mean and variance
É Incorrect inferences/results: garbage in garbage out
“The only really good solution to the missing data problem is not to have
any. So in the design and execution of research projects, it is essential to
put great effort into minimising the occurrence of missing data. Statistical
adjustments can never make up for sloppy research” — Paul D. Allison
(Monash) FIT5196 4 / 24
Outline
1 Missing Data Mechanisms
2 Missing Data Pattern
3 Methods for Handling Missing Values
4 Summary
(Monash) FIT5196 5 / 24
Missing Data Mechanisms
Missing Data Mechanisms
Describe relationships between measured variables and the probability of
missing data
Deciding upon the method for analysing missing values requires
understanding about both the reasons for the missing values and the nature
of the data for the missing observations.
Three different missingness mechanisms:
É Missing at random
É Missing completely at random
É Missing not at random
(Monash) FIT5196 6 / 24
Missing Data Mechanisms
Mechanisms: Missing at Random (MAR)
MAR: the probability of missing data on a variable is related to some other
measured variable (or variables) in the analysis model but not to the values
of the variable itself.
É B: a binary n × p matrix indicating the missingness of the data
É Y = (Yobs ,Ymiss)
− Yobs : observed part of Y
− Ymis : missing part of Y
É η: some unknown parameter
p(B | Yobs ,Ymiss , η) = p(B | Yobs , η)
which says the probability of missingness depends on the observed portion of
data via some parameter η that relates Yobs to R.
Practical issue: no way to confirm that the probability of missing data on Y
is solely a function of other measured variables.
(Monash) FIT5196 7 / 24
Missing Data Mechanisms
Mechanisms: Missing at Random (MAR)
Examples
É A psychologist is studying quality of life in a group of cancer patients and
finds that elderly patients and patients with less education have a higher
propensity to refuse the quality of life questionnaire.
− The missingness in the quality of life is related to the age and education
É An educational researcher is studying reading achievement and finds that
Hispanic students have a higher rate of missing data than Caucasian students
− The missingness in reading achievement is related to the ethic groups of
students.
(Monash) FIT5196 7 / 24
Missing Data Mechanisms
Mechanisms: Missing Completely at Random (MCAR)
MCAR: the probability of missing data on a variable is unrelated to other
measured variables and is unrelated to the values of the variable itself.
É B: a binary n × p matrix indicating the missingness of the data
É Y = (Yobs ,Ymiss)
− Yobs : observed part of Y
− Ymis : missing part of Y
É η: some unknown parameter
É MCAR is defined probabilistically as
p(B | Yobs ,Ymiss , η) = p(B | η)
which says that some parameter still governs the probability that R takes on
a value of zero or one, but missingness is no longer related to the data.
MCAR is a more restrictive condition than MAR.
Both MAR and MCAR could be ignorable.
(Monash) FIT5196 8 / 24
Missing Data Mechanisms
Mechanisms: Missing Completely at Random (MCAR)
Example:
Example adopted from “Dealing with missing data: Key assumptions and methods for applied analysis” by Marina Soley-Bori.
(Monash) FIT5196 8 / 24
Missing Data Mechanisms
Mechanisms: Missing Completely at Random (MCAR)
Effect of MCAR:
112
Table 6.1 Summary of Effects of Missingness Corrections for Math Achievement Scores
N
Mean
Math
IRT
Score
SD
Math
IRT
Score
Skew,
Kurtosis
Math
IRT
Score
Mean
Reading
IRT
Scores—
Not
Missing1
Mean
Reading
IRT
Scores—
Missing2
F
Average
Error of
Estimates
(SD)
Correlation
With
Reading
IRT Score
Effect
Size
(r2)
Original
Data—
“Population”
15,163 38.03 11.94 −0.02,
−0.85
.77 .59
Missing
Completely
at Random
(MCAR)
12,099 38.06 11.97 −0.03,
−0.86
29.98 30.10 < 1, ns .77* .59 Missing Not at Random (MNAR), Low 12,134 43.73 9.89 −0.50, 0.17 33.63 23.09 5,442.49, p < .0001, η2 = .26 .70* .49 Missing Not at Random (MNAR), Extreme 7,578 38.14 8.26 −0.01, 0.89 30.26 29.74 10.84, p < .001, η2 = .001 .61* .37 Missing Not at Random (MNAR), Inverse 4,994 37.60 5.99 0.20, 0.60 29.59 30.20 13.35, p < .001, η2 = .001 −.20* .04 Figure is from "Dealing with missing or incomplete data" . (Monash) FIT5196 8 / 24 Missing Data Mechanisms Mechanisms: Missing Completely at Random (MCAR) Test MCAR: separate the missing and the complete cases on a particular variable and examine group mean differences on other variables in the data set. É Univariate T-test Comparisons: It separates the missing and the complete cases on a particular variable and uses a T-test to examine group mean differences on other variables in the data set. − A non-significant t test: the data are MCAR − A significant T statistic (or alternatively, a large mean difference): the data are MAR or MNAR. É Little’s MCAR Test: A multivariate extension of the t-test approach that simultaneously evaluates mean differences on every variable in the data set − A global test of MCAR that applies to the entire data set (Monash) FIT5196 8 / 24 Missing Data Mechanisms Mechanisms: Missing Not at Random (MNAR) MNAR: the probability of missing data on a variable is related to the values of the variable itself, even after controlling for other variables É B: a binary n × p matrix indicating the missingness of the data É Y = (Yobs ,Ymiss) − Yobs : observed part of Y − Ymis : missing part of Y É η: some unknown parameter É MCAR is defined probabilistically as p(B | Yobs ,Ymiss , η) (Monash) FIT5196 9 / 24 Missing Data Mechanisms Mechanisms: Missing Not at Random (MNAR) Examples É Students with poor reading skills have missing test scores because they experienced reading comprehension difficulties during the exam. − The missingness in reading achievement is related to reading skills. É A number of patients in the cancer trial become so ill (e.g., their quality of life becomes so poor) that they can no longer participate in the study. − The missingness in the quality of life is related to the quality of life itself. (Monash) FIT5196 9 / 24 Missing Data Mechanisms Mechanisms: Missing Not at Random (MNAR) Effects of MNAR 112 Table 6.1 Summary of Effects of Missingness Corrections for Math Achievement Scores N Mean Math IRT Score SD Math IRT Score Skew, Kurtosis Math IRT Score Mean Reading IRT Scores— Not Missing1 Mean Reading IRT Scores— Missing2 F Average Error of Estimates (SD) Correlation With Reading IRT Score Effect Size (r2) Original Data— “Population” 15,163 38.03 11.94 −0.02, −0.85 .77 .59 Missing Completely at Random (MCAR) 12,099 38.06 11.97 −0.03, −0.86 29.98 30.10 < 1, ns .77* .59 Missing Not at Random (MNAR), Low 12,134 43.73 9.89 −0.50, 0.17 33.63 23.09 5,442.49, p < .0001, η2 = .26 .70* .49 Missing Not at Random (MNAR), Extreme 7,578 38.14 8.26 −0.01, 0.89 30.26 29.74 10.84, p < .001, η2 = .001 .61* .37 Missing Not at Random (MNAR), Inverse 4,994 37.60 5.99 0.20, 0.60 29.59 30.20 13.35, p < .001, η2 = .001 −.20* .04 Figure is from "Dealing with missing or incomplete data" . (Monash) FIT5196 9 / 24 Missing Data Mechanisms MAR, MCAR v.x. MNAR? (Monash) FIT5196 10 / 24 Missing Data Mechanisms MAR, MCAR v.x. MNAR? Example adopted from "Applied Missing Data Analysis" by Craig K. Enders. (Monash) FIT5196 10 / 24 Missing Data Pattern Missing data Patten A missing data pattern refers to the configuration of observed and missing values in a data set. The univariate pattern has missing values isolated to a single variable. (Monash) FIT5196 11 / 24 Missing Data Pattern Missing data Patten A missing data pattern refers to the configuration of observed and missing values in a data set. A monotone missing data pattern is typically associated with a longitudinal study where participants drop out and never return. (Monash) FIT5196 11 / 24 Missing Data Pattern Missing data Patten A missing data pattern refers to the configuration of observed and missing values in a data set. a general pattern has missing values dispersed throughout the data matrix in a haphazard fashion. (Monash) FIT5196 11 / 24 Missing Data Pattern Missing data Patten A missing data pattern refers to the configuration of observed and missing values in a data set. Example: A study examining the effects of a program to increase students’ knowledge of their asthma. It is interested in examining how a measure of a student’s self efficacy beliefs about controlling their asthma symptoms relates to a number of predictors. Figures are from "A review of methods for missing data" by Pigott (Monash) FIT5196 11 / 24 Methods for Handling Missing Values Outline 1 Missing Data Mechanisms 2 Missing Data Pattern 3 Methods for Handling Missing Values 4 Summary (Monash) FIT5196 12 / 24 Methods for Handling Missing Values Methods for handling missing values (Monash) FIT5196 13 / 24 Methods for Handling Missing Values Deletion method: List-wise Deletion Listwise deletion (also known as complete-case analysis) discards the data for any case that has one or more missing values. (Monash) FIT5196 14 / 24 Methods for Handling Missing Values Deletion method: List-wise Deletion Listwise deletion (also known as complete-case analysis) discards the data for any case that has one or more missing values. Figure is from "Applied Missing Data Analysis" (Monash) FIT5196 14 / 24 Methods for Handling Missing Values Deletion method: List-wise Deletion Listwise deletion (also known as complete-case analysis) discards the data for any case that has one or more missing values. Figure is from "Applied Missing Data Analysis" (Monash) FIT5196 14 / 24 Methods for Handling Missing Values Deletion method: List-wise Deletion Listwise deletion (also known as complete-case analysis) discards the data for any case that has one or more missing values. Considerations É The primary benefit of list-wise deletion is convenience, producing a common set of cases for all analyses. É It assumes MCAR data and can produce distorted parameter estimates when this assumption does not hold. É Deleting the incomplete data records can produce a dramatic reduction in the total sample size, the magnitude of which increases as the missing data rate or number of variables increases. (Monash) FIT5196 14 / 24 Methods for Handling Missing Values Deletion method: Pairwise Deletion Pairwise deletion (also known as available-case analysis) attempts to mitigate the loss of data by eliminating cases on an analysis-by-analysis basis. (Monash) FIT5196 15 / 24 Methods for Handling Missing Values Deletion method: Pairwise Deletion Pairwise deletion (also known as available-case analysis) attempts to mitigate the loss of data by eliminating cases on an analysis-by-analysis basis. Example: compute covariance Figure are from "A Review of Methods for Missing Data" by Therese D. Pigott (Monash) FIT5196 15 / 24 Methods for Handling Missing Values Deletion method: Pairwise Deletion Pairwise deletion (also known as available-case analysis) attempts to mitigate the loss of data by eliminating cases on an analysis-by-analysis basis. Considerations: É It requires MCAR data and can produce distorted parameter estimates when this assumption does not hold. É It is dependent on the magnitude of correlations that exist between variables. É It can produce estimated covariance matrices are outside of the range of 1.0 to 1.0, which causes estimation problems for multivariate analyses that use a covariance matrix as input data. É It is lack of a consistent sample base: cause problems in computing standard errors and covariance. (Monash) FIT5196 15 / 24 Methods for Handling Missing Values Single Imputation methods Single imputation: generates a single replacement value for each missing data point. É Yields a complete data set É Produces biased parameter estimates É Underestimates standard errors Methods É Mean Imputation É Regression Imputation É Stochastic Regression Imputation (Monash) FIT5196 16 / 24 Methods for Handling Missing Values Arithmetic mean imputation Arithmetic mean imputation (also referred to as mean substitution) takes the seemingly appealing tack of filling in the missing values with the arithmetic mean of the available cases. µcomplete = 10.35, µmiss = 11.7, µimpute = 11.7 (Monash) FIT5196 17 / 24 Methods for Handling Missing Values Regression imputation Regression imputation replaces missing values with predicted scores from a regression equation. Basic idea: use information from the complete variables to fill in the incomplete variables. Two steps: 1 Estimate a set of regression equations that predict the incomplete variables from the complete variables. 2 Generate predicted values for the incomplete variables (Monash) FIT5196 18 / 24 Methods for Handling Missing Values Regression imputation Regression imputation replaces missing values with predicted scores from a regression equation. Basic idea: use information from the complete variables to fill in the incomplete variables. Example É Regression function JPi = β̂0 + β̂1(IQi ) = −2.065+ 0.123(IQi ) (Monash) FIT5196 18 / 24 Methods for Handling Missing Values Regression imputation Regression imputation replaces missing values with predicted scores from a regression equation. Basic idea: use information from the complete variables to fill in the incomplete variables. Example (Monash) FIT5196 18 / 24 Methods for Handling Missing Values Regression imputation Regression imputation replaces missing values with predicted scores from a regression equation. Basic idea: use information from the complete variables to fill in the incomplete variables. Example (Monash) FIT5196 18 / 24 Methods for Handling Missing Values Effects of mean and regressing imputation Figures are from "Dealing with missing or incomplete data" (Monash) FIT5196 19 / 24 Methods for Handling Missing Values Stochastic regression imputation Stochastic regression imputation add random residuals to the predicate values generated by standard regression imputation. Basic idea: to restore lost variability to the data and effectively eliminate the biases associated with standard regression imputation methods. Two steps: 1 Estimate a set of regression equations that predict the incomplete variables from the complete variables. 2 Generate predicted values for the incomplete variables 3 Add a normally distributed residual term to each predicted score (Monash) FIT5196 20 / 24 Methods for Handling Missing Values Stochastic regression imputation Stochastic regression imputation add random residuals to the predicate values generated by standard regression imputation. Basic idea: to restore lost variability to the data and effectively eliminate the biases associated with standard regression imputation methods. Example: É Regression function JPi = β̂0 + β̂1(IQi ) + zi = −2.065+ 0.123(IQi ) + zi and zi ∼ Normal(0, σ 2 JP|IQ) where σ2JP|IQ is the residual variance. (Monash) FIT5196 20 / 24 Methods for Handling Missing Values Stochastic regression imputation Stochastic regression imputation add random residuals to the predicate values generated by standard regression imputation. Basic idea: to restore lost variability to the data and effectively eliminate the biases associated with standard regression imputation methods. Example: (Monash) FIT5196 20 / 24 Methods for Handling Missing Values Stochastic regression imputation Stochastic regression imputation add random residuals to the predicate values generated by standard regression imputation. Basic idea: to restore lost variability to the data and effectively eliminate the biases associated with standard regression imputation methods. Example: (Monash) FIT5196 20 / 24 Methods for Handling Missing Values Stochastic regression imputation Stochastic regression imputation add random residuals to the predicate values generated by standard regression imputation. Basic idea: to restore lost variability to the data and effectively eliminate the biases associated with standard regression imputation methods. The only procedure in this chapter that gives unbiased parameter estimates under an MAR missing data mechanism. (Monash) FIT5196 20 / 24 Methods for Handling Missing Values Imputation with K-Nearest Neighbour The idea: use value of the K-Nearest neighbours to impute the missing value. Estimate a missing value yi ,h in the i−th observation yi É Select K observations whose attribute values are similar to yi É the missing value is estimated as − categorial values: the most common values among all neighbours − numerical values: the average value is used weighted KNNI ŷi ,h = ∑ j∈IKih si (yj )yj ,h ∑ j∈IKih si (yj ) (Monash) FIT5196 21 / 24 Methods for Handling Missing Values Other imputation methods Hot-deck imputation: a collection of techniques that impute the missing values with scores from “similar” respondents. É Example: consider a general population survey in which some respondents refuse to disclose their income. − classifies respondents into cells based on demographic characteristics such as gender, age, race, and marital status − replaces the missing values with a random draw from the income distribution of respondents that shared the same constellation of demographic characteristics as the individual with missing data. (Monash) FIT5196 22 / 24 Methods for Handling Missing Values Other imputation methods Last observation carried forward: specific to longitudinal designs É imputes missing repeated measures variables with the observation that immediately precedes dropout (Monash) FIT5196 22 / 24 Methods for Handling Missing Values Evaluate a missing-data method Minimise bias: Although it is well-known that missing data can introduce bias into parameter estimates, a good method should make that bias as small as possible. Maximise the use of available information: We want to avoid discarding any data, and we want to use the available data to produce parameter estimates that are efficient (i.e., have minimum sampling variability). Yield good estimates of uncertainty: We want accurate estimates of standard errors, confidence intervals and p-values. (Monash) FIT5196 23 / 24 Summary Summary What we discussed É Missing value patterns É Missing value mechanisms É Different methods used to handle missing values Acknowledgement: this content of those slides are based on É Chapters 1 and 2 in "Applied Missing Data Analysis" by Craig K. Enders É "A review of methods for missing data" by Therese D. Pigott É ’Dealing with missing data: Key assumptions and methods for applied analysis" by Marina Soley-Bori É Chapters 2 and 3 in "Missing data" by Paul D. Allison Assessment 2 released. É Due date: Wednesday 3 Oct. (Monash) FIT5196 24 / 24 Missing Data Mechanisms Missing Data Pattern Methods for Handling Missing Values Summary