Day-2 : Getting Started with Data Cleaning
Delivered By: Prakash Sukhwal, Associate Lecturer & Consultant, Analytics & Intelligent Systems Practice (isspcs@nus.edu.sg)
Data Files:
1. injuries_2017_Part1.csv
2. injuries_2017_descriptions.csv
3. injuries_prods.csv
In [ ]:
In [4]:
In [5]:
# setting up the infrastructure
#install.packages(‘pacman’, repos = “http://cran.us.r-project.org”)
# using this we load the libraries
# it installs the ones we do not have in our system pacman::p_load(tidyverse, rebus,IRdisplay, lubridate, editrules)
# pay attention to the ‘/’
setwd(“C:/Users/isspcs/Desktop/workshop-data”)
In [6]:
Source: An introduction to data cleaning with R; Statistics Netherlands, The Hague/Heerlen 2013
In [8]:
injury = read.csv(‘injuries_2017_Part1.csv’, sep=”,”, comment.char = “”, stringsAsFactor = F)
head(injury,2)
dim(injury)
case_num trmt_date age sex race body_part
diag diag_other disp
170510444 29/4/2017 3 Male White
170633260 7/6/2017 70 Male Not stated
386906 13
1. Technically Correct Data
Finger Internal
Crushing
Ingested Object
NA Re NA A
– each value in the table belongs to a variable
– has a correct data type as acceptable in real world
1.1 coerce to correct data types
In [ ]:
In [7]:
In [ ]:
# check data types of all the columns
sapply(injury, class)
# create a value frequency map for weight
# write code
# data entry error 4 1/2 instead of 4.5 weight_cleaned = str_replace_all( ) # complete code
In [ ]:
typeof(weight_cleaned)
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
1.2 Data Augmentation
head(weight_cleaned)
# remove spaces
# complete code
weight_cleaned = str_replace_all()
# add weight_cleaned to injury dataframe
# complete code
injury$weight =
head(injury)
dim(injury)
# examine the datatypes
sapply(injury, class)
summary(injury)
– bring in more data
– joins
– files:
narratives (injuries_2017_descriptions.csv)
product names (injuries_prods.csv)
In [37]:
In [ ]:
# read in narratives
# complete code
injury_descp = read.csv() head(injury_descp) dim(injury_descp)
In [ ]:
# perform a left join using by = ‘case_num’
# complete code
injury_data_whole_v1 = left_join(, , by = “case_num”)
In [ ]:
head(injury_data_whole_v1)
dim(injury_data_whole_v1)
In [ ]:
# read in prod names
# complete code
injury_prods = read.csv() head(injury_prods) dim(injury_prods)
In [ ]:
In [ ]:
In [ ]:
1.3 fix Data types and handle dates
In [ ]:
# now the names are different for the key column
# complete code
injury_data_whole_v2 = left_join(, , by = c( = ))
head(injury_data_whole_v2)
dim(injury_data_whole_v2)
# set columns to factors
columns = c(‘sex’, ‘race’, ‘body_part’, ‘disposition’, ‘location’, ‘title’) injury_data_whole_v2[, columns] = lapply(, ) # complete code
In [ ]:
# date field
# complete code to set date field with proper type
# use lubridate package
injury_data_whole_v2$trmt_date =
In [ ]:
1.4 unique string values
– need for normalization
In [ ]:
# examine the datatypes
sapply(injury_data_whole_v2, class)
# examine the unique values for possible errors
injury_data_whole_v2 %>% distinct(location)
injury_data_whole_v2 %>% distinct(sex)
injury_data_whole_v2 %>% distinct(race)
normalize strings
In [ ]:
# normalize strings
# trim the white spaces
injury_data_whole_v2$location = injury_data_whole_v2$sex =
standardize values
In [ ]:
# standardize values
# complete code
injury_data_whole_v2 = injury_data_whole_v2%>%
mutate(gender = recode_factor())
In [ ]:
2. Consistent Data
– technically correct data which can be further used for analysis.
– may have follwing issues:
1. duplicates
2. missings (NA)
3. errors
4. outliers etc.
2.1 Detection & Localization of Errors
– univariate and multivariate constraint violations
A. Duplicates
example
In [ ]:
# cross tabulate values for sanity check
table(injury_data_whole_v2$gender, injury_data_whole_v2$sex)
# example
x = c(1:5, 3:8, 7,8) duplicated(x) anyDuplicated(x)
In [ ]:
# write code for checking duplicates in dataset # hint: anyDuplicated
anyDuplicated(injury)
df[!duplicated(df[c(“col1″,”col2”)]),]
B. Missing Values
– NAs
– summary stats show the presence of NAs
In [ ]:
In [ ]:
# get rid of diag_other first as it is full of NAs
# complete code
injury_data_whole_v2 = injury_data_whole_v2 %>% select()
# assign the complete cases to version 3
# complete code
injury_data_whole_v3 = injury_data_whole_v2 %>% filter()
In [ ]:
Alternates
– df %>% na.omit
– df %>% drop_na
– df %>% filter(!is.na(x1)) # based on single column
summary(injury_data_whole_v3)
C. Outliers
In [ ]:
injury_data_whole_v3 %>% ggplot(aes(gender, age)) + geom_boxplot()
In [ ]:
removal (if needed)
x[!x %in% boxplot.stats(x)$out]
further reading: http://www.questionflow.org/2017/12/26/combined-outlier-detection- with-dplyr-and-ruler/ (http://www.questionflow.org/2017/12/26/combined-outlier- detection-with-dplyr-and-ruler/)
D. Feature Generation
– Age
– Gender
– we will use narrative column for this purpose:
sample = “70-YOM ATE LARGE PIECE OF CHICKEN W/BROCCOLI,
FELT LIKE THROAT CLOSED UP & STARTING SPITTING UP DROOL.
DX: ESOPHAGEAL OBSTRUCTION D/T FOOD IMP”
sample2 = “20 YO M C/O SMOKE INHALATION/SHOULDER INJ S/P ELECTRIC
AL FIRE/DRYER
INHOME/LIFTING ELDERLY REL FROM HOME DX RT SHLD INJ SMOKE INHALAT
ION FD?”
rebus package
notations of interest
some special character name in rebus package
# default coef is 1.5 IQR
boxplot.stats(injury_data_whole_v3$age, coef = 2)$out
SPC ==> \s
DOT ==> .
START ==> ^
END ==> $
DGT ==> \d
some function names in rebus package optional() ==> ?
or() ==> ?:
one_or_more() ==> +
capture() used with one of the above functions
In [ ]:
In [ ]:
In [ ]:
In [ ]:
# Age
pattern = capture(optional(DGT) %R% DGT) %R% capture(or(‘Y’, ‘YO’,’M’,’-‘, ” “))
sample = “1OYOF SLIPPED BACKWARDS ON PUDDLE STIKING BACK ON REFIGERATOR AT THE GROCERY STORE. NO LOC.
DX: ACUTE LOW BACK PAIN WITHOUT SCIATICA ”
str_match(sample, pattern)
# test the pattern
age = str_match() # complete code age
In [ ]:
#age2[,2]
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
TASK: extract gender from narrative and store in gender2 column
In [ ]:
#perform string match on whole dataframe # complete code
age = str_match()
# add the extracted age to version 3 as age2 column
# complete code
injury_data_whole_v3$age2 =
head(injury_data_whole_v3,4)
dim(injury_data_whole_v3)
summary(injury_data_whole_v3$age2)
injury_data_whole_v3%>% filter(is.na(age2))
###### TASK: extract gender from narrative and
###### store in gender2 column
# gender2
# complete code
pattern =
In [ ]:
# test on both the samples
sample = “70-YOM ATE LARGE PIECE OF CHICKEN W/BROCCOLI, FELT LIKE THROAT CLOSED UP & STARTING SPITTING UP DROOL. DX: ESOPHAGEAL OBSTRUCTION D/T FOOD IMP”
sample2 = “20 YO M C/O SMOKE INHALATION/SHOULDER INJ S/P ELECTRICAL FIRE/DRYER INHOME/LIFTING ELDERLY REL FROM HOME DX RT SHLD INJ SMOKE INHALATION FD?”
gender= str_match(sample2, pattern) gender
In [ ]:
In [ ]:
In [ ]:
E. Inconsistencies (domain knowledge)
– occurs when a record has value which is not realistic (ex: Age <0)
- use basic function if checks are not complicated
- alternatively,edit rules can be applied using editrules package in
R
- best practice is to maintain edit rules in a separate log
Scenario
Let us assume that the gender documented by medical staff in narrativ
e is accurate.
write a logic to spot data entry errors
gender= str_match(injury_data_whole_v3$narrative, pattern) injury_data_whole_v3$gender2 = as.factor(gender[,2])
injury_data_whole_v3 %>% filter(is.na(gender2)) %>% head()
summary(injury_data_whole_v3)
In [ ]:
# quick look at the cross tab of 2 gender columns
# write code
In [ ]:
editrule package in R
In [ ]:
In [ ]:
# complete code for the given scenario
###### editrule package in R
# edit rule on age
(age_constraint = editset(c(“age >= 0”, “age <= 120")))
# check for violations
violations = violatedEdits(age_constraint, injury_data_whole_v3) summary(violations)
References:
1. Data Source: https://www.cpsc.gov/Research--Statistics/NEISS-Injur
y-Data; further cleaned https://gith
ub.com/hadley/neiss/tree/master/data
2. An introduction to data cleaning with R: Edwin de Jonge and Mark v
an der Loo
3. A Data Scientist's Guide to Acquiring, Cleaning and Managing Data
in R: Samuel Buttrey
4. https://stringr.tidyverse.org/reference/str_match.html
5. https://stringr.tidyverse.org/reference/str_extract.html
In [ ]:
# read more on data anonymization
# imputation techniques
# fuzzy logics
# Hmisc, MICE, deducorrect, LittleMCAR
In [ ]: