Seminar 7 Get Started with Data Pre-Processing using the dplyr package then apply
missing data methods to the Titanic passengers dataset in R
First install the dplyr package with: install.packages(“dplyr”, dependencies=TRUE) then add it to your R library using: library(“dplyr”)
When using dplyr remember the following:
1. first argument is always a data frame
2. subsequent arguments say what to do with the data frame
3. the sequence of arguments will return a new data frame (data frame used in the first argument is not changed unless you do so by assignation (i.e. using <- ))
Now go through the following code entering each line into R’s command line interface:
Now that you have learned to use the dplyr verbs try to answer the following questions using this dataset: beer <- read.csv("http://www.hodgett.co.uk/brewdog.csv", header=TRUE)
• Create a data frame containing only beers with an ABV of 8 or greater.
• Create a data frame containing only beers a bottle size of 375ml.
• Create a data frame with only 2 columns (beer name and price)
• Create a data frame with only 2 columns (beer name and price) where the beers are ordered in ascending order of price.
• Create a data frame with only 2 columns (beer name and price) where the beers are ordered in descending order of price.
• Create a data frame with only IPA beers listed in order of ascending price.
• Create a version of the brewdog beer dataset with an additional column containing Price times ABV.
• Create a data frame summarising every variable in the brewdog beer dataset.
The answers:
1. filter(beer, ABV >= 8)
2. filter(beer, Bottle.Size..ml. == “375ml”)
3. select(beer, Name, Price)
4. arrange(select(beer, Name, Price), Price)
5. arrange(select(beer, Name, Price), desc(Price))
6. arrange(filter(beer, Style == “IPA”), Price)
7. mutate(beer, ABV_Price = Price * ABV)
8. summary(beer)
Now we can move onto dealing with missing data using the titanic passengers’ dataset which has many missing values.
To start, import the following data:
data <- read.csv("http://www.hodgett.co.uk/titanic.csv", header=TRUE)
If you run head(data) you will see that there are 14 columns in the data frame, the table below describes what the columns names mean:
You can identify the number of complete cases with sum(complete.cases(data)) and the number of incomplete cases with sum(!complete.cases(data)).
You can have a better look at the data using a bar and combinations plot from the VIM package. Firstly install and add the VIM Package to your R library using install.packages("VIM") &
library("VIM")
Then use the following code to create the plot: aggr(data, numbers=TRUE, prop=FALSE)
From this you can see that the variables body and age have a very high percentage of missing values. In fact there are more rows with age and body missing than there are complete cases:
Explore the data
Before getting stuck into replacing the missing values in the dataset, here are a few things you could explore:
• You can see how many of the passengers were male or female with summary(data$sex) or see the average age with summary(data$age)
• You can also find the average age with mean(data$age, na.rm=TRUE) or have a look at the range of ages with a histogram: hist(data$age)
• You could create a subset of the dataset containing data only about male survivors:
qdata <- data[which(data$sex=="male" & data$survived==1),] hist(qdata$age)
Can you use your new dplyr skills to see if more males or females survived?
• You could look to see what variables correlate with each other using a corrgram plot. First install then add the corrgram package to your R library using install.packages("corrgram") & library("corrgram"). Then you can create a corrgram plot using corrgram(data) – Dark blue indicates strong positive correlation and dark red indicates high negative correlation.
You could also look at the correlation between the incomplete cases and the other variables. To do this make a copy of the dataset using missdata <- data then add the incomplete cases as a new column in missdata using: missdata$missing <- as.numeric(!complete.cases(data)) then you can use corrgram(missdata) to generate the plot.
Simple Imputation
You can apply simple imputation to the missing ages in the dataset using the mean value.
Start by creating a copy of the dataset and call it si using: si <- data
Now use the following command to set all of the values with NA in si$age to the mean of si$age:
si$age[is.na(si$age)] <- mean(si$age, na.rm=TRUE)
You can check that the NA values have been updated by comparing summary(data) and summary(si)
Multiple Imputation
Now apply multiple imputation to the missing ages in the dataset using the mice package in R.
To start, install and add the mice package with: install.packages("mice")&library("mice")
As the titanic dataset is relatively large with mixed datatypes, running a multiple imputation with all of the variables would take a very long time, so instead we will use a subset of the data and save the imputation to the variable imi:
imi <- mice( subset(data, select = c('survived', 'pclass', 'sex', 'age', 'embarked')), m = 5, maxit = 10)
Then you can save the imputed values to the variable mi with: mi <- complete(imi)
You can then compare the imputed age values for single and multiple imputation with: summary(si$age) and summary(mi$age)
hist(si$age) and hist(mi$age)
As you would expect with simple imputation the histogram shows a high frequency of values around
the mean whilst multiple imputation shows values that are more evenly spread.