Tidy data
STT 180 Module 2 Lecture 4
Dola Pathak
Michigan State University
(Michigan State University) Introduction to Data Science 1 / 15
Learning objectives
1 Able to identify messy data.
2 Understand when to clean messy data.
3 Learn to effectively use the tools in tidyverse to tidy data.
(Michigan State University) Introduction to Data Science 2 / 15
Tidy data
To follow along, load the tidyverse package with
library(tidyverse)
Type table1 in your Console pane
table1
# A tibble: 6 x 4
country year cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
This is a tidy data set that consists of data on tuberculosis cases for three
countries in two years. The country’s population is also given.
(Michigan State University) Introduction to Data Science 3 / 15
Untidy data
Some common examples of untidy data sets:
table2
# A tibble: 12 x 4
country year type count
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
(Michigan State University) Introduction to Data Science 4 / 15
Untidy data
table3
# A tibble: 6 x 3
country year rate
*
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
(Michigan State University) Introduction to Data Science 5 / 15
Untidy data
table4a
# A tibble: 3 x 3
country `1999` `2000`
*
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
table4b
# A tibble: 3 x 3
country `1999` `2000`
*
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
(Michigan State University) Introduction to Data Science 6 / 15
Untidy data
table5
# A tibble: 6 x 4
country century year rate
*
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
(Michigan State University) Introduction to Data Science 7 / 15
What defines a tidy data set?
A tidy data set has the following three properties:
1 each variable has its own column,
2 each observation (case) has its own row,
3 each value has its own cell.
(Michigan State University) Introduction to Data Science 8 / 15
Why tidy data?
Tidy data will allow you to
1 make use of R’s vectorized operations,
2 easily subset and manipulate the data frame or tibble,
3 effectively use ggplot2 and dplyr,
4 efficiently collaborate.
(Michigan State University) Introduction to Data Science 9 / 15
Package tidyr
Package tidyr contains four functions to help you tackle common
problems associated with messy data. This package is automatically
loaded when you load tidyverse. The functions are
1 gather()
2 spread()
3 separate()
4 unite()
(Michigan State University) Introduction to Data Science 10 / 15
Package tidyr
Problem: I have a variable spread across multiple columns.
table4a
# A tibble: 3 x 3
country `1999` `2000`
*
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
gather(data = table4a, key = “year”,
value = “cases”, -1)
# A tibble: 6 x 3
country year cases
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
Function gather() transforms your data from wide format to long format.
(Michigan State University) Introduction to Data Science 11 / 15
Package tidyr
Problem: I have a variable spread across multiple columns.
table4a
# A tibble: 3 x 3
country `1999` `2000`
*
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
gather(data = table4a, key = “year”,
value = “cases”, -1)
# A tibble: 6 x 3
country year cases
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
Function gather() transforms your data from wide format to long format.
(Michigan State University) Introduction to Data Science 11 / 15
Package tidyr
Problem: I have a variable spread across multiple columns.
table4a
# A tibble: 3 x 3
country `1999` `2000`
*
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
gather(data = table4a, key = “year”,
value = “cases”, -1)
# A tibble: 6 x 3
country year cases
1 Afghanistan 1999 745
2 Brazil 1999 37737
3 China 1999 212258
4 Afghanistan 2000 2666
5 Brazil 2000 80488
6 China 2000 213766
Function gather() transforms your data from wide format to long format.
(Michigan State University) Introduction to Data Science 11 / 15
Package tidyr
Problem: I have observations scattered across multiple rows.
table2
# A tibble: 12 x 4
country year type count
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
spread(data = table2, key = “type”, value = “count”)
# A tibble: 6 x 4
country year cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Function spread() transforms your data from long format to wide format.
(Michigan State University) Introduction to Data Science 12 / 15
Package tidyr
Problem: I have observations scattered across multiple rows.
table2
# A tibble: 12 x 4
country year type count
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
spread(data = table2, key = “type”, value = “count”)
# A tibble: 6 x 4
country year cases population
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Function spread() transforms your data from long format to wide format.
(Michigan State University) Introduction to Data Science 12 / 15
Package tidyr
Problem: I have cells with multiple values.
table3
# A tibble: 6 x 3
country year rate
*
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
separate(data = table3, col = “rate”,
into = c(“cases”, “population”),
sep = “/”)
# A tibble: 6 x 4
country year cases population
*
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Function separate() will split a column into multiple columns at a
specified character.
(Michigan State University) Introduction to Data Science 13 / 15
Package tidyr
Problem: I have cells with multiple values.
table3
# A tibble: 6 x 3
country year rate
*
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
separate(data = table3, col = “rate”,
into = c(“cases”, “population”),
sep = “/”)
# A tibble: 6 x 4
country year cases population
*
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Function separate() will split a column into multiple columns at a
specified character.
(Michigan State University) Introduction to Data Science 13 / 15
Package tidyr
Problem: I have one value spread across multiple cells.
table5
# A tibble: 6 x 4
country century year rate
*
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
unite(data = table5, col = “year”,
century:year, sep = “”)
# A tibble: 6 x 3
country year rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Function unite() will paste together multiple columns into a single
column.
(Michigan State University) Introduction to Data Science 14 / 15
Package tidyr
Problem: I have one value spread across multiple cells.
table5
# A tibble: 6 x 4
country century year rate
*
1 Afghanistan 19 99 745/19987071
2 Afghanistan 20 00 2666/20595360
3 Brazil 19 99 37737/172006362
4 Brazil 20 00 80488/174504898
5 China 19 99 212258/1272915272
6 China 20 00 213766/1280428583
unite(data = table5, col = “year”,
century:year, sep = “”)
# A tibble: 6 x 3
country year rate
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
Function unite() will paste together multiple columns into a single
column.
(Michigan State University) Introduction to Data Science 14 / 15
Helpful guidelines
• Every messy data set is different.
• Tidy the data one step at a time, and save the result at each step.
• Look at help(package = “tidyr”) for other functions in package
tidyr.
• It may be easier to tidy some things before you read the data into R.
• Tidy data may not be possible for certain data sets (gene expression
data, corpus objects, igraph objects). Be willing to work with other
structures.
(Michigan State University) Introduction to Data Science 15 / 15