CS计算机代考程序代写 data science Tidy data

Tidy data
STT 180 Module 2 Lecture 4

Dola Pathak

Michigan State University

(Michigan State University) Introduction to Data Science 1 / 15

Learning objectives

1 Able to identify messy data.

2 Understand when to clean messy data.

3 Learn to effectively use the tools in tidyverse to tidy data.

(Michigan State University) Introduction to Data Science 2 / 15

Tidy data

To follow along, load the tidyverse package with

library(tidyverse)

Type table1 in your Console pane

table1

# A tibble: 6 x 4

country year cases population

1 Afghanistan 1999 745 19987071

2 Afghanistan 2000 2666 20595360

3 Brazil 1999 37737 172006362

4 Brazil 2000 80488 174504898

5 China 1999 212258 1272915272

6 China 2000 213766 1280428583

This is a tidy data set that consists of data on tuberculosis cases for three
countries in two years. The country’s population is also given.

(Michigan State University) Introduction to Data Science 3 / 15

Untidy data

Some common examples of untidy data sets:

table2

# A tibble: 12 x 4

country year type count

1 Afghanistan 1999 cases 745

2 Afghanistan 1999 population 19987071

3 Afghanistan 2000 cases 2666

4 Afghanistan 2000 population 20595360

5 Brazil 1999 cases 37737

6 Brazil 1999 population 172006362

7 Brazil 2000 cases 80488

8 Brazil 2000 population 174504898

9 China 1999 cases 212258

10 China 1999 population 1272915272

11 China 2000 cases 213766

12 China 2000 population 1280428583

(Michigan State University) Introduction to Data Science 4 / 15

Untidy data

table3

# A tibble: 6 x 3

country year rate

1 Afghanistan 1999 745/19987071

2 Afghanistan 2000 2666/20595360

3 Brazil 1999 37737/172006362

4 Brazil 2000 80488/174504898

5 China 1999 212258/1272915272

6 China 2000 213766/1280428583

(Michigan State University) Introduction to Data Science 5 / 15

Untidy data

table4a

# A tibble: 3 x 3

country `1999` `2000`

1 Afghanistan 745 2666

2 Brazil 37737 80488

3 China 212258 213766

table4b

# A tibble: 3 x 3

country `1999` `2000`

1 Afghanistan 19987071 20595360

2 Brazil 172006362 174504898

3 China 1272915272 1280428583

(Michigan State University) Introduction to Data Science 6 / 15

Untidy data

table5

# A tibble: 6 x 4

country century year rate

1 Afghanistan 19 99 745/19987071

2 Afghanistan 20 00 2666/20595360

3 Brazil 19 99 37737/172006362

4 Brazil 20 00 80488/174504898

5 China 19 99 212258/1272915272

6 China 20 00 213766/1280428583

(Michigan State University) Introduction to Data Science 7 / 15

What defines a tidy data set?

A tidy data set has the following three properties:

1 each variable has its own column,

2 each observation (case) has its own row,

3 each value has its own cell.

(Michigan State University) Introduction to Data Science 8 / 15

Why tidy data?

Tidy data will allow you to

1 make use of R’s vectorized operations,

2 easily subset and manipulate the data frame or tibble,

3 effectively use ggplot2 and dplyr,

4 efficiently collaborate.

(Michigan State University) Introduction to Data Science 9 / 15

Package tidyr

Package tidyr contains four functions to help you tackle common
problems associated with messy data. This package is automatically
loaded when you load tidyverse. The functions are

1 gather()

2 spread()

3 separate()

4 unite()

(Michigan State University) Introduction to Data Science 10 / 15

Package tidyr

Problem: I have a variable spread across multiple columns.

table4a

# A tibble: 3 x 3

country `1999` `2000`

1 Afghanistan 745 2666

2 Brazil 37737 80488

3 China 212258 213766

gather(data = table4a, key = “year”,

value = “cases”, -1)

# A tibble: 6 x 3

country year cases

1 Afghanistan 1999 745

2 Brazil 1999 37737

3 China 1999 212258

4 Afghanistan 2000 2666

5 Brazil 2000 80488

6 China 2000 213766

Function gather() transforms your data from wide format to long format.

(Michigan State University) Introduction to Data Science 11 / 15

Package tidyr

Problem: I have a variable spread across multiple columns.

table4a

# A tibble: 3 x 3

country `1999` `2000`

1 Afghanistan 745 2666

2 Brazil 37737 80488

3 China 212258 213766

gather(data = table4a, key = “year”,

value = “cases”, -1)

# A tibble: 6 x 3

country year cases

1 Afghanistan 1999 745

2 Brazil 1999 37737

3 China 1999 212258

4 Afghanistan 2000 2666

5 Brazil 2000 80488

6 China 2000 213766

Function gather() transforms your data from wide format to long format.

(Michigan State University) Introduction to Data Science 11 / 15

Package tidyr

Problem: I have a variable spread across multiple columns.

table4a

# A tibble: 3 x 3

country `1999` `2000`

1 Afghanistan 745 2666

2 Brazil 37737 80488

3 China 212258 213766

gather(data = table4a, key = “year”,

value = “cases”, -1)

# A tibble: 6 x 3

country year cases

1 Afghanistan 1999 745

2 Brazil 1999 37737

3 China 1999 212258

4 Afghanistan 2000 2666

5 Brazil 2000 80488

6 China 2000 213766

Function gather() transforms your data from wide format to long format.

(Michigan State University) Introduction to Data Science 11 / 15

Package tidyr

Problem: I have observations scattered across multiple rows.

table2

# A tibble: 12 x 4

country year type count

1 Afghanistan 1999 cases 745

2 Afghanistan 1999 population 19987071

3 Afghanistan 2000 cases 2666

4 Afghanistan 2000 population 20595360

5 Brazil 1999 cases 37737

6 Brazil 1999 population 172006362

7 Brazil 2000 cases 80488

8 Brazil 2000 population 174504898

9 China 1999 cases 212258

10 China 1999 population 1272915272

11 China 2000 cases 213766

12 China 2000 population 1280428583

spread(data = table2, key = “type”, value = “count”)

# A tibble: 6 x 4

country year cases population

1 Afghanistan 1999 745 19987071

2 Afghanistan 2000 2666 20595360

3 Brazil 1999 37737 172006362

4 Brazil 2000 80488 174504898

5 China 1999 212258 1272915272

6 China 2000 213766 1280428583

Function spread() transforms your data from long format to wide format.

(Michigan State University) Introduction to Data Science 12 / 15

Package tidyr

Problem: I have observations scattered across multiple rows.

table2

# A tibble: 12 x 4

country year type count

1 Afghanistan 1999 cases 745

2 Afghanistan 1999 population 19987071

3 Afghanistan 2000 cases 2666

4 Afghanistan 2000 population 20595360

5 Brazil 1999 cases 37737

6 Brazil 1999 population 172006362

7 Brazil 2000 cases 80488

8 Brazil 2000 population 174504898

9 China 1999 cases 212258

10 China 1999 population 1272915272

11 China 2000 cases 213766

12 China 2000 population 1280428583

spread(data = table2, key = “type”, value = “count”)

# A tibble: 6 x 4

country year cases population

1 Afghanistan 1999 745 19987071

2 Afghanistan 2000 2666 20595360

3 Brazil 1999 37737 172006362

4 Brazil 2000 80488 174504898

5 China 1999 212258 1272915272

6 China 2000 213766 1280428583

Function spread() transforms your data from long format to wide format.

(Michigan State University) Introduction to Data Science 12 / 15

Package tidyr

Problem: I have cells with multiple values.

table3

# A tibble: 6 x 3

country year rate

1 Afghanistan 1999 745/19987071

2 Afghanistan 2000 2666/20595360

3 Brazil 1999 37737/172006362

4 Brazil 2000 80488/174504898

5 China 1999 212258/1272915272

6 China 2000 213766/1280428583

separate(data = table3, col = “rate”,

into = c(“cases”, “population”),

sep = “/”)

# A tibble: 6 x 4

country year cases population

1 Afghanistan 1999 745 19987071

2 Afghanistan 2000 2666 20595360

3 Brazil 1999 37737 172006362

4 Brazil 2000 80488 174504898

5 China 1999 212258 1272915272

6 China 2000 213766 1280428583

Function separate() will split a column into multiple columns at a
specified character.

(Michigan State University) Introduction to Data Science 13 / 15

Package tidyr

Problem: I have cells with multiple values.

table3

# A tibble: 6 x 3

country year rate

1 Afghanistan 1999 745/19987071

2 Afghanistan 2000 2666/20595360

3 Brazil 1999 37737/172006362

4 Brazil 2000 80488/174504898

5 China 1999 212258/1272915272

6 China 2000 213766/1280428583

separate(data = table3, col = “rate”,

into = c(“cases”, “population”),

sep = “/”)

# A tibble: 6 x 4

country year cases population

1 Afghanistan 1999 745 19987071

2 Afghanistan 2000 2666 20595360

3 Brazil 1999 37737 172006362

4 Brazil 2000 80488 174504898

5 China 1999 212258 1272915272

6 China 2000 213766 1280428583

Function separate() will split a column into multiple columns at a
specified character.

(Michigan State University) Introduction to Data Science 13 / 15

Package tidyr

Problem: I have one value spread across multiple cells.

table5

# A tibble: 6 x 4

country century year rate

1 Afghanistan 19 99 745/19987071

2 Afghanistan 20 00 2666/20595360

3 Brazil 19 99 37737/172006362

4 Brazil 20 00 80488/174504898

5 China 19 99 212258/1272915272

6 China 20 00 213766/1280428583

unite(data = table5, col = “year”,

century:year, sep = “”)

# A tibble: 6 x 3

country year rate

1 Afghanistan 1999 745/19987071

2 Afghanistan 2000 2666/20595360

3 Brazil 1999 37737/172006362

4 Brazil 2000 80488/174504898

5 China 1999 212258/1272915272

6 China 2000 213766/1280428583

Function unite() will paste together multiple columns into a single
column.

(Michigan State University) Introduction to Data Science 14 / 15

Package tidyr

Problem: I have one value spread across multiple cells.

table5

# A tibble: 6 x 4

country century year rate

1 Afghanistan 19 99 745/19987071

2 Afghanistan 20 00 2666/20595360

3 Brazil 19 99 37737/172006362

4 Brazil 20 00 80488/174504898

5 China 19 99 212258/1272915272

6 China 20 00 213766/1280428583

unite(data = table5, col = “year”,

century:year, sep = “”)

# A tibble: 6 x 3

country year rate

1 Afghanistan 1999 745/19987071

2 Afghanistan 2000 2666/20595360

3 Brazil 1999 37737/172006362

4 Brazil 2000 80488/174504898

5 China 1999 212258/1272915272

6 China 2000 213766/1280428583

Function unite() will paste together multiple columns into a single
column.

(Michigan State University) Introduction to Data Science 14 / 15

Helpful guidelines

• Every messy data set is different.

• Tidy the data one step at a time, and save the result at each step.

• Look at help(package = “tidyr”) for other functions in package
tidyr.

• It may be easier to tidy some things before you read the data into R.

• Tidy data may not be possible for certain data sets (gene expression
data, corpus objects, igraph objects). Be willing to work with other
structures.

(Michigan State University) Introduction to Data Science 15 / 15

Related Posts