CS计算机代考程序代写 case study Stat 260, Lecture 6, Tidy Data

Stat 260, Lecture 6, Tidy Data

Stat 260, Lecture 6, Tidy Data

David Stenning

1 / 33

Load packages and datasets

library(tidyverse)
table1 <- read_csv("lec06table1.csv", col_types=cols( country=col_character(), year=col_integer(), cases=col_integer(), population=col_integer() ) ) 2 / 33 Create tibbles used in examples table2 <- table1 %>%
pivot_longer(c(cases,population),names_to=”type”,values_to=”count”) %>%
arrange(country,year)

table3 <- table1 %>%
mutate(rate = paste(cases,population,sep=”/”)) %>%
select(-cases,-population)

table4a <- table1 %>%
select(country,year,cases) %>%
pivot_wider(names_from=year,values_from=cases)

table4b <- table1 %>%
select(country,year,population) %>%
pivot_wider(names_from=year,values_from=population)

3 / 33

Reading

Required Reading:

I Tidy Data: Chapter 12 of online text.

Useful References:

I Advanced: Wickham (2014) [https://www.jstatsoft.org/index.
php/jss/article/view/v059i10/v59i10.pdf]

I Data import (readr/tidyr) cheatsheet at
[https://github.com/rstudio/cheatsheets/raw/master/data-
import.pdf]

4 / 33

https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf
https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf

Tidy data

I In a tidy dataset,
I each variable has its own column,
I each observation has its own row, and
I each value has its own cell.

I (See Figure 12.1 of the online text.)

I The tibble table1 that we read in earlier is tidy:
table1

## # A tibble: 6 x 4
## country year cases population
##
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

5 / 33

Why is table1 tidy?

I It is never straightforward to answer this.

I These are WHO data on Tuberculosis cases.
I The variables are country, year, number of cases and population
I The observations are country/year combinations.
I Each value is in its own cell.

I country and year describe the observational unit, and so
there must be one row for each, and there must be variables
that record country and year.

I cases and population are what we measure on the
observational unit and so must be variables.

6 / 33

Non-tidy data
I There are many ways to be non-tidy.
I Exercise: Why are table2 and table4a not tidy?

print(table2,n=6)

## # A tibble: 12 x 4
## country year type count
##
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## # … with 6 more rows
print(table4a,n=6)

## # A tibble: 3 x 3
## country `1999` `2000`
##
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

7 / 33

Why tidy?

I Good statistics (exploratory, visualization, modelling) requires
that we identify the observational unit.

I R is efficient at computing with vectors, so variables as
column-vectors are efficient.

I Tidyverse tools require it; e.g.,
table1 %>% mutate(rate=cases/population*100000)

## # A tibble: 6 x 5
## country year cases population rate
##
## 1 Afghanistan 1999 745 19987071 3.73
## 2 Afghanistan 2000 2666 20595360 12.9
## 3 Brazil 1999 37737 172006362 21.9
## 4 Brazil 2000 80488 174504898 46.1
## 5 China 1999 212258 1272915272 16.7
## 6 China 2000 213766 1280428583 16.7

8 / 33

table1 %>% group_by(year) %>% summarize(sum(cases))

## # A tibble: 2 x 2
## year `sum(cases)`
##
## 1 1999 250740
## 2 2000 296920
ggplot(table1,aes(x=year,y=cases,color=country)) + geom_point()

0

50000

100000

150000

200000

1999.00 1999.25 1999.50 1999.75 2000.00
year

ca
se

s

country

Afghanistan

Brazil

China

9 / 33

Exercises

I Compute rate from table2.
I Compute rate from table4a and table4b.

10 / 33

pivot_longer()

table4a

## # A tibble: 3 x 3
## country `1999` `2000`
##
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766

I The problem with table 4a is that the cases variable is split
across the columns 1999 and 2000.

11 / 33

pivot_longer()

I To tidy table 4a, we must “pivot” these columns into a new
pair of variables, resulting in a longer tidy table.

I For this operation we need:
(1) the set of columns whose names are not variables but rather

values, which is c(1999,2000),
(2) the name of the variable to move the column names to, which

is year, and
(3) the name of the variable to move the column values to, which is

cases.

12 / 33

pivot_longer()

table4a %>% pivot_longer(c(`1999`,`2000`),names_to=”year”,values_to=”cases”)

## # A tibble: 6 x 3
## country year cases
##
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil 1999 37737
## 4 Brazil 2000 80488
## 5 China 1999 212258
## 6 China 2000 213766
# year and cases are in quotes because those variables are not in table4a

I Exercise Repeat for table4b

13 / 33

pivot_longer(): another example
Billboard Top 100 rankings of songs
bb <- read_csv("billboard.csv") bb ## # A tibble: 317 x 83 ## year artist.inverted track time genre date.entered date.peaked x1st.week ##
## 1 2000 Destiny’s Child Inde~ 03:38 Rock 2000-09-23 2000-11-18 78
## 2 2000 Santana Mari~ 04:18 Rock 2000-02-12 2000-04-08 15
## 3 2000 Savage Garden I Kn~ 04:07 Rock 1999-10-23 2000-01-29 71
## 4 2000 Madonna Music 03:45 Rock 2000-08-12 2000-09-16 41
## 5 2000 Aguilera, Chri~ Come~ 03:38 Rock 2000-08-05 2000-10-14 57
## 6 2000 Janet Does~ 04:17 Rock 2000-06-17 2000-08-26 59
## 7 2000 Destiny’s Child Say ~ 04:31 Rock 1999-12-25 2000-03-18 83
## 8 2000 Iglesias, Enri~ Be W~ 03:36 Latin 2000-04-01 2000-06-24 63
## 9 2000 Sisqo Inco~ 03:52 Rock 2000-06-24 2000-08-12 77
## 10 2000 Lonestar Amaz~ 04:25 Coun~ 1999-06-05 2000-03-04 81
## # … with 307 more rows, and 75 more variables: x2nd.week ,
## # x3rd.week , x4th.week , x5th.week , x6th.week ,
## # x7th.week , x8th.week , x9th.week , x10th.week ,
## # x11th.week , x12th.week , x13th.week , x14th.week ,
## # x15th.week , x16th.week , x17th.week , x18th.week ,
## # x19th.week , x20th.week , x21st.week , x22nd.week ,
## # x23rd.week , x24th.week , x25th.week , x26th.week ,
## # x27th.week , x28th.week , x29th.week , x30th.week ,
## # x31st.week , x32nd.week , x33rd.week , x34th.week ,
## # x35th.week , x36th.week , x37th.week , x38th.week ,
## # x39th.week , x40th.week , x41st.week , x42nd.week ,
## # x43rd.week , x44th.week , x45th.week , x46th.week ,
## # x47th.week , x48th.week , x49th.week , x50th.week ,
## # x51st.week , x52nd.week , x53rd.week , x54th.week ,
## # x55th.week , x56th.week , x57th.week , x58th.week ,
## # x59th.week , x60th.week , x61st.week , x62nd.week ,
## # x63rd.week , x64th.week , x65th.week , x66th.week ,
## # x67th.week , x68th.week , x69th.week , x70th.week ,
## # x71st.week , x72nd.week , x73rd.week , x74th.week ,
## # x75th.week , x76th.week

14 / 33

Structure of the Billboard table

I Columns year through date.peaked describe the song, then
x1st.week through x76th.week are the chart positions for
the first through 76th weeks.
I If a song is on the chart for less than 76 weeks, its position is

NA for any missing weeks.
I Weeks are not variables, they are the time data for the time

series.

15 / 33

Tidying the Billboard data

I Main step is to pivot the rankings in the different weeks into a
rank variable.

I Before pivoting, will select/rename some of the variables.
I After pivoting, will create some new variables and sort the data

frame.

16 / 33

Select and rename

I Won’t need time or genre.
I Recall that select() from dplyr can use – to de-select

I Rename artist.inverted
I Recall that rename() from dplyr takes arguments of the form

newname = oldname
bb <- bb %>% select(-time,-genre) %>%
rename(artist = artist.inverted)

17 / 33

Pivoting the weeks into a “longer” version of the Billboard
data

I We will leave each song info variable as-is.
I The set of columns whose names are values, not variables, are

the weeks.
I The name of the variable to move the column names,

i.e.,x1st.week:x76th.week, to will be weeks.
I The name of the variable to move the column values, i.e. the

chart positions, to will be rank.
I There will be missing values, which we can remove.

18 / 33

bb %>% pivot_longer(c(x1st.week:x76th.week),names_to=”week”,
values_to=”rank”,values_drop_na=TRUE) %>%

mutate(week= parse_number(week)) %>% # replace x1st.week with 1, etc.
arrange(artist,track,week) # sorting by artist, track, week

## # A tibble: 5,307 x 7
## year artist track date.entered date.peaked week rank
##
## 1 2000 2 Pac Baby Don’t Cry (Keep Ya H~ 2000-02-26 2000-03-11 1 87
## 2 2000 2 Pac Baby Don’t Cry (Keep Ya H~ 2000-02-26 2000-03-11 2 82
## 3 2000 2 Pac Baby Don’t Cry (Keep Ya H~ 2000-02-26 2000-03-11 3 72
## 4 2000 2 Pac Baby Don’t Cry (Keep Ya H~ 2000-02-26 2000-03-11 4 77
## 5 2000 2 Pac Baby Don’t Cry (Keep Ya H~ 2000-02-26 2000-03-11 5 87
## 6 2000 2 Pac Baby Don’t Cry (Keep Ya H~ 2000-02-26 2000-03-11 6 94
## 7 2000 2 Pac Baby Don’t Cry (Keep Ya H~ 2000-02-26 2000-03-11 7 99
## 8 2000 2Ge+her The Hardest Part Of Break~ 2000-09-02 2000-09-09 1 91
## 9 2000 2Ge+her The Hardest Part Of Break~ 2000-09-02 2000-09-09 2 87
## 10 2000 2Ge+her The Hardest Part Of Break~ 2000-09-02 2000-09-09 3 92
## # … with 5,297 more rows

19 / 33

pivot_wider()
I The opposite pivot to pivot_longer() is pivot_wider().
I It’s used when observations are split across multiple rows.
I E.G., table2 has observations for each country/year split

across two rows:
table2

## # A tibble: 12 x 4
## country year type count
##
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583

20 / 33

Pivoting table2

I To tidy table2 we need two parameters to pivot with:
(1) the column to take variable names from, which is type, and
(2) the column to take values from, which is count.

table2 %>% pivot_wider(names_from=type,values_from=count)

## # A tibble: 6 x 4
## country year cases population
##
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
# Here, the variables `type` and `count` already exist in table2,
# so we don’t include quotes around them in pivot_wider().

I Exercise Select country, year and cases from table1 and pivot to obtain a
table with rows for each year and columns for each country. (Note: such data is
not tidy.)

21 / 33

Separating
I separate() splits a column on a specified separator, or at a

specified character number.
print(table3,n=4)

## # A tibble: 6 x 3
## country year rate
##
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## # … with 2 more rows
table3 %>% separate(rate,into=c(“cases”,”population”),sep=”/”) %>% print(n=4)

## # A tibble: 6 x 4
## country year cases population
##
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## # … with 2 more rows

I Notice that cases and population are character columns.
22 / 33

Separating based on character number
print(table3,n=4)

## # A tibble: 6 x 3
## country year rate
##
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## # … with 2 more rows
table3 %>% separate(rate,into=c(“first”,”remainder”),sep=1)

## # A tibble: 6 x 4
## country year first remainder
##
## 1 Afghanistan 1999 7 45/19987071
## 2 Afghanistan 2000 2 666/20595360
## 3 Brazil 1999 3 7737/172006362
## 4 Brazil 2000 8 0488/174504898
## 5 China 1999 2 12258/1272915272
## 6 China 2000 2 13766/1280428583

23 / 33

Convert type of columns after separating

table3 %>% separate(rate,into=c(“cases”,”population”),sep=”/”,
convert=TRUE)

## # A tibble: 6 x 4
## country year cases population
##
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583

24 / 33

Missing data

I When we used pivot_longer() on the Billboard data we set
values_drop_na=TRUE to remove weeks where a given track
was not on the charts.

bb %>% select(track,x23rd.week:x25th.week) %>% print(n=4)

## # A tibble: 317 x 4
## track x23rd.week x24th.week x25th.week
##
## 1 Independent Women Part I 10 12 15
## 2 Maria, Maria 26 36 48
## 3 I Knew I Loved You 8 12 14
## 4 Music 29 44 NA
## # … with 313 more rows

I The missings in bb are “explicit”; when pivot_longer()
removes them they become “implicit” (e.g., no row for week 25
for track 4).

25 / 33

Making implicit missing data explicit

I pivot_wider() will make implicit missing values explicit if
needed for a row.

stocks <- tibble( year=c(2015,2016,2016), qtr = c(1,1,2), return = c(1.0,2.0,3.0)) stocks ## # A tibble: 3 x 3 ## year qtr return ##
## 1 2015 1 1
## 2 2016 1 2
## 3 2016 2 3
stocks %>% pivot_wider(names_from=year,values_from=return)

## # A tibble: 2 x 3
## qtr `2015` `2016`
##
## 1 1 1 2
## 2 2 NA 3

26 / 33

Make implicit explicit with complete()

I complete() creates rows for all combinations of input
variables and fills in missing values where necessary.

stocks %>% complete(year,qtr)

## # A tibble: 4 x 3
## year qtr return
##
## 1 2015 1 1
## 2 2015 2 NA
## 3 2016 1 2
## 4 2016 2 3

27 / 33

Case Study: WHO TB data

I The who dataset comes with tidyr. We’ll use a related (less
tidy) version from the WHO website.

tb <- read_csv("tb.csv") 28 / 33 Structure of TB table I First column is 2-letter country code, second is year, third is number of new cases for that country/year. I Then come TB counts for different gender/age categories. I new_sp is “new cases by positive pulmonary smear assay” I gender is m or f I two special age categories 0-4, 5-14, I age categories 0-14, 15-24, 25-34, 35-44, 45-54, 55-65, 65+, unknown (u) I Gender/age columns are not variables, they are data on the observed units. I Tidy data would have one row for each country, year, gender and age category, with a column of counts 29 / 33 Tidying the TB data I Recall structure of the data: country, year, count of new cases, counts of new cases by gender/age categories. names(tb)[1:10] ## [1] "iso2" "year" "new_sp" "new_sp_m04" "new_sp_m514" ## [6] "new_sp_m014" "new_sp_m1524" "new_sp_m2534" "new_sp_m3544" "new_sp_m4554" I Main step is to “pivot” TB prevalence in the different gender/age categories into a count variable. I Complicated by the coding of gender/age categories I Before pivoting, will remove unneeded variables. 30 / 33 Remove variables I Won’t need overall count I Special categories 0-4 and 5-14 overlap with 0-14, so remove I Age unknown not useful for analysing trends, so remove tb <- select(tb,-new_sp, -contains("04"), -contains("514"), -new_sp_mu, -new_sp_fu) tb ## # A tibble: 5,769 x 16 ## iso2 year new_sp_m014 new_sp_m1524 new_sp_m2534 new_sp_m3544 new_sp_m4554 ##
## 1 AD 1989 NA NA NA NA NA
## 2 AD 1990 NA NA NA NA NA
## 3 AD 1991 NA NA NA NA NA
## 4 AD 1992 NA NA NA NA NA
## 5 AD 1993 NA NA NA NA NA
## 6 AD 1994 NA NA NA NA NA
## 7 AD 1996 0 0 0 4 1
## 8 AD 1997 0 0 1 2 2
## 9 AD 1998 0 0 0 1 0
## 10 AD 1999 0 0 0 1 1
## # … with 5,759 more rows, and 9 more variables: new_sp_m5564 ,
## # new_sp_m65 , new_sp_f014 , new_sp_f1524 ,
## # new_sp_f2534 , new_sp_f3544 , new_sp_f4554 ,
## # new_sp_f5564 , new_sp_f65 31 / 33

Pivot counts for demographic groups
I Create demographic variable demog and count variable count by pivoting

over all variables except iso2 and year.

tb_longer <- tb %>%
pivot_longer(c(new_sp_m014:new_sp_f65),

names_to=”demog”,
values_to=”count”,
values_drop_na=TRUE)

tb_longer

## # A tibble: 33,615 x 4
## iso2 year demog count
##
## 1 AD 1996 new_sp_m014 0
## 2 AD 1996 new_sp_m1524 0
## 3 AD 1996 new_sp_m2534 0
## 4 AD 1996 new_sp_m3544 4
## 5 AD 1996 new_sp_m4554 1
## 6 AD 1996 new_sp_m5564 0
## 7 AD 1996 new_sp_m65 0
## 8 AD 1996 new_sp_f014 0
## 9 AD 1996 new_sp_f1524 1
## 10 AD 1996 new_sp_f2534 1
## # … with 33,605 more rows

32 / 33

Separate gender from age category.

I First remove new_sp_, then separate result on first column
maxlen <- max(nchar(tb_longer$demog)) tb <- tb_longer %>% mutate(demog = substr(demog,8,maxlen)) %>%

separate(demog, into=c(“gender”,”agecat”),sep=1)
tb

## # A tibble: 33,615 x 5
## iso2 year gender agecat count
##
## 1 AD 1996 m 014 0
## 2 AD 1996 m 1524 0
## 3 AD 1996 m 2534 0
## 4 AD 1996 m 3544 4
## 5 AD 1996 m 4554 1
## 6 AD 1996 m 5564 0
## 7 AD 1996 m 65 0
## 8 AD 1996 f 014 0
## 9 AD 1996 f 1524 1
## 10 AD 1996 f 2534 1
## # … with 33,605 more rows

33 / 33