Stat 260, Lecture 4, Data Transformation
Stat 260, Lecture 4, Data Transformation
David Stenning
1 / 21
Load packages
library(gapminder)
library(dplyr)
library(nycflights13)
head(flights)
## # A tibble: 6 x 19 2 / 21 Reading Required reading: I Workflow basics: Chapter 4 of online text Also, for a helpful reference: I dplyr cheatsheet at 3 / 21 https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf Data manipulation with dplyr I We have seen the need to manipulate datasets when studying gapminder <- mutate(gapminder, log10GdpPercap = log10(gdpPercap))
gm07 <- filter(gapminder,year==2007)
I In this lecture we will cover data manipulation more
systematically
4 / 21
dplyr
I The tidyverse package dplyr contains intuitive tools for
manipulating data sets.
I These tools are named after verbs.
I The five key verbs (functions) are
1. filter() to select subsets of observations (rows),
2. arrange() to reorder rows,
3. select() to select variables (columns),
4. mutate() to create new variables from existing ones, and
5. summarize() to calculate summary statistics.
5 / 21
Filter rows with filter()
I We previously used filter() to extract a subset of the
gapminder dataset based on a logical condition.
gm07 <- filter(gapminder,year==2007)
I gm07 now contains the rows of gapminder in which the
variable year is equal to 2007.
I The == is a comparison.
I We specify the rows to extract with comparison and logical
operators.
6 / 21
Relational (comparison) operators
I The basic relational operators are described in ?Comparison:
== is equal, != is not equal, > is greater than, < is less than,
>= is greater than or equal, <= is less than or equal.
I Watch out for finite-precision arithmetic.
2>3; 2/2 == 1; sqrt(2)^2 == 2; near(sqrt(2)^2,2) ## [1] FALSE ## [1] TRUE ## [1] FALSE ## [1] TRUE 7 / 21 Logical operators !TRUE ; TRUE | FALSE & FALSE; (TRUE | FALSE) & FALSE # eval parentheses first ## [1] FALSE ## [1] TRUE ## [1] FALSE I Relational comparisons can be combined with logicals ## [1] TRUE I Logical operations between vectors are element-wise. I Exercises: First, extract all flights from January or February. 9 / 21 Many logicals and %in% I In lab 2 we saw the %in% operator: ## [1] NA ## [1] NA ## [1] NA ## [1] TRUE 11 / 21 is.na() I Test for NA with is.na(): ## [1] TRUE NA FALSE ## [1] TRUE TRUE FALSE I Exercise: Extract all flights from January with missing 12 / 21 Sorting with arrange() I arrange() changes the order of rows, putting NAs last ## # A tibble: 4 x 2 13 / 21 Descending order I The default in arrange() is increasing order. arrange(vv,desc(x)) ## # A tibble: 4 x 2 I Exercise: Arrange the mpg data set by decreasing order in the 14 / 21 Selecting columns with select() column. functions. select(flights,month,day,hour,minute) ## # A tibble: 336,776 x 4 15 / 21 Ranges select(flights,month:minute) ## # A tibble: 336,776 x 17 ## # A tibble: 336,776 x 2 16 / 21 Renaming variables I You can rename variables as they are selected. flights <- rename(flights,tail_num = tailnum)
17 / 21
Helper functions
I Some useful helper functions are starts_with(),
ends_with() and contains().
I See ?select for a complete list.
select(flights,contains("dep"))
## # A tibble: 336,776 x 3
## dep_time sched_dep_time dep_delay
## I Exercise Select all variables with “dep” or “arr” in the name. 18 / 21 Add new variables with mutate() I We’ve used mutate() to add log-GDP to the gapminder data gapminder <- mutate(gapminder, log10GdpPercap = log10(gdpPercap))
I You can use variables as they are created by mutate():
flights <- mutate(flights,gain = arr_delay - dep_delay, gainh = gain/60)
19 / 21
Summaries and grouping
I A common task is “split-apply-combine”: We want to split a
data set into groups defined by one variable, apply a statistical
summary to each group, and then combine the results.
I With dplyr we use group_by() and summarize().
by_day <- group_by(flights,year,month,day)
summarize(by_day, delay = mean(dep_delay,na.rm=TRUE))
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day delay
## I Note: mean() returns NA whenever there are missing values, unless we set Combining operations with the pipe without the need to store intermediate results group_by(month,day) %>% ## # A tibble: 365 x 4 I Notice that we omit the data set name when a function
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 11 more variables: arr_delay
## # tailnum
## # hour
I Data transformation: Chapter 5 of online text
[https://github.com/rstudio/cheatsheets/raw/master/data-
transformation.pdf]
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
data visualization. For example:
I Basic logical operators: ! is NOT, & is AND, | is OR.
(2==2) | (2==3)
x <- c(TRUE,TRUE,FALSE); y <- c(FALSE,TRUE,TRUE)
!x ; x&y ; x|y
## [1] FALSE FALSE TRUE
## [1] FALSE TRUE FALSE
## [1] TRUE TRUE TRUE
8 / 21
filter() example
I Extract all flights from January, with departure delay of more
than 1:
jan13 <- filter(flights, month==1 & dep_delay > 1)
Then extract all flights from January or February that have a
departure delay of 1 or more.
hiv <- read.csv("../Labs/HIVprev.csv",stringsAsFactors = FALSE)
cc <- c("Botswana","Central African Republic","Congo","Kenya","Lesotho",
"Malawi","Namibia","South Africa","Swaziland","Uganda","Zambia",
"Zimbabwe")
hihiv <- filter(hiv,Country %in% cc)
I This is a convenient alternative to
filter(hiv,Country=="Botswana" | Country=="Central African Republic", etc. )
I Exercise: The nycflights13 package includes a table
airlines that translates the two-letter airline codes in
flights into the full names of the airlines. Extract all flights
operated by United, American or Delta.
10 / 21
Missing data: NA
I In R, missing data (not available) is denoted by NA.
I NA takes precendence in all comparison and arithmetic
operations, and almost all logical operations.
NA > 3; NA+10; NA & TRUE; NA | TRUE
vv <- c(10,NA,1)
is.na(vv)
## [1] FALSE TRUE FALSE
vv>1
is.na(vv) | vv> 1
departure delay.
vv <- tibble(x=c(NA,10,10,1),y=c("one","two","three","four"))
arrange(vv,x)
## # A tibble: 4 x 2
## x y
##
## 1 1 four
## 2 10 two
## 3 10 three
## 4 NA one
arrange(vv,x,y)
## x y
##
## 1 1 four
## 2 10 three
## 3 10 two
## 4 NA one
I Use desc() for descending order.
## x y
##
## 1 10 two
## 2 10 three
## 3 1 four
## 4 NA one
number of cylinders (variable cyl) and increasing order by
engine displacement (variable displ) within cylinders.
I Whereas filter() subsets by row, select() subsets by
I We specify columns by their name, possibly using helper
I Select month, day, hour and minute from flights:
## month day hour minute
##
## 1 1 1 5 15
## 2 1 1 5 29
## 3 1 1 5 40
## 4 1 1 5 45
## 5 1 1 6 0
## 6 1 1 5 58
## 7 1 1 6 0
## 8 1 1 6 0
## 9 1 1 6 0
## 10 1 1 6 0
## # … with 336,766 more rows
I Select or de-select a range of columns with :
## month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 1 1 517 515 2 830 819
## 2 1 1 533 529 4 850 830
## 3 1 1 542 540 2 923 850
## 4 1 1 544 545 -1 1004 1022
## 5 1 1 554 600 -6 812 837
## 6 1 1 554 558 -4 740 728
## 7 1 1 555 600 -5 913 854
## 8 1 1 557 600 -3 709 723
## 9 1 1 557 600 -3 838 846
## 10 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 10 more variables: arr_delay
## # carrier
## # air_time
select(flights,-(month:minute))
## year time_hour
##
## 2 2013 2013-01-01 05:00:00
## 3 2013 2013-01-01 05:00:00
## 4 2013 2013-01-01 05:00:00
## 5 2013 2013-01-01 06:00:00
## 6 2013 2013-01-01 05:00:00
## 7 2013 2013-01-01 06:00:00
## 8 2013 2013-01-01 06:00:00
## 9 2013 2013-01-01 06:00:00
## 10 2013 2013-01-01 06:00:00
## # … with 336,766 more rows
I To rename a variable without selecting it use rename().
## 1 517 515 2
## 2 533 529 4
## 3 542 540 2
## 4 544 545 -1
## 5 554 600 -6
## 6 554 558 -4
## 7 555 600 -5
## 8 557 600 -3
## 9 557 600 -3
## 10 558 600 -2
## # … with 336,766 more rows
set.
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # … with 355 more rows
na.rm=TRUE. 20 / 21
I The pipe %>% can be used to chain together operations,
select(flights,month,day,dep_delay) %>%
summarize(count = n(), delay=mean(dep_delay,na.rm=TRUE))
## # Groups: month [12]
## month day count delay
##
## 1 1 1 842 11.5
## 2 1 2 943 13.9
## 3 1 3 914 11.0
## 4 1 4 915 8.95
## 5 1 5 720 5.73
## 6 1 6 832 7.15
## 7 1 7 933 5.42
## 8 1 8 899 2.55
## 9 1 9 902 2.28
## 10 1 10 932 2.84
## # … with 355 more rows
receives data from the pipe. 21 / 21