CS计算机代考程序代写 chain Stat 260, Lecture 4, Data Transformation

Stat 260, Lecture 4, Data Transformation

Stat 260, Lecture 4, Data Transformation

David Stenning

1 / 21

Load packages

library(gapminder)
library(dplyr)
library(nycflights13)
head(flights)

## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # … with 11 more variables: arr_delay , carrier , flight ,
## # tailnum , origin , dest , air_time , distance ,
## # hour , minute , time_hour

2 / 21

Reading

Required reading:

I Workflow basics: Chapter 4 of online text
I Data transformation: Chapter 5 of online text

Also, for a helpful reference:

I dplyr cheatsheet at
[https://github.com/rstudio/cheatsheets/raw/master/data-
transformation.pdf]

3 / 21

https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf

Data manipulation with dplyr

I We have seen the need to manipulate datasets when studying
data visualization. For example:

gapminder <- mutate(gapminder, log10GdpPercap = log10(gdpPercap)) gm07 <- filter(gapminder,year==2007) I In this lecture we will cover data manipulation more systematically 4 / 21 dplyr I The tidyverse package dplyr contains intuitive tools for manipulating data sets. I These tools are named after verbs. I The five key verbs (functions) are 1. filter() to select subsets of observations (rows), 2. arrange() to reorder rows, 3. select() to select variables (columns), 4. mutate() to create new variables from existing ones, and 5. summarize() to calculate summary statistics. 5 / 21 Filter rows with filter() I We previously used filter() to extract a subset of the gapminder dataset based on a logical condition. gm07 <- filter(gapminder,year==2007) I gm07 now contains the rows of gapminder in which the variable year is equal to 2007. I The == is a comparison. I We specify the rows to extract with comparison and logical operators. 6 / 21 Relational (comparison) operators I The basic relational operators are described in ?Comparison: == is equal, != is not equal, > is greater than, < is less than, >= is greater than or equal, <= is less than or equal. I Watch out for finite-precision arithmetic. 2>3; 2/2 == 1; sqrt(2)^2 == 2; near(sqrt(2)^2,2)

## [1] FALSE

## [1] TRUE

## [1] FALSE

## [1] TRUE

7 / 21

Logical operators
I Basic logical operators: ! is NOT, & is AND, | is OR.

!TRUE ; TRUE | FALSE & FALSE; (TRUE | FALSE) & FALSE # eval parentheses first

## [1] FALSE

## [1] TRUE

## [1] FALSE

I Relational comparisons can be combined with logicals
(2==2) | (2==3)

## [1] TRUE

I Logical operations between vectors are element-wise.
x <- c(TRUE,TRUE,FALSE); y <- c(FALSE,TRUE,TRUE) !x ; x&y ; x|y ## [1] FALSE FALSE TRUE ## [1] FALSE TRUE FALSE ## [1] TRUE TRUE TRUE 8 / 21 filter() example I Extract all flights from January, with departure delay of more than 1: jan13 <- filter(flights, month==1 & dep_delay > 1)

I Exercises: First, extract all flights from January or February.
Then extract all flights from January or February that have a
departure delay of 1 or more.

9 / 21

Many logicals and %in%

I In lab 2 we saw the %in% operator:
hiv <- read.csv("../Labs/HIVprev.csv",stringsAsFactors = FALSE) cc <- c("Botswana","Central African Republic","Congo","Kenya","Lesotho", "Malawi","Namibia","South Africa","Swaziland","Uganda","Zambia", "Zimbabwe") hihiv <- filter(hiv,Country %in% cc) I This is a convenient alternative to filter(hiv,Country=="Botswana" | Country=="Central African Republic", etc. ) I Exercise: The nycflights13 package includes a table airlines that translates the two-letter airline codes in flights into the full names of the airlines. Extract all flights operated by United, American or Delta. 10 / 21 Missing data: NA I In R, missing data (not available) is denoted by NA. I NA takes precendence in all comparison and arithmetic operations, and almost all logical operations. NA > 3; NA+10; NA & TRUE; NA | TRUE

## [1] NA

## [1] NA

## [1] NA

## [1] TRUE

11 / 21

is.na()

I Test for NA with is.na():
vv <- c(10,NA,1) is.na(vv) ## [1] FALSE TRUE FALSE vv>1

## [1] TRUE NA FALSE
is.na(vv) | vv> 1

## [1] TRUE TRUE FALSE

I Exercise: Extract all flights from January with missing
departure delay.

12 / 21

Sorting with arrange()

I arrange() changes the order of rows, putting NAs last
vv <- tibble(x=c(NA,10,10,1),y=c("one","two","three","four")) arrange(vv,x) ## # A tibble: 4 x 2 ## x y ##
## 1 1 four
## 2 10 two
## 3 10 three
## 4 NA one
arrange(vv,x,y)

## # A tibble: 4 x 2
## x y
##
## 1 1 four
## 2 10 three
## 3 10 two
## 4 NA one

13 / 21

Descending order

I The default in arrange() is increasing order.
I Use desc() for descending order.

arrange(vv,desc(x))

## # A tibble: 4 x 2
## x y
##
## 1 10 two
## 2 10 three
## 3 1 four
## 4 NA one

I Exercise: Arrange the mpg data set by decreasing order in the
number of cylinders (variable cyl) and increasing order by
engine displacement (variable displ) within cylinders.

14 / 21

Selecting columns with select()
I Whereas filter() subsets by row, select() subsets by

column.
I We specify columns by their name, possibly using helper

functions.
I Select month, day, hour and minute from flights:

select(flights,month,day,hour,minute)

## # A tibble: 336,776 x 4
## month day hour minute
##
## 1 1 1 5 15
## 2 1 1 5 29
## 3 1 1 5 40
## 4 1 1 5 45
## 5 1 1 6 0
## 6 1 1 5 58
## 7 1 1 6 0
## 8 1 1 6 0
## 9 1 1 6 0
## 10 1 1 6 0
## # … with 336,766 more rows

15 / 21

Ranges
I Select or de-select a range of columns with :

select(flights,month:minute)

## # A tibble: 336,776 x 17
## month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 1 1 517 515 2 830 819
## 2 1 1 533 529 4 850 830
## 3 1 1 542 540 2 923 850
## 4 1 1 544 545 -1 1004 1022
## 5 1 1 554 600 -6 812 837
## 6 1 1 554 558 -4 740 728
## 7 1 1 555 600 -5 913 854
## 8 1 1 557 600 -3 709 723
## 9 1 1 557 600 -3 838 846
## 10 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 10 more variables: arr_delay ,
## # carrier , flight , tailnum , origin , dest ,
## # air_time , distance , hour , minute
select(flights,-(month:minute))

## # A tibble: 336,776 x 2
## year time_hour
## ## 1 2013 2013-01-01 05:00:00
## 2 2013 2013-01-01 05:00:00
## 3 2013 2013-01-01 05:00:00
## 4 2013 2013-01-01 05:00:00
## 5 2013 2013-01-01 06:00:00
## 6 2013 2013-01-01 05:00:00
## 7 2013 2013-01-01 06:00:00
## 8 2013 2013-01-01 06:00:00
## 9 2013 2013-01-01 06:00:00
## 10 2013 2013-01-01 06:00:00
## # … with 336,766 more rows

16 / 21

Renaming variables

I You can rename variables as they are selected.
I To rename a variable without selecting it use rename().

flights <- rename(flights,tail_num = tailnum) 17 / 21 Helper functions I Some useful helper functions are starts_with(), ends_with() and contains(). I See ?select for a complete list. select(flights,contains("dep")) ## # A tibble: 336,776 x 3 ## dep_time sched_dep_time dep_delay ##
## 1 517 515 2
## 2 533 529 4
## 3 542 540 2
## 4 544 545 -1
## 5 554 600 -6
## 6 554 558 -4
## 7 555 600 -5
## 8 557 600 -3
## 9 557 600 -3
## 10 558 600 -2
## # … with 336,766 more rows

I Exercise Select all variables with “dep” or “arr” in the name.

18 / 21

Add new variables with mutate()

I We’ve used mutate() to add log-GDP to the gapminder data
set.

gapminder <- mutate(gapminder, log10GdpPercap = log10(gdpPercap)) I You can use variables as they are created by mutate(): flights <- mutate(flights,gain = arr_delay - dep_delay, gainh = gain/60) 19 / 21 Summaries and grouping I A common task is “split-apply-combine”: We want to split a data set into groups defined by one variable, apply a statistical summary to each group, and then combine the results. I With dplyr we use group_by() and summarize(). by_day <- group_by(flights,year,month,day) summarize(by_day, delay = mean(dep_delay,na.rm=TRUE)) ## # A tibble: 365 x 4 ## # Groups: year, month [12] ## year month day delay ##
## 1 2013 1 1 11.5
## 2 2013 1 2 13.9
## 3 2013 1 3 11.0
## 4 2013 1 4 8.95
## 5 2013 1 5 5.73
## 6 2013 1 6 7.15
## 7 2013 1 7 5.42
## 8 2013 1 8 2.55
## 9 2013 1 9 2.28
## 10 2013 1 10 2.84
## # … with 355 more rows

I Note: mean() returns NA whenever there are missing values, unless we set
na.rm=TRUE. 20 / 21

Combining operations with the pipe
I The pipe %>% can be used to chain together operations,

without the need to store intermediate results
select(flights,month,day,dep_delay) %>%

group_by(month,day) %>%
summarize(count = n(), delay=mean(dep_delay,na.rm=TRUE))

## # A tibble: 365 x 4
## # Groups: month [12]
## month day count delay
##
## 1 1 1 842 11.5
## 2 1 2 943 13.9
## 3 1 3 914 11.0
## 4 1 4 915 8.95
## 5 1 5 720 5.73
## 6 1 6 832 7.15
## 7 1 7 933 5.42
## 8 1 8 899 2.55
## 9 1 9 902 2.28
## 10 1 10 932 2.84
## # … with 355 more rows

I Notice that we omit the data set name when a function
receives data from the pipe. 21 / 21