Stat 260, Lecture 5, Reading Data
Stat 260, Lecture 5, Reading Data
David Stenning
1 / 23
Load packages
library(tidyverse)
library(nycflights13)
2 / 23
Reading
Required Reading:
I Workflow: scripts: Chapter 6 of online text
I Introduction to data wrangling: Chapter 9 of online text
I Tibbles: Chapter 10 of online text
I Reading data with readr: Chapter 11 of online text
Useful Reference:
I Data import (readr/tidyr) cheatsheet at
[https://github.com/rstudio/cheatsheets/raw/master/data-
import.pdf]
3 / 23
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf
Data Wrangling
From Ch. 9 of the online text:
4 / 23
Tibbles
I In base R, the data structure used to hold data sets is the data
frame.
I We can make a data frame from vectors as follows:
dd <- data.frame(x=c(NA,10,1),y=c("one","two","three"))
dd
## x y
## 1 NA one
## 2 10 two
## 3 1 three
I The tidyverse authors find the default behaviour of data frames
to be odd, and so implemented an improvement called tibbles:
tt <- tibble(x=c(NA,10,1),y=c("one","two","three"))
tt
## # A tibble: 3 x 2
## x y
##
## 1 NA one
## 2 10 two
## 3 1 three
5 / 23
data frames to tibbles and back
I data frames can be coerced to tibbles and vice versa.
as_tibble(dd)
## # A tibble: 3 x 2
## x y
##
## 1 NA one
## 2 10 two
## 3 1 three
as.data.frame(tt)
## x y
## 1 NA one
## 2 10 two
## 3 1 three
6 / 23
tibble printing
I One difference between data frames and tibbles is how they are
printed.
I Printing a data frame: all rows and columns, up to your R
session’s max.print.
I Printing a tibble: the first 10 rows, as many columns as fit the
screen, and the column data types.
flights
## # A tibble: 336,776 x 19 7 / 23 Control printing of tibbles I To see all rows/columns of a tibble, best to View() it. I But you can also print all rows and columns by setting 8 / 23 Extracting columns as vectors I Use the basic tools $ and [[ to extract a variable from a tibble dd$x ## [1] NA 10 1 ## [1] NA 10 1 ## [1] NA 10 1 ## [1] NA 10 1 9 / 23 Subsetting: columns a data frame or tibble, but we can also use the more basic tool tt[,”x”] ## # A tibble: 3 x 1 ## # A tibble: 3 x 2 ## [1] NA 10 1 ## x y 10 / 23 Subsetting: rows I Using filter() is the preferred method to extract rows of a tt[2,] ## # A tibble: 1 x 2 ## # A tibble: 2 x 2 ## x y ## x y 11 / 23 Exercise I Create a data frame myd and tibble myt that each have I What do names(myd) and names(myt) return? 17 / 23 Parsing a vector I read_csv() returns a message that described how each I Parsing a file depends on the parse_* functions, such as I The parse_* functions take a vector of character strings as parse_number(c(“$10.55″,”33%”,”Number is 44″,”.”),na=”.”) ## [1] 10.55 33.00 44.00 NA I The parse functions are designed to handle data formats and I In this course we assume North American data formats and I See the text if you need other formats. 18 / 23 Other parsing functions parse_character(), parse_factor(), parse_datetime(), I Use the str() function to see the mode of an object: ## logi [1:2] TRUE FALSE ## logi [1:2] TRUE FALSE ## int [1:2] 1 0 ## num [1:2] 1 0 ## Factor w/ 2 levels “1”,”0″: 1 2 19 / 23 Dates and times I These parsers have default formats for dates and times, but I The formatting rules are described in help(strptime).
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay
## # carrier
## # air_time
options(dplyr.print_min=Inf) and
options(tibble.width=Inf).
or data frame:
tt$x
dd[[“x”]]
tt[[“x”]]
I Using select() is the preferred method to subset columns of
[; e.g.,
## x
##
## 1 NA
## 2 10
## 3 1
tt[,c(“x”,”y”)]
## x y
##
## 1 NA one
## 2 10 two
## 3 1 three
dd[,”x”] # returns a vector
dd[,c(“x”,”y”)]
## 1 NA one
## 2 10 two
## 3 1 three
data frame or tibble, but we can also use [.
## x y
##
## 1 10 two
tt[1:2,]
## x y
##
## 1 NA one
## 2 10 two
dd[2,]
## 2 10 two
dd[1:2,]
## 1 NA one
## 2 10 two
columns named cat, dog and mouse. Each column should be
of length three, but the values in each column are up to you.
I Create the variable a1 <- c("cat","dog","bird","fish")
and the variable a2 <- c("cat","tiger"). We can combine
logicals with [ to subset. What do the following return?
I myd[,names(myd) %in% a1]
I myd[,names(myd) %in% a2]
I myt[,names(myt) %in% a1]
I myt[,names(myt) %in% a2]
12 / 23
Importing data
I We read in the HIV prevalence data with the base R function
read.csv(), which returned a data frame.
I We will now discuss the tidyverse equivalent, read_csv(),
which returns a tibble.
hiv <- read_csv("../Labs/HIVprev.csv")
## Parsed with column specification:
## cols(
## Country = col_character(),
## year = col_double(),
## prevalence = col_double()
## )
13 / 23
Why use read_csv() instead of read.csv()?
read_csv():
I reports how each column of the CSV file was “parsed” (more
on this later),
I returns a tibble,
I uses stringsAsFactors = FALSE as the default (recall:
hiv <- read.csv("../Labs/HIVprev.csv",stringsAsFactors = FALSE) ),
I is faster, and
I is more consistent across operating systems.
14 / 23
Other read_ functions
I CSV stands for comma-separated files, aka comma-delimited
files
I read_csv2() reads semicolon-delimited files,
I read_tsv() reads tab-delimited files,
I read_delim() reads files with user-specified delimiter.
I Exercise: A file called “chicken.C” contains the following data
on two chickens, with IDs 22 and 33, who laid 2 and 1 eggs,
respectively. (Reference: https://isotropic.org/papers/chicken.pdf) How
would you read this data file into R?
IDCeggs
22C2
33C1
15 / 23
https://isotropic.org/papers/chicken.pdf
Skip and comments
I Some files contain a header that describes the data, aka
meta-data, that we should skip when reading.
I Some files include comments that start with common
characters, such as “#”, that we wish to drop.
I Example file:
This is a header
that you should skip
# this is a comment
A,B,C
1,2.2,1999-05-10
4,5.5,2001-04-04 # another comment
16 / 23
Reading example into R with read_csv()
In lec05exfile.csv we have:
This is a header
that you should skip
# this is a comment
A,B,C
1,2.2,1999-05-10
4,5.5,2001-04-04 # another comment
read_csv("lec05exfile.csv",skip=2,comment="#")
## Parsed with column specification:
## cols(
## A = col_double(),
## B = col_double(),
## C = col_date(format = "")
## )
## # A tibble: 2 x 3
## A B C
##
## 1 1 2.2 1999-05-10
## 2 4 5.5 2001-04-04
column of the input file was parsed.
parse_number(), that parse vectors.
input and return a vector of a given mode, handling missing
values as specified by the user.
character sets from around the world.
character set.
I parse_logical(), parse_integer(), parse_double(),
parse_date() and parse_time().
str(parse_logical(c(“TRUE”,”FALSE”)))
str(parse_logical(c(“1″,”0”)))
str(parse_integer(c(“1″,”0”)))
str(parse_double(c(“1″,”0”)))
str(parse_factor(c(“1″,”0”)))
your best bet is to specify the format yourself.
dd <- c("05/14/1966/12/34/56","04/02/2002/07/43/00","08/17/2005/07/22/00",
"08/12/2008/16/20/00")
dd <- parse_datetime(dd,format = "%m/%d/%Y/%H/%M/%S")
str(dd)
## POSIXct[1:4], format: "1966-05-14 12:34:56" "2002-04-02 07:43:00" "2005-08-17 07:22:00" ...
mean(dd)
## [1] "1995-09-18 22:59:59 UTC"
diff(dd)
## Time differences in days
## [1] 13106.797 1232.985 1091.374
20 / 23
Parsing files
I read_csv() and other read functions guess at the format of
each column. Sometimes this works, sometimes not.
I You can read about how these functions guess in the text.
I Here we’ll focus on manually specifying the format.
I Recall our example file, lec05exfile.csv , which we will read in:
This is a header
that you should skip
# this is a comment
A,B,C
1,2.2,1999-05-10
4,5.5,2001-04-04 # another comment
dat <- read_csv("lec05exfile.csv",skip=2,comment="#")
## Parsed with column specification:
## cols(
## A = col_double(),
## B = col_double(),
## C = col_date(format = "")
## )
21 / 23
Parsing files
I Cut-and-paste the guess and replace parsers as necessary
dat <- read_csv("lec05exfile.csv",skip=2,comment="#",
col_types=cols(
A = col_integer(),
B = col_double(),
C = col_date(format = "%Y-%m-%d")
)
)
str(dat$A)
## int [1:2] 1 4
I For reproducibility your R scripts should have a manual
specification of the parsing of each column, rather than relying
on guesses that can change as your data changes.
22 / 23
Exercise
I Copy the following data to a file and read it in to R. Specify
the column types yourself, based on the descriptions in the
header of the file. Hint: read about col_factor().
# Variable fert is a factor that records the
# type of fertilizer used in the experiment,
# date records the date and time of the experiment
# and yield is the yield of corn.
fert,date,yield
F1A2,2018/04/01/12/30,22.56
F1A1,2018/04/02/12/00,26.06
F2A2,2018/04/01/12/45,32.03
F2A1,2018/04/02/12/00,33.21
23 / 23