CS计算机代考程序代写 data structure Stat 260, Lecture 5, Reading Data

Stat 260, Lecture 5, Reading Data

Stat 260, Lecture 5, Reading Data

David Stenning

1 / 23

Load packages

library(tidyverse)
library(nycflights13)

2 / 23

Reading

Required Reading:

I Workflow: scripts: Chapter 6 of online text
I Introduction to data wrangling: Chapter 9 of online text
I Tibbles: Chapter 10 of online text
I Reading data with readr: Chapter 11 of online text

Useful Reference:

I Data import (readr/tidyr) cheatsheet at
[https://github.com/rstudio/cheatsheets/raw/master/data-
import.pdf]

3 / 23

https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-import.pdf

Data Wrangling

From Ch. 9 of the online text:

4 / 23

Tibbles
I In base R, the data structure used to hold data sets is the data

frame.
I We can make a data frame from vectors as follows:

dd <- data.frame(x=c(NA,10,1),y=c("one","two","three")) dd ## x y ## 1 NA one ## 2 10 two ## 3 1 three I The tidyverse authors find the default behaviour of data frames to be odd, and so implemented an improvement called tibbles: tt <- tibble(x=c(NA,10,1),y=c("one","two","three")) tt ## # A tibble: 3 x 2 ## x y ##
## 1 NA one
## 2 10 two
## 3 1 three

5 / 23

data frames to tibbles and back

I data frames can be coerced to tibbles and vice versa.
as_tibble(dd)

## # A tibble: 3 x 2
## x y
##
## 1 NA one
## 2 10 two
## 3 1 three
as.data.frame(tt)

## x y
## 1 NA one
## 2 10 two
## 3 1 three

6 / 23

tibble printing
I One difference between data frames and tibbles is how they are

printed.

I Printing a data frame: all rows and columns, up to your R
session’s max.print.

I Printing a tibble: the first 10 rows, as many columns as fit the
screen, and the column data types.

flights

## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 11 more variables: arr_delay ,
## # carrier , flight , tailnum , origin , dest ,
## # air_time , distance , hour , minute , time_hour

7 / 23

Control printing of tibbles

I To see all rows/columns of a tibble, best to View() it.

I But you can also print all rows and columns by setting
options(dplyr.print_min=Inf) and
options(tibble.width=Inf).

8 / 23

Extracting columns as vectors

I Use the basic tools $ and [[ to extract a variable from a tibble
or data frame:

dd$x

## [1] NA 10 1
tt$x

## [1] NA 10 1
dd[[“x”]]

## [1] NA 10 1
tt[[“x”]]

## [1] NA 10 1

9 / 23

Subsetting: columns
I Using select() is the preferred method to subset columns of

a data frame or tibble, but we can also use the more basic tool
[; e.g.,

tt[,”x”]

## # A tibble: 3 x 1
## x
##
## 1 NA
## 2 10
## 3 1
tt[,c(“x”,”y”)]

## # A tibble: 3 x 2
## x y
##
## 1 NA one
## 2 10 two
## 3 1 three
dd[,”x”] # returns a vector

## [1] NA 10 1
dd[,c(“x”,”y”)]

## x y
## 1 NA one
## 2 10 two
## 3 1 three

10 / 23

Subsetting: rows

I Using filter() is the preferred method to extract rows of a
data frame or tibble, but we can also use [.

tt[2,]

## # A tibble: 1 x 2
## x y
##
## 1 10 two
tt[1:2,]

## # A tibble: 2 x 2
## x y
##
## 1 NA one
## 2 10 two
dd[2,]

## x y
## 2 10 two
dd[1:2,]

## x y
## 1 NA one
## 2 10 two

11 / 23

Exercise

I Create a data frame myd and tibble myt that each have
columns named cat, dog and mouse. Each column should be
of length three, but the values in each column are up to you.

I What do names(myd) and names(myt) return?
I Create the variable a1 <- c("cat","dog","bird","fish") and the variable a2 <- c("cat","tiger"). We can combine logicals with [ to subset. What do the following return? I myd[,names(myd) %in% a1] I myd[,names(myd) %in% a2] I myt[,names(myt) %in% a1] I myt[,names(myt) %in% a2] 12 / 23 Importing data I We read in the HIV prevalence data with the base R function read.csv(), which returned a data frame. I We will now discuss the tidyverse equivalent, read_csv(), which returns a tibble. hiv <- read_csv("../Labs/HIVprev.csv") ## Parsed with column specification: ## cols( ## Country = col_character(), ## year = col_double(), ## prevalence = col_double() ## ) 13 / 23 Why use read_csv() instead of read.csv()? read_csv(): I reports how each column of the CSV file was “parsed” (more on this later), I returns a tibble, I uses stringsAsFactors = FALSE as the default (recall: hiv <- read.csv("../Labs/HIVprev.csv",stringsAsFactors = FALSE) ), I is faster, and I is more consistent across operating systems. 14 / 23 Other read_ functions I CSV stands for comma-separated files, aka comma-delimited files I read_csv2() reads semicolon-delimited files, I read_tsv() reads tab-delimited files, I read_delim() reads files with user-specified delimiter. I Exercise: A file called “chicken.C” contains the following data on two chickens, with IDs 22 and 33, who laid 2 and 1 eggs, respectively. (Reference: https://isotropic.org/papers/chicken.pdf) How would you read this data file into R? IDCeggs 22C2 33C1 15 / 23 https://isotropic.org/papers/chicken.pdf Skip and comments I Some files contain a header that describes the data, aka meta-data, that we should skip when reading. I Some files include comments that start with common characters, such as “#”, that we wish to drop. I Example file: This is a header that you should skip # this is a comment A,B,C 1,2.2,1999-05-10 4,5.5,2001-04-04 # another comment 16 / 23 Reading example into R with read_csv() In lec05exfile.csv we have: This is a header that you should skip # this is a comment A,B,C 1,2.2,1999-05-10 4,5.5,2001-04-04 # another comment read_csv("lec05exfile.csv",skip=2,comment="#") ## Parsed with column specification: ## cols( ## A = col_double(), ## B = col_double(), ## C = col_date(format = "") ## ) ## # A tibble: 2 x 3 ## A B C ##
## 1 1 2.2 1999-05-10
## 2 4 5.5 2001-04-04

17 / 23

Parsing a vector

I read_csv() returns a message that described how each
column of the input file was parsed.

I Parsing a file depends on the parse_* functions, such as
parse_number(), that parse vectors.

I The parse_* functions take a vector of character strings as
input and return a vector of a given mode, handling missing
values as specified by the user.

parse_number(c(“$10.55″,”33%”,”Number is 44″,”.”),na=”.”)

## [1] 10.55 33.00 44.00 NA

I The parse functions are designed to handle data formats and
character sets from around the world.

I In this course we assume North American data formats and
character set.

I See the text if you need other formats.

18 / 23

Other parsing functions
I parse_logical(), parse_integer(), parse_double(),

parse_character(), parse_factor(), parse_datetime(),
parse_date() and parse_time().

I Use the str() function to see the mode of an object:
str(parse_logical(c(“TRUE”,”FALSE”)))

## logi [1:2] TRUE FALSE
str(parse_logical(c(“1″,”0”)))

## logi [1:2] TRUE FALSE
str(parse_integer(c(“1″,”0”)))

## int [1:2] 1 0
str(parse_double(c(“1″,”0”)))

## num [1:2] 1 0
str(parse_factor(c(“1″,”0”)))

## Factor w/ 2 levels “1”,”0″: 1 2

19 / 23

Dates and times

I These parsers have default formats for dates and times, but
your best bet is to specify the format yourself.

I The formatting rules are described in help(strptime).
dd <- c("05/14/1966/12/34/56","04/02/2002/07/43/00","08/17/2005/07/22/00", "08/12/2008/16/20/00") dd <- parse_datetime(dd,format = "%m/%d/%Y/%H/%M/%S") str(dd) ## POSIXct[1:4], format: "1966-05-14 12:34:56" "2002-04-02 07:43:00" "2005-08-17 07:22:00" ... mean(dd) ## [1] "1995-09-18 22:59:59 UTC" diff(dd) ## Time differences in days ## [1] 13106.797 1232.985 1091.374 20 / 23 Parsing files I read_csv() and other read functions guess at the format of each column. Sometimes this works, sometimes not. I You can read about how these functions guess in the text. I Here we’ll focus on manually specifying the format. I Recall our example file, lec05exfile.csv , which we will read in: This is a header that you should skip # this is a comment A,B,C 1,2.2,1999-05-10 4,5.5,2001-04-04 # another comment dat <- read_csv("lec05exfile.csv",skip=2,comment="#") ## Parsed with column specification: ## cols( ## A = col_double(), ## B = col_double(), ## C = col_date(format = "") ## ) 21 / 23 Parsing files I Cut-and-paste the guess and replace parsers as necessary dat <- read_csv("lec05exfile.csv",skip=2,comment="#", col_types=cols( A = col_integer(), B = col_double(), C = col_date(format = "%Y-%m-%d") ) ) str(dat$A) ## int [1:2] 1 4 I For reproducibility your R scripts should have a manual specification of the parsing of each column, rather than relying on guesses that can change as your data changes. 22 / 23 Exercise I Copy the following data to a file and read it in to R. Specify the column types yourself, based on the descriptions in the header of the file. Hint: read about col_factor(). # Variable fert is a factor that records the # type of fertilizer used in the experiment, # date records the date and time of the experiment # and yield is the yield of corn. fert,date,yield F1A2,2018/04/01/12/30,22.56 F1A1,2018/04/02/12/00,26.06 F2A2,2018/04/01/12/45,32.03 F2A1,2018/04/02/12/00,33.21 23 / 23