Lecture 4: Data Management
Spatial Data Science II
Dr. Adams
Overview
Tidy data
Tibble vs. Data.Frame Data Import
tidyr
Data Analysis Pipeline
(Wickham and Grolemund 2016)
Where does the time go in data analysis?
https://whatsthebigdata.com/2016/05/01/ data-scientists-spend-most-of-their-time-cleaning-data/
What is the worst part of data analysis?
https://whatsthebigdata.com/2016/05/01/ data-scientists-spend-most-of-their-time-cleaning-data/
Why is it such a pain
Lots of people work on building new tools for data modelling Very-little work has been accomplished on data cleaning tools Data formats are very inconsistent
Disconnect between data entry / data generation and analysis Historically smaller datasets
Families and Data
“Happy families are all alike; every unhappy family is unhappy in its own way.” – Leo Tolstoy
“Like families, tidy datasets are all alike but every messy dataset is messy in its own way.” – (Wickham 2014)
General data structure
Composed of rows and columns Columns are often labelled
Rows can be labelled
We call this rectangular data (flat file)
Excel table
ArcGIS attribute table CSV file
Terminology
Dataset
A collection of values
Numbers (Quantitative) Strings (Qualitative)
Values
Organized as variables and observations
Variable: Values measured for an attribute (e.g. height, population or age)
Observation: Values measured for an entity (e.g. person, country or neighbourhood)
Data Example
Biological Oxygen Demand
datasets::BOD
## Time demand
##1 1 8.3
##2 2 10.3
##3 3 19.0
##4 4 16.0
##5 5 15.6
##6 7 19.8
Biological Oxygen Demand
datasets::BOD is a dataset Time is a variable
demand is a variable
There are six observations 12 values
A set of data organization rules
Tidy Data is a set of rules that allows us to maintain our data in a consistent fashion (Wickham 2014)
Less time on data munging (data preparation)
Tidy Data Rules
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
(Wickham 2014)
Benefits of tidy data
It is easier to learn to work with different tools when they share a consistent input.
R can use vectorized functions, which work well with tidy data. Operations that occur on the entire vector in an efficient way
Opposed to using loops, which are slow in R
Data.Frame
Most people begin with data.frame() in R. read.csv() reads data into a data.frame().
BOD
## Time demand
##1 1 8.3
##2 2 10.3
##3 3 19.0
##4 4 16.0
##5 5 15.6
##6 7 19.8
Tibble
library(tidyverse) as_tibble(BOD)
## # A tibble: 6 x 2
## Time demand
##
##1 1 8.3
##2 2 10.3
##3 319
##4 416
##5 5 15.6
##6 7 19.8
Tibble
Prints first 10 rows
Prints only columns that fit on screen
Prints column types
Prints dimensions of data (row, column)
Issues with data.frame
colnames(BOD)
## [1] “Time” “demand”
# Partial Matching
BOD$Ti
## [1] 1 2 3 4 5 7
Column Missing Warnings
# Convert BOD to tibble
tibbleBOD <- as_tibble(BOD)
# Try partial match
tibbleBOD$Ti
## Warning: Unknown or uninitialised column: 'Ti'.
## NULL
Select variable in a data.frame
To select a columm (varible) we use the $ operator with a data.frame
BOD$Time
## [1] 1 2 3 4 5 7
Select varaible in a tibbel
To select a columm (varible) we use the select() function with a tibble
tibbleBOD %>% select(Time)
## # A tibble: 6 x 1 ## Time
##
##1 1
##2 2 ##3 3 ##4 4 ##5 5 ##6 7
Extracting rows by criteria – data.frame
BOD[which(BOD$Time > 2),]
## Time demand ##3 3 19.0 ##4 4 16.0 ##5 5 15.6 ##6 7 19.8
Extracting rows by criteria – tibble
tibbleBOD %>% filter(Time > 2)
## # A tibble: 4 x 2
## Time demand
##
##1 319
##2 416
##3 5 15.6
##4 7 19.8
Subset Observations (Tibble)
Data Wrangling Cheat Sheet
dplyr::filter()
Rows that meet a logical criteria
dplyr::distinct()
Remove duplicate rows
dplyr::sample_frac()
Random sample of fraction of rows
dplyr::sample_n()
Random sample of n rows
dplyr::slice()
Select rows by posistion, e.g. 10:15
Data Import
Data in libraries
Data can be stored and accessed with libraries (packages) in R.
# EPA Dataset
ggplot::mpg # Gapminder
gapminder::gapminder
# Biological Oxygen Demand
datasets::BOD
data.frame to tibble
Many datasets are data.frames
# data.frame to tibble as_tibble()
as_tibble(BOD)
Functions outside of the tidyverse in other packages may not accept a tibble.
# tibble to data.frame with as.data.frame()
as.data.frame(mpg)
Note: Base R functions typically use dots between words, and tidyverse functions use underscores.
Reading Rectangular Files
carat cut colour
0.23 Ideal E 0.25 Good J 1.2 Fair I
Comma Separated Values
CSV files are very common and likely your best choice for data.
1. They are human readable
It can be opened in a text editor
2. Ideal for rectangular data
3. Compatible with most software
CSV Structure
If you were to open a csv file in a text editor:
carat,cut,colour 0.23,deal,E 0.25,Good,J 1.2,Fair,I
Reading and Writing CSV
R works well with csv files
read.csv()
Stores input as a data.frame
write.csv()
Writes a data.frame to an output file
Working with tibbles (Note the underscore)
read_csv() write_csv()
readr::read_csv()
library(readr)
readr::read_csv(
file, # File Path
col_names = TRUE, # First row column names?
na = c(“”, “NA”), # Character for missing values skip = 0, # Skip N rows
)
data <- read_csv("C:/somedata.csv", na = "-999")
Reading Other Files with readr
reader can be used for other flat files (rectangular files)
read_csv2()
semicolon separated files (common in countries where , is used
as the decimal place) read_tsv()
reads tab delimited files
read_delim()
reads in files with any delimiter
Outside of the tidy verse
Haven: reading SPSS, Stata and SAS files
readxl: reading excel files .xls & .xlsx
DBI: accessing databases in-conjunction with RMySQL / RSQLite / RPostgreSQL
DEMO: Read in and prepare an Air Pollution csv file
Lecture 4 files: air_pollution.csv
Dealing with wide data
When a variable is spread across multiple columns
This is often a time variable
As seen in our air pollution data
tidyr::gather()
Wide Data
# Demo table
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## *
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
tidyr::gather()
table4a %>%
gather(`1999`, `2000`, key = “year”, value = “cases”
## # A tibble: 6 x 3
## country year cases
##
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
)
gather() notes
Unlike a data.frame column names in a tibble can begin with numbers
To access these column names, you use back ticks var Below the tilde ~ on the keyboard
key, gathering the column names cases, the values from the columns
Long Data
One observation might be scattered across multiple rows.
# Demo Long Table
table2
## # A tibble: 12 x 4
## country year type
##
## 1 Afghanistan 1999 cases
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil
## 6 Brazil
## 7 Brazil
## 8 Brazil
## 9 China
## 10 China
1999 cases 37737
1999 population 172006362
2000 cases 80488
2000 population 174504898
1999 cases 212258
1999 population 1272915272
count
745
tidyr::spread()
table2 %>%
spread(key = type, value = count)
## # A tibble: 6 x 4
## country year cases population
##
## 1 Afghanistan 1999 745
## 2 Afghanistan 2000 2666
## 3 Brazil
## 4 Brazil
## 5 China
## 6 China
19987071
20595360
1999 37737 172006362
2000 80488 174504898
1999 212258 1272915272
2000 213766 1280428583
spread() notes
key: column containing variable names
value: where to obtain values for each column created by the
key
Tidy Verse Demonstration
Demonstration of the tidy-verse for data management.
Hadley Wickham Data Analysis, https://youtu.be/go5Au01Jrvs
Homework
Reading: Poole, M. A., & O’Farrell, P. N. (1971). The assumptions of the linear regression model. Transactions of the Institute of British Geographers, 145-158.
References
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). doi:10.18637/jss.v059.i10.
Wickham, Hadley, and Garrett Grolemund. 2016. “R for Data Science.”