Stat 260, Lecture 7, Relational Data
Stat 260, Lecture 7, Relational Data
David Stenning
1 / 34
Load packages and datasets
library(tidyverse)
library(nycflights13)
# Example from http://www.itl.nist.gov/div897/ctg/dm/sql_examples.htm
station <- read_csv("station.csv",col_types=cols(ID=col_integer()))
stats <- read_csv("stats.csv",col_types=cols(ID=col_integer()))
2 / 34
Reading
Required Reading:
I Relational Data: Chapter 13 of online text.
I In the sections on “Joins”, we will focus on the left-join (in the
Mutating Joins section of the online text) and the semi-join (in
the Filtering Joins section of the online text).
Useful Reference:
I Data transformation (dplyr) cheatsheet at
[https://github.com/rstudio/cheatsheets/raw/master/data-
transformation.pdf]
3 / 34
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf
Multiple Tables
I Modern data comes in multiple tables, called relational data.
I Such structure is motivated by relational database management
systems (RDBMS) that revolutionized database management.
I Example station and stats tables for weather data:
station
## # A tibble: 3 x 5
## ID City State Lat_N Long_W
##
## 1 13 Phoenix AZ 33 112
## 2 44 Denver CO 40 105
## 3 66 Caribou ME 47 68
stats
## # A tibble: 6 x 4
## ID Month Temp_F Rain_I
##
## 1 13 1 57.4 0.31
## 2 13 7 91.7 5.15
## 3 44 1 27.3 0.18
## 4 44 7 74.8 2.11
## 5 66 1 6.7 2.1
## 6 66 7 65.8 4.52
4 / 34
Relations in the Weather Data
I The relation, or connection between station and stats is the
ID variable present in both.
I Think of the IDs as short-hand for the info in the station
table.
I For example, ID 13 from station is short-hand for: Phoenix
AZ at latitude 33N and longitude 112W.
I The stats table is much more concise for not repeating the
info on each station.
5 / 34
Joining tables
I However, in some cases we may wish to include information,
such as station name, in the stats table.
I The text calls this a “mutating join”.
I Or, we may wish to filter weather measurements in stats
based on the characteristics of the stations, such as latititude
Lat_N >= 40.
I The text calls this a “filtering join”.
6 / 34
The nycflights13 Relational Data
I We used the flights data from nycflights13. There are
several other tables in this package:
print(airlines,n=3)
## # A tibble: 16 x 2
## carrier name
##
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## # … with 13 more rows
print(airports,n=3)
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
##
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Y~
## 2 06A Moton Field Municipal Airp~ 32.5 -85.7 264 -6 A America/Chica~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chica~
## # … with 1,455 more rows
7 / 34
print(planes,n=3)
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
##
## 1 N10156 2004 Fixed wing mu~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 2 N102UW 1998 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 3 N103US 1999 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## # … with 3,319 more rows
print(weather,n=3)
## # A tibble: 26,115 x 15 8 / 34 Relations in the nycflights Tables Science): I Exercise: The relationship between the weather and 9 / 34 Keys to the station table. print(station,n=3) ## # A tibble: 3 x 5 ## # A tibble: 6 x 4 10 / 34 Multiple Keys I It may take multiple variables to uniquely identify an I For example, in the weather table from nycflights13, we print(weather,n=3) ## # A tibble: 26,115 x 15 11 / 34 Tables with no Primary Key I Some tables lack the variables needed to uniquely identify its I For example, in the flights table year-month-day, flight flights %>% ## # A tibble: 11 x 6 12 / 34 Surrogate Key I You can add a “surrogate key” to a table with no primary key I Exercise Select year, month, day, flight and tailnum from 13 / 34 Joining Tables I To start, add the data in our stations table to the weather columns to stats like a call to mutate() would. ## # A tibble: 6 x 8 14 / 34 A Note on Joins I The term “join” is from SQL (Structured Query Language), I There are many types of joins. In this class we will focus on I However, for a bit of context, we will discuss inner- and 15 / 34 Inner Joins I To illustrate inner- and outer-joins we remove one of the rows station <- station[-3,]
I Now repeat the left_join():
stats %>% left_join(station) ## # A tibble: 6 x 8 16 / 34 The Left-Join as an Outer Join I In the output of left_join(), the data from stats is on the I The left_join() is an “outer join”, because it keeps I Similarly, a right-join keeps all observations in the right table, 17 / 34 The Inner-Join I The inner-join keeps observations that appear in both tables. ## # A tibble: 4 x 8 I This might be good if we only wanted data from stations in 18 / 34 Defining the Key Columns I The station and stats tables are very easy to join because I The by argument to left_join() lets you specify the keys to I The default is by = NULL. and stats. 19 / 34 Left-Joins with nycflights13 I To illustrate left-joins with the nycflights13 data we use the flights2 <- flights %>% flights2 ## # A tibble: 336,776 x 8 20 / 34 Natural Join of weather I flights2 and weather share the variables year, month, day, flights2 %>% left_join(weather) ## # A tibble: 336,776 x 18 21 / 34 by = x I Use by=x to join on a specific column. and planes. Use only tailnum. ## # A tibble: 336,776 x 16 22 / 34 Matching Keys with Different Names I The airport codes are in either origin or dest in the I Use, e.g., by= c(“dest” = “faa”) ## # A tibble: 336,776 x 15 23 / 34 Exercise I Change the name of the ID column in stats to Station. station tables. 24 / 34 Filtering Joins I In past lectures we have use %in% for filtering a table according c(“Canada”,”United States”)) another table. ## # A tibble: 10 x 2 25 / 34 I The “old” way and the semi-join way: ## # A tibble: 141,145 x 8 ## # A tibble: 141,145 x 8 26 / 34 Notes I The n column of the top_dest table used for filtering does not I This ensures the same behaviour as the “old” way. columns, as in the following exercise. 27 / 34 Exercise I From the original flights table, create a table called use summarize() to compute total delays (watch out for I Do a semi-join to filter flights to these days. 28 / 34 Set Operations I An add-on to this chapter is the set operations: intersect(), I Illustrate with intersect(). 29 / 34 Case Study: TB Map tb <- read_csv("tb.csv") %>% -new_sp_mu, -new_sp_fu) %>% key=demog,value=count,na.rm=TRUE)%>% tb2 <- tb %>% spread(key=gender,value=count) %>% 30 / 34 I Use map data from the maps package mutate(iso2 = iso.alpha(region)) ## # A tibble: 99,338 x 7 31 / 34 I Extract TB data on children in 2000 and join to ww. filter(agecat == “014”,year==2000) %>% wwchild00 <- ww %>% left_join(tbchild00) 32 / 34 I Can’t distinguish countries on the count scale, so transform to ggplot(wwchild00,aes(x=long,y=lat,fill=log(count+1), geom_polygon() + coord_quickmap() −50 0 50 −100 0 100 200 la 0 2 4 6 8 33 / 34 Improvements I Crude numbers of cases are not very informative. Large I Would have been better as rates: 2000 and the ISO 2 country code. 34 / 34
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
##
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## # … with 26,112 more rows, and 4 more variables: precip
## # pressure
I Figure from Section 13.2 of the online text (R for Data
airports tables is not shown on the diagram. What is it?
I The ID variable in the station table is its “primary key”.
I It uniquely identifies observations (rows) in the table.
I The ID column in the stats table is a “foreign key” that links
I Primary/foreign keys are the “relations” in relational data.
## ID City State Lat_N Long_W
##
## 1 13 Phoenix AZ 33 112
## 2 44 Denver CO 40 105
## 3 66 Caribou ME 47 68
print(stats,n=3)
## ID Month Temp_F Rain_I
##
## 1 13 1 57.4 0.31
## 2 13 7 91.7 5.15
## 3 44 1 27.3 0.18
## # … with 3 more rows
observation in a table.
need year-month-day-hour plus origin.
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
##
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## # … with 26,112 more rows, and 4 more variables: precip
## # pressure
observations.
and tailnum do not identify the flight.
count(year,month,day,flight,tailnum) %>%
filter(n > 1) %>% print(n=3)
## year month day flight tailnum n
##
## 1 2013 2 9 303
## 2 2013 2 9 655
## 3 2013 2 9 1623
## # … with 8 more rows
using mutate() and row_number().
flights and add a surrogate key to this 5-column table.
statistics in the stats table.
I This is what the text calls a mutating joins, because it adds
stats %>% left_join(station)
## ID Month Temp_F Rain_I City State Lat_N Long_W
##
## 1 13 1 57.4 0.31 Phoenix AZ 33 112
## 2 13 7 91.7 5.15 Phoenix AZ 33 112
## 3 44 1 27.3 0.18 Denver CO 40 105
## 4 44 7 74.8 2.11 Denver CO 40 105
## 5 66 1 6.7 2.1 Caribou ME 47 68
## 6 66 7 65.8 4.52 Caribou ME 47 68
which is the standard language used to construct RDBMS
queries.
the two that I think are most useful in data analysis, the
left-join (a mutating join) and the semi-join (a filtering join).
outer-joins.
of station.
## ID Month Temp_F Rain_I City State Lat_N Long_W
##
## 1 13 1 57.4 0.31 Phoenix AZ 33 112
## 2 13 7 91.7 5.15 Phoenix AZ 33 112
## 3 44 1 27.3 0.18 Denver CO 40 105
## 4 44 7 74.8 2.11 Denver CO 40 105
## 5 66 1 6.7 2.1
## 6 66 7 65.8 4.52
left.
observations in one or more of the tables, in this case the left
table.
and a full-join keeps all observations in both tables.
stats %>% inner_join(station)
## ID Month Temp_F Rain_I City State Lat_N Long_W
##
## 1 13 1 57.4 0.31 Phoenix AZ 33 112
## 2 13 7 91.7 5.15 Phoenix AZ 33 112
## 3 44 1 27.3 0.18 Denver CO 40 105
## 4 44 7 74.8 2.11 Denver CO 40 105
station, but there is a tendency to accidentally loose data.
I Better to use a filtering join (next topic).
the primary key in station has the same name as the primary
key in stats.
match on.
I Uses all variables that appear in both tables; ID for station
I This is called a “natural” join.
reduced flights table defined in the text.
select(year:day, hour, origin, dest, tailnum, carrier)
## year month day hour origin dest tailnum carrier
##
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # … with 336,766 more rows
hour and origin.
## year month day hour origin dest tailnum carrier temp dewp humid
##
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # … with 336,766 more rows, and 7 more variables: wind_dir
## # wind_speed
## # visib
I For example, year means something different in flights2
flights2 %>% left_join(planes, by=”tailnum”)
## year.x month day hour origin dest tailnum carrier year.y type
##
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe~
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe~
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe~
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe~
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe~
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe~
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixe~
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixe~
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixe~
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA
## # … with 336,766 more rows, and 6 more variables: manufacturer
## # model
flights2 table, and in faa in the airports table.
flights2 %>% left_join(airports,by=c(“dest” = “faa”))
## year month day hour origin dest tailnum carrier name lat lon alt
##
## 1 2013 1 1 5 EWR IAH N14228 UA Geor~ 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Geor~ 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miam~ 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL Hart~ 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chic~ 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort~ 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Wash~ 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orla~ 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chic~ 42.0 -87.9 668
## # … with 336,766 more rows, and 3 more variables: tz
## # tzone
I With these modified tables, do a left-join of the stats and
to a character string.
I E.G., gapminder %>% filter(country %in%
I Filtering joins are an extension to filter a table according to
top_dest <- flights2 %>% count(dest,sort=TRUE) %>% head(n=10)
print(top_dest,n=4)
## dest n
##
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## # … with 6 more rows
flights2%>% filter(dest %in% top_dest$dest) %>% print(n=4)
## year month day hour origin dest tailnum carrier
##
## 1 2013 1 1 5 JFK MIA N619AA AA
## 2 2013 1 1 6 LGA ATL N668DN DL
## 3 2013 1 1 5 EWR ORD N39463 UA
## 4 2013 1 1 6 EWR FLL N516JB B6
## # … with 141,141 more rows
flights2 %>% semi_join(top_dest) %>% print(n=4)
## year month day hour origin dest tailnum carrier
##
## 1 2013 1 1 5 JFK MIA N619AA AA
## 2 2013 1 1 6 LGA ATL N668DN DL
## 3 2013 1 1 5 EWR ORD N39463 UA
## 4 2013 1 1 6 EWR FLL N516JB B6
## # … with 141,141 more rows
appear in the output.
I The power of semi_join() is in matching to multiple
top_dep_delay comprised of the year-month-days with the 3
largest total delays, where total delay is defined as the sum of
the dep_delay variable for each year-month-day.
I Hints: use group_by() to group flights by year-month-day;
missing values); use arrange() to sort on your total delays
variable (you want to sort in descending order)
union(), and setdiff(), which can act on pairs of tables.
v1 <- c("apple","pen","pineapple"); v2 <- c("apple","orange","grape")
intersect(v1,v2)
## [1] "apple"
df1 <- tibble(x=c(1,2),y=c(1,1)); df2 <- tibble(x=c(1,1),y=c(1,2))
intersect(df1,df2)
## # A tibble: 1 x 2
## x y
##
## 1 1 1
select(-new_sp, -contains(“04”), -contains(“514”),
gather(new_sp_m014:new_sp_f65,
mutate(demog = substr(demog,8,12)) %>%
separate(demog, into=c(“gender”,”agecat”),sep=1)
mutate(count = m+f) %>%
select(-m,-f) %>%
filter(!is.na(iso2))
library(maps) # install.packages(“maps”) first
ww <- as_tibble(map_data("world")) %>%
ww
## long lat group order region subregion iso2
##
## 1 -69.9 12.5 1 1 Aruba
## 2 -69.9 12.4 1 2 Aruba
## 3 -69.9 12.4 1 3 Aruba
## 4 -70.0 12.5 1 4 Aruba
## 5 -70.1 12.5 1 5 Aruba
## 6 -70.1 12.6 1 6 Aruba
## 7 -70.0 12.6 1 7 Aruba
## 8 -70.0 12.6 1 8 Aruba
## 9 -69.9 12.5 1 9 Aruba
## 10 -69.9 12.5 1 10 Aruba
## # … with 99,328 more rows
tbchild00 <- tb2 %>%
select(iso2,count)
log-count (plus 1 to avoid log of 0).
group=group)) +
long
t
log(count + 1)
countries have large counts.
I Need a table of the population of children in each country in
I Join population to the TB data, and calculate rates per 100,000