CS计算机代考程序代写 SQL database case study Stat 260, Lecture 7, Relational Data

Stat 260, Lecture 7, Relational Data

Stat 260, Lecture 7, Relational Data

David Stenning

1 / 34

Load packages and datasets

library(tidyverse)
library(nycflights13)
# Example from http://www.itl.nist.gov/div897/ctg/dm/sql_examples.htm
station <- read_csv("station.csv",col_types=cols(ID=col_integer())) stats <- read_csv("stats.csv",col_types=cols(ID=col_integer())) 2 / 34 Reading Required Reading: I Relational Data: Chapter 13 of online text. I In the sections on “Joins”, we will focus on the left-join (in the Mutating Joins section of the online text) and the semi-join (in the Filtering Joins section of the online text). Useful Reference: I Data transformation (dplyr) cheatsheet at [https://github.com/rstudio/cheatsheets/raw/master/data- transformation.pdf] 3 / 34 https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf Multiple Tables I Modern data comes in multiple tables, called relational data. I Such structure is motivated by relational database management systems (RDBMS) that revolutionized database management. I Example station and stats tables for weather data: station ## # A tibble: 3 x 5 ## ID City State Lat_N Long_W ##
## 1 13 Phoenix AZ 33 112
## 2 44 Denver CO 40 105
## 3 66 Caribou ME 47 68
stats

## # A tibble: 6 x 4
## ID Month Temp_F Rain_I
##
## 1 13 1 57.4 0.31
## 2 13 7 91.7 5.15
## 3 44 1 27.3 0.18
## 4 44 7 74.8 2.11
## 5 66 1 6.7 2.1
## 6 66 7 65.8 4.52

4 / 34

Relations in the Weather Data

I The relation, or connection between station and stats is the
ID variable present in both.

I Think of the IDs as short-hand for the info in the station
table.
I For example, ID 13 from station is short-hand for: Phoenix

AZ at latitude 33N and longitude 112W.
I The stats table is much more concise for not repeating the

info on each station.

5 / 34

Joining tables

I However, in some cases we may wish to include information,
such as station name, in the stats table.
I The text calls this a “mutating join”.

I Or, we may wish to filter weather measurements in stats
based on the characteristics of the stations, such as latititude
Lat_N >= 40.
I The text calls this a “filtering join”.

6 / 34

The nycflights13 Relational Data

I We used the flights data from nycflights13. There are
several other tables in this package:

print(airlines,n=3)

## # A tibble: 16 x 2
## carrier name
##
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## # … with 13 more rows
print(airports,n=3)

## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
##
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/New_Y~
## 2 06A Moton Field Municipal Airp~ 32.5 -85.7 264 -6 A America/Chica~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/Chica~
## # … with 1,455 more rows

7 / 34

print(planes,n=3)

## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
##
## 1 N10156 2004 Fixed wing mu~ EMBRAER EMB-1~ 2 55 NA Turbo-~
## 2 N102UW 1998 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## 3 N103US 1999 Fixed wing mu~ AIRBUS INDUST~ A320-~ 2 182 NA Turbo-~
## # … with 3,319 more rows
print(weather,n=3)

## # A tibble: 26,115 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
##
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## # … with 26,112 more rows, and 4 more variables: precip ,
## # pressure , visib , time_hour

8 / 34

Relations in the nycflights Tables
I Figure from Section 13.2 of the online text (R for Data

Science):

I Exercise: The relationship between the weather and
airports tables is not shown on the diagram. What is it?

9 / 34

Keys
I The ID variable in the station table is its “primary key”.
I It uniquely identifies observations (rows) in the table.
I The ID column in the stats table is a “foreign key” that links

to the station table.
I Primary/foreign keys are the “relations” in relational data.

print(station,n=3)

## # A tibble: 3 x 5
## ID City State Lat_N Long_W
##
## 1 13 Phoenix AZ 33 112
## 2 44 Denver CO 40 105
## 3 66 Caribou ME 47 68
print(stats,n=3)

## # A tibble: 6 x 4
## ID Month Temp_F Rain_I
##
## 1 13 1 57.4 0.31
## 2 13 7 91.7 5.15
## 3 44 1 27.3 0.18
## # … with 3 more rows

10 / 34

Multiple Keys

I It may take multiple variables to uniquely identify an
observation in a table.

I For example, in the weather table from nycflights13, we
need year-month-day-hour plus origin.

print(weather,n=3)

## # A tibble: 26,115 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_gust
##
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## # … with 26,112 more rows, and 4 more variables: precip ,
## # pressure , visib , time_hour

11 / 34

Tables with no Primary Key

I Some tables lack the variables needed to uniquely identify its
observations.

I For example, in the flights table year-month-day, flight
and tailnum do not identify the flight.

flights %>%
count(year,month,day,flight,tailnum) %>%
filter(n > 1) %>% print(n=3)

## # A tibble: 11 x 6
## year month day flight tailnum n
##
## 1 2013 2 9 303 2
## 2 2013 2 9 655 2
## 3 2013 2 9 1623 2
## # … with 8 more rows

12 / 34

Surrogate Key

I You can add a “surrogate key” to a table with no primary key
using mutate() and row_number().

I Exercise Select year, month, day, flight and tailnum from
flights and add a surrogate key to this 5-column table.

13 / 34

Joining Tables

I To start, add the data in our stations table to the weather
statistics in the stats table.
I This is what the text calls a mutating joins, because it adds

columns to stats like a call to mutate() would.
stats %>% left_join(station)

## # A tibble: 6 x 8
## ID Month Temp_F Rain_I City State Lat_N Long_W
##
## 1 13 1 57.4 0.31 Phoenix AZ 33 112
## 2 13 7 91.7 5.15 Phoenix AZ 33 112
## 3 44 1 27.3 0.18 Denver CO 40 105
## 4 44 7 74.8 2.11 Denver CO 40 105
## 5 66 1 6.7 2.1 Caribou ME 47 68
## 6 66 7 65.8 4.52 Caribou ME 47 68

14 / 34

A Note on Joins

I The term “join” is from SQL (Structured Query Language),
which is the standard language used to construct RDBMS
queries.

I There are many types of joins. In this class we will focus on
the two that I think are most useful in data analysis, the
left-join (a mutating join) and the semi-join (a filtering join).

I However, for a bit of context, we will discuss inner- and
outer-joins.

15 / 34

Inner Joins

I To illustrate inner- and outer-joins we remove one of the rows
of station.

station <- station[-3,] I Now repeat the left_join(): stats %>% left_join(station)

## # A tibble: 6 x 8
## ID Month Temp_F Rain_I City State Lat_N Long_W
##
## 1 13 1 57.4 0.31 Phoenix AZ 33 112
## 2 13 7 91.7 5.15 Phoenix AZ 33 112
## 3 44 1 27.3 0.18 Denver CO 40 105
## 4 44 7 74.8 2.11 Denver CO 40 105
## 5 66 1 6.7 2.1 NA NA
## 6 66 7 65.8 4.52 NA NA

16 / 34

The Left-Join as an Outer Join

I In the output of left_join(), the data from stats is on the
left.

I The left_join() is an “outer join”, because it keeps
observations in one or more of the tables, in this case the left
table.

I Similarly, a right-join keeps all observations in the right table,
and a full-join keeps all observations in both tables.

17 / 34

The Inner-Join

I The inner-join keeps observations that appear in both tables.
stats %>% inner_join(station)

## # A tibble: 4 x 8
## ID Month Temp_F Rain_I City State Lat_N Long_W
##
## 1 13 1 57.4 0.31 Phoenix AZ 33 112
## 2 13 7 91.7 5.15 Phoenix AZ 33 112
## 3 44 1 27.3 0.18 Denver CO 40 105
## 4 44 7 74.8 2.11 Denver CO 40 105

I This might be good if we only wanted data from stations in
station, but there is a tendency to accidentally loose data.
I Better to use a filtering join (next topic).

18 / 34

Defining the Key Columns

I The station and stats tables are very easy to join because
the primary key in station has the same name as the primary
key in stats.

I The by argument to left_join() lets you specify the keys to
match on.

I The default is by = NULL.
I Uses all variables that appear in both tables; ID for station

and stats.
I This is called a “natural” join.

19 / 34

Left-Joins with nycflights13

I To illustrate left-joins with the nycflights13 data we use the
reduced flights table defined in the text.

flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)

flights2

## # A tibble: 336,776 x 8
## year month day hour origin dest tailnum carrier
##
## 1 2013 1 1 5 EWR IAH N14228 UA
## 2 2013 1 1 5 LGA IAH N24211 UA
## 3 2013 1 1 5 JFK MIA N619AA AA
## 4 2013 1 1 5 JFK BQN N804JB B6
## 5 2013 1 1 6 LGA ATL N668DN DL
## 6 2013 1 1 5 EWR ORD N39463 UA
## 7 2013 1 1 6 EWR FLL N516JB B6
## 8 2013 1 1 6 LGA IAD N829AS EV
## 9 2013 1 1 6 JFK MCO N593JB B6
## 10 2013 1 1 6 LGA ORD N3ALAA AA
## # … with 336,766 more rows

20 / 34

Natural Join of weather

I flights2 and weather share the variables year, month, day,
hour and origin.

flights2 %>% left_join(weather)

## # A tibble: 336,776 x 18
## year month day hour origin dest tailnum carrier temp dewp humid
##
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
## 7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
## 8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
## 9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
## 10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
## # … with 336,766 more rows, and 7 more variables: wind_dir ,
## # wind_speed , wind_gust , precip , pressure ,
## # visib , time_hour

21 / 34

by = x

I Use by=x to join on a specific column.
I For example, year means something different in flights2

and planes. Use only tailnum.
flights2 %>% left_join(planes, by=”tailnum”)

## # A tibble: 336,776 x 16
## year.x month day hour origin dest tailnum carrier year.y type
##
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixe~
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixe~
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixe~
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixe~
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixe~
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixe~
## 7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixe~
## 8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixe~
## 9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixe~
## 10 2013 1 1 6 LGA ORD N3ALAA AA NA
## # … with 336,766 more rows, and 6 more variables: manufacturer ,
## # model , engines , seats , speed , engine

22 / 34

Matching Keys with Different Names

I The airport codes are in either origin or dest in the
flights2 table, and in faa in the airports table.

I Use, e.g., by= c(“dest” = “faa”)
flights2 %>% left_join(airports,by=c(“dest” = “faa”))

## # A tibble: 336,776 x 15
## year month day hour origin dest tailnum carrier name lat lon alt
##
## 1 2013 1 1 5 EWR IAH N14228 UA Geor~ 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA Geor~ 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miam~ 25.8 -80.3 8
## 4 2013 1 1 5 JFK BQN N804JB B6 NA NA NA
## 5 2013 1 1 6 LGA ATL N668DN DL Hart~ 33.6 -84.4 1026
## 6 2013 1 1 5 EWR ORD N39463 UA Chic~ 42.0 -87.9 668
## 7 2013 1 1 6 EWR FLL N516JB B6 Fort~ 26.1 -80.2 9
## 8 2013 1 1 6 LGA IAD N829AS EV Wash~ 38.9 -77.5 313
## 9 2013 1 1 6 JFK MCO N593JB B6 Orla~ 28.4 -81.3 96
## 10 2013 1 1 6 LGA ORD N3ALAA AA Chic~ 42.0 -87.9 668
## # … with 336,766 more rows, and 3 more variables: tz , dst ,
## # tzone

23 / 34

Exercise

I Change the name of the ID column in stats to Station.
I With these modified tables, do a left-join of the stats and

station tables.

24 / 34

Filtering Joins

I In past lectures we have use %in% for filtering a table according
to a character string.
I E.G., gapminder %>% filter(country %in%

c(“Canada”,”United States”))
I Filtering joins are an extension to filter a table according to

another table.
top_dest <- flights2 %>% count(dest,sort=TRUE) %>% head(n=10)
print(top_dest,n=4)

## # A tibble: 10 x 2
## dest n
##
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## # … with 6 more rows

25 / 34

I The “old” way and the semi-join way:
flights2%>% filter(dest %in% top_dest$dest) %>% print(n=4)

## # A tibble: 141,145 x 8
## year month day hour origin dest tailnum carrier
##
## 1 2013 1 1 5 JFK MIA N619AA AA
## 2 2013 1 1 6 LGA ATL N668DN DL
## 3 2013 1 1 5 EWR ORD N39463 UA
## 4 2013 1 1 6 EWR FLL N516JB B6
## # … with 141,141 more rows
flights2 %>% semi_join(top_dest) %>% print(n=4)

## # A tibble: 141,145 x 8
## year month day hour origin dest tailnum carrier
##
## 1 2013 1 1 5 JFK MIA N619AA AA
## 2 2013 1 1 6 LGA ATL N668DN DL
## 3 2013 1 1 5 EWR ORD N39463 UA
## 4 2013 1 1 6 EWR FLL N516JB B6
## # … with 141,141 more rows

26 / 34

Notes

I The n column of the top_dest table used for filtering does not
appear in the output.

I This ensures the same behaviour as the “old” way.
I The power of semi_join() is in matching to multiple

columns, as in the following exercise.

27 / 34

Exercise

I From the original flights table, create a table called
top_dep_delay comprised of the year-month-days with the 3
largest total delays, where total delay is defined as the sum of
the dep_delay variable for each year-month-day.
I Hints: use group_by() to group flights by year-month-day;

use summarize() to compute total delays (watch out for
missing values); use arrange() to sort on your total delays
variable (you want to sort in descending order)

I Do a semi-join to filter flights to these days.

28 / 34

Set Operations

I An add-on to this chapter is the set operations: intersect(),
union(), and setdiff(), which can act on pairs of tables.

I Illustrate with intersect().
v1 <- c("apple","pen","pineapple"); v2 <- c("apple","orange","grape") intersect(v1,v2) ## [1] "apple" df1 <- tibble(x=c(1,2),y=c(1,1)); df2 <- tibble(x=c(1,1),y=c(1,2)) intersect(df1,df2) ## # A tibble: 1 x 2 ## x y ##
## 1 1 1

29 / 34

Case Study: TB Map

tb <- read_csv("tb.csv") %>%
select(-new_sp, -contains(“04”), -contains(“514”),

-new_sp_mu, -new_sp_fu) %>%
gather(new_sp_m014:new_sp_f65,

key=demog,value=count,na.rm=TRUE)%>%
mutate(demog = substr(demog,8,12)) %>%
separate(demog, into=c(“gender”,”agecat”),sep=1)

tb2 <- tb %>% spread(key=gender,value=count) %>%
mutate(count = m+f) %>%
select(-m,-f) %>%
filter(!is.na(iso2))

30 / 34

I Use map data from the maps package
library(maps) # install.packages(“maps”) first
ww <- as_tibble(map_data("world")) %>%

mutate(iso2 = iso.alpha(region))
ww

## # A tibble: 99,338 x 7
## long lat group order region subregion iso2
##
## 1 -69.9 12.5 1 1 Aruba AW
## 2 -69.9 12.4 1 2 Aruba AW
## 3 -69.9 12.4 1 3 Aruba AW
## 4 -70.0 12.5 1 4 Aruba AW
## 5 -70.1 12.5 1 5 Aruba AW
## 6 -70.1 12.6 1 6 Aruba AW
## 7 -70.0 12.6 1 7 Aruba AW
## 8 -70.0 12.6 1 8 Aruba AW
## 9 -69.9 12.5 1 9 Aruba AW
## 10 -69.9 12.5 1 10 Aruba AW
## # … with 99,328 more rows

31 / 34

I Extract TB data on children in 2000 and join to ww.
tbchild00 <- tb2 %>%

filter(agecat == “014”,year==2000) %>%
select(iso2,count)

wwchild00 <- ww %>% left_join(tbchild00)

32 / 34

I Can’t distinguish countries on the count scale, so transform to
log-count (plus 1 to avoid log of 0).

ggplot(wwchild00,aes(x=long,y=lat,fill=log(count+1),
group=group)) +

geom_polygon() + coord_quickmap()

−50

0

50

−100 0 100 200
long

la
t

0

2

4

6

8
log(count + 1)

33 / 34

Improvements

I Crude numbers of cases are not very informative. Large
countries have large counts.

I Would have been better as rates:
I Need a table of the population of children in each country in

2000 and the ISO 2 country code.
I Join population to the TB data, and calculate rates per 100,000

34 / 34