#####################################
########## Review of joins ##########
#####################################
# Use the tidyverse and nycflights13 packages
library(tidyverse)
library(nycflights13)
#######################
###### Left join ######
#######################
# First, use select() to remove unwanted variables for this demonstration
flights2 <- flights %>% select(year:day, hour, tailnum, carrier)
# Now, let’s examine the flights2 tibble and the airlines tibble
flights2
airlines
# Notice that both the flights2 and airlines tibble contain the variable `carrier`.
# In addition, airlines containes the variable `name`. We can therefore add `name`
# to `flights2` using a left_join().
flights2_new1 <- flights2 %>% left_join(airlines,by=”carrier”)
flights2_new1
# Like mutate, left_join() adds variable to the right. It keeps all observations in “x,”
# which for this example is flights2.
# Because `carrier` is the only variable that appears in both flights2 and airlines, we
# did not need to specify “by = ” in our code:
flights2_new2 <- flights2 %>% left_join(airlines)
flights2_new2
# We can ask whether these two tibbles are indeed identical:
identical(flights2_new1,flights2_new2)
# Does order matter, i.e., left_join(x,y) vs. left_join(y,x); i.e., x %>% left_join(y), etc.
airlines %>% left_join(flights2) #Do not do this!
# Above, we get a tibble with the same number of rows and columns and essentially
# containing the same information, but the ordering is different. (This ordering,
# particularly of the columns, is less “natural”).
# As seen in the required reading, we *do* need to use “by = ” if there are multiple
# common variables and we need to specify which to join on.
flights2
planes
# Because both tables have a variable called “year,” and because they mean different
# things (year of flight vs. year of manufacture for flights2 vs. planes, respectively),
# we need to specify to join by `tailnum`:
flights2 %>% left_join(planes, by=”tailnum”)
# Note that we keep both `year` variables, now called `year.x` and `year.y` to be clear.
# What if we didn’t specify “by”?
flights2 %>% left_join(planes)
# Notice that R doesn’t return an error, but rather returns “NA” when matches can’t be
# found (because of the different meaning of `year`).
# Finally, it is sometimes helpful to link variable `a` in table `x` to a variable `b`
# in table `y`, which we can do with: by = c(“a” = “b”).
flights3 <- flights %>% select(year:day, hour, origin, dest, tailnum, carrier)
flights3
airports
# We can join based on the origin airport:
flights3 %>% left_join(airports, by=c(“origin” = “faa”))
# Or we can join based on the destination airport:
flights3 %>% left_join(airports, c(“dest” = “faa”))
flights3 %>% left_join(airports, by=c(“origin” = “faa”)) %>% left_join(airports, c(“dest” = “faa”))
### What we generally find is that x %>% left_join(y) is going to keep all the
### observations (i.e.,rows) in x, while adding additional columns from y to the
### right, similar to mutate(). (This is why doing airlines %>% left_join(flights2)
### is a bit improper, even if the same information is retained. The table “airlines”
### is in a way just a key that tells us the full airline name based on the carrier
### code, which is why it is only 16×2, whereas flights2 contains much more information
### with 336,766 rows. So when doing a left_join, we want to add a column(s) to flights2,
### not add both columns and hundreds of thousands of rows to airlines; the latter will
### also be slower.)
### for x and y; left_join(x,y), or x %>% left_join(y)
#######################
###### Semi join ######
#######################
# The code x %>% semi_join(y) keeps *only* the observations (rows) in x that have a match in y.
# Suppose we want to find the top 4 most popular *destinations*:
top_dest <- flights %>% count(dest,sort=TRUE) %>% head(4)
top_dest
# We can get information about each flight that went to those top destinations:
tmp <- flights %>% semi_join(top_dest)
# Note that the number of columns in the above tibble remains the same as in flights,
# but the number of rows have change because we’ve removed all the rows that do not
# correspond to a value of `dest` in `top_dest`, i.e., for which the destination
# airport was not one of the top 4 destinations: ORD, ATL, LAX, BOS.
# For semi_join, order is very important. x %>% semi_join(y) is *not* the same as
# y %>% semi_join(x):
top_dest %>% semi_join(flights) # DO NOT DO THIS IN PRACTICE!
# Unlike left_join(), semi_join() does not duplicate observations; only the existence
# of a match is important. It also does not add variables to `x`.
##############################
###### Summary of joins ######
##############################
### In summary, you would want to use left_join() if you would instead be using a series
### of complicated mutate() statements, whereas you would want to use semi_join() in
### place of a series of complicated filter() functions. You generally want the number
### of rows to remain but the number of columns to expand when using left_join(). You
### generally want the number of columns to remain the same but the number of rows to
### be reduced when using semi_join.