Practical 8. Assignment 1 support session
Lex Comber January 2018
Contents
1. Introduction 1
2. The data 2
3. Useful database functions in dplyr 3 3.1Singletablemanipulation ………………………………… 3 Task 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2Twotablemanipulation …………………………………. 6 Task 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
4. Final considerations 6
1. Introduction
You have been set the task of answering the following 3 questions in Assignment 1
• Q1: What is the household penetration of ‘PRIVATE LABEL THIN SPAGHETTI’? That is, out of all customers purchasing Pasta, what percent purchase this brand?
• Q2: How does the household penetration of the product PRIVATE LABEL THIN SPAGHETTI vary within the two regions, relative to the sales of Pasta in each region?
• Q3: For which category of products (pasta, pasta sauces, syrups and pancake mixes) does the provision of coupons appear to have a positive impact? For this you should consider which customers first purchased an item in a category using a coupon, and then how many of these customers made additional purchases of the item in the category.
The Assignment submission should be around 1000 words and accounts for 50% of the final module mark. It is based on the Dunnhumby data, Carbo-Loading: A Relational Database as described here https://www. dunnhumby.com/sourcefiles. You will need to download the data from the site. It includes 4 datasets in CSV / SAV formats and a User Guide. You have read .csv files into R before using the read.csv function.
For this practical session you will need to have the following packages installed and loaded.
# test for package existance and install
if (!is.element(“tidyverse”, installed.packages())) install.packages(“tidyverse”, dep = T)
if (!is.element(“nycflights13”, installed.packages())) install.packages(“nycflights13”, dep = T)
# load into the R session
library(tidyverse) library(nycflights13)
1
2. The data
You should read the data into R and if you want to take advantage of the relational database functionality of tibble formats and dplyr functions as wrapped in the tidyverse package, you may wish to read them in and convert them to tibble format. Note: Some of the files are large and may take a little while to load into R.
Then, as ever you should inspect them (note that these have been named as follows in the below: cl for dh_causal_lookup.csv, pl for dh_product_lookup.csv, sl for dh_store_lookup.csv and tr for dh_transactions.csv.
cl
## # A tibble: 351,372 x 6 ## upc store week feature_desc display_desc geography
library(tidyverse)
cl <- as_tibble(read.csv(“dunnhumby/dh_causal_lookup.csv”)) pl <- as_tibble(read.csv(“dunnhumby/dh_product_lookup.csv”)) sl <- as_tibble(read.csv(“dunnhumby/dh_store_lookup.csv”)) tr <- as_tibble(read.csv(“dunnhumby/dh_transactions.csv”))
<int> 1 1 1 1 1 1 1 1 1 1
## 2 566300023 ## 3 566300028 ## 4 566300029 ## 5 566300035 ## 6 601011292 BARILLA MARINARA PSTA SCE pasta sauce ## 7 601011293 BARILLA SWT PEPPER/GARLIC pasta sauce ## 8 601011294 BARILLA ARRABBIATA SPCY P pasta sauce ## 9 601011295 "BARILLA\\BL GRN OLIVE SCE" pasta sauce ## 10 601011296 BARILLA BASILICO SAUCE pasta sauce ## # ... with 917 more rows, and 1 more variables: product_size <fctr>
sl
## # A tibble: 387 x 2
-
## store store_zip_code
- ## <int> <int> ##11 37865
## <dbl> <int> <int> <fctr> <fctr>
## 1 7680850108 ## 2 5100001212 ## 3 5100002792 ## 4 3620000300 ## 5 4112907742 ## 6 3620000250 ## 7 5100001210 ## 8 3620000218 ## 9 3620000217 ## 10 6172005010 ## # ... with 351,362 more rows
1 68 Wrap Interior Feature Not on Display 1 66 Wrap Back Feature Not on Display 1 72 Interior Page Feature Not on Display 1 55 Wrap Interior Feature Not on Display 1 68 Wrap Interior Feature Not on Display 1 55 Wrap Interior Feature Not on Display 1 66 Wrap Back Feature Not on Display 1 55 Wrap Interior Feature Not on Display 1 55 Wrap Interior Feature Not on Display 1 76 Not on Feature Rear End Cap
pl
## # A tibble: 927 x 5 ## upc product_description commodity ## <dbl> <fctr> <fctr> ## 1 111112360 VINCENT S ORIG MARINARA S pasta sauce
PINE MOUNTAIN SYRUP MILLER CANE SYRUP MILLER CANE SYRUP
brand <fctr> Vincent's syrups Pine Mountain syrups Miller syrups Miller syrups Pine Mountain
PINE MOUNTAIN SYRUP
Barilla Barilla Barilla Barilla Barilla
2
##22
##33
##44
##55
##66
##77
##88
##99
## 10 10
## # … with 377 more rows
tr
## # A tibble: 5,197,681 x 11 ## upc dollar_sales units time_of_transaction geography week
30084 30039 31210 30044 31204 30064 37918 31406 35801
## <dbl> ## 1 7680850106 ## 2 3620000470 ## 3 1800028064 ## 4 9999985067 ## 5 9999985131 ## 6 5100002794 ## 7 1800000957 ## 8 9999985051 ## 9 9999985053 ## 10 3620000446 ## # ... with 5,197,671 more rows, and 5 more variables: household <int>, ## # store <int>, basket <int>, day <int>, coupon <int>
The Guide to the data includes the following figure, which shows how the different datasets can be linked together using commonly named attributes:
3. Useful database functions in dplyr
It is recommended that you explore the help for the dplyr packages
?dplyr
If you go to the index link at the bottom of the help page that you open, you will be taken to an alphabetical list of the functions in the package. However at the top of the page, in common with many R packages there is a link to examine the User guides, package vignettes and other documentation. You could click on this explore how the package is coded, how it efficiently uses memory, the database functionality of the SQLite package that dplyr draws from.
3.1 Single table manipulation
Perhaps the place to start are the vignettes in dplyr: vignette(“dplyr”, package = “dplyr”)
And the two table operations
vignette(“two-table”, package = “dplyr”)
<dbl> <int> 0.80 1 3.59 1 2.25 1 0.85 1 2.19 1 2.19 1 3.45 1 1.29 1 0.75 1 2.19 1
<int> <int> <int> 1100 2 1 1100 2 1 1137 2 1 1148 2 1 1323 2 1 1323 2 1 1415 2 1 1415 2 1 1513 2 1 1523 2 1
3
Figure 1: The Dunnhumby data schema.
4
You will need the nycflights13 package for these vignettes. It contains datasets that are used to illustrate the operation of the dplyr functions.
This vignette illustrates the basic tools for manipulating a single dataset that are supported in plyr:
• filter() (and slice()) • arrange()
• select() (and rename()) • distinct()
• mutate() (and transmute())
• summarise()
• sample_n() (and sample_frac())
You have used some of these already in earlier practicals. You should load the nycflights13 data:
## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time
library(nycflights13) flights
## <int> <int> <int> <int>
<int> <dbl> <int> 515 2 830 529 4 850 540 2 923 545 -1 1004 600 -6 812 558 -4 740 600 -5 913 600 -3 709 600 -3 838 600 -2 753
## 1 2013 ## 2 2013 ## 3 2013 ## 4 2013 ## 5 2013 ## 6 2013 ## 7 2013 ## 8 2013 ## 9 2013 ## 10 2013 ## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # ## # ## #
arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
1 1 517 1 1 533 1 1 542 1 1 544 1 1 554 1 1 554 1 1 555 1 1 557 1 1 557 1 1 558
Task 1
Your task is to work through and explore these tools, following the example code snippets in the vignette and consider how they relate to your Assignment objectives – you should just about be able to answer Q1 using the functions in this section if you know what the code is for the brand, although it may be easier using some of the merge and join functions described later.
Some logical lests and the match and unique functions could be useful.
# Q1: What is the household penetration of `PRIVATE LABEL THIN SPAGHETTI`? # That is, out of all customers purchasing Pasta, what percent purchase this brand?
# There are a number of stages to this
# 1. Identify the upc codes for product and commodity index1 <- which(pl$product_description == product_name) index2 <- which(pl$commodity == commodity_name)
# product
upc.1 <- unlist(pl[index1,”upc”])
# commodity
5
upc.2 <- unlist(pl[index2,”upc”])
# 2. then match these to the transactions upc
# filter and get rid of the NAs
tmp1 <- filter(tr, !is.na(match(tr$upc, upc.1)))
tmp2 <- filter(tr, !is.na(match(tr$upc, upc.2)))
# 3. then subset the transactions data
# and determine the how many hosueholds buy product / commodity n.product <- length(unique(tmp1$household))
n.commod <- length(unique(tmp2$household))
3.2 Two table manipulation
Perhaps the place to start is the section entitled Two-table verbs at dplyr::two-table, and again if you click on the source link you will see the .Rmd (RMarkdown) code that compiles to produce the vignette.
The details of the joins and set operations in this section are at the heart of developing a successful analysis for Assignment 1. Again you should work through these illustrations yourself, with a view to understanding how they could be used in your analysis. If you undertook the answer to Q1 using the single table manipulation functions listed in Section 3.1, you should be able to identify how you could have done it using the different join functions with filter.
Similarly, the answer to Q2 can be derived from Q1 if you can split the subsetting / results of the filter functions into the 2 geographies.
Task 2
Your task is to work through and explore these tools, following the example code snippets in the vignette and consider how they relate to your Assignment objectives – you should be able to Q1, Q2 and Q3 using the functions in this section. Again the filter and join functions will be useful as will logical or grep functions to select the product family (commodity) and brand (product_description). Also, notice how the pipe commands (%>%) can be used to push the results / outputs of one operation into the inputs of another.
4. Final considerations
The third part of Assignment requires you to do some thinking: your task is to determine For which category of products (pasta, pasta sauces, syrups and pancake mixes) does the provision of coupons appear to have a positive impact?. And, it has been suggested that you should consider which customers first purchased an item in a category using a coupon, and then how many of these customers made additional purchases of the item in the category.
Some further suggestions:
- think about what the day and coupon variables indicate
- think how arrange from dplyr could be used to sort the data
- break the problem down into discrete steps (e.g. create a list of all ‘pasta’ transactions; create a list of
all customers who purchased pasta with a coupon; . . . etc)
- do it for just one product category (eg pasta) to the steps etc working
- do it for a sample of the transactions data so that you can develop the code efficiently: working with
the whole dataset may take time for some of the operations you will have to undertake
- think about the outputs you want – number of customers? proportion of customers? both?
- then think about putting it all into functions, that when run together in sequence, return the desired
outputs
6
my.q3.func <- function(prod.cat){ some code
some more code
etc…
return(val) }
• if functions are using data in working memory then they may need to compiled within a function:
my.q3.func <- function(prod.cat){
some code to create temp.variable my.q3.nested.func(temp.variable) {
some code
return(return.val) }
some more code etc… return(val)
}
• Then you should be able to pass variables to the function in a similar way to the below:
So for example, the way that I did was to write a function that could be called to analyse a commodity such as pasta. It does the following for ‘commod’ – the name of the commodity
my.q3.func(“pasta”) my.q3.func(“pasta sauces”)
…etc
1. Finds households who bought with a coupon
# link the product and transaction data
# filter for coupon and commodity
tr %>% left_join(pl) %>% filter(coupon == 1) %>% filter(commodity == commod) -> r1 # the unique households that bought commod with coupon
h <- unique(r1$household)
2. Links the product and transaction data again
tr %>% left_join(pl) %>% filter(commodity == commod) -> r2
3. Then it has a function to examine each household in h that does the following # – extracts the household transactions from r2
# – sort this by day and the inverse of coupon (ie coupon going from 1 to 0)
# – extract (select) the coupon variable
r2 %>% filter(household == x) %>% arrange(day, -coupon) %>% select(coupon) %>% unlist- 4. Finally it tests to see if the first element of the coupon variable = 1
# (i.e. they used a coupon first time) # AND that its length is > 1 (ie the household bought the commodity more than once)
if (tmp[1] == 1 & length(tmp) > 1) return(x)
7
> tmp