R语言代写 Practical 8. Assignment 1 support session

Practical 8. Assignment 1 support session

Lex Comber January 2018

Contents

1. Introduction 1

2. The data 2

3. Useful database functions in dplyr 3 3.1Singletablemanipulation ………………………………… 3 Task 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2Twotablemanipulation …………………………………. 6 Task 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4. Final considerations 6

1. Introduction

You have been set the task of answering the following 3 questions in Assignment 1

• Q1: What is the household penetration of ‘PRIVATE LABEL THIN SPAGHETTI’? That is, out of all customers purchasing Pasta, what percent purchase this brand?

• Q2: How does the household penetration of the product PRIVATE LABEL THIN SPAGHETTI vary within the two regions, relative to the sales of Pasta in each region?

• Q3: For which category of products (pasta, pasta sauces, syrups and pancake mixes) does the provision of coupons appear to have a positive impact? For this you should consider which customers first purchased an item in a category using a coupon, and then how many of these customers made additional purchases of the item in the category.

The Assignment submission should be around 1000 words and accounts for 50% of the final module mark. It is based on the Dunnhumby data, Carbo-Loading: A Relational Database as described here https://www. dunnhumby.com/sourcefiles. You will need to download the data from the site. It includes 4 datasets in CSV / SAV formats and a User Guide. You have read .csv files into R before using the read.csv function.

For this practical session you will need to have the following packages installed and loaded.

# test for package existance and install

if (!is.element(“tidyverse”, installed.packages())) install.packages(“tidyverse”, dep = T)

if (!is.element(“nycflights13”, installed.packages())) install.packages(“nycflights13”, dep = T)

# load into the R session

library(tidyverse) library(nycflights13)

1

2. The data

You should read the data into R and if you want to take advantage of the relational database functionality of tibble formats and dplyr functions as wrapped in the tidyverse package, you may wish to read them in and convert them to tibble format. Note: Some of the files are large and may take a little while to load into R.

Then, as ever you should inspect them (note that these have been named as follows in the below: cl for dh_causal_lookup.csv, pl for dh_product_lookup.csv, sl for dh_store_lookup.csv and tr for dh_transactions.csv.

cl

## # A tibble: 351,372 x 6
##           upc store  week          feature_desc   display_desc geography

library(tidyverse)
cl <- as_tibble(read.csv(“dunnhumby/dh_causal_lookup.csv”)) pl <- as_tibble(read.csv(“dunnhumby/dh_product_lookup.csv”)) sl <- as_tibble(read.csv(“dunnhumby/dh_store_lookup.csv”)) tr <- as_tibble(read.csv(“dunnhumby/dh_transactions.csv”))

                                                                      <int>
                                                                          1
                                                                          1
                                                                          1
                                                                          1
                                                                          1
                                                                          1
                                                                          1
                                                                          1
                                                                          1
                                                                          1
##  2 566300023
##  3 566300028
##  4 566300029
##  5 566300035
##  6 601011292   BARILLA MARINARA PSTA SCE pasta sauce
##  7 601011293   BARILLA SWT PEPPER/GARLIC pasta sauce
##  8 601011294   BARILLA ARRABBIATA SPCY P pasta sauce
##  9 601011295 "BARILLA\\BL GRN OLIVE SCE" pasta sauce
## 10 601011296      BARILLA BASILICO SAUCE pasta sauce
## # ... with 917 more rows, and 1 more variables: product_size <fctr>

sl

## # A tibble: 387 x 2
  • ##     store store_zip_code
    
  • ##  <int> <int> ##11 37865
##         <dbl> <int> <int>                <fctr>         <fctr>
##  1 7680850108
##  2 5100001212
##  3 5100002792
##  4 3620000300
##  5 4112907742
##  6 3620000250
##  7 5100001210
##  8 3620000218
##  9 3620000217
## 10 6172005010
## # ... with 351,362 more rows
1    68 Wrap Interior Feature Not on Display
1    66     Wrap Back Feature Not on Display
1    72 Interior Page Feature Not on Display
1    55 Wrap Interior Feature Not on Display
1    68 Wrap Interior Feature Not on Display
1    55 Wrap Interior Feature Not on Display
1    66     Wrap Back Feature Not on Display
1    55 Wrap Interior Feature Not on Display
1    55 Wrap Interior Feature Not on Display
1    76        Not on Feature   Rear End Cap

pl

## # A tibble: 927 x 5
##          upc         product_description   commodity
##        <dbl>                      <fctr>      <fctr>
##  1 111112360   VINCENT S ORIG MARINARA S pasta sauce
PINE MOUNTAIN SYRUP
  MILLER CANE SYRUP
  MILLER CANE SYRUP
               brand
              <fctr>
           Vincent's
syrups Pine Mountain
syrups        Miller
syrups        Miller
syrups Pine Mountain
PINE MOUNTAIN SYRUP
Barilla
Barilla
Barilla
Barilla
Barilla

2

##22
##33
##44
##55
##66
##77
##88
##99
## 10 10
## # … with 377 more rows

tr

## # A tibble: 5,197,681 x 11
##           upc dollar_sales units time_of_transaction geography  week
30084
30039
31210
30044
31204
30064
37918
31406
35801
##         <dbl>
##  1 7680850106
##  2 3620000470
##  3 1800028064
##  4 9999985067
##  5 9999985131
##  6 5100002794
##  7 1800000957
##  8 9999985051
##  9 9999985053
## 10 3620000446
## # ... with 5,197,671 more rows, and 5 more variables: household <int>,
## #   store <int>, basket <int>, day <int>, coupon <int>

The Guide to the data includes the following figure, which shows how the different datasets can be linked together using commonly named attributes:

3. Useful database functions in dplyr
It is recommended that you explore the help for the dplyr packages

?dplyr

If you go to the index link at the bottom of the help page that you open, you will be taken to an alphabetical list of the functions in the package. However at the top of the page, in common with many R packages there is a link to examine the User guides, package vignettes and other documentation. You could click on this explore how the package is coded, how it efficiently uses memory, the database functionality of the SQLite package that dplyr draws from.

3.1 Single table manipulation

Perhaps the place to start are the vignettes in dplyr: vignette(“dplyr”, package = “dplyr”)

And the two table operations

vignette(“two-table”, package = “dplyr”)

<dbl> <int>
 0.80     1
 3.59     1
 2.25     1
 0.85     1
 2.19     1
 2.19     1
 3.45     1
 1.29     1
 0.75     1
 2.19     1
<int>     <int> <int>
 1100         2     1
 1100         2     1
 1137         2     1
 1148         2     1
 1323         2     1
 1323         2     1
 1415         2     1
 1415         2     1
 1513         2     1
 1523         2     1

3

Figure 1: The Dunnhumby data schema.

4

You will need the nycflights13 package for these vignettes. It contains datasets that are used to illustrate the operation of the dplyr functions.

This vignette illustrates the basic tools for manipulating a single dataset that are supported in plyr:

• filter() (and slice()) • arrange()
• select() (and rename()) • distinct()

• mutate() (and transmute())
• summarise()
• sample_n() (and sample_frac())

You have used some of these already in earlier practicals. You should load the nycflights13 data:

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time

library(nycflights13) flights

##    <int> <int> <int>    <int>
<int>     <dbl>    <int>
  515         2      830
  529         4      850
  540         2      923
  545        -1     1004
  600        -6      812
  558        -4      740
  600        -5      913
  600        -3      709
  600        -3      838
  600        -2      753
##  1  2013
##  2  2013
##  3  2013
##  4  2013
##  5  2013
##  6  2013
##  7  2013
##  8  2013
##  9  2013
## 10  2013
## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,

## # ## # ## #

arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
minute <dbl>, time_hour <dttm>
1     1      517
1     1      533
1     1      542
1     1      544
1     1      554
1     1      554
1     1      555
1     1      557
1     1      557
1     1      558

Task 1

Your task is to work through and explore these tools, following the example code snippets in the vignette and consider how they relate to your Assignment objectives – you should just about be able to answer Q1 using the functions in this section if you know what the code is for the brand, although it may be easier using some of the merge and join functions described later.

Some logical lests and the match and unique functions could be useful.

# Q1: What is the household penetration of `PRIVATE LABEL THIN SPAGHETTI`?
# That is, out of all customers purchasing Pasta, what percent purchase this brand?

# There are a number of stages to this
# 1. Identify the upc codes for product and commodity index1 <- which(pl$product_description == product_name) index2 <- which(pl$commodity == commodity_name)
# product
upc.1 <- unlist(pl[index1,”upc”])
# commodity

5

upc.2 <- unlist(pl[index2,”upc”])
# 2. then match these to the transactions upc
# filter and get rid of the NAs
tmp1 <- filter(tr, !is.na(match(tr$upc, upc.1)))
tmp2 <- filter(tr, !is.na(match(tr$upc, upc.2)))
# 3. then subset the transactions data
# and determine the how many hosueholds buy product / commodity n.product <- length(unique(tmp1$household))
n.commod <- length(unique(tmp2$household))

3.2 Two table manipulation

Perhaps the place to start is the section entitled Two-table verbs at dplyr::two-table, and again if you click on the source link you will see the .Rmd (RMarkdown) code that compiles to produce the vignette.

The details of the joins and set operations in this section are at the heart of developing a successful analysis for Assignment 1. Again you should work through these illustrations yourself, with a view to understanding how they could be used in your analysis. If you undertook the answer to Q1 using the single table manipulation functions listed in Section 3.1, you should be able to identify how you could have done it using the different join functions with filter.

Similarly, the answer to Q2 can be derived from Q1 if you can split the subsetting / results of the filter functions into the 2 geographies.

Task 2

Your task is to work through and explore these tools, following the example code snippets in the vignette and consider how they relate to your Assignment objectives – you should be able to Q1, Q2 and Q3 using the functions in this section. Again the filter and join functions will be useful as will logical or grep functions to select the product family (commodity) and brand (product_description). Also, notice how the pipe commands (%>%) can be used to push the results / outputs of one operation into the inputs of another.

4. Final considerations

The third part of Assignment requires you to do some thinking: your task is to determine For which category of products (pasta, pasta sauces, syrups and pancake mixes) does the provision of coupons appear to have a positive impact?. And, it has been suggested that you should consider which customers first purchased an item in a category using a coupon, and then how many of these customers made additional purchases of the item in the category.

Some further suggestions:

  • think about what the day and coupon variables indicate
  • think how arrange from dplyr could be used to sort the data
  • break the problem down into discrete steps (e.g. create a list of all ‘pasta’ transactions; create a list of

    all customers who purchased pasta with a coupon; . . . etc)

  • do it for just one product category (eg pasta) to the steps etc working
  • do it for a sample of the transactions data so that you can develop the code efficiently: working with

    the whole dataset may take time for some of the operations you will have to undertake

  • think about the outputs you want – number of customers? proportion of customers? both?
  • then think about putting it all into functions, that when run together in sequence, return the desired

    outputs

6

my.q3.func <- function(prod.cat){ some code
some more code
etc…

return(val) }

• if functions are using data in working memory then they may need to compiled within a function:

my.q3.func <- function(prod.cat){

some code to create temp.variable my.q3.nested.func(temp.variable) {

some code

return(return.val) }

some more code etc… return(val)

}

• Then you should be able to pass variables to the function in a similar way to the below:

So for example, the way that I did was to write a function that could be called to analyse a commodity such as pasta. It does the following for ‘commod’ – the name of the commodity

my.q3.func(“pasta”) my.q3.func(“pasta sauces”)

…etc

1. Finds households who bought with a coupon
# link the product and transaction data
# filter for coupon and commodity
tr %>% left_join(pl) %>% filter(coupon == 1) %>% filter(commodity == commod) -> r1 # the unique households that bought commod with coupon

h <- unique(r1$household)
2. Links the product and transaction data again

tr %>% left_join(pl) %>% filter(commodity == commod) -> r2

3. Then it has a function to examine each household in h that does the following # – extracts the household transactions from r2
# – sort this by day and the inverse of coupon (ie coupon going from 1 to 0)
# – extract (select) the coupon variable

r2 %>% filter(household == x) %>% arrange(day, -coupon) %>% select(coupon) %>% unlist- 4. Finally it tests to see if the first element of the coupon variable = 1

  # (i.e. they used a coupon first time)
  # AND that its length is > 1 (ie the household bought the commodity more than once)

if (tmp[1] == 1 & length(tmp) > 1) return(x)

7

> tmp