代写 R database graph Contents

Contents
Assignment 1 Support Session
Lex Comber January 2019
1. Introduction 1
2. The data 2
3. Useful database functions in dplyr 4 3.1Singletablemanipulation ………………………………… 4 Task 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.2Twotablemanipulation …………………………………. 5 Task 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4. Final considerations 6
1. Introduction
You have been set the task of answering the following 3 questions in Assignment 1
• Q1: What is the household penetration of ‘PRIVATE LABEL THIN SPAGHETTI’? That is, out of all customers purchasing Pasta, what percent purchase this brand?
• Q2: How does the household penetration of the product PRIVATE LABEL THIN SPAGHETTI vary within the two regions, relative to the sales of Pasta in each region?
• Q3: For which category of products (pasta, pasta sauces, syrups and pancake mixes) does the provision of coupons appear to have a positive impact? For this you should consider which customers first purchased an item in a category using a coupon, and then how many of these customers made additional purchases of the item in the category.
The Assignment submission should be around 1000 words and accounts for 50% of the final module mark. It is based on the Dunnhumby data, Carbo-Loading: A Relational Database as described here https://www. dunnhumby.com/sourcefiles. You will need to download the data from the site. It includes 4 datasets in CSV / SAV formats and a User Guide. You have read .csv files into R before using the read.csv function.
For this practical session you will need to have the following packages installed and loaded.
# test for package existance and install
if (!is.element(“tidyverse”, installed.packages())) install.packages(“tidyverse”, dep = T)
if (!is.element(“nycflights13”, installed.packages())) install.packages(“nycflights13”, dep = T)
# load into the R session
library(tidyverse) library(nycflights13)
1

2. The data
You should read the data into R and if you want to take advantage of the relational database functionality of tibble formats and dplyr functions as wrapped in the tidyverse package, you may wish to read them in and convert them to tibble format. Note: Some of the files are large and may take a little while to load into R.
Then, as ever you should inspect them (note that these have been named as follows in the below: cl for dh_causal_lookup.csv, pl for dh_product_lookup.csv, sl for dh_store_lookup.csv and tr for dh_transactions.csv.
library(tidyverse)
cl <- as_tibble(read.csv("dunnhumby/dh_causal_lookup.csv")) pl <- as_tibble(read.csv("dunnhumby/dh_product_lookup.csv")) sl <- as_tibble(read.csv("dunnhumby/dh_store_lookup.csv")) tr <- as_tibble(read.csv("dunnhumby/dh_transactions.csv")) cl ## # A tibble: 351,372 x 6 ## upc store week feature_desc display_desc ##
geography

1
1
1
1
1
1
1
1
1
1
product_size

25 OZ
## 1 7680850108
## 2 5100001212
## 3 5100002792
## 4 3620000300
## 5 4112907742
## 6 3620000250
## 7 5100001210
## 8 3620000218
## 9 3620000217
## 10 6172005010
## # … with 351,362 more rows
1 68 Wrap Interior Feature Not on Display
1 66 Wrap Back Feature Not on Display
1 72 Interior Page Feature Not on Display
1 55 Wrap Interior Feature Not on Display
1 68 Wrap Interior Feature Not on Display
1 55 Wrap Interior Feature Not on Display
1 66 Wrap Back Feature Not on Display
1 55 Wrap Interior Feature Not on Display
1 55 Wrap Interior Feature Not on Display
1 76 Not on Feature Rear End Cap
pl
## # A tibble: 927 x 5
## upc product_description commodity brand
##
## 1 111112360 VINCENT S ORIG MARINARA S pasta sau~ Vincent’s
## 2 566300023 PINE MOUNTAIN SYRUP
## 3 566300028 MILLER CANE SYRUP
## 4 566300029 MILLER CANE SYRUP
## 5 566300035 PINE MOUNTAIN SYRUP
## 6 601011292 BARILLA MARINARA PSTA SCE pasta sau~ Barilla
## 7 601011293 BARILLA SWT PEPPER/GARLIC pasta sau~ Barilla
## 8 601011294 BARILLA ARRABBIATA SPCY P pasta sau~ Barilla
## 9 601011295 “BARILLA\\BL GRN OLIVE SC~ pasta sau~ Barilla
## 10 601011296 BARILLA BASILICO SAUCE
## # … with 917 more rows
sl
## # A tibble: 387 x 2
## store store_zip_code
## ##11 37865
pasta sau~ Barilla
26 OZ
26 OUNCE
26 OZ
26 OZ
26 OZ
syrups
syrups
syrups
syrups
Pine Mount~ 40 OZ
Miller 19 OZ
Miller 12 OZ
Pine Mount~ 19 OZ
2

##22
##33
##44
##55
##66
##77
##88
##99
## 10 10
## # … with 377 more rows
tr
## # A tibble: 5,197,681 x 11
30084
30039
31210
30044
31204
30064
37918
31406
35801
Figure 1: The Dunnhumby data schema.
##
##
## 1
## 2
## 3
## 4 10.00e9
## 5 10.00e9
## 6 5.10e9
## 7 1.80e9
## 8 10.00e9
## 9 10.00e9
## 10 3.62e9
## # … with 5,197,671 more rows, and 4 more variables: store ,
## # basket , day , coupon
The Guide to the data includes the following figure, which shows how the different datasets can be linked together using commonly named attributes:
upc dollar_sales units time_of_transac~ geography week household

7.68e9
3.62e9
1.80e9

0.8 1
3.59 1
2.25 1
0.85 1
2.19 1
2.19 1
3.45 1
1.29 1
0.75 1
2.19 1

1 1 1 1 1 1 1 1 1 1

125434
125434
108320
162016
89437
89437
158549
158549
18851 118337
3
1100 2
1100 2
1137 2
1148 2
1323 2
1323 2
1415 2
1415 2
1513 2
1523 2

3. Useful database functions in dplyr
It is recommended that you explore the help for the dplyr packages
?dplyr
If you go to the index link at the bottom of the help page that you open, you will be taken to an alphabetical list of the functions in the package. However at the top of the page, in common with many R packages there is a link to examine the User guides, package vignettes and other documentation. You could click on this explore how the package is coded, how it efficiently uses memory, the database functionality of the SQLite package that dplyr draws from.
3.1 Single table manipulation
Perhaps the place to start are the vignettes in dplyr: vignette(“dplyr”, package = “dplyr”)
And the two table operations
vignette(“two-table”, package = “dplyr”)
You will need the nycflights13 package for these vignettes. It contains datasets that are used to illustrate
the operation of the dplyr functions.
This vignette illustrates the basic tools for manipulating a single dataset that are supported in plyr:
• filter() (and slice()) • arrange()
• select() (and rename()) • distinct()
• mutate() (and transmute())
• summarise()
• sample_n() (and sample_frac())
You have used some of these already in earlier practicals. You should load the nycflights13 data:
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
library(nycflights13) flights
##

515 2 830
529 4 850
540 2 923
545 -1 1004
600 -6 812
558 -4 740
600 -5 913
600 -3 709
600 -3 838
600 -2 753
1 1 517
1 1 533
1 1 542
1 1 544
1 1 554
1 1 554
1 1 555
1 1 557
1 1 557
1 1 558
## 1 2013
## 2 2013
## 3 2013
## 4 2013
## 5 2013
## 6 2013
## 7 2013
## 8 2013
## 9 2013
## 10 2013
## # … with 336,766 more rows, and 12 more variables: sched_arr_time ,
## # arr_delay , carrier , flight , tailnum ,
## # origin , dest , air_time , distance , hour ,
4

## # minute , time_hour Task 1
Your task is to work through and explore these tools, following the example code snippets in the vignette and consider how they relate to your Assignment objectives – you should just about be able to answer Q1 using the functions in this section if you know what the code is for the brand, although it may be easier using some of the merge and join functions described later.
Some logical lests and the match and unique functions could be useful.
# Q1: What is the household penetration of `PRIVATE LABEL THIN SPAGHETTI`?
# That is, out of all customers purchasing Pasta, what percent purchase this brand?
# There are a number of stages to this
# 1. Identify the upc codes for product and commodity
index1 <- which(pl$product_description == product_name) index2 <- which(pl$commodity == commodity_name) # product upc.1 <- unlist(pl[index1,"upc"]) # commodity upc.2 <- unlist(pl[index2,"upc"]) # 2. then match these to the transactions upc # filter and get rid of the NAs tmp1 <- filter(tr, !is.na(match(tr$upc, upc.1))) tmp2 <- filter(tr, !is.na(match(tr$upc, upc.2))) # 3. then subset the transactions data # and determine the how many hosueholds buy product / commodity n.product <- length(unique(tmp1$household)) n.commod <- length(unique(tmp2$household)) 3.2 Two table manipulation Perhaps the place to start is the section entitled Two-table verbs at dplyr::two-table, and again if you click on the source link you will see the .Rmd (RMarkdown) code that compiles to produce the vignette. The details of the joins and set operations in this section are at the heart of developing a successful analysis for Assignment 1. Again you should work through these illustrations yourself, with a view to understanding how they could be used in your analysis. If you undertook the answer to Q1 using the single table manipulation functions listed in Section 3.1, you should be able to identify how you could have done it using the different join functions with filter. Similarly, the answer to Q2 can be derived from Q1 if you can split the subsetting / results of the filter functions into the 2 geographies. Task 2 Your task is to work through and explore these tools, following the example code snippets in the vignette and consider how they relate to your Assignment objectives - you should be able to Q1, Q2 and Q3 using the functions in this section. Again the filter and join functions will be useful as will logical or grep functions to select the product family (commodity) and brand (product_description). Also, notice how the pipe commands (%>%) can be used to push the results / outputs of one operation into the inputs of another.
5

4. Final considerations
The third part of Assignment requires you to do some thinking: your task is to determine For which category of products (pasta, pasta sauces, syrups and pancake mixes) does the provision of coupons appear to have a positive impact?. And, it has been suggested that you should consider which customers first purchased an item in a category using a coupon, and then how many of these customers made additional purchases of the item in the category.
Some further suggestions:
• think about what the day and coupon variables indicate
• think how arrange from dplyr could be used to sort the data
• break the problem down into discrete steps (e.g. create a list of all ‘pasta’ transactions; create a list of
all customers who purchased pasta with a coupon; . . . etc)
• do it for just one product category (eg pasta) to the steps etc working
• do it for a sample of the transactions data so that you can develop the code efficiently: working with
the whole dataset may take time for some of the operations you will have to undertake
• think about the outputs you want – number of customers? proportion of customers? both?
• then think about putting it all into functions, that when run together in sequence, return the desired
outputs
• if functions are using data in working memory then they may need to compiled within a function:
my.q3.func <- function(prod.cat){ some code some more code etc... return(val) } my.q3.func <- function(prod.cat){ some code to create temp.variable my.q3.nested.func(temp.variable) { some code return(return.val) } some more code etc... return(val) } • Then you should be able to pass variables to the function in a similar way to the below: So for example, the way that I did was to write a function that could be called to analyse a commodity such as pasta. It does the following for ‘commod’ - the name of the commodity my.q3.func("pasta") my.q3.func("pasta sauces") ...etc 1. Finds households who bought with a coupon # link the product and transaction data # filter for coupon and commodity tr %>% left_join(pl) %>% filter(coupon == 1) %>% filter(commodity == commod) -> r1 # the unique households that bought commod with coupon
h <- unique(r1$household) 6 2. Links the product and transaction data again tr %>% left_join(pl) %>% filter(commodity == commod) -> r2
3. Then it has a function to examine each household in h that does the following # – extracts the household transactions from r2
# – sort this by day and the inverse of coupon (ie coupon going from 1 to 0)
# – extract (select) the coupon variable
r2 %>% filter(household == x) %>% arrange(day, -coupon) %>% select(coupon) %>% unlist- 4. Finally it tests to see if the first element of the coupon variable = 1
# (i.e. they used a coupon first time)
# AND that its length is > 1 (ie the household bought the commodity more than once)
if (tmp[1] == 1 & length(tmp) > 1) return(x)
7
> tmp