Data Challenge — Question 1
Using San Francisco Ford GoBike Share dataset hosted on Google’s BigQuery, write queries to get the following:
1. How many trips were made for each month in 2015?
2. What region do millenials (birth year between 1981 and 1996) prefer to start their trips? What region do Generation X (birth year between 1965 and 1980) prefer to start their trips?
3. What is the average number of trips per station with kiosk?
Instructions
1. San Francisco Ford Go Bike Share is a public dataset hosted on Google’s BigQuery. Completing the
challenge is free with BigQuery’s 1TB/mo of free tier processing. Access BigQuery using a Google account. Upon creating a google cloud account, create a project (i.e. My Data Challenge Project).
2. Write queries to return the answers for each of the three questions above.
3. Share the queries in your submission to the data challenge
Data Challenge — Question 2
local_datetime
user
store_nam e
city
state
zip
dwell
load_id
load_date
12/10/17 17:29
f5845
soulcycle
houston
tx
78259
5
1096
12/11/17 17:29
12/12/17 6:11
71e61
la fitness
austin
tx
77573
45
1109
12/13/17 6:11
12/13/17 5:57
9059f
soulcycle
dallas
tx
75206
50
1831
12/14/17 5:57
12/14/17 5:32
8c566
la fitness
spring
tx
77379
40
1206
12/15/17 5:32
…
…
…
…
…
…
…
…
…
12/15/17 5:41
9059f
soulcycle
dallas
tx
75206
55
9938
12/16/17 5:41
12/16/17 13:19
871fb
crossfit
dickinso n
tx
77539
15
1110
12/17/17 13:19
12/13/17 5:57
9059f
soulcycle
dallas
tx
75206
50
9921
12/16/17 6:57
In the above table, there is a duplicate row due to an error in processing. An example of duplicate rows is highlighted in red. The goal is to have only one row per unique combination of local_datetime, user and store_name. Imagine this is a table with billions of rows and multiple such duplicates.
1. Discuss some ways you might try to identify the duplicate rows using SQL.
2. How would you create a new table without duplicate rows using only SQL?
Data Challenge — Question 3
Given the following details about tables in a database.
table_a: person’s name and their selected greeting choice
id
greeting
betty
hello
elle
bonjour
…
…
frank
hiya
zoey
howdy
table_b: person’s name and the name of their dog(s) (if they have at least one dog)
id
dog
betty
rio
elle
bow-wow
…
…
zoey
raffy
table_c: person’s name and their veterinarian bill
id
vet_bill
betty
27.20
betty
150.99
elle
85.20
elle
47.50
…
…
zoey
79.50
zoey
15.75
zoey
35.00
You execute the following queries:
Do you see any potential issue with the following query? If yes, what?
with p as
(
select a.id, b.dog
from table_a a
left join table_b b
on a.id = b.id
)
select p.id, p.dog, sum(c.vet_bill) as sum_of_vet_bill from p
left join table_c c
on c.id = p.id
group by p.id, p.dog;
How is the above query different than the one below? Do you see any potential issue with the following query? If yes, what?
with p as
(
select a.id, b.dog
from table_a a
left join table_b b
on a.id = b.id
)
select p.id, p.dog, sum(c.vet_bill) as sum_of_vet_bill from table_c c
left join p
on c.id = p.id
group by p.id, p.dog;
Data Bonus Challenge
The most trusted news publications such as Wall Street Journal, Forbes, and Business Insider utilize inMarket to understand how consumer behavior is evolving and what it means for global brands (an example is on the following page). In this challenge, we provide a small sample of our anonymized data. Using the two data tables and data sources available online (i.e. newspaper, stock market news, etc.), discover business trends in the second half of 2014.
store_visits.csv file is a table containing information on the behavior of consumers surveyed at select chains.
Field Name
Data Type
Description
time_week
date
Time of visit
device_id
bytes/string
Device identifier
category
string
Category of chain
chain
string
Name of chain
visits
integer
Number of times device_id was at chain
store_purchases.csv file is a table containing information about consumer purchases.
Field Name
Data Type
Description
time_week
date
Time of visit
device_id
bytes/string
Device identifier
item_category
string
Category of item_type
item_type
string
Type of item device_id purchased
item_count
integer
Number of item_type purchased
Instructions
1. Evaluate store_visits.csv and store_purchases.csv using analytical tool(s) of your choice (i.e. Tableau,
SQL, Python, R, Excel, etc.)
2. Create a 1-2 page summary with 2 market trends that PR professionals can pitch to news outlets. Consider things like loyalty, market segmentation, and shopping behavior. (Feel free to include additional page(s) explaining your methodologies.)
3. Bonus: Share any additional findings you thought were interesting about the data