1 Assignment 3 1.1 Instructions
assignment-3 October 26, 2018
For full credit, earn 100 points. Problem C is big, and worth 100 points on its own, with its own bonus questions. Problem A and Problem B are worth 35 and 25 respectively, so full credit re- quires significant progress on Problem C, but Problem A and Problem B are more closed-ended, i.e., likely can be finished more quickly. Please indicate clearly which components of the assign- ment are completed as you fill out your solution and turn it in.
1.2 Problem A (35 points)
For this problem, you will be working with flight data from the Bureau of Transporttaion Statistics. For development, these data are a little largeish (~700Mb/7 million flights) for prototyping, so two truncated files are provided under data/2007-10k.csv and data/2008-10k.csv. Important: it’s strongly recommended to prototype and develop code using the truncated data.
Design note: The code you will develop as part of this problem’s solution should be general- ized so that it works when there are more than just two years’ worth of data, i.e. when there are more files than just 2007.csv and 2008.csv.
To work with the full dataset, go to the link and download the data files for 2007 and 2008. Put these files in the data/ directory and extract them. (The files can be extracted from the command line by navigating into data/ and running bunzip2 200{7..8}.csv.bz2.
A1. (5 points) Write a function that takes a year as an input argument and loads the data for that year into a pandas dataframe, then drops the rows in the dataframe that have a null in any of these columns: “Year”, “Month”, “DayofMonth”, “DepTime”, “Origin”, and “Dest”, and then returns this dataframe.
In [ ]:
A2. (10 points) Update the function so that before returning the dataframe, it creates a new column in the dataframe that contains datetime objects holding the departure date of the flight.
In [ ]:
A3. (5 points) Update the function so that it also takes an airport code as an input argument, and returns a dataframe of flights originating from that airport that occurred in the specified year. [Hint: while prototyping, to test your code choose an ‘Origin’ that’s in the truncated file, like Philly’s airport!]
1
In [ ]:
A4. (5 points) Using this function, create dataframes holding the flight data for Philadelphia International Airport (PHL) for 2007 and 2008. Then use the .groupby() method to obtain the busiest month of the year for both years. Did this change from 2007 to 2008?
In [ ]:
A5. (10 points) Update the function so that instead of a year value, it now takes two dates as input, denoting a range. The function must now return all flights originating from the specified airport within this range of time. If the range spans multiple years, the function should load data from all necessary files and return a single dataframe containing all the data within the specified range of time.
Using this function, get all the flight data for flights from PHL for all of 2007 and 2008. Then, create a daily count of flights over all of the days in the two years.
In [ ]:
BONUS. (5 points) Display the daily counts in a plot. In [ ]:
1.3 Problem B (25 points)
In this problem, you will be using the Baseball Databank provided by Sean Lahman. This data is already in the data/baseballdatabank2017.1/core/ directory. It contains a collection of tables. The immediate goal will be to create a dataframe that has the following information for each baseball player:
• Batting statistics (to be described) • Fielding statistics (to be described) • Pitching statistics (to be described) • Their salaries
• The teams they played for
• Their full names: First Middle Last • Their heights and weights
The dataset has a data dictionary available at:
• ../data/baseballdatabank-2017.1/core/readme2014.txt
The data and tables you will need are: • Batting.csv
– The number of games played and at bats
– The number of runs, hits, doubles, triples, homeruns, RBIs, strikeouts, and times hit by
• Fielding.csv
– The number of games played
2
– The number of opponents put out, assisted outs, and fielding errors • Master.csv
– Their full names
– Their heights and weights
• Pitching.csv
– The number of games played, won, lost
– The number of strikeouts, hits, earned runs, homeruns, and batters hit by pitches
• Salaries.csv
– The players salary
• Teams.csv
– The name of the player’s team
– The year the team was named its name
B1. (5 points) Load the data and keep only the columns of interest.
In [ ]:
B2. (5 points) Create a function that takes a year as input and outputs subsets of each of the tables with data from only that year.
In [ ]:
B3. (10 points) Write a new function that will again take a year as input, then use the previous filtering function to get separate tables, and finally merge these tables using the appropriate joins. Determine which columns have the same names but different values, then determine a useful naming scheme of suffixes that indicates the table the column was originally drawn from. Use this scheme in combination with the suffixes argument of the merge() function to avoid column name conflicts. Also remove duplicate columns from the merged table.
In [ ]:
B4. (5 points) There will be nulls in the data. Determine how to deal with the NAs and apply this strategy.
In [ ]:
1.4 Problem C (100 points)
In this exercise we’ll build and expand a cheese and wine pairings database with the high-level goal of linking wine varieties and recipe ingredients.
A Google search on food and wine pairings turned up a literature Ph.D. candidate’s blog post: + http://sedimentality.com/drinking-wine/list-of-wine-and-food-pairings/
This post is interesting, because instead of pairing dishes with wine vintages or brands, it pairs ingredients with wine varieties, tagging ingredients by several categories. Thus, we could pull the
3
data from this website to join wine varieties from a large wine reviews database to ingredients in a large recipes database. Your goal in this section of the assignment is to collect the ingredient- variety-category data.
C1. (5/5 pts) Why are we allowed to scrape this data out of the website?
Explain what information is posted (or not posted) on this domain, allowing us to definitely move forward with this exercise. Use the markdown cell below.
Response.
C2. (5/10 pts) Download the blog post using the requests module and store the html response.
In [ ]:
C3. (5/15 pts) Navigate to the same URL in your web browser, and identify where the target data are by inspecting the html (with your eyes) and matching up tags with the target content.
In the markdown cell below, describe the tags that you will need to extract the data from. Provide as much information as is necessary on the nesting of the document so as to be able to describe the location of the target data for parsing.
Response.
C4. (10/25 pts) Parse the html text with BeautifulSoup in preparation for scraping. Then, print out the target data. [Hint: use BeautifulSoup’s .find_all(tag_list) method to iterate through the tags identified in C3. From there, check to see if a tag has the data of interest by using the .find(tag) method. ]
In [ ]:
C5. (5/30 pts) Interpret the output from the previous part to make a plan for how to split the target data up with regular expressions to create a structured data object. Describe any structure that you’ll use in the markdown cell, below.
Response.
C6. (5/35 pts) Create an associative array (dictionary) data structure that can hold the ingre- dients in their categories, by wine variety. Use a default dictionary with its default value as a dictionary.
In [ ]:
C7. (10/45 pts) Complete the web-scraping script’s loop in C4 and store the extracted data in the data structure you initialized in C6. Print the stored data to confirm your output. Note you’ll need to use the regular expresions module to complete this section.
In [ ]:
C8. (5/50 pts) Write your data out to a json file. Name this file: ./data/wine_ingredient-pairings.json.
Note: since defaultdicts are not json serialiable (you can’t make a file out of them), you will have to coerce (type cast) your final data to a standard Python dictionary to finish the job.
In [ ]:
4
C9. (5/55 pts) What limitations did you observe in this data, and in your extraction of this data?
Were there any issues either with this data’s content, volume, its veracity or accuracy? Were you able to get all of the data present? Did you wind up getting some data that you didn’t want or need, or maybe not in the right form? Take a look at the wine and categories keys—are they consistent? Be as specific as possible!
Response.
BONUS (5 pts) Supposing we were going to rebuild the scraped wine-food pairings data as a gold standard through a survey. Are there any changes to allowable food types or to the manner in which the data are organized that might be beneficial to the wine recommendation task?
Response.
C10. (5/60 pts) Download and familiarize yourself with the cuisine prediction dataset: https://www.kaggle.com/kaggle/recipe-ingredients-dataset (it’s only necessary to download the train.json file and work with that). We’re going to enrich our current data by adding re- sults from this dataset.
In [ ]:
C11. (15/75 pts) Expand each wine variety’s list of foods to include any co-appearing in the recipes found in the above Kaggle dataset. Make sure to record how often ingredients in the Kaggle recipes occured alongside foods in the wine variety list. Store this information in another nested defaultdict, with the following structure:
wines = { wine: {
food_category: { pairing: Counter(), ...
},
… },
… }
In [ ]:
BONUS. (5 pts) How would you create the above structure for wines using nested a nested defaultdict?
In [ ]:
C12. (10/90 pts) We’d also like to incorporate a wine review dataset. Download and familiarize yourself with the dataset, found at: https://www.kaggle.com/zynicide/wine-reviews. Specifi- cally, utilize the file: winemag-data-130k-v2.csv to link each of the scraped wines found in wines to the specific names of the wines which are of that variety (i.e., for the Sauvignon wine, create a list of all the wines with variety column that say Sauvignon, and attach them to the wines object from before).
In [ ]:
5
C13. (5/95 pts) Now write a function which allows the user to input a list of ingredients and also a choice of ingredient category (such as “Cheese/nuts”), and outputs a list of all wine names that pair with these ingredients. This function won’t incorporate the additional ingredients we obtained from the Kaggle dataset.
In [ ]:
C14. (5/100 pts) This should give you a ton of results! Maybe too many for the function to ac- tually be useful in helping pick a wine. Perhaps there’s some disagreement between the datasets. How should the pairings and ingredients from the scraped data and the Kaggle data each indi- vidually be pre-processed as text to obtain a good alignment?
Response.
6