Assignment 4¶
Cleaning and Exploring Data with Pandas¶

In this quiz, you will investigate restaurant food safety scores for restaurants in San Francisco. Above is a sample score card for a restaurant. The scores and violation information have been made available by the San Francisco Department of Public Health.
Loading Food Safety Data¶
There are 2 files in the data directory:
1. business.csv containing food establishments in San Francisco
2. inspections.csv containing retaurant inspections records
Let’s start by loading them into Pandas dataframes. One of the files, business.csv, has encoding (ISO-8859-1), so we will account for that when reading it.
Question 1¶
Question 1a¶
Read the two files noted above into pandas dataframes called bus and ins respectively. Print the first 5 rows of each to inspect them.
In [ ]:
In [ ]:
In [ ]:
Examining the Business data¶
From its name alone, we expect the businesses.csv file to contain information about the restaurants. Let’s investigate this dataset.
Question 2¶
Question 2a: How many records are there?¶
In [ ]:
Question 2b: How many unique business IDs are there?¶
In [ ]:
Question 2c: is the business_id unique in this data?¶
In [ ]:
Question 2d: What are the 5 most common businesses by name, and how many are there in San Francisco?¶
In [ ]:
Zip code¶
Next, let’s explore some of the variables in the business table. We begin by examining the postal code.
Question 3¶
Question 3a¶
How are the zip code values encoded in python: ints, floats, strings, booleans …
To answer this you might want to examine a particular entry.
In [ ]:
In [ ]:
Question 3b¶
What are the unique values of postal_code?
In [ ]:
Question 3c¶
Let’s say we decide to exclude the businesses that have no zipcode for our analysis (which might include food trucks for example). Use the list of valid 5-digit zip codes below to create a new dataframe called bus_valid, with only businesses whose postal_codes show up in this list of valid zipcodes. How many businesses are there in this new dataframe?
In [ ]:
In [ ]:
Latitude and Longitude¶
Another aspect of the data we want to consider is the prevalence of missing values. If many records have missing values then we might be concerned about whether the nonmissing values are representative of the population.
Question 4¶
Consider the longitude and latitude in the business DataFrame.
Question 4a¶
How many businesses are missing longitude values, working with only the businesses that are in the list of valid zipcodes?
In [ ]:
Question 4b¶
For these zip codes find the number of businesses in each zip code and the number without longitude values.
Create a new dataframe of counts of the null and proportion of null values, storing the result in bus_sf_latlong. It should have 3 columns:
1. postal_code: Contains the zip codes in the validZip variable above.
2. null_lon: The number of missing values for the zip code.
3. not_null_lon: The number of present values for the zip code.
In [ ]:
4c. Do any zip codes appear to have more than their ‘fair share’ of missing longitude?¶
To answer this, you will want to compute the proportion of missing longitude values for each zip code, and print the proportion missing longitude, and print the top five zipcodes in descending order of proportion missing postal_code.
In [ ]:
Investigate the inspection data¶
Let’s now turn to the inspection DataFrame. Earlier, we found that ins has 4 columns, these are named business_id, score, date and type. In this section, we determine the granularity of ins and investigate the kinds of information provided for the inspections.
Question 5¶
Question 5a¶
As with the business data, assess whether there is one inspection record for each business, by counting how many rows are in the data and how many unique businesses there are in the data. If they are exactly the same number, it means there is only one inspection per business, clearly.
In [ ]:
Question 5b¶
What values does type take on? How many occurrences of each value is in the DataFrame? Create a new dataframe ins2, keeping from the dataframe only records with values of type that occur more than 10 times in the inspections file. (eliminate records that have values of type that occur rarely). Check the result to make sure rare types are eliminated.
In [ ]:
In [ ]:
Question 5c¶
Since the data was stored in a .csv file, the dates are formatted as strings such as 20160503. Once we read in the data, we would like to have dates in an appropriate format for analysis. Add a new column called year by capturing the first four characters of the date column.
BIG Hint: you could apply a lambda function here to convert the values to a string and slice off the first four characters.
In [ ]:
Question 5d¶
What range of years is covered in this data set? Are there roughly same number of inspections each year? Maybe you should drop records for any years with less than 50 inspections to avoid biasing an analysis. Do that to create a new dataframe called ins3.
In [ ]:
In [ ]:
Let’s examine only the inspections for one year, 2016. This puts businesses on a more equal footing because inspection guidelines generally refer to how many inspections should occur in a given year.
In [ ]:
Question 6¶
Question 6a¶
Merge the business and 2016 inspections data, keeping all businesses regarless of whether they show up in the inspections file. Show the first several rows of the resulting dataframe.
In [ ]:
Question 6b¶
Print the 20 lowest rated businesses names, their addresses and their ratings.
In [ ]:
Done!¶
Now submit your notebook to the bcourses site.
In [ ]: