Copy_of_CIS_545_HW1_Student_Version_Fall_20
CIS 545 Homework 1: Data Wrangling and Cleaning¶
Fall 2020 | Due September 30th, 10:00 PM EST
Hello future data scientists and welcome to CIS 545! In this homework, you will familiarize yourself with Pandas 🐼! The cutest animal and one of the essential libraries for Data Science. This homework is focused on one of the most important tasks in Data Science, preparing datasets so that they can be analyzed, plotted, used for machine learning models, etc…
This homework will be broken into analyzing two datasets across five sections (although section 1 has the most sub-parts)!
Working with New York flu and hospital acquired infection data
Working with Airbnb and Uber Data and on a merged/joined version of the Airbnb and Uber datasets (More on this later…)
Note: Before starting, you must click on the “Copy To Drive” option in the top bar. This is the master notebook so you will not be able to save your changes without copying it ! Once you click on that, make sure you are working on that version of the notebook so that your work is saved
Run the following two cells to setup the notebook
In [ ]:
%%capture
!pip3 install penngrader
!pip3 install py_stringsimjoin
!pip install python-Levenshtein
from penngrader.grader import *
# Import neccessary libraries
!pip install geocoder
import pandas as pd
import numpy as np
import seaborn as sns
from string import ascii_letters
import matplotlib.pyplot as plt
from datetime import datetime
import time
import geocoder
import py_stringsimjoin as ssj
import py_stringmatching as sm
from Levenshtein import distance
from difflib import SequenceMatcher
import requests
from lxml import html
In [ ]:
!wget https://penn-cis545-files.s3.amazonaws.com/NY_Hospital_Acquired_Infections__Beginning_2008.csv
!wget https://penn-cis545-files.s3.amazonaws.com/UPDATED_2_zillow_df.csv
!wget https://penn-cis545-files.s3.amazonaws.com/UPDATED_2_airbnb_df.csv
!wget https://penn-cis545-files.s3.amazonaws.com/UPDATED_3_NY_Flu_Confirmed_Cases_By_County__Beginning_2009_10_Season.csv
–2020-09-25 13:53:52– https://penn-cis545-files.s3.amazonaws.com/NY_Hospital_Acquired_Infections__Beginning_2008.csv
Resolving penn-cis545-files.s3.amazonaws.com (penn-cis545-files.s3.amazonaws.com)… 52.216.78.148
Connecting to penn-cis545-files.s3.amazonaws.com (penn-cis545-files.s3.amazonaws.com)|52.216.78.148|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 4303682 (4.1M) [text/csv]
Saving to: ‘NY_Hospital_Acquired_Infections__Beginning_2008.csv’
NY_Hospital_Acquire 100%[===================>] 4.10M –.-KB/s in 0.1s
2020-09-25 13:53:52 (40.1 MB/s) – ‘NY_Hospital_Acquired_Infections__Beginning_2008.csv’ saved [4303682/4303682]
–2020-09-25 13:53:52– https://penn-cis545-files.s3.amazonaws.com/UPDATED_2_zillow_df.csv
Resolving penn-cis545-files.s3.amazonaws.com (penn-cis545-files.s3.amazonaws.com)… 52.216.78.148
Connecting to penn-cis545-files.s3.amazonaws.com (penn-cis545-files.s3.amazonaws.com)|52.216.78.148|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 3686420 (3.5M) [text/csv]
Saving to: ‘UPDATED_2_zillow_df.csv’
UPDATED_2_zillow_df 100%[===================>] 3.52M –.-KB/s in 0.09s
2020-09-25 13:53:52 (37.9 MB/s) – ‘UPDATED_2_zillow_df.csv’ saved [3686420/3686420]
–2020-09-25 13:53:52– https://penn-cis545-files.s3.amazonaws.com/UPDATED_2_airbnb_df.csv
Resolving penn-cis545-files.s3.amazonaws.com (penn-cis545-files.s3.amazonaws.com)… 52.216.78.148
Connecting to penn-cis545-files.s3.amazonaws.com (penn-cis545-files.s3.amazonaws.com)|52.216.78.148|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 3940116 (3.8M) [text/csv]
Saving to: ‘UPDATED_2_airbnb_df.csv’
UPDATED_2_airbnb_df 100%[===================>] 3.76M –.-KB/s in 0.09s
2020-09-25 13:53:53 (41.2 MB/s) – ‘UPDATED_2_airbnb_df.csv’ saved [3940116/3940116]
–2020-09-25 13:53:53– https://penn-cis545-files.s3.amazonaws.com/UPDATED_3_NY_Flu_Confirmed_Cases_By_County__Beginning_2009_10_Season.csv
Resolving penn-cis545-files.s3.amazonaws.com (penn-cis545-files.s3.amazonaws.com)… 52.216.113.19
Connecting to penn-cis545-files.s3.amazonaws.com (penn-cis545-files.s3.amazonaws.com)|52.216.113.19|:443… connected.
HTTP request sent, awaiting response… 200 OK
Length: 5746389 (5.5M) [text/csv]
Saving to: ‘UPDATED_3_NY_Flu_Confirmed_Cases_By_County__Beginning_2009_10_Season.csv’
UPDATED_3_NY_Flu_Co 100%[===================>] 5.48M –.-KB/s in 0.1s
2020-09-25 13:53:53 (50.5 MB/s) – ‘UPDATED_3_NY_Flu_Confirmed_Cases_By_County__Beginning_2009_10_Season.csv’ saved [5746389/5746389]
What is Pandas?¶
Apart from animals, Pandas is a Python library to aid with data manipulation/analysis. It is built with support from Numpy. Numpy is another Python package/library that provides efficient calculations for matricies and other math problems.
Let’s also get familiarized with the PennGrader. It was developed specifically for 545 by a previous TA, Leonardo Murri.
PennGrader was developed to provide students with instant feedback on their answer. You can submit your answer and know whether it’s right or wrong instantly. We then record your most recent answer in our backend database. Let’s try it out! Fill in the cell below with your 8-digit Penn ID and then run the following cell to initialize the grader.
In [ ]:
# PLEASE ENSURE YOUR PENN-ID IS ENTERED CORRECTLY. IF NOT, THE AUTOGRADER WON’T KNOW WHO
# TO ASSIGN POINTS TO YOU IN OUR BACKEND
STUDENT_ID = 0000000000 # YOUR PENN-ID GOES HERE AS AN INTEGER #
In [ ]:
grader = PennGrader(homework_id = ‘CIS545_Fall_2020_HW1’, student_id = STUDENT_ID)
PennGrader initialized with Student ID: 0
Make sure this correct or we will not be able to store your grade
Pandas (the animal) are lazy. Their days are made up of eating and sleeping. Just like mine. Let’s run a cell just to make sure PennGrader works
TODO: PennGrader Check¶
Change favorite_activity to whichever panda “activity” you prefer (2 point)
Note: We’ll use cells like these “TODO” above to indicate what is important to have in each section of the notebook. Some general guidelines:
You don’t have to do all of these in one cell/step; we’re just labeling them for each section which might have smaller sub-sections (ex: look at how 1.2 is set up) for your reference
Make sure to read these carefully and do everything that is asked
Make sure to run all the PennGrader test cells; if we forgot anything, please let us know ASAP on Piazza so that we can update the Markdown cells here
In [ ]:
# In this cell, put which panda activity you prefer in lowercase(eating or sleeping)
# Input activity name in all lowercase
favorite_activity = ” ”
In [ ]:
# Run this cell to submit to PennGrader!
# [CIS 545 PennGrader Cell] – 2 point
grader.grade(test_case_id = ‘panda_test’, answer = favorite_activity)
You earned 0/2 points.
But, don’t worry you can re-submit and we will keep only your latest score.
You just had your first experience with the Penn Grader! For the future questions, once you have completed a question, you can submit your answer to the Penn Grader for immediate feedback. Awesome, right?
We will use scores from Penn Grader to determine your grade. You will still need to submit your notebook so we can check for cheating and plagarism. Do not cheat.
Note: If you run Penn Grader after the due date for any question, your assignment will be marked late, even if you already had full points for the question before the deadline. To remedy this, if you’re going to run your notebook after the deadline, either do not run the grading cells, or reinitialize the grader with an empty or clearly fake ID such as 999999999999 (please use 10+ digits to be clearly a fake STUDENT_ID)
Adding our data so that our code can find it¶
We can’t be data scientist without data! We provided code for you to download the data (the “wget” cell from earlier). If you go to the view on the left and click files, you should see something similar to this image
Part 1: Working with New York Influenza and Hospital-Acquired Infection Data¶
In this part of the homework we will be working with two different healthcare focused datasets!
The New York Influenza dataset contains data about flu cases in New York. The New York Hospital-Acquired Infection dataset contains information pertaining to infections patients acquired while in different hospitals across New York state.
Let’s first get both of our datasets loaded into two Pandas Dataframes, one for each of the datasets. Use pandas’ read_csv functionality, which you can find documentation for here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
While reading documentation is hard at first, we strongly encourage you to get into a good habit of it, since many times your questions will be answered directly by the documentation (ex: “why isn’t my dataframe dropping duplicates” or “why didn’t this dataframe update”).
TODO: Part 1 Intro¶
Save the Hopsital-Acquired dataframe to a variable named: hospital_df
Save the New York Influenza dataframe to a variable named: flu_df
Save the data types to flu_df_types (2 point)
In [ ]:
# TODO: Import your two files to pandas dataframes — make sure the dataframes are named correctly!
In [ ]:
# Now runs this cell to update the column names; please don’t delete this!
hospital_df.columns = [column.replace(‘ ‘, ‘_’) for column in hospital_df.columns]
flu_df.columns = [column.replace(‘ ‘, ‘_’) for column in flu_df.columns]
—————————————————————————
NameError Traceback (most recent call last)
1 # Now runs this cell to update the column names; please don’t delete this!
—-> 2 hospital_df.columns = [column.replace(‘ ‘, ‘_’) for column in hospital_df.columns]
3 flu_df.columns = [column.replace(‘ ‘, ‘_’) for column in flu_df.columns]
NameError: name ‘hospital_df’ is not defined
The cell above uses list comphrension in python, if you are unfamiliar with this concept you can read more about it here: https://docs.python.org/3/tutorial/datastructures.html
Let’s focus on the flu_df for now. We will work with the hospital dataframe in a bit. We would like to take a look at what the flu dataframe looks like. Display the first 10 rows of the dataframe in the cell below (take a look at the documentation to find now to do this!)
In [ ]:
# TODO: Display the first 10 rows of `flu_df`
Another thing that is often times helpful to do is inspect the types of each column in a dataframe. Output the types of the flu_df in this cell below.
In [ ]:
# TODO: Display the datatypes in `flu_df`
Save the types of the the Season, Region, Count, and FIPS columns to a series and pass them into the autograder cell below.
In [ ]:
# TODO: Just a warmup to get exposed to indexing and series vs. dataframe
flu_df_types=
In [ ]:
# View the output here!
flu_df_types
Pass series you just created into the “answer” parameter below (2 point)
In [ ]:
# [CIS 545 PennGrader Cell] – 2 point
grader.grade(test_case_id = ‘flu_df_types’, answer = flu_df_types)
1.1 Dropping data¶
When you work with data, you’ll have NaNs, duplicates or columns that don’t give much insight into the data. There are different ways to deal with missing values (i.e. imputation, which you can read into on your own), but for now, let’s drop some of these rows to clean up our data. Note that there might be multiple ways to do each step.
Refer to the documentation if you get stuck — it’s your best friend!
TODO: 1.1¶
Drop duplicate rows
Drop rows that have nulls (e.g. NaN) in them
Drop the “Season” and “FIPS” columns. These columns aren’t neccessary for the rest of the homework.
Drop all rows with the “Count” column equal to 0
Save the result to flu_df_cleaned (5 points)
In [ ]:
# TODO: Drop duplicate rows
In [ ]:
# TODO: Drop rows that have any nulls (NaN) in them
In [ ]:
# TODO: Drop the Season and FIPS columns
In [ ]:
# Drop all rows with “Count” equal to 0
In [ ]:
flu_df_cleaned=
View your dataframe again to check if you dropped the columns and the appropriate rows properly:
In [ ]:
flu_df_cleaned
Pass the cleaned Flu dataframe to the following cell as the input for the “answer” parameter (5 points)
In [ ]:
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘flu_drop_test’, answer = flu_df_cleaned)
1.2 Extracting Month, Day, Year of Each Row in the flu_df¶
Next, let’s filter out information from the Ending Date Column. We want to add columns for Month, Day, and Year of the Ending Date. Make sure these new columns have datatypes that allow for mathematical operations.
Hint: Highly recommend using the “apply” function
TODO: 1.2¶
Extract the columns Year, Month and Day from the Week_Ending_Date column — this is case-sensitive!
Convert these columns to a sensible type that would allow us to do mathematical operations on them (hint: strings probably won’t be so good here)
Store this as flu_df_cleaned (5 points)
Extract columns for Latitude and Longitude appropriately and update flu_df_cleaned (5 points)
Compute the average week number that data was collected on and store it in mean_week_num_collected (5 points)
In [ ]:
# TODO: First, extract the Month, Day, and Year of each entry into a numeric type (see the `apply` function).
flu_df_cleaned=
Pass the dataframe into the auograder cell below (5 points) — please note that the column names are case-sensitive!
In [ ]:
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘flu_end_date_test’, answer = flu_df_cleaned)
As mentioned above, we now have two more tasks to do in order to clean up and wrangle this flu dataframe. The first is as follows: one column in the flu_df_cleaned dataframe contains longitude and latitude data; create two columns, one for latitude and one for longitude (make sure the new columns have numeric types).
Name the columns Latitude and Longitude (5 points)
In [ ]:
# TODO: Next, extract columns for `Latitude` and `Longitude` by processing the tuple in `County_Centroid`
In [ ]:
# TODO: Cast the `Latitude` and `Longitude` column into numeric types
flu_df_cleaned=
In [ ]:
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘flu_lat_long’, answer = flu_df_cleaned)
Lastly, we are interested in learning a bit more about when this data was collected. We aren’t sure if the every week of the year is equally represented in the dataset. We want to know the average week number of the Week_Ending_Date column. The week number is a scalar that ranges from 1 to 52.
Store this as mean_week_num_collected. Think about what type of variable this should be, considering that we’re looking at a week number! (5 points)
In [ ]:
# TODO: Finally, compute the average week of the year using the “Week_Ending_Date” column and set your answer to “mean_week_num_collected”
mean_week_num_collected =
In [ ]:
# View the output of the variable here before running the grader cell
print(mean_week_num_collected)
In [ ]:
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘flu_week_mean’, answer = mean_week_num_collected)
1.4 Which Counties have the highest average count of Influenza A from September to March?¶
Let’s find which counties have the highest average count of Influenza A from September to March (flu season!), inclusive. We’ll use a groupby function here, and also do a bit of postprocessing to clean up our dataframe.
TODO: 1.4¶
Create a dataframe with only average (mean) influenza A counts and County name. The columns should be named County and Average_Flu_A_Count
Order it from highest to lowest average influenza A count
Edit County so that we only capitalize the first letter of the county
Store this as flu_A_by_county (8 points)
Make a plot of this data using the matplotlib package (4 points)
In [ ]:
# TODO: Create `flu_A_by_county` – look at the groupby functions!
flu_A_by_county=
In [ ]:
# TODO: Make sure to update your dataframe as appropriate so it matches the syntax expected!
flu_A_by_county=
Your table should look something like
County Average_Flu_A_Count
Countyname Number1
Othercounty Number2
In [ ]:
# [CIS 545 PennGrader Cell] – 8 points
grader.grade(test_case_id = ‘flu_A_top_county’, answer = flu_A_by_county)
Plot a bar plot of the 5 counties with the highest Average Influenza A counts from September to March. We will be manually grading this, and we understand there will be a lot of differences between individual plots — so don’t worry too much about this. (4 points)
We’re going to be making sure your plots are correct and reasonable; small things like axes labels and colors we’d invite you to do if you want to make your plot better and more useful to an end-user. We likely won’t be answering questions about the specifics here as a result!
In [ ]:
# TODO: Create a plot using matplotlib. – 4 points.
# You may also use other libraries like plotly or seaborn if you like, but we can’t guarantee the same level of help for these other libraries
1.5 Looking at Hospital Infection Data¶
Now lets bring in our hospital infection data.
Just like our flu_df, we need to clean and prepare our hospital infection data (more data wrangling!). Visualize the hospital_df dataframe below to take a look at it.
In [ ]:
#TODO: Take a look at the hopsital_df
TODO: 1.5¶
Compute second_null, the column that has the second null in a given row (this won’t be the same value for all rows) (8 points)
Before we clean this dataframe, we would like to answer a finicky question about the dataframe. We want to get an idea for the extent of the null values in the dataframe. For each row in hospital_df we want to know which column contains the second null value in the row.
Create a series named second_null that has the answer to this question. You may find that functions like cumsum() and idxmax() might be helpful — we’d recommend looking at the documentation to think about how to use these before asking on Piazza!
In [ ]:
# TODO: Compute `second_null`
second_null=
Pass the your second_null series into the grader cell below:
In [ ]:
# [CIS 545 PennGrader Cell] – 8 points
grader.grade(test_case_id = ‘find_second_null’, answer = second_null)
1.6 Cleaning up Hospital Infection Data¶
We now want to clean up our hospital infection data. Let’s follow the some of the same steps we performed with the flu_df for dropping duplicates/nulls as well as two more cleaning tasks.
TODO: 1.6¶
Drop duplicates and null values from hospital_df
Create columns for Latitude and Longitude
Replace every instance of “Not significantly different than NYS 2018 average” in the Comparison Results column with the string “p-value greater than 0.05, NOT significantly different than NYS 2018 average”
Save this as hospital_df_cleaned (8 points)
First, drop all duplicates and nulls as we did before!
In [ ]:
# TODO: Drop duplicates and nulls
Now, create columns for latitude and longitude
In [ ]:
# TODO: Create columns for `Latitude` and `Longitude`
Finally, replace the entries in Comparison_Results
In [ ]:
#TODO: Replace the text as mentioned above!
hospital_df_cleaned=
Pass the dataframe hospital_df_cleaned into the grader cell below (8 points)
In [ ]:
# [CIS 545 PennGrader Cell] – 8 points
grader.grade(test_case_id = ‘hospital_df_clean_up’, answer = hospital_df_cleaned)
Part 2: Grabbing more some more data: Generating Zipcodes¶
We are really interested in comparing whether flu cases are at all related to hospital acquired infections. As such we would like to do a location based comparison. We have a few tasks in this part:
TODO: Part 2¶
Order the flu dataframe by Disease (lexographically) and create a new dataframe called flu_df_limited that contains only the first 300 rows. (5 points)
Order the hospital_df by Facility_ID (from least to greatest) and create a new dataframe called hospital_df_limited that contains only the first 300 rows. (5 points)
You will use the flu_df_cleaned and hospital_df_cleaned to make these new smaller dataframes.
Notice both the flu dataframe and the hospital dataframe don’t have a zipcode column:
We want the zipcode for each row for each of the two dataframes
Add a new column to the dataframe with the associated zipcode for the given longitude/latitude point
This can be achieved through using the latitude and longitude. Google around for ways to convert latitude/longitude points into a zipcode in Python. We suggest a few options:
https://geopy.readthedocs.io/en/stable/
https://developer.mapquest.com/
https://stackoverflow.com/questions/54320931/python-code-for-reverse-geo-coding-using-google-api
This part can be a little challenging. Keep working at it and don’t get discouraged — TAs are also here to help (office hours & Piazza)!
You’ll find that we might have to iterate over the values, and there are a couple ways to do this:
apply functions that do operations in parallel (generally best-practice in pandas)
for-loops and other iterators…but if you run this on the entire dataset it will be slow, so we would recommend thinking a bit on how to make things efficient if you choose to go this route
Make sure the added zipcode column name is named flu_zipcode for the flu_limited_df and hospital_zipcode for the hosptial_limited_df. Note to keep the zipcode column type as strings. Name your two new dataframes flu_limited_df and hospital_limited_df and submit them to their respective autograder cells below.
In [ ]:
# There are a few options to do this problem but we recommend using the geocoder package. Let’s import it here.
import geocoder
# Take a look at the geocoder.mapquest function
In [ ]:
# TODO: Create `flu_df_limited`
flu_df_limited=
In [ ]:
# TODO: Create `hospital_df_limited`
hospital_df_limited=
In [ ]:
# TODO: Find and add flu zipcodes to `flu_df_limited` — and do the same for `hospital_df_limited`
hospital_df_limited=
flu_df_limited=
In [ ]:
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘hospital_df_limited’, answer = hospital_df_limited)
In [ ]:
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘flu_df_clean_up’, answer = flu_df_limited)
Part 3: Combining the data¶
When you become a full time data scientist, a lot of times, data will be spread out across multiple files/tables. The way to combine these tables is through join/merge operations. If you’re familiar with SQL, this will be very familiar to you. If not, don’t worry. I believe in you!
To start, here’s a nice diagram which shows you the different types of joins
A clarifying point: The two venn diagrams with the “(if Null)” are also called Left Outer Join and Right Outer Join
When working in data science, a large portion of the time we will want to perform an inner join (see the diagram above for some information on what an inner join is if you are not familiar). As such perform an inner join between the hospital_df_limited and the flu_df_limited on the Year columns (we would like matches between the Year column for hospital_df_limited and the Year column for flu_df_limited).
Note This is a change made on 9/20 5PM EDT to correct the previous typos that did not include _limited in the df names. It is also in the FAQ.
Name your answer inner_joined_df (4 points).
In [ ]:
#TODO: Perform an inner join between the hospital_df_limited and flu_df_limited on the Year column.
inner_joined_df =
In [ ]:
# [CIS 545 PennGrader Cell] – 4 points
grader.grade(test_case_id = ‘basic_join_test’, answer = inner_joined_df)
TODO: Part 3.1¶
Let’s now do a join that is a little bit more interesting / uncommon.
Find all the zipcodes that strictly appear in DO NOT appear in both datasets:
If 11111 is in flu data and hospital infection data, we don’t want it
If 22222 is in flu data but not in hospital, we want to keep it
If 33333 is in hospital infection data but not in flu, we want to keep it
In other words, we want all the zipcodes that have flu cases with no hospital infections (this is probably impossible in the real world, but our data is a small sample of the real world). Note we are using the hospital_df_limited and flu_df_limited dataframes here
Hint: Google around for Exclusive Full Joins
We want the final answer as a two column dataframe with columns named: flu_zips, hospital_zips (6 points)
flu_zips should contain the zipcodes that only exist within the flu dataset and vice-a-versa for hospital_zips column
If there is a row where hospital_zip has a zipcode, that same row should not have a flu_zips entry (Should be NaN) as otherwise, that means that zipcode is shared between both datasets
Make sure your final answer dataframe is named: combined_df
In [ ]:
# TODO: Create combined_df
combined_df=
Enter your newly created dataframe combined_df to the penngrader cell below:
In [ ]:
# [CIS 545 PennGrader Cell] – 6 points
grader.grade(test_case_id = ‘join_test’, answer = combined_df)
Part 4: Working With Zillow and AirBnB data!¶
Let’s now look at some Zillow and AirBnB data! As a data scientist, you will often be asked to work with many different kinds of datasets (and through this course you will get hands on experience with data from many different fields!)
Read in the Zillow and AirBnB data to pandas dataframes and take a look at it so that you are familiar with the features (columns) in both datasets.
In [ ]:
# TODO: Take a look at the Zillow Df
In [ ]:
#TODO: Take a look at the AirBnB Df
4.1 Correlation Matrix¶
We want to see if, on average, there is any correlation between the gross square footage of apartments versus airbnb prices for shared zipcodes, and if there is a correlation, how strong is it?
TODO: 4.1¶
Find the average value of each column for each zipcode within Zillow
Find the average value of each column for each zipcode within AirBnb
Merge them on the matching column:
Your merged table may now have 1-3 nonsense columns
Since the values have been averaged by zipcode, some of these columns no longer include valid information… think on it and visit office hours if you get stuck!
Identify and drop these (can also be done before merging)
Generate the correlation matrix. Find the value associated with Zillow Gross Square footage vs. Airbnb Price. Hint: Read about Pandas “corr()” function.
Name your final answer correlation matrix dataframe to: correlation_matrix (5 points)
Plot a correlation matrix — just to get a sense of what it might look like!
Note: You can correctly answer the following section just using Zillow’s gross square footage column and Airbnb’s price column but after this question, we want to plot a correlation matrix which will require all the columns(not just the two listed above)
In [ ]:
# TODO: Create your correlation matrix
correlation_matrix=
Submit the correlation matrix to the autograder
(5 points)
In [ ]:
# [CIS 545 PennGrader Cell] – 5 points
grader.grade(test_case_id = ‘corr_mat’, answer = correlation_matrix)
Here we provide code for you to visualize the correlation matrix. In the following code snippet below, please assign your correlation matrix to the variable named “corr” and then run the cell. You should see a correlation matrix!
This isn’t a graded cell; but you might find that knowing how to plot correlation matrices is helpful in general EDA and maybe later on in the class! 🙂
In [ ]:
sns.set(style = “white”)
# Generate a large random dataset
rs = np.random.RandomState(33)
d = pd.DataFrame(data=rs.normal(size=(100, 26)),
columns=list(ascii_letters[26:]))
# Compute the correlation matrix
# ASSIGN THE “corr” VARIABLE TO YOUR CORRELATION MATRIX
corr = correlation_matrix
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={“shrink”: .5})
plt.title(“Correlation Heatmap: Airbnb & Zillow Data”)
plt.show()
4.2 Joining columns on roughly similar data¶
This problem is pretty difficult conceptually. We’re putting it in the assignment because we know you all can do this!
There will be times when you have similar (but not exact) data between columns that you would like to join on. Convential joins, like the one you did above, will scan through each of the specified join columns, and find exact matches to establish that the rows from the associated datasets can be combined together into one row.
But what happens if the join operation sees “Mr.Smith” in one dataset versus “Mr Smith”? It will skip over and think these two entries can’t be joined even though there is a minor difference and the two values should be matched. This is where we combine what we call “n-grams” using a technique called Jaccard similarity.
Let’s break this down: An n-gram is a contiguous sequence of n items. An example would be given the sentence: “CIS 545 is a great class” … If we were looking at an n-gram of two, the results would be: [“CIS 545”, “545 is”, “is a”, “a great”, “great class”]
The above example is tokenizing/breaking up the n-grams by spaces. However, we can also do it characterwise. Given “CIS 545 is a great class” with n-gram of 2 and doing characterwise splits, we will have: [“CI”, “IS”, “S5”, “54”, “45”, “5i”, … ]
Jaccard similarity is a way to measure similarity between two sets. There’s plenty of other documentation online for it, but a mathematical notation example is below:
We will focus on the NEIGHBORHOOD column in zillow_df and the neighbourhood in airbnb_df. If you take a look, these two columns share the same type of information but are formatted differently (i.e. UPPER WEST SIDE (59-79) in zillow_df versus Upper West Side in airbnb_df.
We know these columns mean very similar things…but merge/join operations don’t. A possibility is to clean that column so that the formats match. But what if you have a ton of entries and you can’t possibly account for everything?
Our goal here is to join the neighbourhood columns without having to perform extensive cleaning operations in one of the columns.
TODO: 4.2¶
Take your zillow_df NEIGHBORHOOD column and lowercase all the entries except the first entry (aka capitalize). We do this to make the neighborhood columns between the two datasets as similar as possible and reducing cleaning necessary
Define a Tokenizer with 5 grams. We have already imported our recommended library to do this at the top. Try looking at this for how to define a tokenzier: http://anhaidgroup.github.io/py_stringmatching/v0.2.x/QgramTokenizer.html
Do a jaccard join:
Again, we have already imported our recommended library for doing this at the top
Look here for more documentation: http://anhaidgroup.github.io/py_stringsimjoin/v0.1.x/api/jaccard_join.html
When calling this function, set the l_out_attrs and r_out_attrs parameters to their associated table’s neighbourhood column so that we keep those columns as we want to see our join results
When selecting the tolerance, play around with some values between $$.1 < x < .5$$ to see what the join thinks could be "joinable" Drop duplicates from your jaccard-joined dataframe based on both neighbourhood columns as you will have a lot of duplicates. In other words, you might have 100+ entries of [MURRAY HILL, MURRAY HILL]. We only want one entry for this. Name your answer to this section: answer_df (14 points) -- this is a large test case since it's a super involved question. Good luck! In [ ]: # TODO: Preprocess `zillow_df` and create a tokenizer with 5 grams In [ ]: # TODO: Create your jaccard join and drop duplicates. Name your answer dataframe answer_df. answer_df= Find a row entry in this dataframe where the Airbnb and Zillow neighborhoods aren't the same but are similar (i.e Clinton vs Clinton Hill or Mill basin vs Mill Basin. Assign (You can manually type it out) the associated name to the following variables based on the dataset they originally belonged to. When you submit your answer to the autograder, make it a tuple of (result dataframe from the join, airbnb_neighborhood, zillow_neighborhood). (14 points) It should be possible to earn partial credit for this! But it definitely is a harder problem; good luck! In [ ]: # Add the Airbnb and Zillow neighborhood examples airbnb_neighborhood = zillow_neighborhood = answer = (answer_df, airbnb_neighborhood, zillow_neighborhood) In [ ]: # [CIS 545 PennGrader Cell] - 14 points grader.grade(test_case_id = 'jaccard_join', answer = answer) Part 5 XPath¶ Brought to you by popular demand from students...¶ So far, we've looked at data from Airbnb and Zillow. It turns out Airbnb was a unicorn startup! According to Wikipedia, a unicorn is a "privately held startup company with a current valuation of US$1 billion or more." Let's look into unicorn startup companies some more. Take a look at this wikpedia page that has a full list of unicorn startup companies. https://en.wikipedia.org/wiki/List_of_unicorn_startup_companies. Note that Uber is on the former unicorns list since it IPO'd, and Airbnb's delayed IPO means it still is as well. TODO: Part 5¶ Using this data, return a dataframe that has the top 50 unicorn companies by Valuation. Your dataframe should be In descending order by Valuation. Have exactly two columns, Company and Valuation Hint: The first row should be something like: Company Valuation Ant Financial 150 There are many different ways to do this but we recommend trying it with XPath. That's also how we will recommend in office hours! Assign the answer of this section to a dataframe named: answer_df and submit it in a tuple with the path for xpath (9 points) In [ ]: # TODO: Request the data and build the DOM tree (we've done this for you!) w = requests.get("https://en.wikipedia.org/wiki/List_of_unicorn_startup_companies") dom_tree = html.fromstring(w.content) print(dom_tree) In [ ]: # TODO: Create `answer_df` and set your final xpath "path" # `path` should be the value you pass in for dom_tree.xpath() path= answer_df= In [ ]: # Check the path prints! # It should be in the form of '/.../text()' (may vary slightly; only include the actual parameter used in dom_tree.xpath!) print(path) In [ ]: # [CIS 545 PennGrader Cell] - 9 points grader.grade(test_case_id = 'xPath_test', answer = (answer_df, path)) You're done! Time to wrap things up and submit HW1. HW Submission¶ The good news is you basically know the score you already got because of the autograder. In fact, since for the first time the autograder is integrated into Gradescope, you should be able to see your autograder score automatically! There is one question we're manually grading (and we'll check for plagirism), but for the most part things are relatively certain. Before you submit on Gradescope (you must submit your notebook to receive credit): Restart and Run-All to make sure there's nothing wrong with your notebook Double check that you have the correct PennID (all numbers) in the autograder. Make sure you've run all the PennGrader cells Go to the "File" tab at the top left, and click "Download .ipynb" and upload the Python notebook to Gradescope directly! Let the course staff know ASAP if you have any issues submitting, but otherwise best of luck! Congrats on finishing the first HW.