Introduction to Software Development
Homework 5 : Analyze Celebrity Deaths
(Deadline as per Coursera) This homework deals with the following topics:
¡ñ The pandas module
¡ñ Loading data
¡ñ Joining data
¡ñ Querying data
¡ñ Summarizing data
¡ñ Aggregate functions
¡ñ The numpy library
¡ñ The matplotlib library
¡ñ Data visualization
General Idea of the Assignment
In this assignment, you will analyze data from
use functions from the pandas module for loading, inspecting and querying data. You are expected to summarize data, create pivot tables and apply aggregate functions, and to visualize data using
histograms and other kinds of plots.
For each question, there are clear instructions in each cell. Follow those instructions and write
the code after each block of:
# YOUR CODE HERE
raise NotImplementedError()
We¡¯ll run a Python test script against your program to test whether each function implementation is correct.
contains records of deaths of famous humans and non-humans in 2016. You¡¯ll
the file ¡°celebrity_deaths_2016.xlsx¡± which
Make sure to delete the line raising an error! Please use the exact variable name if it is specified
in the comment.
¡ð There are 5 columns: date_of_death, name, age, bio, cause_id
Introduction to Software Development
About the Data
All of the data is contained within the ¡°celebrity_deaths_2016.xlsx¡± file which contains 2 sheets:
Other information about the dataset:
Submission
To complete the assignment, download celebrity_deaths_2016.ipynb and celebrity_deaths_2016.xlsx.
Evaluation
Two points for each question.
¡ñ “celeb_death”: contains records of deaths of famous humans and non-humans
¡ñ “cause_of_death”: contains the causes of the deaths
¡ð There are 2 columns: cause_id, cause_of_death
During this exercise, you¡¯ll need to merge the ¡°celeb_death¡± data with the ¡°cause_of_death¡±
data using the ¡°cause_id¡± column. This will give you the cause for each death.
¡ñ The cause of death was not reported for all individuals
¡ñ The dataset might include deaths that took place in other years (you’ll need to ignore
these records)
¡ñ The dataset might contain duplicate records (you’ll need to remove them)