程序代写代做代考 Data Analysis – Celebrity Deaths in 2016¶

Data Analysis – Celebrity Deaths in 2016¶
Source: Wikipedia – Deaths in 2016
Structure of dataset:¶
• File: “celebrity_deaths_2016.xlsx”
• Contains 2 sheets:
▪ “celeb_death”: contains records of deaths of famous humans and non-humans
▪ “cause_of_death”: contains the causes of the deaths (you’ll need to merge it with the “celeb_death” sheet)
Other information about the dataset:¶
• The cause of death was not reported for all individuals
• The dataset might include deaths that took place in other years (you’ll need to ignore these records)
• The dataset might contain duplicate records (you’ll need to remove them)
The goals of the exercise:¶
• Load, merge, and clean the data
• Explore the data and answer some simple questions
• Run some basic analysis
• Visualize your results
In [ ]:
###########################################################
### EXECUTE THIS CELL BEFORE YOU TO TEST YOUR SOLUTIONS ###
###########################################################

import imp, os, sys
import importlib
sol = imp.load_compiled(“sol”, “./sol.py”)
from nose.tools import assert_equal
from pandas.util.testing import assert_frame_equal, assert_series_equal
In [ ]:
“””
We’re providing most of the import statements you need for the entire exercise
“””

import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Load, merge, and clean the data¶
In [ ]:
“”” 1.
1. Load the “celebrity_deaths_2016.xlsx” data file in the name “xl”
2. Print the sheet names
“””

# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_equal(xl.sheet_names, sol.xl.sheet_names)
In [ ]:
“”” 2.
1. Read the “celeb_death” sheet into a dataframe named “df”
2. Take a look at the top 5 rows. Save it in a variable called ‘top5’, then print it
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(top5, sol.top5)
In [ ]:
“”” 3.
1. Take a look at the data types stored in each column in df. Store these in a variable called ‘df_dtypes’
2. Get the shape of df. Store this in a variable called ‘df_shape’
3. Print these
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_series_equal(df_dtypes, sol.df_dtypes)
assert_equal(df_shape, sol.df_shape)
In [ ]:
“”” 4.
Drop the duplicates (based on all columns) from df
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
assert_frame_equal(df, sol.df2)
In [ ]:
“”” 5.
1. Read the “cause_of_death” sheet into a DataFrame named “cause_of_death”
2. Take a look at the top 5 rows. Store this in a variable named cause_top5, then print it
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(cause_top5, sol.cause_top5)
In [ ]:
“”” 6.
Drop the duplicates (based on the “cause_id” column) from the cause_of_death DataFrame

Hint: There is a single DataFrame method that does this
Use the “subset” argument to specify the “cause_id” column

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(cause_of_death, sol.cause_of_death)
In [ ]:
“”” 7.
1. Merge the cause_of_death DataFrame with the df DataFrame and name the new DataFrame as “df”
2. Take a look at the top 5 rows in df. Save these in a variable called df_top5, then print it

Note: There are records in df (left DataFrame) that do not have a matching record in cause_of_death (right DataFrame)
We want to see ALL records in df despite the missing matches in cause_of_death, so you DON’T want to use an “inner join”
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################
assert_frame_equal(df_top5, sol.df_top5)

Querying data¶
For the following questions, all the operations are on the Dataframe df.
In [ ]:
“”” 8.
We’ll be doing some calculations with the age column, but it was loaded from the data file as dtype “object”
So first, we need to cast DataFrame df to a numeric value
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_series_equal(df[‘age’], sol.df3[‘age’])
In [ ]:
“”” 9.
What was the average age of death? Store this value in a variable called ‘avg_age’, then print it
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_equal(avg_age, sol.avg_age)
In [ ]:
“”” 10.
How many people died after 70 years old because of?
— Store the result count in a variable named “count” and print it
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_equal(count, sol.count)
In [ ]:
“”” 11.
Who died the youngest and what was the cause of death?
— Store the name in a variable named “youngest_name” and print it
— Store the cause in a variable named “youngest_cause” and print it

Hint: Get the min age and find the record that has that value
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_series_equal(youngest_name, sol.youngest_name)
assert_series_equal(youngest_cause, sol.youngest_cause)
In [ ]:
“”” 12.
We’ll be running some queries based on the “bio” and “cause of death” columns,
but they were loaded from the data file as objects. So first, cast these two colums to strings
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_series_equal(df[’cause of death’], sol.df[’cause of death’])
assert_series_equal(df[‘bio’], sol.df[‘bio’])
In [ ]:
“”” 13.
What is total number of deaths caused by cancer?
— Store the result count in a variable named “count_cancer” and print it

Hint: Check if the cause_of_death is any type of (contains) “cancer”
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_equal(count_cancer, sol.count_cancer)
In [ ]:
“”” 14.
How many American celebrities died?
— Store the result count in a variable named “count_american” and print it

Hint: Search the bio for “American”
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_equal(count_american, sol.count_american)
In [ ]:
“”” 15.
What was one cause of death for celebrities who died at 50?
— Name the result as “rand_cause”
— Print the result using “print(“Age 50. Cause of Death:”, rand_cause)”

Hint: import random and randomly select a cause of death for celebrities who died at 50
“””

#############################################################################################
### DO NOT MODIFY THIS! WE NEED TO SEED THE RANDOM VALUE TO ACCURATELY TEST YOUR SOLUTION ###
import random
random.seed(0)
#############################################################################################

# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_equal(rand_cause, sol.rand_cause)
In [ ]:
“”” 16.
What was the mean age for each cause of death?

Hint: import numpy and group by ’cause of death’, then get the mean
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(df, sol.df_grouped_cause)

Count the number of people who died in each month of 2016¶
1. Create new columns that shows which month and year each person died in
2. Group all the entries based on the month they appeared in
In [ ]:
“”” 17.
Make a new column in the DataFrame df with the numeric month of death

Hint:
Use the apply() method to run a function against the “date of death” column,
and return the numeric month (get the value by using “.month”)
“””

def get_month(date):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_series_equal(df[‘month’], sol.df[‘month’])
In [ ]:
“”” 18.
Make a new column in the DataFrame df with the year of death

Hint: Apply a function to get the year from the “date of death” column
“””

def get_year(date):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_series_equal(df[‘year’], sol.df[‘year’])
In [ ]:
“”” 19.
Filter out any deaths that didn’t take place in 2016
— Name the new DataFrame as “df_2016”
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(df_2016, sol.df_2016)
In [ ]:
“”” 20.
Using a pivot table, obtain a list that contains the number of people that died in each month
— Name the result as “df_per_month”
“””
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_frame_equal(df_per_month, sol.df_per_month)

Data Visualization¶
In [ ]:
“”” 21.
Visualize the number of deaths per month as a bar chart

Hint: A DataFrame has a simple .plot() method you can use.

The x axis should show the individual number of the month and the y axis should show the death counts
Don’t forget to add a title and labels for the x and y axes
“””

# YOUR CODE HERE
raise NotImplementedError()

###########################
### DO NOT MODIFY THIS! ###
plt.show()
###########################
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()

Make a bar chart that plots the number of deaths per nationality¶
1. Create a new column that identifies the nationality of each celebrity, extracting the first word from the bio
2. Make a bar chart that plots the number of deaths per nationality
In [ ]:
“”” 22.
Create a new column in the DataFrame df that identifies the nationality of each celebrity,
extracting the first word from the bio

Hint:
To get the nationality from the bio, use the method split() on the column “bio”
and use the first element in the split result as the nationality.
“””

def get_nationality(bio):
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()
In [ ]:
“”” 23.
Make a bar chart that plots the number of deaths per nationality
Only include nationalities with more than 50 deaths
— Name the result DataFrame as “unlucky_countries”

Hint: A DataFrame has a simple .plot() method you can use.

The x axis should show the individual nationalities and the y axis should show the death counts
Don’t forget to add a title and labels for the x and y axis
“””
# YOUR CODE HERE
raise NotImplementedError()

###########################
### DO NOT MODIFY THIS! ###
plt.show()
###########################
In [ ]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert_series_equal(unlucky_countries, sol.unlucky_countries)
In [ ]:
# YOUR CODE HERE
raise NotImplementedError()