Final Project
[Full mark: 100; 70% of module grade]
BEE1038: Introduction to Data Science in Economics
In this project, you will demonstrate your understanding and mastery of programming in Python using data science tools, in addition to showing your understanding of the different research methods that use big data.
What you learnt so far should cover almost everything you will need, so if you are stuck then read through the notebooks again. Some of your tutorials and exercises may be also relevant. For Problem 4, you may need to consult the ¡°Bit by Bit¡± book1. If you are still unsure, then have a look online. Google and Stack OverFlow are your friends. You must use Python only; MATLAB and Excel are not acceptable.
The grade of this project will be calculated out of 100, and it will contribute 70% towards your overall grade in the module. The following aspects need to be shown:
¡ñ Manipulation of different data structures including dictionaries and data frames. ¡ñ Preparing and preprocessing data.
¡ñ Doing a basic plot, and changing plot markers, colors, etc.
¡ñ Improving and extending analysis.
¡ñ Showing an understanding of different research methods for using big data. ¡ñ Showing the ability to critique some scientific approaches.
Your submission will be a compressed file (.zip) containing:
1. A copy of your Python script named solution_code.ipynb (done in Jupyter Notebook).
Your scripts must be sufficient to reproduce your answers to Problems 1-3.
2. Same copy printed as a PDF, solution_code.pdf 2
3. A PDF file named P4_answers.pdf that contains your answers to Problem 4.
The deadline is Monday 26th April at 3:00 PM (BST).
Collaboration: You are encouraged to think about this project with other students or ask each other for help. If you do, you should do the following: 1) write your own code (no code copying from others), 2) Report the names of all people that you worked with in your submission, 3) if you received help from someone, write that explicitly, 4) plagiarism of code or writeup will not be tolerated. The University takes poor academic practice and academic misconduct very
1 https://www.bitbybitbook.com/en/1st-ed/preface/
2 See here how to create a PDF from a .ipynb file: https://vle.exeter.ac.uk/mod/forum/discuss.php?d=194496
seriously and expects all students to behave in a manner which upholds the principles of academic honesty. Please make sure you familiarise yourself with the general guidelines and rules from this link3 and this link4.
Problem 1 [25 marks]: IMDB Movie Ratings Data Set
In this problem, you will use a data set called movie_metadata.csv (see folder data), provided by the Internet Movie Database (IMDB) website.5 IMDB allows users to give ratings for every movie out there. They then use a formula to calculate a score for these movies. The top 100 movies are on this link. 6 This data set has information about 5043 movies, including movie director, IMDB score, genre, etc. Once you load the data, use df.columns to see the names of all columns. Column headings are self-explanatory.
Please follow the instructions below for your data analysis.
A. Load the movie_metadata.csv file in your notebook, and print the data set to screen. Print
the column headings.
B. Create a new data set ¡®limitDF¡¯ that contains only the following columns: ‘imdb_score’,
‘gross’, ‘budget’, ‘duration’, ‘director_facebook_likes’, and ‘num_voted_users’.
C. For this new data set, drop every row that contains at least one missing value.
D. Print the following stats for every column in ¡®limitDF¡¯: number of rows, minimum,
maximum, mean, standard deviation, median, first quartile, and third quartile.
For the remaining questions, go back to the main data set.
E. Plot the histogram for the IMDB ratings (¡®imdb_score¡¯), using 20 bins. Add illustrative labels to x-axis and y-axis, and resize plot and axes¡¯ label font size to make the plot look visually acceptable.
F. Now let¡¯s try to do something more interesting. We will compare older movies to newer movies. In other words, we will check to see whether movie ratings show an increase or decrease over the years? To answer this, we will plot the average IMDB ratings over years (‘title_year’). Before you do so, there¡¯s a little problem. Some early years had very few movies, and so an average rating for these years won¡¯t be representative. We want to filter those years out before we start analysis:
3 http://as.exeter.ac.uk/academic-policy-standards/tqa-manual/aph/managingacademicmisconduct/ 4
https://vle.exeter.ac.uk/pluginfile.php/1794/course/section/27399/A%20Guide%20to%20Citing%2C%20Referencing% 20and%20Avoiding%20Plagiarism%20V.2.0%202014.pdf
5 https://www.imdb.com/
6 https://www.imdb.com/search/title/?count=100&groups=top_1000&sort=user_rating
a. Count the number of movies per year
b. Filter out years that have strictly fewer than 15 movies
c. Calculate median IMDB rating per year
d. Plot it. It should look something like this (reproduce accurately).
G. In this part, we will see what plot keywords are associated with the highest IMDB scores. There is a column named ¡®plot_keyword¡¯. Unfortunately, the shape of data in this column does not make it easy to visualize what we want easily. The problem is each movie has multiple plot_keywords. These are all recorded in the same data frame cell, separated by ¡®|¡¯ character. For example, the plot keywords for the first row has:
avatar|future|marine|native|paraplegic
We need to do some pre-processing. Ideally, we want to create a new data set, keyword_score, that has two columns: keyword, and imdb_score. Each row will have only one keyword and the IMDB rating of the movie that has that keyword. For example, the IMDB rating for the movie in row 0 is 7.9. This means that each of the five keywords listed for this movie will have that score in the new data set. I.e., the first five rows are: (‘avatar’, 7.9), (‘future’, 7.9), (‘marine’, 7.9), (‘native’, 7.9), (‘paraplegic’, 7.9). Here is one way to do it (other ways are acceptable; with some deductions for less efficient methods):
a. Write a function that takes one row and returns a list of 2-dimension tuples. For example, for row 0, it returns:
Test your function on row 0.
b. Use the method apply() on all rows to produce a one list containing all 2-dim tuples
c. Convert the list of tuples into a data frame
d. Filter out keywords that were listed for strictly fewer than 50 movies
e. Calculate the mean IMDB rating per keyword, and sort
f. Plot it like this one:
H. Use the data set to perform compelling extra analysis. You will get marks if you create a compelling and interesting visualisation and/or analysis. One plot (or one piece of analysis) is enough, but you can do more if they are all tied into one main idea. Please provide 1-2 sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter (using Markdown).
[(‘avatar’, 7.9), (‘future’, 7.9), (‘marine’, 7.9), (‘native’, 7.9), (‘paraplegic’, 7.9)]
Problem 2 [20 marks]: Premier League 2020-2021 Scores Data Set
The English Premier League (EPL) is the top level of the English football league system. Each year, 20 clubs play each other in what is called a season. Seasons run from August to May with each team playing 38 matches (playing all 19 other teams both home and away); home: the team hosts the match, away: the team plays at another team’s stadium. There is usually a “home advantage”; teams usually perform better at home. A match between two teams can end up in one of two ways:
a) Draw: both teams score the same number of goals. In this case, each team receives 1 point.
b) Win/Lose: one team scores more goals than the other. In this case, the winning team receives 3 points, while the losing receives 0 points.
At the end of the season, the team with the most points wins the tournament. Other teams are ranked based on their total points. Ties are broken by goal difference (goals scored minus goals received), and then by goals scored.
In this problem, you will use a data set called Football_E0_2021.csv (see folder data), provided by Football-data website.7 This website contains football data for different leagues. The data set in file Football_E0_2021.csv contains the results for the Premier League (PL) matches for this season (2020-2021), up until 21 March 2020. This data set has information about 290 PL matches, including team names, number of goals by each, number of shots, and cards, etc (see notes.txt for description of all columns). Once you load the data, use df.columns.to_list() to see the names of all columns. For this problem, you will write a function that calculates the final Premier League table given the results (in the form of a data frame). Please follow the instructions below.
A. Load the Football_E0_2021.csv file in your notebook, and print the data set to screen. Print the column headings.
B. Create a list of the 20 teams, using the column HomeTeam. Call it PLTeams. Print it.
C. At the point of downloading this file different teams had played different number of games. Using the columns HomeTeam and AwayTeam, calculate the number of
matches played by each team.
Note: you have the freedom to reproduce the table in J by following a pipeline that is different from all or some of the steps in D-J. You have to code up all of it (no cheating). Few points may be deducted from less efficient (computation-wise) pipelines.
D. Write a function calcScore() that takes one row as an input (call it row), and it returns two variables: numbers of points for the Home team and the Away team. You will use
7 http://www.football-data.co.uk/
the scores from columns FTHG and FTAG. The rules of point calculations is the following: If two teams score the same number of goals by the end of the game, each receives 1 point. Otherwise, the team that scores more goals, receives 3 points, while the losing team receives 0 points. In addition to the row parameter, let the function input also include three parameters: w, l, and d; corresponding to win, lose, and draw points. The first line of your function will look like :
def calcScore(row, w=3, l=0, d=1):
Test your function on row 0.
E. Using the function you wrote, and using the method apply(), create two columns HP
and AP that contain the number of points awarded for Home and Away teams
respectively. [Hint: check zip and lambda]
F. Write a function TeamPoints(), that takes as an input: a data frame, team name
(string), and whether the team is Home or Away (string or boolean). The function should use the two columns you created in E, and it should return the total number of points earned by this team for all Home games, or all Away games. First line should look something like:
def TeamPoints(df, team, H_A):
Test your function using Man City (you should get 38 and 33): print(TeamPoints(df,’Man City’,’home’)) print(TeamPoints(df,’ Man City’,’away’))
G. Write a function TeamsPointsDict(), that takes as an input: a data frame and a list of teams. The function should return a dictionary that has (key: value) pairs as (team_name: num_points), where team_name is each of the 20 PL teams, and num_points is the total number of points collected. Call your function using the list of teams you created in B.
H. Create a data frame from the dictionary you produced in G.
I. Order rows based on collected points (in a descending order), and print the data
frame. The first 10 rows of your results should look like this:
J. Using steps D-I, write a function GetTable() that takes as an input a data frame, and it returns a new data frame containing the final table of ranked teams and their collected points. Similar to D, your function input should also include three parameters: w, l, and d; corresponding to win, lose, and draw points (as in calcScore). The first line of your function will look like:
def GetTable(df, w=3, l=0, d=1):
K. In the past, teams were awarded 2 points for winning (rather than 3 points). This change in rule was made to incentivise teams to win. Let¡¯s see whether using 2 points for a win makes any difference on the final ranking. If you wrote your function correctly, you should be able to do this using:
GetTable(df, w=2)
L. Use the data set to perform compelling extra analysis. You will get marks if you create a compelling and interesting visualisation and/or analysis. One plot (or one piece of analysis) is enough, but you can do more if they are all tied into one main idea. Please provide 1-2 sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter (using Markdown).
Problem 3 [25 marks]: COVID-19
In this problem, you will use a data set, named owid-covid-data.csv (see folder data), from ¡°Our World in Data¡± website.8 The data is about COVID-19 cases in all countries on every day during the last year and this year, until 23 March 2021. The website provides many visualisations related to COVID in different countries and over days.9 The data you will find in data folder is downloaded from this link which has more information about the data.10
Please follow the instructions below for your data analysis.
A. Load the owid-covid-data.csv file in your notebook, and print the data set to screen. Print
the column headings.
B. Using the column date, print the number of unique dates in the data.
C. Using the column iso_code, print the number of unique countries in the data
D. From C, you will notice that there are too many countries. As it turns out, there are 11
values in iso_code that do not correspond to countries. Let¡¯s remove all rows that correspond to those values. They all start with the string ¡°OWID_¡±. Re-run the code from C again, it should be 11 fewer. Make sure you reset the index of your data frame.
E. Using the column ¡®date¡¯, find the earliest date in your data frame.
F. Using the column ¡®total_cases_per_million¡¯, create a new column ¡®total_cases_per_100¡¯.
8 https://ourworldindata.org
9 see this for example about vaccination progress in different countries: https://ourworldindata.org/covid-vaccinations 10 See this for general info: https://github.com/owid/covid-19-data/tree/master/public/data and this for codebook: https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv
G. Using the column ¡®date¡¯, create a new column ¡®DaysSince1Jan20¡¯.
H. As you may have noticed already, each country is represented by many rows (each row
represents one day for one country. Some columns have a different value every day per country. These are the COVID-related columns (e.g., ‘total_deaths_per_million’, ‘total_cases_per_million’), others have the same value per country repeated every day (e.g., ‘diabetes_prevalence’, ‘female_smokers’, ‘male_smokers’). We want to create a new data frame ¡®df_country¡¯ that contains data at the level of each country (i.e., each row represents one country). The new data frame should contain one value for each country for the following columns (and it only has these columns): ‘iso_code’, ‘female_smokers’, ‘male_smokers’. Make ¡®iso_code¡¯ the index of ¡®df_country¡¯
I. Using ¡®df_country¡¯, reproduce the following plot: you will get marks for reproducing the plot as accurately as possible, taking into consideration the steps undertaken to reach the final figure.
J. Create a new data frame ¡®df_GBR¡¯ that contains data only for the UK (using iso_code= GBR; should be 418 rows).
K. In ¡®df_GBR¡¯, create a new column ¡®total_cases_per_million_smoothed¡¯, which contains the moving average of the column ¡®total_cases_per_million¡¯, with a window size of 5 i.e., moving average of row x is calculated from rows x-2, x-1, x, x+1, and x+2. For x=1 and x=2, the moving average is calculated using three and four rows, respectively (same for the last two rows).
L. Reproduce the plot shown below, as accurately as possible.
The plot is self-explanatory. The upper panel contains both solid lines and dashed lines. The dashed lines are from the column ¡®new_cases_smoothed_per_million¡¯. You will get marks for reproducing the plot as accurately as possible, taking into consideration the steps undertaken to reach the final figure.
M. Use the data set to perform compelling extra analysis. You will get marks if you create a compelling and interesting visualisation and/or analysis. One plot (or one piece of analysis) is enough, but you can do more if they are all tied into one main idea. Please provide 1-2 sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter (using Markdown).
Problem 4 [30 marks]:
In this problem, you will demonstrate your understanding for the materials covered in the module from ¡°Bit by Bit¡± book. Your answers should not exceed 1000 words in total. Write your answers in a separate PDF file. Consider the following research question and answer the questions below:
Does playing violent video games make people more violent?
A. Hypothesis: Before collecting any data, and using your intuition and prior knowledge, do you think the answer to the above question is:
a. Yes, it does
b. No, it makes people less violent
c. No, it has no effect
d. It depends
e. Impossible to tell
f. Other [elaborate]
B. Justifications: Lay down some reasons that support your answer to the previous question.
C. Constructs: what are some potential proxies for each of the following:
a. Playing violent video games (How often? How long?)
b. Being violent
D. Methodologically speaking, what is a good research design for answering the above
question? Provide a research plan that uses one or more of the four research designs
(Chapters 2-5) in the book.
E. What are some ethical issues you should be aware of when trying to answer this research
question or when following the research design you outlined above? How to deal with
them?
F. You hear about a study that was just published online. The study used data from video
game shops and Neighbourhood Watch in different neighbourhoods in London. They found that neighbourhoods where video games shops sold higher number of violent video games have also experienced higher rates of crime incidents reported. The authors concluded that playing violent video games causes people to become more violent. Discuss the potential problems of this study in terms of validity (Section 4.4.1).