CS计算机代考程序代写 Problem 2 [20 marks]: Premier League 2020-2021 Scores Data Set

Problem 2 [20 marks]: Premier League 2020-2021 Scores Data Set
The English Premier League (EPL) is the top level of the English football league system. Each year, 20 clubs play each other in what is called a season. Seasons run from August to May with each team playing 38 matches (playing all 19 other teams both home and away); home: the team hosts the match, away: the team plays at another team’s stadium. There is usually a “home advantage”; teams usually perform better at home. A match between two teams can end up in one of two ways:
a) Draw: both teams score the same number of goals. In this case, each team receives 1 point.
b) Win/Lose: one team scores more goals than the other. In this case, the winning team receives 3 points, while the losing receives 0 points.
At the end of the season, the team with the most points wins the tournament. Other teams are ranked based on their total points. Ties are broken by goal difference (goals scored minus goals received), and then by goals scored.
In this problem, you will use a data set called Football_E0_2021.csv (see folder data), provided by Football-data website.7 This website contains football data for different leagues. The data set in file Football_E0_2021.csv contains the results for the Premier League (PL) matches for this season (2020-2021), up until 21 March 2020. This data set has information about 290 PL matches, including team names, number of goals by each, number of shots, and cards, etc (see notes.txt for description of all columns). Once you load the data, use df.columns.to_list() to see the names of all columns. For this problem, you will write a function that calculates the final Premier League table given the results (in the form of a data frame). Please follow the instructions below.
A. Load the Football_E0_2021.csv file in your notebook, and print the data set to screen. Print the column headings.
B. Create a list of the 20 teams, using the column HomeTeam. Call it PLTeams. Print it.
C. At the point of downloading this file different teams had played different number of games. Using the columns HomeTeam and AwayTeam, calculate the number of
matches played by each team.
Note: you have the freedom to reproduce the table in J by following a pipeline that is different from all or some of the steps in D-J. You have to code up all of it (no cheating). Few points may be deducted from less efficient (computation-wise) pipelines.
D. Write a function calcScore() that takes one row as an input (call it row), and it returns two variables: numbers of points for the Home team and the Away team. You will use
7 http://www.football-data.co.uk/

the scores from columns FTHG and FTAG. The rules of point calculations is the following: If two teams score the same number of goals by the end of the game, each receives 1 point. Otherwise, the team that scores more goals, receives 3 points, while the losing team receives 0 points. In addition to the row parameter, let the function input also include three parameters: w, l, and d; corresponding to win, lose, and draw points. The first line of your function will look like :
def calcScore(row, w=3, l=0, d=1):
Test your function on row 0.
E. Using the function you wrote, and using the method apply(), create two columns HP
and AP that contain the number of points awarded for Home and Away teams
respectively. [Hint: check zip and lambda]
F. Write a function TeamPoints(), that takes as an input: a data frame, team name
(string), and whether the team is Home or Away (string or boolean). The function should use the two columns you created in E, and it should return the total number of points earned by this team for all Home games, or all Away games. First line should look something like:
def TeamPoints(df, team, H_A):
Test your function using Man City (you should get 38 and 33): print(TeamPoints(df,’Man City’,’home’)) print(TeamPoints(df,’ Man City’,’away’))
G. Write a function TeamsPointsDict(), that takes as an input: a data frame and a list of teams. The function should return a dictionary that has (key: value) pairs as (team_name: num_points), where team_name is each of the 20 PL teams, and num_points is the total number of points collected. Call your function using the list of teams you created in B.
H. Create a data frame from the dictionary you produced in G.
I. Order rows based on collected points (in a descending order), and print the data
frame. The first 10 rows of your results should look like this:

J. Using steps D-I, write a function GetTable() that takes as an input a data frame, and it returns a new data frame containing the final table of ranked teams and their collected points. Similar to D, your function input should also include three parameters: w, l, and d; corresponding to win, lose, and draw points (as in calcScore). The first line of your function will look like:
def GetTable(df, w=3, l=0, d=1):
K. In the past, teams were awarded 2 points for winning (rather than 3 points). This change in rule was made to incentivise teams to win. Let¡¯s see whether using 2 points for a win makes any difference on the final ranking. If you wrote your function correctly, you should be able to do this using:
GetTable(df, w=2)
L. Use the data set to perform compelling extra analysis. You will get marks if you create a compelling and interesting visualisation and/or analysis. One plot (or one piece of analysis) is enough, but you can do more if they are all tied into one main idea. Please provide 1-2 sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter (using Markdown).
Problem 3 [25 marks]: COVID-19
In this problem, you will use a data set, named owid-covid-data.csv (see folder data), from ¡°Our World in Data¡± website.8 The data is about COVID-19 cases in all countries on every day during the last year and this year, until 23 March 2021. The website provides many visualisations related to COVID in different countries and over days.9 The data you will find in data folder is downloaded from this link which has more information about the data.10
Please follow the instructions below for your data analysis.
A. Load the owid-covid-data.csv file in your notebook, and print the data set to screen. Print
the column headings.
B. Using the column date, print the number of unique dates in the data.
C. Using the column iso_code, print the number of unique countries in the data
D. From C, you will notice that there are too many countries. As it turns out, there are 11
values in iso_code that do not correspond to countries. Let¡¯s remove all rows that correspond to those values. They all start with the string ¡°OWID_¡±. Re-run the code from C again, it should be 11 fewer. Make sure you reset the index of your data frame.
E. Using the column ¡®date¡¯, find the earliest date in your data frame.
F. Using the column ¡®total_cases_per_million¡¯, create a new column ¡®total_cases_per_100¡¯.
8 https://ourworldindata.org
9 see this for example about vaccination progress in different countries: https://ourworldindata.org/covid-vaccinations 10 See this for general info: https://github.com/owid/covid-19-data/tree/master/public/data and this for codebook: https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv

G. Using the column ¡®date¡¯, create a new column ¡®DaysSince1Jan20¡¯.
H. As you may have noticed already, each country is represented by many rows (each row
represents one day for one country. Some columns have a different value every day per country. These are the COVID-related columns (e.g., ‘total_deaths_per_million’, ‘total_cases_per_million’), others have the same value per country repeated every day (e.g., ‘diabetes_prevalence’, ‘female_smokers’, ‘male_smokers’). We want to create a new data frame ¡®df_country¡¯ that contains data at the level of each country (i.e., each row represents one country). The new data frame should contain one value for each country for the following columns (and it only has these columns): ‘iso_code’, ‘female_smokers’, ‘male_smokers’. Make ¡®iso_code¡¯ the index of ¡®df_country¡¯
I. Using ¡®df_country¡¯, reproduce the following plot: you will get marks for reproducing the plot as accurately as possible, taking into consideration the steps undertaken to reach the final figure.
J. Create a new data frame ¡®df_GBR¡¯ that contains data only for the UK (using iso_code= GBR; should be 418 rows).
K. In ¡®df_GBR¡¯, create a new column ¡®total_cases_per_million_smoothed¡¯, which contains the moving average of the column ¡®total_cases_per_million¡¯, with a window size of 5 i.e., moving average of row x is calculated from rows x-2, x-1, x, x+1, and x+2. For x=1 and x=2, the moving average is calculated using three and four rows, respectively (same for the last two rows).
L. Reproduce the plot shown below, as accurately as possible.

The plot is self-explanatory. The upper panel contains both solid lines and dashed lines. The dashed lines are from the column ¡®new_cases_smoothed_per_million¡¯. You will get marks for reproducing the plot as accurately as possible, taking into consideration the steps undertaken to reach the final figure.
M. Use the data set to perform compelling extra analysis. You will get marks if you create a compelling and interesting visualisation and/or analysis. One plot (or one piece of analysis) is enough, but you can do more if they are all tied into one main idea. Please provide 1-2 sentences to explain your interesting analysis. Write it in a separate cell inside Jupyter (using Markdown).

Related Posts