Programming Exercises for Descriptive Statistics and Probability
__Please don’t use any external libraries to solve for the question. No built-in functions to calculate probability or entropy from R should be used for this part, the only help you can get from R should be dataframe manipulation. All answers for probability calculations need to be up to 2 decimal places – this is an instruction not a request. You need to follow coding standards for showing enough comments for your code as well as making sure that there is no duplication of codes (Write external functions to reduce duplications if required). You can check for the coding standards from your previous programming unit – As the prerequisites for this unit, you have already done some sort of programming units so saying that you don’t know coding standards is not an excuse and we will not accept that. Please show all working including code and presentation for this question.__
Sports analytics (i.e., the application of data science techniques to competitive sports) is a rapidly growing area of data science. In this question we will look at some very basic analytics applied to the outcomes of consecutive games of English Premier League (EPL). The file EPL.csv contains a record of the outcomes of games of EPL played by Premier League Teams in the seasons from 1993 to 2018. The data is sequential, in the sense that each row recorded the result whether the home team wins (H), the away team wins (A), or there is a draw (D).
20 Teams per season, each team will play 38 games in total, 19 games at Home and 19 Games away
Results of 4560 Premier League matches – 380 matches over 12 seasons from 2006/2007 to 2017/2018
In [ ]:
# Read the data
EPL <- read.csv("EPL.csv")
In [ ]:
# Get the summary
summary(EPL)
home_team away_team home_goals
Arsenal : 228 Arsenal : 228 Min. :0.000
Chelsea : 228 Chelsea : 228 1st Qu.:1.000
Everton : 228 Everton : 228 Median :1.000
Liverpool : 228 Liverpool : 228 Mean :1.543
Manchester City : 228 Manchester City : 228 3rd Qu.:2.000
Manchester United: 228 Manchester United: 228 Max. :9.000
(Other) :3192 (Other) :3192
away_goals result season
Min. :0.000 A:1288 2006-2007: 380
1st Qu.:0.000 D:1164 2007-2008: 380
Median :1.000 H:2108 2008-2009: 380
Mean :1.144 2009-2010: 380
3rd Qu.:2.000 2010-2011: 380
Max. :7.000 2011-2012: 380
(Other) :2280
Take a look at entries number 3, the home team is Everton and the Away team is Watford, the result was "H" indicating that Everton won the match, or you can compare the home_goals and away_goals
In [ ]:
# Inspect the data
head(EPL)
home_team
away_team
home_goals
away_goals
result
season
Sheffield United
Liverpool
1
1
D
2006-2007
Arsenal
Aston Villa
1
1
D
2006-2007
Everton
Watford
2
1
H
2006-2007
Newcastle United
Wigan Athletic
2
1
H
2006-2007
Portsmouth
Blackburn Rovers
3
0
H
2006-2007
Reading
Middlesbrough
3
2
H
2006-2007
Part 1: Statistics for the whole dataset
Part 1.1. (2 Marks)
Which season has the highest number of goals?
In [ ]:
highest <- function(dat = EPL){
# Name of the highest season
seasonH <- NULL
## Solution:
# Printing out the max season
cat("Highest goal scoring season is: ", seasonH)
}
highest()
Highest goal scoring season is:
Part 1.2. (2 Marks)
Which season has the lowest number of goals?
In [ ]:
lowest <- function(dat = EPL){
# Name of lowest season
seasonL <- NULL
## Solution
# Printing out the min season
cat("Lowest goal scoring season is: ", seasonL)
}
lowest()
Lowest goal scoring season is:
Part 1.3. (2 Marks)
which team has the highest average goals per season $\frac{\text{Total goals score}}{\text{seasons played}}$?
In [ ]:
# Defining a function to calculate the goals score by each team
team.goals <- function(dat = EPL){
return()
}
In [ ]:
team.highest <- function(dat = EPL){
# name of the team:
teamN <- NULL
goals.ratio <- 0
## Solution
# Printing out the team:
cat("The team with the highest goal score is: ", teamN, "the goals ratio is", round(goals.ratio,2))
}
team.highest()
The team with the highest goal score is: the goals ratio is 0
Part 1.4. (2 Marks)
Which team concedes the most average goals per season $\frac{\text{Total goals concedes}}{\text{seasons played}}$?
In [ ]:
team.lowest <- function(dat = EPL){
# name of the team:
teamN <- NULL
goals.ratio <- 0
## Solution
# Printing out the team:
cat("The team with the lowest goal score is: ", teamN, "the goals ratio is", round(goals.ratio,2))
}
team.lowest()
The team with the lowest goal score is: the goals ratio is 0
Part 2: Statistics for the individual team (Manchester United) (27 Marks)
In [ ]:
# Lower case all team name for ease of performing the task
EPL$home_team <- tolower(EPL$home_team)
EPL$away_team <- tolower(EPL$away_team)
In [ ]:
head(EPL)
home_team
away_team
home_goals
away_goals
result
season
sheffield united
liverpool
1
1
D
2006-2007
arsenal
aston villa
1
1
D
2006-2007
everton
watford
2
1
H
2006-2007
newcastle united
wigan athletic
2
1
H
2006-2007
portsmouth
blackburn rovers
3
0
H
2006-2007
reading
middlesbrough
3
2
H
2006-2007
Part 2.1. (2 Marks)
Find out the probabilities P(Manchester United Wins), P(Manchester United Loses), and P(Manchester United Draws). This includes all the results both home and away.
In [ ]:
Task2.1 <- function(team = "manchester united", dat = EPL){
# Set up the variables
P.W <- NULL
P.L <- NULL
P.D <- NULL
## Solution
# Print out the results:
cat("The probability that", team, "wins is: ", P.W, '\n')
cat("The probability that", team, "loses is: ", P.L, '\n')
cat("The probability that", team, "draws is: ", P.D, '\n')
}
Task2.1()
The probability that manchester united wins is:
The probability that manchester united loses is:
The probability that manchester united draws is:
Part 2.2. (2 Marks)
Find out the conditional probabilities:
1. P(Man Utd Wins| Playing at Home)
2. P(Man Utd Wins| Playing away)
3. P(Man Utd Draws| Playing at Home)
4. P(Man Utd Draws| Playing away)
5. P(Man Utd Loses| Playing at Home)
6. P(Man Utd Loses| Playing away)
Please make comparison and a general conclusion.
In [ ]:
Task2.2 <- function(team = "manchester united", dat = EPL){
# Set up the variables
P.W.H <- NULL
P.W.A <- NULL
P.D.H <- NULL
P.D.A <- NULL
P.L.H <- NULL
P.L.A <- NULL
# Solution
# Print out the results
cat("P(", team, "Wins|Playing at Home) = ", P.W.H, '\n')
cat("P(", team, "Wins|Playing away_team) = ", P.W.A, '\n')
cat("P(", team, "Draws|Playing at Home) = ", P.D.H,'\n')
cat("P(", team, "Draws|Playing away_team) = ", P.D.A, '\n')
cat("P(", team, "Loses|Playing at Home) = ", P.L.H,'\n')
cat("P(", team, "Loses|Playing away_team) = ", P.L.A,'\n')
}
Task2.2()
P( manchester united Wins|Playing at Home) =
P( manchester united Wins|Playing away_team) =
P( manchester united Draws|Playing at Home) =
P( manchester united Draws|Playing away_team) =
P( manchester united Loses|Playing at Home) =
P( manchester united Loses|Playing away_team) =
Part 2.3. (3 Marks)
What is the probability that Man Utd will win a game given that they won their previous game?
In [ ]:
# Optional function to make the whole thing easier, count joint probability win, loss for a team (This is a suggestion, not a requirement)
count.win <- function(team, dat){
ww <- 0
wl <- 0
lw <- 0
ll <- 0
return(c(joint.ww, joint.wl, joint.lw, joint.ll))
}
In [ ]:
Task2.3 <- function(team = "manchester united", dat = EPL){
# Set up the variable
P.W.W <- NULL
## Solution
# Print out the results
return(P.W.W)
}
cat("P(manchester united Wins| winning the previous game) = ", Task2.3(), '\n')
P(manchester united Wins| winning the previous game) =
Part 2.4. (3 Marks)
What is the probability that Man Utd will win a game given that they didn't win their previous game?
In [ ]:
Task2.4 <- function(team = "manchester united", dat = EPL){
# Set up the variable
P.L.W <- NULL
## Solution
# Print out the results
return(P.L.W)
}
cat("P(manchester united Wins| not winning the previous game) = ", Task2.4(), '\n')
P(manchester united Wins| not winning the previous game) =
Part 2.5. (3 Marks)
Calculate the probability of Man Utd not winning their next two games given that they won their previous game.
In [ ]:
Task2.5 <- function(team = "manchester united", dat = EPL){
# Set up the variable
P.W.L.L <- NULL
## Solution
# Return
return(P.W.L.L)
}
cat("P(not winning their next two games | winning the previous game) = ", Task2.5(), '\n')
P(not winning their next two games | winning the previous game) =
Part 2.6. (3 Marks)
Given that a win is three points, a draw is one point and a loss is 0 point. Which season Man Utd receive the highest point tally.
In [ ]:
# optional support function to create points gained by a team
agg.support <- function(team, dat){
return()
}
In [ ]:
Task2.6 <- function(team = "manchester united", dat = EPL){
# Set up the variable
S.H <- NULL
H.P <- NULL
## Solution
# Return the result
cat("The season that", team, "achieved the highest points is", S.H, " in which they achieved", H.P, "points")
return(S.H)
}
Task2.6()
The season that manchester united achieved the highest points is in which they achieved points
NULL
Part 2.7. (3 Marks)
Given that a win is three points, a draw is one point and a loss is 0 point. Which season Man Utd receive the lowest point tally.
In [ ]:
Task2.7 <- function(team = "manchester united", dat = EPL){
# Set up the variable
S.L <- NULL
L.P <- NULL
## Solution
# Return the result
cat("The season that", team, "achieved the highest points is", S.L, " in which they achieved", L.P, "points")
return(S.L)
}
Task2.7()
The season that manchester united achieved the highest points is in which they achieved points
NULL
Part 2.8. (3 Marks)
Printing out the same statistics from Task 1 to Task 5 but only for the two seasons that they achieved the highest and lowest score (please also print out the name of the season)
In [ ]:
# Optional support function to print the result based on the task
print.tasks <- function(team, dat){
}
In [ ]:
Task2.8 <-function(team = "manchester united", dat = EPL){
S.H <- NULL
cat('\n')
S.L <- NULL
cat('\n')
## Solution
# For Highest season
cat('\n')
cat("The statistics for highest points season for", team,",", S.H,",are as followed:")
cat('\n')
cat("The statistics for lowest points season for", team,",", S.L,",are as followed:")
}
Task2.8()
The statistics for highest points season for manchester united , ,are as followed:
The statistics for lowest points season for manchester united , ,are as followed:
Part 2.9. (3 Marks)
Writing a function that take the argument such as "2006-2007" and print out the result from Task 1 to Task 5 as well as the total points they received. Make sure that your function can handle errors.
In [ ]:
# Please write your own function here, we will provide you the frame but that is all
Task2.9 <- function(team = "manchester united", dat = EPL, S = "2006-2007"){
}
Task2.9()
NULL
Part 2.10. (2 Marks)
Polish the above function by taking in a vector of seasons, instead of just 1 season, print out the results from Task 1 to Task 5 and also their total points tally. Make sure that you can handle errors and duplications
In [ ]:
# Please write your own function here, we will provide you the frame but that is all
Task2.10 <- function(team = "manchester united", dat = EPL, S = c("2006-2007", "2008-2009", "2012-2013")){
}
Task2.10()
NULL
Part 3: Statistics for the individual team - automated (5 Marks)
Writing a function that takes two inputs, the first one is the team name/vector of team names (compulsory input) e.g. It can be "chelsea", "Manchester united", " Arsenal", etc. The second one is the season/vector of seasons (optional). Printing out the statistics from Part2. Task 1 to Task 7. Make sure that the function can handle errors, you can assume that if we put "chelsea" should be the same with "Chelsea".
Printing out the results in a way that we can understand the statistics. Make sure that you can present your result in a way that is easy to understand.
In [ ]:
# Please write your own function here, we will provide you the frame but that is all
Task3 <- function(Team = c("Chelsea", "Manchester City", "Arsenal"), dat = EPL, S = c("2006-2007", "2008-2009", "2012-2013")){
}
Task3()
NULL
References