Project Report
Feature extraction
I choose the two teams’ ELO rating and their ranks at the time when the match between them happened as my model’s feature. Because ELO reflects the relative skill levels of the two teams. Obviously the team with higher ELO is more likely to win. Indeed, a naïve model that simply chooses the team with higher ELO as winner achieves an accuracy about 80%. So I include them as my feature. The rank of a team in the world is also an indicator of its strength, so I also include the ranks of the two team as features.
Since the value of EOL and rank have different range and scale. So I normalize them to let them both have a value range between 0 and 1.
I also extract the home advantage feature. Its value is 1 if the match happens in team1’s home, -1 if in the team2’s home and 0 if in other places. The Tournament field contains the country where the match happens and can use it to extract the home advantage feature. I include this feature, because it is well known that a team has advantage if it plays the match at its home. To test whether this feature is useful, I will compare the performance of models with and without this feature.
We can’t choose score as feature, because it is the outcome of the match. If we include it as feature, I can simply compare two team’s scores to predict the winner which makes the prediction pointless.
Model Selection
Since it is a classification problem, we can use any classification machine learning algorithm to solve it. So I consider it using common classification algorithms such as
logistic regression, naive Bayes classifier, SVM and decision trees. I test these all the models and choose the one that has the best performance.
Experiment and performance
To have a baseline to compare with, I first test the naïve model which simply chooses the team with higher ELO as winner,it is accuracy is 0.822430.
Then I test the logistic regression, naive Bayes classifier, SVM and decision trees without and with the home advantage feature. The result is as follows.
Without home advantage feature:
Logistic_pred 0.878505
Bayes_pred 0.859813
SVC_pred 0.775701
Tree_pred 0.700935
With home advantage feature:
Logistic_pred_withHome 0.897196
Bayes_pred_withHome 0.757009
SVC_pred_withHome 0.775701
Tree_pred_withHome 0.775701
From the results, we can see that logistic regress has the best performance and with home advantage feature, its accuracy improves from 0.878505 to 0.897196.
So I choose logistic regression model with ELO, rank and home advantage features and achieve an accuracy of 0.897196 which is much better than the naïve model.
/docProps/thumbnail.jpeg