How Difficult it is to predict the English Premier League Table – The Nostradamus Way
Football fanatics are abuzz with excitement because a new season of English Premier League has started! Like every season, the diehards will weigh the chances of their favourite clubs and try to be a Nostradamus in predicting their fates. Here at Goalden Times, Debojyoti Chakraborty digs deep to come up with a fairly easy and accurate forecasting model. Or is it? Read on to know more.
Knowing the Unknown, Seeing the Unseen
The urge to see the future is human nature. Thus, it’s not a coincidence that every football lover bets on, either literally or hypothetically, a team to win a tournament even before it has started. We are not talking about loyalty here. We are actually referring to a fan who chooses the winner based on logical deduction. Apart from the self-satisfaction or social recognition among friends, a true prediction brings in a wealth of fortune if translated correctly in the betting market. So, how difficult it is to logically derive any of these so-called forecasting models?
Simple Theories, Complex Route
A few readers might already be aware of various statistical models adopted by the scholars to arrive at some of the most accurate predictions across different spheres of life. Poisson regression (univariate as well as multivariate), trinomial distribution on stochastically balanced model, Bayesian analysis—these terms will excite many a researcher but will definitely bamboozle most football lovers. However, the basic approach for any such model remains more or less the same—predict the outcome of every match to arrive at the final league table. This in itself seems quite an exercise as it amounts to forecasting the result of 380 matches in a season. Various other factors are also considered, e.g., the expected number of goals to be scored in a match, the home team advantage, the home team’s offensive power, the opponent team’s defensive power, and a random factor. Some models are even more intricate as they drill down to the level of shots on target and tackles won per match, weather and pitch conditions during a particular fixture, the proximity to deadline of the transfer window, individual player’s form, fitness, injury concerns, and so on! In fact, even proven models from other spectrums of life, like Markowitz portfolio management that is used widely in stock trading, have been used to predict the outcome of football matches. Needless to say, predicting match outcomes is a complicated exercise indeed.
They Must be Accurate
After putting in so much time, effort, and money, it is, of course, imperative for the results to be accurate. Well, let’s embrace ourselves for a shocking truth.
The chart above shows how each game in the 2011–12 English Premier League season was predicted by using a very intensive model. The model was used by a famous betting house to determine the match odds. Naturally it was supposed to be one of the most accurate ones in order to maximize the agency’s profits. Probabilities of all the possible bets are plotted horizontally against the corresponding odds offered by the bookmaker (the vertical axis is a logarithmic one). Since only the feasible bets are included in the figure, each data point is above the diagonal. Green squares represent the winning bets and red ones stand for the losing bets. They seem to be quite evenly spread, don’t they? So the football enthusiast, the odd person passing by, or the bookmaker agency—no one seems to be very accurate in predicting match outcomes.
Start from Scratch
So, how good can we be in delivering one such predictive model with the least amount of data? Let us begin without digging up a lot of data points. Let us start from scratch, i.e., let us set aside all our pre-conceived notions of team strengths, past records, and financial muscle, and conclude that the English Premier League is a perfect example of a socialist society. In other words, let us assume that every team is equal and will finish with exactly the same points at the end of the season. What does that mean? It simply implies that all the 20 teams would end up securing the same position in the league table. That position or rank can be either at the top or bottom of the table (1 or 20), or at the middle of the road (10 or 11). Whatever it is, let us see how much variance we end up with.
The less the variance, the better it is. In a perfect predictive model, the variance will be zero. So, it makes much more sense to assume that every team will finish in the same mid-table position. From this point on, whatever model we must come up with, the variance should be lower than 25.89. We shall take data from the last five Premier League seasons (2010–11 onwards) and try to get as close as possible to the final results of the 2015–16 season.
One of the most basic parameters that I can think of at this stage is the previous season’s performance. What if we predict that the teams will finish this season in exactly the same place as the last season? It might sound a bit foolish, but think about it. How much difference do we see in the performance of the top-flight teams in two consecutive seasons? Keep in mind that we are talking about all the 20 clubs collectively here. It’s not about a single club anymore.
A look at the graph for the 2012–13 season reveals that this prediction would not have done too badly! Teams promoted from the Championship, Queens Park Rangers, for example, were predicted to secure the last three spots in the table, in order of their Championship season’s ranks. Even though the independent parameter (last season’s league rank) is quite crude and limited in terms of providing adequate informative data, the model showed a decent output with a variance of 17.61. Not bad to start off with!
What next then? Well, this decade has seen unprecedented financial muscle flexing by football clubs. An insane influx of money, a scarcity of quality strikers, and the simple rules of demand–supply—all have contributed to EPL clubs being the biggest spenders in world football. So, let us also take into account the cash flow of each team, i.e., money spent on transfers as well as money earned from them (seller club anyone?). Let’s note that free transfers are not considered here. Even though these freebies have played a crucial part in the team’s season outcome in some cases, I have followed the demand–supply principle here. If something is given out for free, then it’s not that good!
The results look a bit better, but it’s not that different from what we have already achieved. The variance has come down slightly, but there’s still a long way left to go.
So, what else can we take into account? Let us analyse the teams’ performance in the league once a number of matches have been played. After a few rounds in the tournament, squad depths get tested, reserve bench strengths are put to use, and sometimes the league takes a back seat. Now, for domestic cups, at least till the fag end, teams generally try out fringe and youth players so as not to hamper the clubs’ chances in the league. The top teams in the Champions League generally have much bigger and better teams for the tournament, which are also known as the first team squads. These teams are somewhat prepared to face the grinding matches twice a week and hence are likely to be least impacted by that. The actual problem arises for those poor souls who happen to be in the gruelling Europa League. A demanding travel itinerary, a never-ending schedule of matches, and frequent lack of financial motivation take a toll on their league performances. So, we took a look at the teams that participated in the Europa league and also took into account their progress in the tournament.
The predicted ranks are now starting to look a lot closer to the actual ones. The variance has come down quite a bit to 12.57—a whopping 51.44% reduction from where we started. That is as good as it can get at this stage.
Nostradamus Comes Out
So, let me try to put our model into action and predict the final league standing for the on-going season. As explained above, the model used takes into account three simple datasets as input parameter:
- Last year’s league position
- Net cash spent in the transfer market for the season
- Number of rounds expected to be played in the Europa Cup in this season
As of now, we know that:
- Southampton got knocked out in the qualifying round.
- Liverpool and Tottenham Hotspur are probably exiting in the round of 32 as they really do not give much importance to the competition. They are more focused on finishing in the top four in the EPL and gaining a Champions League spot.
So, the model is run and this is how the EPL 2015–16 table is predicted.
Now, this is far from a finished product and the prediction should be taken with a pinch of salt.
- While the variance has been substantially low (12.57) from where it all started (25.89), there is still massive room for improvement.
- The model has been unable to pinpoint the table position for each club. For the last five seasons, the model has been successful in predicting the final table position for only two to three clubs, on an average.
- Having said that, the model has been quite accurate in predicting each club’s final position within three places of its eventual table rank.
As we are already a few matches into the season, it looks highly unlikely that Chelsea would be able to retain their crown based on their current form. However, this prediction does not take into account current season form—this is purely based on the data available before the first game week of the season.
If one happens to look at the following figures, the gradual evolution of the model will come clear.
As the model has absorbed more and more parameters, the variance has gone up slightly. In fact, with the two-parameter model, a fewer number of clubs finished within three places of their predicted ranks! However, this error was rectified with the latest model. The two-parameter model, however, scored heavily in the convergence, i.e., the potential to churn out more accurate predictions in the long run. For the time being, though, our three-parameter model seems to have a struck a reasonable middle ground—an acceptable variance and a decent convergence.
Needless to say, as we fit in more and more relevant parameters into the model, it will become more and more accurate. However, this will only come at the cost of added complexity. The principle of diminishing return of utility should dictate our actions here. Should we consider the impact of managerial change, injuries in the overall squad (or even key players), or difficult fixture at the start / end of the season? Can any of these parameters diminish the variance significantly? The search is endless. Considering the effort required to build up this model and the astounding proximity to the eventual final table in the last five seasons of EPL, this looks more than a decent proposition. Disagree? Please leave your comments. You never know, your thoughts may just be incorporated in our next predictive model!