Updated: Oct 13, 2018
Last year, I built Indian Football's first Expected Goals (xG) model and started publishing results via my twitter feed. So, to kick off this year's coverage of the Indian football season at Grey Area Analytics, I thought we'd look back at the first article on this site and how far it's come in the past year.
I also regularly get questions from analysts at clubs on how exactly one can calculate and maintain an xG model. This should hopefully clear things up for you.
What is Expected Goals?
Expected Goals is a stat that tells you how likely a player was to score from a given chance. You may also call it chance quality.
Why do we need Expected Goals?
Football is a sport with a high degree of variance. Unlike sports like Basketball, where the average team takes 85 shots a game and converts most of them, in football you'd be lucky to get 10 chances and a goal from each team in a game. In Indian Football, this is a bigger problem because we don't play enough games, let alone score goals.
How does one calculate Expected Goals?
One of the first things one needs to understand is that not all shots are equal. A shot from the half-line is obviously less likely to convert into a goal in comparison to a shot from inside the 6 yard box.
To calculate xG, I capture a set of features that describe each shot and then pass those through a Logistic Regression Model after excluding penalties. In the new model, I assign a default of 0.75 xG for all penalties. Below is the equation for all other types of chances,
Expected Goals = - 0.349971
+ 0.006271 * Minute
+ 0.096542 * Game_state
+ 1.341620 * if the shot was in "prime"
+ 0.340696 * if the shot was in "inside box - right"
- 0.834023 * if the shot was in "outside box - left wing"
- 1.186690 * if the shot was in "outside box - right wing"
+ 0.049426 * if the shot was in "inside box - central"
- 1.366107 * if the shot was in "outside box - central"
+ 0.208562 * Headed Chance
+ 0.458224 * Through Ball
- 0.013551 * Cross
+ 0.424485 * Counter-Attack
- 1.463867 * Open Play Chance
- 2.054052 * Defensive Pressure
- 1.364752 * Free-kick Chance
- 1.653561 * Corner Chance
- 0.242841 * if the assist was in "inside box - left"
+ 1.002087 * if the assist was in "prime"
+ 0.104116 * if the assist was in "inside box - right"
- 1.180954 * if the assist was in "outside box - left wing"
- 0.683533 * if the assist was in "outside box - right wing"
- 0.768507 * if the assist was in "inside box - central"
- 0.138973 * if the assist was in "outside box - central"
- 0.203478 * if the assist was in "Defensive Half"
OK, but does it make footballing sense?
To demonstrate the underlying football logic and that the model isn't throwing up random numbers, I generated some graphs to display it's accuracy.
It's widely accepted that goals in football follow a poisson distribution. So, it's not unreasonable to expect Expected Goals to also follow the same pattern. As shown in the below graph, the output of the model closely resembles a poisson distribution.
One of the fundamental rules of Expected Goals is that closer you get to the goal, the easier it should be to convert the chance. Once again, my model seems to agree with the principle.
And finally, the state of the game should also affect the ability to convert a particular chance. In the below graph, it's clear to see that the more comfortable a team is, the more likely they are to score goals.
Accuracy and Testing
Garry Gelade published this piece where he compares famous models that are publicly quoted on the internet. He posts some accuracy and error measure numbers that I've reproduced here, but I've also added my model to the mix.
While my model isn't even close to being the best, it's definitely in the ball park. The best way to measure the accuracy of a logistic regression model is to plot an ROC curve and then measure the Area Under the Curve. The closer the curve gets to the top left corner and away from the diagonal line, the better the accuracy of predictions.
Below is the ROC curve. The diagonal line represents random guessing while the curve represents my model. As you can see, my model functions pretty well as a classifier.
The obvious weakness in my model is that I use zones as a categorical input rather than quantitative. Another key weakness is the sample size being low. In each Indian league we get 90 games and approximately 2000+ shots per season, which is nowhere near enough.
Got feedback? Find me on twitter @sgtstalnpeppa