Soccer

Sports Data Modelling for Beginners: A Simple Elo for the 2021 Euro & Copa América

If you’ve gone to the effort of clicking on this article either via my site, my Twitter or otherwise, I can make a fairly well-educated guess that you fall in to one of two categories:

  1. You work with data in your day job, but also enjoy applying your well-practiced data skills to your hobby/side-passion of predictive sports modelling (or if you’re super lucky maybe sports modelling is your day job)
  2. Despite not having much previous experience working with data, you see a whole heap of content on Twitter and other socials to do with predictive sports modelling and think, “Damn, this sports data stuff is so far up my alley but I have no idea how to do any of it. Maybe I’ll learn how to do it myself one day.”

If you fall into category 1, congratulations – the good news is you have a cool job and an even cooler passion (at least in my humble opinion). The bad news is you probably won’t learn any new tricks from me here.

If category 2 is more your speed, then it’s only good news – learning data modelling skills is both rewarding and very useful in a career sense. Better yet, this article is just for you.

As someone who is only a few years into a career in data analytics, I like to fool myself and pretend I’m a fully-fledged member of category 1, but at the same time I’m also massively sympathetic towards those in category 2 as it wasn’t that long ago that it was me on the outside of the glass looking in and thinking, “how do I get to do this as my full-time job?”.

It’s not always obvious to know how to get started when learning new skills, so I feel obliged to help those who are looking to have a crack at data analytics however I can. In this case, I wanted to share a tutorial on how to build an Elo model which I intend to enter into Betfair Australia’s upcoming Euro & Copa América Datathon.

It was a competition just like this one a few years back that focused on the Australian Open which gave me my first proper taste of data modelling in sports and since then my skills have grown exponentially and I’ve had doors opened in my career that I didn’t previously think were even a viable option for me to take, so I can honestly not speak highly enough for the opportunities competitions like Betfair’s Datathon can create.

If you’re interested in taking on my (very ordinary) model, feel free to register for the Datathon yourself!

What is an Elo Model?

First and foremost, I see building an Elo model as a really good starting point for those looking to get into sports data modelling. It will be challenging for those who don’t have much experience in the space, but also simple enough to understand and apply in practice! Let’s get into it.

Elo modelling is a commonly-used approach towards creating rating systems in sport. Originally devised by a very smart guy called Arpad Elo for ranking chess players, the popularity of Elo modelling has grown massively since it’s first official use in the 1960s to the point where it is now widely used to model just about every professional sporting code across the globe. You might have seen people post team ratings for various sports on Twitter or somewhere similar. There’s a solid chance that these are Elo ratings or at the very least inspired by Elo-esque principles.

From a set of Elo ratings we can construct win/loss probabilities for given match-ups between two teams using a simple formula which takes the difference between the two teams’ ratings and outputs the probability of each team winning or losing the match. A very basic yet very effective approach!

To read more the inner workings of Elo models, click here for a detailed run-down of how the mathematics behind an Elo rating system works in the context of tennis matches.

Elo-based systems lend themselves particularly well to modelling soccer, so much so that publicly available Elo ratings systems (www.eloratings.net for international soccer for example) have been adopted by professional bodies to help seed tournaments and create fairer fixtures.

What Do I Need?

This tutorial aims to serve as guide of how to build a basic soccer Elo model with a particular focus on the 2021 editions of the Euro and Copa America, as they will serve as the subject of Betfair’s Datathon.

To follow along with this walk-through you will need two things:

  1. This code is written in a programming language called R. R is a free-to-use statistical computing and graphics environment which is used by by data analysts/scientists/brilliant geniuses like myself all over the globe for all things data analytics. It is one of the more popular programming languages out there and learning it was the single best thing I’ve ever done for my career. I won’t lie, there is a steep learning curve when you get started, but once you know what you’re doing, there’s no limit as to what you can do with it.
    Running R code will require you to have R and RStudio running on your system. Google is your friend here if you’re unsure how to get started – I’ve never had a question relating to programming languages that Google has failed to answer.
  2. You will also need the historical data set provided for Betfair’s 2021 Euro & Copa América Datathon. For full access to the data set including international soccer fixtures from 2014 to 2021 you can register for the Datathon here, or alternatively if you are reading this after the Datathon has concluded, you can download the data required to run this code below.

That’s enough chatter for now, let’s get into writing code and building an Elo model!

Load Packages & Import Data

In the interest of being concise, I’m going to skip any instructions relating to how to use RStudio or the basics of R itself here. There are 1001 tutorials for that on Google already, most of which will be better-written than what I could offer. Instead I’ll jump straight into the model building process itself and leave the R basics tutorial to the wider internet.

The first thing to do is load the packages that will be required to run the Elo model and to read in the historic data from wherever it is stored on your machine. Packages need to be installed before they can be loaded – use the “install.packages()” function first if you don’t have any of the below installed already.
The main package to take notice of here is the “elo” package. As the name suggests, this package is for running elo models.
The “MLmetrics” package will be used towards the end of the tutorial to back-test the accuracy of our model.

library(readr)
library(dplyr)
library(lubridate)
library(elo)
library(MLmetrics)

raw_data <- read_csv("datathon_initial_form_data.csv")

That’s all it takes to read the data we need into our R workspace. Remember to set your working directory via the Session menu to wherever you’ve saved the data, or else rejig what’s above to reflect the file path on your computer to where the file is saved.

Data Exploration

When using a new data set which is unfamiliar, it is always good practice to first get an idea of the what is included in the data set.

colnames(raw_data)
##  [1] "tournament"                     "timestamp"                      "date_GMT"                       "status"                         "stadium_name"                  
##  [6] "attendance"                     "home_team_name"                 "away_team_name"                 "referee"                        "Game Week"                     
## [11] "home_team_pre_match_ppg"        "away_team_pre_match_ppg"        "home_ppg"                       "away_ppg"                       "home_team_goal_count"          
## [16] "away_team_goal_count"           "total_goal_count"               "total_goals_at_half_time"       "home_team_goal_count_half_time" "away_team_goal_count_half_time"
## [21] "home_team_goal_timings"         "away_team_goal_timings"         "home_team_corner_count"         "away_team_corner_count"         "home_team_yellow_cards"        
## [26] "home_team_red_cards"            "away_team_yellow_cards"         "away_team_red_cards"            "home_team_first_half_cards"     "home_team_second_half_cards"   
## [31] "away_team_first_half_cards"     "away_team_second_half_cards"    "home_team_shots"                "away_team_shots"                "home_team_shots_on_target"     
## [36] "away_team_shots_on_target"      "home_team_shots_off_target"     "away_team_shots_off_target"     "home_team_fouls"                "away_team_fouls"               
## [41] "home_team_possession"           "away_team_possession"           "home_team_xg"                   "away_team_xg"

Plenty of interesting stats in that can be explored – possession, shots, cards, fouls, and xG (expected goals) can tell us plenty about how a match was played!

Something noticeable is that the data set does not have a feature included to indicate the result of each match. Let’s start by creating two new columns: one showing the result from the home team’s perspective and another from the away team’s perspective.

We’ll also create a feature to show the margin of the match, i.e. the absolute goal difference between the two sides.

raw_data <- raw_data %>% 
  mutate(home_result = case_when(home_team_goal_count > away_team_goal_count ~ 1,
                                 home_team_goal_count < away_team_goal_count ~ 0,
                                 home_team_goal_count == away_team_goal_count ~ 0.5),
  away_result = case_when(home_team_goal_count < away_team_goal_count ~ 1,
                          home_team_goal_count > away_team_goal_count ~ 0,
                          home_team_goal_count == away_team_goal_count ~ 0.5),
  margin = abs(home_team_goal_count - away_team_goal_count))

If the above code looks daunting or hard to follow, you can read up on some of the functions within the “dplyr” package to figure out what’s going on here.

Let’s also make sure the date_GMT column is formatted correctly in R as a date. We can do that using the “parse_date_time()” function from the “lubridate” package. The data we imported should already have been in chronological order so no changes need to be made there (running the Elo model should be done chronologically).

raw_data <- raw_data %>% 
  mutate(date_GMT = parse_date_time(raw_data$date_GMT , "mdY_HM"))

Running the Elo Model

Now comes the fun part – setting the Elo model in action!

There are a number of steps we could have taken before getting to this point to make the model a touch more complex (and possibly more accurate as a result), however in the interest of keeping this tutorial as accessible and easy to follow as possible let’s jump straight into running a basic Elo model based on our data set.

To run the Elo model we will need to use the “elo.run()” function from the “elo” package. The “elo.run()” function requires we input a formula as an argument which tells the function which columns list the two teams, which column is our target/outcome variable, as well as any other features we want to include in our model.

We can use the home_result column we’ve created to identify the result from the home team’s perspective.

We can also input a ‘k value’ which essentially dictates the maximum number of Elo ‘points’ can be won or lost in a single match. k values are an important part of an Elo model – you can read up about how the well-renowned FiveThirtyEight settled on their NBA model’s k-value here.

We could set k to be a constant (such as 30), however to add an extra layer of (very slight) complexity, we can instead choose to use a variable k value which is dictated by the margin of the match using a formula of k = 30 + 30*margin.
This means that for every goal the margin increases by, the k value will increase by 30, starting with a base k value of 30 for draws (margin equals 0). The idea here is simply to help account for and reward/punish teams winning/losing by bigger margins compared to closer matches.

elo_model <- elo.run(data = raw_data,
                     formula = home_result ~ home_team_name + away_team_name + k(30 + 30*margin))

Very simple, but that’s all it takes to run the Elo model across our data set!

Let’s take a look at the last few matches in the data set now that we’ve run the model over it to get a better view of how an Elo model functions.

elo_results <- elo_model %>%
  as.data.frame()

elo_results %>% tail(n = 10)
##             team.A          team.B        p.A wins.A   update.A   update.B     elo.A    elo.B
## 3802       Austria         Denmark 0.25783532    0.0 -38.675299  38.675299 1701.6586 1962.672
## 3803       Moldova          Israel 0.09277590    0.0 -11.133109  11.133109 1080.6122 1498.990
## 3804      Scotland   Faroe Islands 0.86914400    1.0  19.628399 -19.628399 1563.9736 1195.798
## 3805       Andorra         Hungary 0.01454793    0.0  -1.745752   1.745752  958.1647 1693.990
## 3806       England          Poland 0.73976753    1.0  15.613948 -15.613948 1981.3904 1768.669
## 3807    San Marino         Albania 0.01842475    0.0  -1.658228   1.658228  813.8640 1507.789
## 3808       Germany North Macedonia 0.89991673    0.0 -53.995004  53.995004 1846.2882 1572.742
## 3809 Liechtenstein         Iceland 0.06629862    0.0  -7.955835   7.955835 1002.6509 1478.044
## 3810        Greece         Georgia 0.68100398    0.5  -5.430119   5.430119 1584.1062 1463.221
## 3811         Spain          Kosovo 0.97265316    1.0   2.461215  -2.461215 2094.2415 1468.899

We can also now check out what the latest Elo rankings for each team in the data set! Let’s look at the top 10. Keep in mind that teams start with an Elo rating of 1500, so any team with an Elo greater 1500 can be considered above average, while those with Elo ratings below 1500 are therefore below average.

final_elos <- final.elos(elo_model)
final_elos %>% sort(decreasing = TRUE) %>% head(n = 10)
##   Belgium     Spain    Brazil    France  Portugal   England     Italy   Denmark Argentina   Morocco 
##  2097.725  2094.241  2042.438  2019.180  1999.920  1981.390  1966.257  1962.672  1960.346  1911.098

Accounting for Draws

Something that we need to acknowledge both generally and also in relation to the Euro & Copa America Datathon is that Elo models are best suited to binary outcomes – i.e. win or lose, however in most soccer fixture we also have the possibility of a draw to account for. We are still able to ensure that Elo ratings update appropriately after a draw using 0.5 wins to denote the situation in which a draw occurs, but when it comes to making probabilistic win predictions using the Elo ratings (see p.A column above for example) things get a little more confusing.

For the Euro & Copa America Datathon, we also need to consider the fact that draws are a possibility during group stage match-ups, therefore we need to develop a workaround option to include draw probabilties.

One way to do this – and the method that will be adopted for this tutorial – is to find the historic rate at which two teams of a certain Elo prediction split ending up drawing their matches. For example, how often do matches in which an Elo model deems Team A an 80% chance of winning and Team B a 20% chance of winning result in a draw? How about a 70%-30% split? We can find these draw rates for a range of probability points between 0 and 1 and use them to redistribute win/loss probabilities accordingly.

For the record, I am in no way saying that the following is the absolute best possible way to incorporate draw probabilities into an Elo model, however I feel it is an adequate solution given the simplicity of this tutorial. Please don’t crucify me for it, seasoned data modellers!

Let’s start the process by finding historic draw rates. We will bucket matches at 5% increments according to the home team’s probability of winning the match according to the model.

draw_rates <- data.frame(win_prob = elo_model$elos[,3],
                        win_loss_draw = elo_model$elos[,4]) %>%
  mutate(prob_bucket = abs(round((win_prob)*20)) / 20) %>%   # Rounds predicted win probabilities to the nearest 0.05
  group_by(prob_bucket) %>%
  summarise(draw_prob = sum(ifelse(win_loss_draw == 0.5, 1, 0)) / n())   # Calculate draw rate for each bucket

draw_rates %>% head(n=20)
## # A tibble: 20 x 2
##    prob_bucket draw_prob
##          <dbl>     <dbl>
##  1        0       0.0385
##  2        0.05    0.0843
##  3        0.1     0.133 
##  4        0.15    0.221 
##  5        0.2     0.297 
##  6        0.25    0.281 
##  7        0.3     0.293 
##  8        0.35    0.258 
##  9        0.4     0.271 
## 10        0.45    0.267 
## 11        0.5     0.255 
## 12        0.55    0.242 
## 13        0.6     0.274 
## 14        0.65    0.256 
## 15        0.7     0.260 
## 16        0.75    0.224 
## 17        0.8     0.195 
## 18        0.85    0.181 
## 19        0.9     0.168 
## 20        0.95    0.0686

We now have data which will help us deem how likely a match-up between two teams is to end up in a draw. While we’re at it, let’s plot it because why not.

plot(x = draw_rates$prob_bucket,
     y = draw_rates$draw_prob,
     ylim = c(0,0.5),
     col = "blue",
     xlab = "Home Team's Win Probability Bucket",
     ylab = "Draw Probability")

The trend isn’t completely perfect (may potentially be as a result of not having enough data), but there is enough there to see that draws happen more often in 50-50 matchups (or similar) than they do in lopsided matchups (e.g. >90% or <10% win probabilities).

The next step is to merge this data in with our existing data set. We also need to include the win/loss probabilities for each match that we’ve already found using our Elo model so that we may tweak them to include for the possibility of a draw.

data_with_probabilities <- raw_data %>% 
  select(tournament, date_GMT, home_team_name, away_team_name, home_result, away_result) %>%   # Remove some redundant columns
  mutate(home_elo = elo_results$elo.A - elo_results$update.A,    # Add in home team's elo (need to subtract points update to obtain pre-match rating)
         away_elo = elo_results$elo.B - elo_results$update.B,    # Add in away team's elo (need to subtract points update to obtain pre-match rating)
         home_prob = elo_results$p.A,                            # Add in home team's win/loss probability
         away_prob = 1 - home_prob) %>%                          # Add in away team's win/loss probability
  mutate(prob_bucket = round(20*home_prob)/20) %>%               # Bucket the home team's win/loss probability into a rounded increment of 0.05
  left_join(draw_rates, by = "prob_bucket") %>%                  # Join in our historic draw rates using the probability buckets
    relocate(draw_prob, .after = home_prob) %>% 
  select(-prob_bucket)

Having now brought the draw probability for each match into the data frame, we need to redistribute the win and loss probabilities so that Pr(win) + Pr(draw) + Pr(loss) sums to exactly 1. We can do this by simply subtracting the draw probability from each of the win and loss probabilities in a proportional manner. See below:

data_with_probabilities <- data_with_probabilities %>% 
  mutate(home_prob = home_prob - home_prob * draw_prob,          # Redistribute home team's probs proportionally to create win/draw/loss probs
         away_prob = away_prob - away_prob * draw_prob)          # Redistribute away team's probs proportionally to create win/draw/loss probs

data_with_probabilities %>% 
  select(home_team_name, away_team_name, home_prob, draw_prob, away_prob) %>% 
  tail(n=10)
## # A tibble: 10 x 5
##    home_team_name away_team_name  home_prob draw_prob away_prob
##    <chr>          <chr>               <dbl>     <dbl>     <dbl>
##  1 Austria        Denmark            0.185     0.281     0.534 
##  2 Moldova        Israel             0.0805    0.133     0.787 
##  3 Scotland       Faroe Islands      0.712     0.181     0.107 
##  4 Andorra        Hungary            0.0140    0.0385    0.948 
##  5 England        Poland             0.574     0.224     0.202 
##  6 San Marino     Albania            0.0177    0.0385    0.944 
##  7 Germany        North Macedonia    0.749     0.168     0.0833
##  8 Liechtenstein  Iceland            0.0607    0.0843    0.855 
##  9 Greece         Georgia            0.504     0.260     0.236 
## 10 Spain          Kosovo             0.906     0.0686    0.0255

And there you have it! We’ve now come up with win, draw and loss probabilities for each match-up in our data set! It might not be the cleanest way of going about it, but it is effective in overcoming the binary assumption imposed by the traditional Elo prediction formula.

Keep in mind that if we were to be focusing on knockout matches (i.e. where no draws are possible), we could have just skipped the previous few steps as we already had binary win-loss probabilities as direct outputs from our Elo model.

Back Testing

The last step in our modelling process is to back test against a subset of our data to get an idea of our model’s accuracy.

We can use the “MLmetrics” package to run a log loss function on our subset.
Let’s look at how the model performed when we limit the data set to include only the most recent matches from early-2021.

matches_2021 <- data_with_probabilities %>% 
  filter(year(date_GMT) == 2021) %>%                      # Filter down to only 2021 matches
  mutate(home_win = ifelse(home_result == 1, 1, 0),       # Include new columns which show the true outcome of the match
         draw = ifelse(home_result == 0.5, 1, 0),
         away_win = ifelse(away_result == 1, 1, 0)) %>% 
  select(date_GMT, home_team_name, away_team_name, home_prob, draw_prob, away_prob, home_win, draw, away_win)


# Run the multinomial log loss function from MLmetrics to output a log loss score for our sample
MultiLogLoss(
  y_pred = matches_2021[,c("home_prob", "draw_prob", "away_prob")] %>% as.matrix(),
  y_true = matches_2021[,c("home_win", "draw", "away_win")] %>% as.matrix()
)
## [1] 0.8920468

A pretty good result for such a simple Elo model!

Making Future Predictions

Okay, so now our Elo model is set in place, we have the latest set of Elo ratings for each team and we’ve back-tested our model. We can now apply our model to future matches to obtain probabilities for match-ups that are yet to occur.

Again we can do this using the “elo” package. This time we will use the “elo.prob” function, which takes two teams and outputs the probability of the first team winning the match-up. Like before, this function only considers win/loss outcomes to be possible, so if we were to also be looking to generate draw probabilities for a future match-up, we can just go through the exact same process as we did previsouly (i.e. make a binary win/loss prediction, merge in historic draw rates for various probabilities, redistribute accordingly).

Let’s just keep this simple for now though and focus on win and loss probabilities – no draws. We’ll put together a small dataframe of matches and see what our model thinks – we’ve gone for hypothetical match-ups of Brazil v Argentina, England v France, Spain v Germany and an obligatory 2006 World Cup rematch of Australia v Italy (i.e. the biggest daylight robbery in the history of the known world).

future_matches <- data.frame(
  team_a = c("Brazil", "England", "Spain", "Australia"),
  team_b = c("Argentina", "France", "Germany", "Italy"))  %>% 
  mutate(elo_a = final_elos[team_a],
         elo_b = final_elos[team_b],
         team_a_win_prob = elo.prob(elo.A = elo_a,
                                    elo.B = elo_b)
  )

future_matches
##      team_a    team_b    elo_a    elo_b team_a_win_prob
## 1    Brazil Argentina 2042.438 1960.346       0.6159888
## 2   England    France 1981.390 2019.180       0.4458291
## 3     Spain   Germany 2094.241 1846.288       0.8064855
## 4 Australia     Italy 1804.243 1966.257       0.2823918
Lucas Neill doesn’t like the fact that the Elo model only gives Australia a 28% chance of beating Italy in a hypothetical rematch.

That’s it! We now have the means to predict the outcome of soccer matches using the Elo model we’ve created.

I’ll be using the code in this tutorial to create probabilistic predictions for every match the upcoming Euro and Copa América soccer tournaments and submitting them as an entry into Betfair’s Datathon. Here’s to hoping the model doesn’t completely flop!

Conclusions & Areas for Improvement

Elo modelling can be a surprisingly accurate modelling technique given how simple it is to implement. This tutorial only gives the most basic of frameworks from which you are free to build a more intricate model with more detailed inputs and features.

Some things that you might want to consider adding to this Elo model:

  • Home ground advantage
  • Including key match statistics (i.e. shots on target, possession %, etc.)
  • Whether the match was a dead rubber (teams may take the foot off the gas if they don’t need to win to advance to the next stage of a tournament)
  • Selected team line-ups (were key regular players missing through injury/suspension?)

There’s plenty of ways to make a model like this more complex in search of greater accuracy – it’s simply a matter of how far you need to go to in order to obtain the data you need to run the model!

If you clicked on this article with an ambition of getting started in the world of sports data modelling, I hope it gave some indication that it doesn’t take a super-genius to build a predictive model – it only takes some scratchy programming skills and a little bit of elbow grease to build one for yourself!

I’m looking forward to seeing how this model fares in the upcoming Datathon – fingers crossed it’s simplicity doesn’t mean it bombs!