Further Calculations using R to analyse the performance of jockeys and trainers
By Phill Clarke on Friday, June 1st, 2018Last week we explored some visual ways in which to analyse performance. This included representations such as line charts, scatterplots and regression fits. This week we examine some mathematical and statistical approaches to expand on the simple calculations of strike rate and profit & loss.
A positive profit & loss is obviously critical in making money from a series of bets. A negative profit & loss largely renders everything else irrelevant. However, a simple positive profit & loss calculation still does not expose the full story. Imagine a positive profit of 20 points, from a series of 1000 bets. Is that a good or bad performance? A calculation of Profit on Turnover (POT) or Yield can assist.
Again, we’ll use the combination of Aiden O’Brien and Ryan Moore, in Group Races only, since the beginning of 2016, with a filter of less than or equal to 4.00 starting price, as the example dataset. The code examples again assume data has already been returned from the Smartform database and is contained in a variable called smartform_results
. Also assumed is that part of the initial MySQL query was to limit results to those only since January, 1st, 2016. Full R code will be provided at the end of this article.
# Filter for flat races only
flat_races_only <- dplyr::filter(smartform_results,
race_type_id == 12 |
race_type_id == 15)
# Filter for Group races only
group_races_only <- dplyr::filter(flat_races_only,
group_race == 1 |
group_race == 2 |
group_race == 3 )
# Filter for Aiden O'Brien runners only
obrien_group_races_only <- dplyr::filter(group_races_only,
grepl("A P O'Brien", trainer_name))
# Remove non-runners
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))
# Filter for Ryan Moore rides only
obrien_moore_group_races_only <- dplyr::filter(obrien_group_races_only,
grepl("R L Moore", jockey_name))
# Filter for Starting Prices of 4.00 or less
obrien_moore_group_races_only_price_filter <- dplyr::filter(obrien_moore_group_races_only,
starting_price_decimal <= 4.0)
# # Calculate Profit and Loss
obrien_moore_cumulative <- cumsum(
ifelse(obrien_moore_group_races_only_price_filter$finish_position == 1, (obrien_moore_group_races_only_price_filter$starting_price_decimal-1),-1)
)
obrien_moore_group_races_only_price_filter$cumulative <- obrien_moore_cumulative
# Calculate Strike Rate
obrien_moore_winners <- nrow(dplyr::filter(obrien_moore_group_races_only_price_filter,
finish_position == 1))
obrien_moore_runners <- nrow(obrien_moore_group_races_only_price_filter)
strike_rate <- (obrien_moore_winners / obrien_moore_runners) * 100
Detailed explanations of the above calculations have been provided in previous articles. The P&L for the combination examined now stands at 19.67 points, with a Strike Rate of 48.52%.
How good is this in the context of the total amount wagered? Next, calculate Profit on Turnover (POT) or Yield. This is simply the total cumulative profit, divided by the total number of runners and multiplied by 100 to return a percentage.
# Calculate POT
profit_on_turnover <- (tail(obrien_moore_cumulative, n=1) / obrien_moore_runners) * 100
This returns a POT of 14.46%, which is a pretty reasonable figure. A return of almost 15% on an investment would certainly keep many people very happy.
Moving on, there are some other calculations which can assist with providing additional clarity to the overall picture.
The first of these is Impact Value (IV). This measure helps to assertain whether a specific combination of factors returns winners at a higher rate than the rate of winners which did not meet the specific criteria being examined. In our case, we are looking at the performance of Aiden O’Brien trained and Ryan Moore ridden Group runners, with a price filter, versus those Group runners, with the same price filter, who were not trained by Aiden O’Brien and ridden by Ryan Moore. It is worth keeping in mind that IVs are only an indication of the rate of winners, and does not take price into account.
In order to calculate Impact Values, the ratio of the filtered winners to all runners is divided by the ratio of filtered runners to all runners. In our specific case, this is the ratio of O’Brien and Moore winners to all Group winners, starting at a price equal to or less than 4.00, divided by the ratio of O’Brien and Moore runners to all Group Runners, with the same price filter applied.
An Impact Value of greater than 1.0 indicates that the filtered angle in question is outperforming runners in the entire dataset who did meet our filtering criteria.
# Calculate Impact Values
# First filter all Group runners by price, to return those just starting at 4.00 or less
all_group_runners <- nrow(dplyr::filter(group_races_only,
starting_price_decimal <= 4.0))
# Filter all Group winners by the 4.00 price limit
all_group_winners <- nrow(dplyr::filter(group_races_only,
finish_position == 1 &
starting_price_decimal <= 4.0))
# Now calculate the Impact Value
iv <- (obrien_moore_winners / all_group_winners) / (obrien_moore_runners / all_group_runners)
An IV of 1.29 is returned. This is a very healthy result and indicates that the O’Brien and Moore combination generally returns a higher ratio of winners than that of all Group runners, also filtered by the 4.00 or less price.
Finding winners is one thing, making a profit may be something entirely different. Therefore, there is a futher calculation to examine. Now we’ll look at the ratio of Actual vs Expected (A/E) winners, based on probabilties calculated from starting price. This figure will help to inform whether the filtered combination is outperforming market expectations, based on Starting Price probabilities.
A/E is calculated by dividing the number of winners from the filtered dataset, by the sum of all win probabilities. Decimal starting prices first need to be converted to probabilities. Individual starting prices may be converted to probabilities by expressing as a decimal, a fraction with 1 as the numerator and the decimal starting price as the denominator. i.e. 1/Decimal SP or 1/4.0 = 0.25.
# Calculated Actual vs Expected ratio
# Convert all decimal odds to probabilities
total_sp <- sum(1/obrien_moore_group_races_only_price_filter$starting_price_decimal)
# Calculate A/E by dividing the number of all O'Brien and Moore winners, but the sum of all SP probabilities.
obrien_moore_ae <- obrien_moore_winners / total_sp
An A/E figure of 1.07 is returned. Once again, any figure above 1.00 should be viewed positively. If the return is greater than 1.00 it essentially means the filtered dataset is outperforming the market’s expectations. Or, to put it another way, a figure above 1.00 means that the filtered selections win more often than their probabilities (odds) indicate they should. Using a combination like Aiden O’Brien and Ryan Moore it may be somewhat surprising that they perform in excess of market expectations, especially given the regression line seen in last week’s scatterchart, but the A/E figure shows this to be true.
Finally, the last computation included in today’s article is known as Archie. If you are statistically minded, Archie is a variation on the Chi Squared goodness of fit calculation. However, in layman’s terms, Archie provides a figure showing how much the results obtained from a set of data can be attributed to luck. The higher the Archie figure, the less likely the results are down to pure luck. The larger the Archie figure, the better. As a rough guide, an Archie figure of 1.00 still means there is a 32% likelihood the results obtained were due to chance.
I also like to convert the Archie result to a percentage Confidence figure. In this case also, the higher the Confidence figure, the less likely the results are only down to luck.
# Calculate Archie
archie <- (obrien_moore_runners * (obrien_moore_winners - total_sp)^2)/ (total_sp * (obrien_moore_runners - total_sp))
# Calculate the Confidence figure
conf <- pchisq(archie, df = 1)*100
In the case of Group race runners, starting at a price of 4.00 or less, trained by Aiden O’Brien and ridden by Ryan Moore, the Confidence figure is almost 60%. This is reasonable, but not particularly strong.
It is worth keeping in mind that Archie and Confidence figures can be seen to change rapidly if the sample data size is only small. Our filtered dataset in this case contains only 136 runners, which is not particularly large.
Further details concerning all these calculations – POT/Yield, IV, A/E and Archie/Confidence – can always be found through Internet searches using popular search engines.
Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.
The full R code used in this article is found below.
# Load the RMySQL library package
library("RMySQL")
library("dplyr")
# Connect to the Smartform database. Substitute the placeholder credentials for your own.
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(),
host='127.0.0.1',
user='yourusername',
password='yourpassword',
dbname='smartform')
sql1 <- paste("SELECT historic_races.course,
historic_races.meeting_date,
historic_races.conditions,
historic_races.group_race,
historic_races.race_type_id,
historic_races.race_type,
historic_runners.name,
historic_runners.jockey_name,
historic_runners.trainer_name,
historic_runners.finish_position,
historic_runners.starting_price_decimal
FROM smartform.historic_runners
JOIN smartform.historic_races USING (race_id)
WHERE historic_races.meeting_date >= '2016-01-01'", sep="")
smartform_results <- dbGetQuery(con, sql1)
dbDisconnect(con)
# Filter for flat races only
flat_races_only <- dplyr::filter(smartform_results,
race_type_id == 12 |
race_type_id == 15)
# Filter for Group races only
group_races_only <- dplyr::filter(flat_races_only,
group_race == 1 |
group_race == 2 |
group_race == 3 )
# Filter for Aiden O'Brien runners only
obrien_group_races_only <- dplyr::filter(group_races_only,
grepl("A P O'Brien", trainer_name))
# Remove non-runners
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))
# Filter for Ryan Moore rides only
obrien_moore_group_races_only <- dplyr::filter(obrien_group_races_only,
grepl("R L Moore", jockey_name))
# Filter for Starting Prices of 4.00 or less
obrien_moore_group_races_only_price_filter <- dplyr::filter(obrien_moore_group_races_only,
starting_price_decimal <= 4.0)
# # Calculate Profit and Loss
obrien_moore_cumulative <- cumsum(
ifelse(obrien_moore_group_races_only_price_filter$finish_position == 1, (obrien_moore_group_races_only_price_filter$starting_price_decimal-1),-1)
)
obrien_moore_group_races_only_price_filter$cumulative <- obrien_moore_cumulative
# Calculate Strike Rate
obrien_moore_winners <- nrow(dplyr::filter(obrien_moore_group_races_only_price_filter,
finish_position == 1))
obrien_moore_runners <- nrow(obrien_moore_group_races_only_price_filter)
strike_rate <- (obrien_moore_winners / obrien_moore_runners) * 100
# Calculate POT
profit_on_turnover <- (tail(obrien_moore_cumulative, n=1) / obrien_moore_runners) * 100
# Calculate Impact Values
# First filter all Group runners by price, to return those just starting at 4.00 or less
all_group_runners <- nrow(dplyr::filter(group_races_only,
starting_price_decimal <= 4.0))
# Filter all Group winners by the 4.00 price limit
all_group_winners <- nrow(dplyr::filter(group_races_only,
finish_position == 1 &
starting_price_decimal <= 4.0))
# Now calculate the Impact Value
iv <- (obrien_moore_winners / all_group_winners) / (obrien_moore_runners / all_group_runners)
# Calculated Actual vs Expected ratio
# Convert all decimal odds to probabilities
total_sp <- sum(1/obrien_moore_group_races_only_price_filter$starting_price_decimal)
# Calculate A/E by dividing all O'Brien and Moore winners, but the sum of all SP probabilities.
obrien_moore_ae <- obrien_moore_winners / total_sp
# Calculate Archie
archie <- (obrien_moore_runners * (obrien_moore_winners - total_sp)^2)/ (total_sp * (obrien_moore_runners - total_sp))
# Calculate the Confidence figure
conf <- pchisq(archie, df = 1)*100