Betwise Blog
Betwise news, analysis and automatic betting info

Creating Functions with R – using trainer and jockey combinations

By Phill Clarke on Monday, June 18th, 2018

In the previous article we looked at how to calculate some useful figures such as profit on turnover, impact values, actual vs expected, Archie and a confidence figure regarding how much luck was involved in the returned figures. The code demonstrated works well, but can be cumbersome to easily alter items such as trainer, jockey or price filters.

Fortunately the R language supports creation of user defined functions. A function is essentially a wrapper around a set of code routines, which are executed when the function is called at a later time. This makes it very easy to re-run the same code multiple times, using specific arguments to alter the results. Sounds difficult? It’s not really. Much of the R code we’re already familiar with, such as dplyr::filter are functions themselves. If you ever find yourself writing the same R code snippet more than three times in a larger program or script, think about how to create a function instead.

The goal of the function described in this article is to return a set of results for a specific trainer and jockey combination, with some additional argument options added.

The code examples again assume data has already been returned from the Smartform database and is contained in a variable called smartform_results. Also assumed is that part of the initial MySQL query was to limit results to those only since January, 1st, 2016. Full R code will be provided at the end of this article.

The first step is to define the function name and arguments.

# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

The function is now named tj for Trainer and Jockey. The function has four defined arguments. A race filter, a price filter and arguments for trainer and jockey. If a value is not defined for a function argument, the user must include and define the argument when calling the function. This is the case for trainer and jockey above. However, default values for arguments may be specified. In the above code the race_filter has an empty default value, for all races, and the price_filter is defined as 1000, which is the maximum possible price on Betfair, thus also including all possible prices when applied as a less than or equal to fitler.

Now, we begin the remainder of our function, which is essentially the same code as previously, with some additional changes to account for the function arguments.

# Filter for flat races only
flat_races_only <- dplyr::filter(smartform_results,
                                 race_type_id == 12 |
                                   race_type_id == 15)

# Add an if/else statement for the race_filter argument
if (race_filter == "group"){

filtered_races <- dplyr::filter(flat_races_only,
                                  group_race == 1 |
                                    group_race == 2 |
                                    group_race == 3 )
} else {

  filtered_races = flat_races_only
}

The if else statement above is another new concept. It states that if the race_filter equals the word group apply one set of code, otherwise (else) run a different set of code. In the case of the current function, only one race_filter is supported, that is filter by Group races only or return results from all races. Additional race filters, such as class or age perhaps, could also be added to the function through additional else options.

The next block of code should be largely familiar from the previous article.

# Filter by trainer name
trainer_filtered <- dplyr::filter(filtered_races, 
                                         grepl(trainer, trainer_name))

# Remove non-runners
trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

# Filter by jockey name
trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                               grepl(jockey, jockey_name))

# Filter by price
trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                               starting_price_decimal <= price_filter)

The above lines now filter by the values provided in the arguments trainer, jockey and price_filter. If values for trainer and jockey are not provided by the user, and because no defaults were supplied, the function will fail. Also, if an incorrect name, which does not match values in the dataset, is supplied the function will also fail. There is no error checking provided in this example code. The price_filter was provided with a default value of 1000 and therefore if the user does not define it, the function will return all values equal to or less than 1000.

The next blocks of code are once again very similar to that used previously when calculating the statistics we’re interested in.

#  Calculate Profit and Loss
trainer_jockey_cumulative <- cumsum(
  ifelse(trainer_jockey_price_filtered$finish_position == 1, 
         (trainer_jockey_price_filtered$starting_price_decimal-1),
         -1)
)

# Calculate Strike Rate
winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                           finish_position == 1))

runners <- nrow(trainer_jockey_price_filtered)


strike_rate <- (winners / runners) * 100

# Calculate Profit on Turnover or Yield
profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100


# Calculate Impact Values
# First filter all runners by price, to return those just starting at the price_filter or less
all_runners <- nrow(dplyr::filter(filtered_races,
                                        starting_price_decimal <= price_filter))

# Filter all winners by the price filter 
all_winners <- nrow(dplyr::filter(filtered_races,
                                        finish_position == 1 &
                                          starting_price_decimal <= price_filter))

# Now calculate the Impact Value
iv <- (winners / all_winners) / (runners / all_runners)

# Calculate Actual vs Expected ratio
# # Convert all decimal odds to probabilities
total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

# Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
ae <- winners / total_sp

# Calculate Archie
archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

# Calculate the Confidence figure
conf <- pchisq(archie, df = 1)*100

That covers all the calculations. Now we return the results in a nice dataframe.

# Create an empty variable
trainer_jockey <- NULL

# Add all calculated figures as named objects to the variable, which creates a list
trainer_jockey$runners <- runners
trainer_jockey$winners <- winners
trainer_jockey$sr <- strike_rate
trainer_jockey$pot <- profit_on_turnover
trainer_jockey$iv <- iv
trainer_jockey$ae <- ae
trainer_jockey$conf <- conf

# Convert and return as a dataframe
as.data.frame(trainer_jockey)

# Finally, close the function
}

The last line here is very important and should not be forgotten. The curly bracket was used to start the function at the beginning, and therefore the matching closing curly bracket must be used at the end.

We now have a trainer/jockey function defined. How do we use it? Simply call the function, with defined arguments. Using the previous filters of Aiden O’Brien trained runners, ridden by Ryan Moore, in Group races and starting at a price of 4.0 or less, we do the following:

# Run the function with arguments and store in a results object
results <- tj(race_filter = "group", 
              price_filter = 4.0, 
              trainer = "A P O'Brien", 
              jockey = "R L Moore")

# Show results
results

  runners winners       sr      pot       iv       ae    conf
1     137      66 48.17518 13.62774 1.279968 1.069609 53.9223

This matches the results previously obtained when running through the code manually. Tomorrow in the Group 1 Queen Anne Stakes at Royal Ascot, Rhododendron is trained by Aiden O’Brien, ridden by Ryan Moore and is currently 3/1, thus matching the filters used in this function.

Now the function is defined, it is easy to start looking at alternative filter sets, without having to manually adjust any code. Some examples are outlined below:

# No price filter, which works because a default of 1000 was defined in the function
results_no_price <- tj(race_filter = "group", 
                       trainer = "A P O'Brien", 
                       jockey = "R L Moore")

  runners winners       sr      pot       iv       ae     conf
1     246      78 31.70732 -3.04878 2.825292 1.004775 4.056167

# All races, not just Group, with a price filter of 4.0
results_all_races <- tj(price_filter = 4.0, 
                        trainer = "A P O'Brien", 
                        jockey = "R L Moore")

  runners winners       sr      pot       iv        ae     conf
1     227     100 44.05286 2.718062 1.343485 0.9936154 6.841419

# All races and no price filter for this trainer and jockey combination
results_all_races_no_price <- tj(trainer = "A P O'Brien", 
                                 jockey = "R L Moore")

  runners winners       sr       pot       iv        ae     conf
1     387     119 30.74935 -8.183463 2.919438 0.9519335 48.63522

Also keep in mind these results are only since January 1st, 2016 as this filter was previously defined in the original SQL query. Hopefully, it should be reasonably clear how to add a date argument to extend this function.

Finally, if we wanted to investigate alternative trainer and jockey combinations, this is also quite easy now the function is already defined.

# David Simcock and Oisin Murphy together in Group races
simcock_murphy <- tj(race_filter = "group", 
                    trainer = "D M Simcock", 
                    jockey = "Oisin Murphy")

  runners winners      sr       pot        iv        ae     conf
1      21       2 9.52381 -66.66667 0.8486226 0.9350056 7.989577

David Simcock and Oisin Murphy team up with Lightning Spear, also in the Queen Anne at Ascot.

Good luck at the big meeting tomorrow!

Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.

The full R code used in this article is found below.

# Load the RMySQL library package
library("RMySQL")
library("dplyr")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2016-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

dbDisconnect(con)

# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

# Filter for flat races only
flat_races_only <- dplyr::filter(smartform_results,
                                 race_type_id == 12 |
                                   race_type_id == 15)

# Add an if else statement for the race_filter argument
if (race_filter == "group"){

filtered_races <- dplyr::filter(flat_races_only,
                                  group_race == 1 |
                                    group_race == 2 |
                                    group_race == 3 )
} else {

  filtered_races = flat_races_only
}

# Filter by trainer name
trainer_filtered <- dplyr::filter(filtered_races, 
                                         grepl(trainer, trainer_name))

# Remove non-runners
trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

# Filter by jockey name
trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                               grepl(jockey, jockey_name))

# Filter by price
trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                               starting_price_decimal <= price_filter)

#  Calculate Profit and Loss
trainer_jockey_cumulative <- cumsum(
  ifelse(trainer_jockey_price_filtered$finish_position == 1, 
         (trainer_jockey_price_filtered$starting_price_decimal-1),
         -1)
)

# Calculate Strike Rate
winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                           finish_position == 1))

runners <- nrow(trainer_jockey_price_filtered)

strike_rate <- (winners / runners) * 100

# Calculate Profit on Turnover or Yield
profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100

# Calculate Impact Values
# First filter all runners by price, to return those just starting at the price_filter or less
all_runners <- nrow(dplyr::filter(filtered_races,
                                        starting_price_decimal <= price_filter))

# Filter all winners by the price filter 
all_winners <- nrow(dplyr::filter(filtered_races,
                                        finish_position == 1 &
                                          starting_price_decimal <= price_filter))

# Now calculate the Impact Value
iv <- (winners / all_winners) / (runners / all_runners)

# Calculate Actual vs Expected ratio
# # Convert all decimal odds to probabilities
total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

# Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
ae <- winners / total_sp

# Calculate Archie
archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

# Calculate the Confidence figure
conf <- pchisq(archie, df = 1)*100

# Create an empty variable
trainer_jockey <- NULL

# Add all calculated figures as named objects to the variable, which creates a list
trainer_jockey$runners <- runners
trainer_jockey$winners <- winners
trainer_jockey$sr <- strike_rate
trainer_jockey$pot <- profit_on_turnover
trainer_jockey$iv <- iv
trainer_jockey$ae <- ae
trainer_jockey$conf <- conf

# Convert and return as a dataframe
as.data.frame(trainer_jockey)

# Finally, close the function
}

Leave a comment