Loops With R – Creating a Racecard with Trainer and Jockey Stats

By Phill Clarke on Tuesday, June 19th, 2018

Yesterday we looked at how to create a function in order to easily run the same set of code multiple times, without having to manually edit the code every time. While this is a highly useful concept to understand, we’re still left with manually applying the trainer and jockey combinations for each runner. It remains a time consuming task. Therefore, this article covers creating a basic for loop to iterate over the rows of a dataframe and apply a function to each of them.

There are many ways to loop over the rows of a dataframe, list or matrix in R. Some methods are more efficient than others, while some are perhaps more logical than others. Specifically this article demonstrates how to apply a for loop, which rightly receives some criticism for being slow to execute in certain circumstances. For our purposes, with a limited number of rows, this will not be a problem. However, the reader should investigate R’s apply family of functions and also the map function in the purrr library.

The full R code will be provided within the article, as there have been some useful changes made to code used previously. The complete code will also, as usual, be provided at the end of this article.

Begin by returning historic racing results, this time since 2013 for a full five year dataset and also from tomorrow’s race card:

# Load the library packages
library("RMySQL")
library("dplyr")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                 host='127.0.0.1', 
                 user='yourusername', 
                 password='yourpassword', 
                 dbname='smartform')

# Select relevant historic results
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2012-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

# Select relevant daily results for tomorrow
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_runners.cloth_number,
              daily_runners.name,
              daily_runners.trainer_name,
              daily_runners.jockey_name,
              daily_runners.forecast_price_decimal
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-06-20'", sep="")

smartform_daily_results <- dbGetQuery(con, sql2)

dbDisconnect(con)

Next is the Trainer/Jockey function explained yesterday. However, the code will be broken into a few sections, as there have been some changes incorporated.

The function, as detailed previously:


# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_name))

  # Remove non-runners
  trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

  # Filter by jockey name
  trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                           grepl(jockey, jockey_name))

  # Filter by price
  trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_jockey_cumulative <- cumsum(
    ifelse(trainer_jockey_price_filtered$finish_position == 1, 
           (trainer_jockey_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_jockey_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0

The last line above is new. This line is being used to catch any instances where the profit on turnover figure is of zero length. That means that the calculation has not been successful, usually because there were no runners for the combination of trainer and jockey.

Continuing with the function:


  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer_jockey <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer_jockey$tj_runners <- runners
  trainer_jockey$tj_winners <- winners
  trainer_jockey$tj_sr <- strike_rate
  trainer_jockey$tj_pot <- profit_on_turnover
  trainer_jockey$tj_iv <- iv
  trainer_jockey$tj_ae <- ae
  trainer_jockey$tj_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer_jockey)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

Once again, there are some new lines in the final part of the function above. The results are checked for NaN values, which again occur if a calculation has failed. It is not possible, for example, to calculate strike rate if there are no runners for the trainer and jockey combination. Error checking such as this will take some time to implement, but does save a lot of headaches later.

The results are then transformed, with t from a long to wide dataframe, the rownames are removed and all results rounded to two decimal places.

Now, we move on to the new section for this article. First, filter the current daily races for Group races only and also create an empty placeholder list.

# Filter tomorrow's races for Group races only
group_races_only <- dplyr::filter(smartform_daily_results,
                                  grepl(paste(c("Group 1", "Group 2", "Group 3"), collapse="|"), race_title))

# Create a placeholder list which will be required later
row <- list()

Then start the for loop, which essentially says for every value of name, which is the column containing the horse’s name, apply the code which follows. This code includes extracing the trainer and jockey names, then executing the function defined earlier. Lastly, the data is iteratively added to the emtpy list and converted to a dataframe.


# Setup the loop
# For each horse in the group_races_only dataframe
for (i in group_races_only$name) {

  runner_details = group_races_only[group_races_only$name==i,]

  # Extract trainer and jockey names
  trainer = runner_details$trainer_name
  jockey = runner_details$jockey_name

  # Apply the Trainer/Jockey function for Group races only
  trainer_jockey_combo <- tj(race_filter = "group", 
                             trainer = trainer, 
                             jockey = jockey)

  # Add results row by row to the previously defined list
  row[[i]] <- trainer_jockey_combo

  # Create a final dataframe
  stats_final <- as.data.frame(do.call("rbind", row)) 

}

As a final piece of code, we bind the new data and the general racing data from the Smartform database in a new variable called racecard. Viewing this racecard will now display the strike rate, profit on turnover, impact value, actual vs expected and confidence figure for every jockey and trainer combination in Group races on tomorrow’s race card. They are of course all at Royal Ascot.

# Create a new variable called racecard. Bind together the generic race details with the newly created stats
racecard <- cbind(group_races_only,stats_final)

This data can now be reviewed for interesting angles. The screenshot below displays the data ordered descending from the highest A/E value. The first thing to notice about the top three entries is the very small sample size, with only three of four runs. However, the fourth entry for Cox and Kirby does have a robust sample size and some very good figures. Shades of Blue in the Queen Mary Stakes tomorrow at Royal Ascot is certainly worth a closer look. As are the Gosden and Dettori trio of Stream of Stars, Cracksman and Purser. This combination already struck three times today with Calyx, Without Parole and Monarch’s Glen.

Good luck!

Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.

The full R code used in this article is found below.

# Load the library packages
library("RMySQL")
library("dplyr")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                 host='127.0.0.1', 
                 user='yourusername', 
                 password='yourpassword', 
                 dbname='smartform')

# Select relevant historic results
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2012-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

# Select relevant daily results for tomorrow
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_runners.cloth_number,
              daily_runners.name,
              daily_runners.trainer_name,
              daily_runners.jockey_name,
              daily_runners.forecast_price_decimal
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-06-20'", sep="")

smartform_daily_results <- dbGetQuery(con, sql2)

dbDisconnect(con)

# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_name))

  # Remove non-runners
  trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

  # Filter by jockey name
  trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                           grepl(jockey, jockey_name))

  # Filter by price
  trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_jockey_cumulative <- cumsum(
    ifelse(trainer_jockey_price_filtered$finish_position == 1, 
           (trainer_jockey_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_jockey_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer_jockey <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer_jockey$tj_runners <- runners
  trainer_jockey$tj_winners <- winners
  trainer_jockey$tj_sr <- strike_rate
  trainer_jockey$tj_pot <- profit_on_turnover
  trainer_jockey$tj_iv <- iv
  trainer_jockey$tj_ae <- ae
  trainer_jockey$tj_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer_jockey)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Filter tomorrow's races for Group races only
group_races_only <- dplyr::filter(smartform_daily_results,
                                  grepl(paste(c("Group 1", "Group 2", "Group 3"), collapse="|"), race_title))

# Create a placeholder list which will be required later
row <- list()

# Setup the loop
# For each horse in the group_races_only dataframe
for (i in group_races_only$name) {


  runner_details = group_races_only[group_races_only$name==i,]

  # Extract trainer and jockey names
  trainer = runner_details$trainer_name
  jockey = runner_details$jockey_name

  # Apply the Trainer/Jockey function for Group races only
  trainer_jockey_combo <- tj(race_filter = "group", 
                             trainer = trainer, 
                             jockey = jockey)

  # Add results row by row to the previously defined list
  row[[i]] <- trainer_jockey_combo

  # Create a final dataframe
  stats_final <- as.data.frame(do.call("rbind", row)) 

}

# Create a new variable called racecard. Bind together the generic race details with the newly created stats
racecard <- cbind(group_races_only,stats_final)

This entry was posted on Tuesday, June 19th, 2018 at 7:46 pm. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Loops With R – Creating a Racecard with Trainer and Jockey Stats

Leave a comment

Recent Posts

Archives