Plotting Trainer, Jockey and Sire Statistics in a Stacked Bar Chart with R

By Phill Clarke on Saturday, June 23rd, 2018

Earlier in the week we looked at how to use a for loop to iterate across rows of a dataframe to calculate statistics in an automated manner. Interesting and useful, but we only looked at one specific set of circumstance; trainer and jockey combinations in Group races. There are many other useful statistics which can be used to examine a race. This article focuses on today’s Diamond Jubilee Stakes at Royal Ascot, extends the one collection of statistics to four and finally plots the outcome in a visual format.

As the code examples for this article now extend to beyond 550 lines, it is not practicle to include all the code in-line with the article text. Therefore, only certain examples will be included in-line with the full R code will be provided at the end of the article.

The initial assumption is that data has been returned from the Smartform database, although some additional field are now returned, specifically trainer_id, jockey_id and sire_name.

# Select relevant historic results
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_races.distance_yards,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal,
              historic_runners.trainer_id,
              historic_runners.jockey_id,
              historic_runners.sire_name
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2012-01-01'", sep="")

Previously we created a trainer & jockey function to investigate these specific combinations in Group races. This is now extended to just trainer, just jockey and just sire functions. The trainer function is found below.

# Trainer stats
# Name the function and add some arguments
tr <- function(race_filter = "", price_filter = 1000, trainer){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer id
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_id))


  # Filter by price
  trainer_price_filtered <- dplyr::filter(trainer_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_cumulative <- cumsum(
    ifelse(trainer_price_filtered$finish_position == 1, 
           (trainer_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer$tr_runners <- runners
  trainer$tr_winners <- winners
  trainer$tr_sr <- strike_rate
  trainer$tr_pot <- profit_on_turnover
  trainer$tr_iv <- iv
  trainer$tr_ae <- ae
  trainer$tr_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

Note that in the above code, instead of filtering by trainer_name, we are now filtering by trainer_id. This is due to the fact that sometimes the trainer names in the daily racing data do not exactly match those in the historic data. For example, Sir Michael Stoute hasn’t always been a knight. Therefore, if we were just matching on trainer_name there would be some occassions where this fails and no results are returned. Smartform instead provides a unique identification number for trainers and jockeys, which insures there will always be a match between historic and daily data.

The function above is just one example. In order to produce the charts later in this article, additional jockey and sire functions have been added, bringing the total to four; trainer, jockey, trainer & jockey and sire. The number of statistics could be extended much further to include angles such as trainer & distance, trainer & course, trainer & age (2yo, 3yo, 4yo+ races) and many more.

The for loop also now includes all four of these functions.

# Create placeholder lists which will be required later
row_tr <- list()
row_jc <- list()
row_tj <- list()
row_sr <- list()

# Setup the loop
# For each horse in the group_races_only dataframe
for (i in group_races_only$name) {


  runner_details = group_races_only[group_races_only$name==i,]

  # Extract trainer, jockey id and sire names
  trainer <- runner_details$trainer_id
  jockey <- runner_details$jockey_id
  sire <- runner_details$sire_name

  # Apply the Trainer function for Group races only
  trainer_combo <- tr(race_filter = "group", 
                                  trainer = trainer)

  # Add results row by row to the previously defined list
  row_tr[[i]] <- trainer_combo

  # Apply the Jockey function for Group races only
  jockey_combo <- jc(race_filter = "group", 
                             jockey = jockey)

  # Add results row by row to the previously defined list
  row_jc[[i]] <- jockey_combo

  # Apply the Trainer/Jockey function for Group races only
  trainer_jockey_combo <- tj(race_filter = "group", 
                             trainer = trainer, jockey = jockey)

  # Add results row by row to the previously defined list
  row_tj[[i]] <- trainer_jockey_combo

  # Apply the Sire function for Group races only
  sire_combo <- sr(race_filter = "group", 
                             sire = sire)

  # Add results row by row to the previously defined list
  row_sr[[i]] <- sire_combo

  # Create a final dataframe
  stats_final_tr <- as.data.frame(do.call("rbind", row_tr))
  stats_final_jc <- as.data.frame(do.call("rbind", row_jc))
  stats_final_tj <- as.data.frame(do.call("rbind", row_tj))
  stats_final_sr <- as.data.frame(do.call("rbind", row_sr))

}

# Create a new variable called racecard. Bind together the generic race details with the newly created stats
racecard <- cbind(group_races_only,stats_final_tr)
racecard <- cbind(racecard,stats_final_jc)
racecard <- cbind(racecard,stats_final_tj)
racecard <- cbind(racecard,stats_final_sr)

Viewing the final racecard now shows forty columns and a wall of data. This isn’t perhaps the easiest way to visualise the overall picture. Instead, we’ll create a stacked barchart showing Impact Values for all four angles. The legend shows tr_iv, jc_iv, tj_iv and sr_iv for the trainer, jockey, trainer & jockey and sire impact values.

# Filter for Diamond Jubilee Only
diamond_jubilee <- dplyr::filter(racecard,
                                 grepl("Diamond Jubilee", 
                                       race_title))

# Filter for just the IV columns which we will plot
racecard_filtered_iv <- diamond_jubilee[,c("name","tr_iv","jc_iv", "tj_iv", "sr_iv")]

# Convert the racecard from wide to long format
racecard_long_iv <- melt(racecard_filtered_iv, id.var="name")

# Plot a stacked barchart
ggplot(racecard_long_iv, aes(x = name, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The highest bars indicate the highest cumulative Impact Values. Some bars do not include all four factors, as sometimes there were no results returned. For example, Bound For Nowhere’s sire, The Factor, has only had one Group runner in the UK & Ireland, which did not win. Therefore, there is no data to calculate strike rate, impact value etc.

This means one needs to be careful examining the chart and also take time to ponder the data in the dataframe. Some sample sizes may be very small and a question should be asked if they are statistically relevant. Bound For Nowhere’s Trainer & Jockey Impact Value is the highest in the race, but this is from only six runners. Compared to Merchant Navy with the second highest tj_iv, but from 357 runners.

Nonetheless, a visual method like this can still assist to narrow the field. Harry Angel, the favourite for the race, is certainly not a standout in the chart, with decent sample sizes across all four factors.

The stacked barchart can also be applied to Actual vs Expected figures, strike rates or Confidence figures. The chart below displays stacked A/E for a more value oriented view.

# Filter for just the AE columns which we will plot
racecard_filtered_ae <- diamond_jubilee[,c("name","tr_ae","jc_ae", "tj_ae", "sr_ae")]

# Convert the racecard from wide to long format
racecard_long_ae <- melt(racecard_filtered_ae, id.var="name")

# Plot a stacked barchart
ggplot(racecard_long_ae, aes(x = name, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Once again, keep Bound for Nowhere’s small sample sizes in mind. Harry Angel appears to be a better betting proposition based on this chart.

Another way to look at this data might be as a grouped bar chart, where the IV and A/E figures are plotted for each horse next to each other.

# Filter for just the AE columns which we will plot
racecard_filtered_all <- diamond_jubilee[,c("name","tr_iv","jc_iv", "tj_iv", "sr_iv", 
                                            "tr_ae","jc_ae", "tj_ae", "sr_ae")]

# Convert the racecard from wide to long format
racecard_long_all <- melt(racecard_filtered_all, id.var="name")

# Plot a grouped barchart
ggplot(racecard_long_all, aes(x = name, y = value, fill = variable)) +   
  geom_bar(position = "dodge", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Although Sire IV is missing for Bound for Nowhere, his other figures do all point to a superier trainer and jockey, albeit from small sample size. How to statistically deal with these small sample sizes will be covered in a future article on Bayesian techniques.

Which horses might be included in a shortlist for today’s Diamond Jubilee? Even with small sample sizes, but knowing Wesley Ward’s Ascot success with sprinters, it may be wise to include Bound for Nowhere, who is currently 14.0 on Betfair. Merchant Navy (IV) and Harry Angel (A/E) are both positives, although much shorter priced at the top of the market.

It is important not to just rely on the data. There are many different factors to consider and a good knowledge of general form is also required. Therefore, after all that work, one might still decide just to back the Aussie danger and triple Group 1 sprint winner, Redkirk Warrior.

Good luck!

Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.

The full R code used in this article is found below.

# Load the library packages
library("RMySQL")
library("dplyr")
library("reshape2")
library("ggplot2")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                 host='127.0.0.1', 
                 user='yourusername', 
                 password='yourpassword', 
                 dbname='smartform')

# Select relevant historic results
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_races.distance_yards,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal,
              historic_runners.trainer_id,
              historic_runners.jockey_id,
              historic_runners.sire_name
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2012-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

# Remove non-runners and non-finishers
smartform_results <- dplyr::filter(smartform_results, !is.na(finish_position))

# Select relevant daily results for tomorrow
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_races.distance_yards,
              daily_runners.cloth_number,
              daily_runners.name,
              daily_runners.trainer_name,
              daily_runners.jockey_name,
              daily_runners.sire_name,
              daily_runners.forecast_price_decimal,
              daily_runners.trainer_id,
              daily_runners.jockey_id
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-06-23'", sep="")

smartform_daily_results <- dbGetQuery(con, sql2)

dbDisconnect(con)

# Remove non-runners
smartform_daily_results <- dplyr::filter(smartform_daily_results, !is.na(forecast_price_decimal))

# Trainer stats
# Name the function and add some arguments
tr <- function(race_filter = "", price_filter = 1000, trainer){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_id))


  # Filter by price
  trainer_price_filtered <- dplyr::filter(trainer_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_cumulative <- cumsum(
    ifelse(trainer_price_filtered$finish_position == 1, 
           (trainer_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer$tr_runners <- runners
  trainer$tr_winners <- winners
  trainer$tr_sr <- strike_rate
  trainer$tr_pot <- profit_on_turnover
  trainer$tr_iv <- iv
  trainer$tr_ae <- ae
  trainer$tr_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Jockey stats
# Name the function and add some arguments
jc <- function(race_filter = "", price_filter = 1000, jockey){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  jockey_filtered <- dplyr::filter(filtered_races, 
                                    grepl(jockey, jockey_id))


  # Filter by price
  jockey_price_filtered <- dplyr::filter(jockey_filtered,
                                          starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  jockey_cumulative <- cumsum(
    ifelse(jockey_price_filtered$finish_position == 1, 
           (jockey_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(jockey_price_filtered,
                                finish_position == 1))

  runners <- nrow(jockey_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(jockey_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/jockey_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  jockey <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  jockey$jc_runners <- runners
  jockey$jc_winners <- winners
  jockey$jc_sr <- strike_rate
  jockey$jc_pot <- profit_on_turnover
  jockey$jc_iv <- iv
  jockey$jc_ae <- ae
  jockey$jc_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(jockey)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Trainer and Jockey stats
# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_id))

  # Remove non-runners
  #  trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

  # Filter by jockey name
  trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                           grepl(jockey, jockey_id))

  # Filter by price
  trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_jockey_cumulative <- cumsum(
    ifelse(trainer_jockey_price_filtered$finish_position == 1, 
           (trainer_jockey_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_jockey_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer_jockey <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer_jockey$tj_runners <- runners
  trainer_jockey$tj_winners <- winners
  trainer_jockey$tj_sr <- strike_rate
  trainer_jockey$tj_pot <- profit_on_turnover
  trainer_jockey$tj_iv <- iv
  trainer_jockey$tj_ae <- ae
  trainer_jockey$tj_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer_jockey)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Sire stats
# Name the function and add some arguments
sr <- function(race_filter = "", price_filter = 1000, sire){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  sire_filtered <- dplyr::filter(filtered_races, 
                                    grepl(sire, sire_name))


  # Filter by price
  sire_price_filtered <- dplyr::filter(sire_filtered,
                                          starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  sire_cumulative <- cumsum(
    ifelse(sire_price_filtered$finish_position == 1, 
           (sire_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(sire_price_filtered,
                                finish_position == 1))

  runners <- nrow(sire_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(sire_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/sire_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  sire <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  sire$sr_runners <- runners
  sire$sr_winners <- winners
  sire$sr_sr <- strike_rate
  sire$sr_pot <- profit_on_turnover
  sire$sr_iv <- iv
  sire$sr_ae <- ae
  sire$sr_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(sire)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Filter tomorrow's races for Group races only
group_races_only <- dplyr::filter(smartform_daily_results,
                                  grepl(paste(c("Group 1", "Group 2", "Group 3"), collapse="|"), race_title))

# Create placeholder lists which will be required later
row_tr <- list()
row_jc <- list()
row_tj <- list()
row_sr <- list()

# Setup the loop
# For each horse in the group_races_only dataframe
for (i in group_races_only$name) {


  runner_details = group_races_only[group_races_only$name==i,]

  # Extract trainer and jockey names
  trainer <- runner_details$trainer_id
  jockey <- runner_details$jockey_id
  sire <- runner_details$sire_name

  # Apply the Trainer function for Group races only
  trainer_combo <- tr(race_filter = "group", 
                                  trainer = trainer)

  # Add results row by row to the previously defined list
  row_tr[[i]] <- trainer_combo

  # Apply the Jockey function for Group races only
  jockey_combo <- jc(race_filter = "group", 
                             jockey = jockey)

  # Add results row by row to the previously defined list
  row_jc[[i]] <- jockey_combo

  # Apply the Trainer/Jockey function for Group races only
  trainer_jockey_combo <- tj(race_filter = "group", 
                             trainer = trainer, jockey = jockey)

  # Add results row by row to the previously defined list
  row_tj[[i]] <- trainer_jockey_combo

  # Apply the Sire function for Group races only
  sire_combo <- sr(race_filter = "group", 
                             sire = sire)

  # Add results row by row to the previously defined list
  row_sr[[i]] <- sire_combo

  # Create a final dataframe
  stats_final_tr <- as.data.frame(do.call("rbind", row_tr))
  stats_final_jc <- as.data.frame(do.call("rbind", row_jc))
  stats_final_tj <- as.data.frame(do.call("rbind", row_tj))
  stats_final_sr <- as.data.frame(do.call("rbind", row_sr))

}

# Create a new variable called racecard. Bind together the generic race details with the newly created stats
racecard <- cbind(group_races_only,stats_final_tr)
racecard <- cbind(racecard,stats_final_jc)
racecard <- cbind(racecard,stats_final_tj)
racecard <- cbind(racecard,stats_final_sr)

# Filter for Diamond Jubilee Only
diamond_jubilee <- dplyr::filter(racecard,
                                 grepl("Diamond Jubilee", 
                                       race_title))

# Filter for just the IV columns which we will plot
racecard_filtered_iv <- diamond_jubilee[,c("name","tr_iv","jc_iv", "tj_iv", "sr_iv")]

# Convert the racecard from wide to long format
racecard_long_iv <- melt(racecard_filtered_iv, id.var="name")

# Plot a stacked barchart
ggplot(racecard_long_iv, aes(x = name, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Filter for just the AE columns which we will plot
racecard_filtered_ae <- diamond_jubilee[,c("name","tr_ae","jc_ae", "tj_ae", "sr_ae")]

# Convert the racecard from wide to long format
racecard_long_ae <- melt(racecard_filtered_ae, id.var="name")

# Plot a stacked barchart
ggplot(racecard_long_ae, aes(x = name, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Filter for just the AE columns which we will plot
racecard_filtered_all <- diamond_jubilee[,c("name","tr_iv","jc_iv", "tj_iv", "sr_iv", 
                                            "tr_ae","jc_ae", "tj_ae", "sr_ae")]

# Convert the racecard from wide to long format
racecard_long_all <- melt(racecard_filtered_all, id.var="name")

# Plot a grouped barchart
ggplot(racecard_long_all, aes(x = name, y = value, fill = variable)) +   
  geom_bar(position = "dodge", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This entry was posted on Saturday, June 23rd, 2018 at 7:22 am. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Plotting Trainer, Jockey and Sire Statistics in a Stacked Bar Chart with R

Leave a comment

Recent Posts

Archives