Calculating Profit and Loss from Historic Data using R

By Phill Clarke on Saturday, May 19th, 2018

Previously we looked at how to calculate the strike rate for jockeys, specifically with relation to those riding in the 1000 Guineas. The result of that race showed how strikes rates are really only part of the overall picture. Sean Levey had the lowest historic strike rate for Group races, of all those peforming in the 1000 Guineas, yet rode a convincing winner. If we had of looked at trainer strike rates, it may have been a different story.

This article will now look at trainer strike rates, but also specifically calculate the profit and loss (P&L) to show what would happen if one had backed every runner for a specific trainer at starting price (SP) in all Group races.

We start very similar to previously, returning historic data from the Smartform database and calculating trainer strike rates.

# Load the RMySQL and dplyr library packages
library("RMySQL")
library("dplyr")

# Execute an SQL command to return some historic data
# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

# This SQL query selects the required columns from the historic_races and historic_runners tables, joining by unique race_id and filtering for results only since January 1st, 2006. The SQL query is saved in a variable called sql1.
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2006-01-01'", sep="")

# Now execute the SQL query, using the previously established connection. Results are stored in a variable called smartform_results.
smartform_results <- dbGetQuery(con, sql1)

# Close the database connection, which is good practice
dbDisconnect(con)

# Race type IDs 12 and 15 correspond to Flat and All Weather races
flat_races_only <- dplyr::filter(smartform_results,
                                  race_type_id == 12 |
                                   race_type_id == 15)

# Filter using the group_race field for Group 1, 2 and 3 races only
group_races_only <- dplyr::filter(flat_races_only,
                                group_race == 1 |
                                  group_race == 2 |
                                  group_race == 3 )

# For each trainer name, count the number of runs
trainer_group_runs <- group_races_only %>% count(trainer_name)

# Rename the second column in trainer_group_rides to something more logical
names(trainer_group_runs)[2]<-"group_runs"

# Now filter for only winning runs
group_winners_only <- dplyr::filter(group_races_only,
                                    finish_position == 1)

# For each trainer, count the number of winning runs
trainer_group_wins <- group_winners_only %>% count(trainer_name)

# Rename the second column in trainer_group_wins to something more logical
names(trainer_group_wins)[2]<-"group_wins"

# Join the two dataframes, trainer_group_runs and trainer_group_wins together, using trainer_name as a key
trainer_group_data <- dplyr::full_join(trainer_group_runs, trainer_group_wins, by = "trainer_name")


# Rename all the NA fields in the new dataframe to zero
# If a trainer has not had a group winner, the group_wins field will be NA
# If this is not changed to zero, later calculations will fail
trainer_group_data[is.na(trainer_group_data)] <- 0

# Now calculate the Group race strike rate for all trainers
trainer_group_data$strike_rate <- (trainer_group_data$group_wins / trainer_group_data$group_runs) * 100

The variable trainer_group_data now contains the strike rate for every trainer who has had a runner in a flat or all weather Group race since January 1st, 2006.

The Group 1 Lockinge Stakes is run today at Newbury. Therefore, let’s look at this field only for today’s examples.

# Execute an SQL command to return some daily data
# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

# This SQL query selects the required columns from the daily_races and daily_runners tables, joining by unique race_id and filtering for results only with today's date. The SQL query is saved in a variable called sql2.  
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_runners.name,
              daily_runners.jockey_name,
              daily_runners.trainer_name
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-05-18'", sep="")

# Now execute the SQL query, using the previously established connection. Results are stored in a variable called smartform_daily_results.
smartform_daily_results <- dbGetQuery(con, sql2)

# Close the database connection, which is good practice
dbDisconnect(con)

The variable smartform_daily_results contains information about all races being run today.

Now we filter for just today’s Lockinge Stakes.

# Filter for the Lockinge Stakes only
lockinge_only <- dplyr::filter(smartform_daily_results,
                              grepl("Lockinge", race_title))

The variable lockinge_only now contains just the basic details for today’s Group 1 Lockinge Stakes.

Lastly, dplyr is again used to perform a different type of join, which combines the lockinge_only and trainer_group_data dataframes, using trainer_name as the key.

# Using dplyr, join the lockinge_only and trainer_group_data dataframes
lockinge_only_with_sr <- dplyr::inner_join(lockinge_only, trainer_group_data, by = "trainer_name")

If we now view the dataframe lockinge_only_with_sr we can see that the trainer with the best strike rate in Group races, with runners also in today’s Lockinge Stakes, is Aiden O’Brien.

lockinge_only_with_sr

    course                                race_title meeting_date             name    jockey_name         trainer_name group_rides group_wins strike_rate
1  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19  Lightning Spear   Oisin Murphy          D M Simcock         256         24    9.375000
2  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19          Suedois      D Tudhope            D O'Meara         190         16    8.421053
3  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19           Limato  Harry Bentley              H Candy         103         15   14.563107
4  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19   Librisa Breeze      R Winston            D K Ivory          44          2    4.545455
5  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19       Zonderland        A Kirby              C G Cox         265         30   11.320755
6  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19        Deauville     W M Lordan          A P O'Brien        2347        385   16.403920
7  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19 Accidental Agent Charles Bishop Eve Johnson Houghton          70          1    1.428571
8  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19 Lancaster Bomber  J A Heffernan          A P O'Brien        2347        385   16.403920
9  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19    Beat The Bank    Jim Crowley          A M Balding         367         25    6.811989
10 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19       War Decree      P B Beggy          A P O'Brien        2347        385   16.403920
11 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19     Rhododendron      R L Moore          A P O'Brien        2347        385   16.403920
12 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19 Alexios Komnenos      C D Hayes            J A Stack          19          3   15.789474
13 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19           Lahore     S De Sousa              C G Cox         265         30   11.320755
14 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19          Addeybb    James Doyle           W J Haggas         459         48   10.457516

No real suprises there perhaps. However, what would now happen if we’d placed a one Pound bet on every Aiden O’Brien runner in Group races over the last twelve years. Would we have made a profit?

To begin, we filter all historic Group races, from the data obtained earlier, for just Aiden O’Brien’s horses.

# Filter historic data just for A P O'Brien horses
obrien_group_races_only <- dplyr::filter(group_races_only, 
                                         grepl("A P O'Brien", trainer_name))

# Non Runners appears as NA, which need to be removed for a true picture and so calulations do not fail
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))

The dataframe obrien_group_races_only now contains the data we need.

The next step is to cumulatively sum all the start prices for winning horses and deduct our one Pound stake where appropriate. The cumsum function is used to achieve this. The ifelse statement below basically says, for all lines in the dataframe where the finishing position is one, use the decimal starting price but also subtract our initial one Pound stake, otherwise (if the finish position isn’t one) simply subtract our one Pound stake from the total.

# Calculate the cumulative total for all O'Brien runners
obrien_cumulative <- cumsum(
  ifelse(obrien_group_races_only$finish_position == 1, (obrien_group_races_only$starting_price_decimal-1),-1)
)

# Add a new column back into obrien_group_races_only, with the cumulative totals
obrien_group_races_only$cumulative <- obrien_cumulative

When viewing the dataframe obrien_group_races_only you will now see a new column included called cumulative. Looking at the last entry for this column in the dataframe, the results is -275.76. Therefore, a single Pound bet on all 2178 Aiden O’Brien Group race runners since 2006 would have resulted in a loss of £275.76. Not really the route to profitable punting.

We can also take this a step further and plot the results for something more visual. The code below uses the gglot2 library again, along with some helper libraries in order to make a prettier chart.

# Load relevant libraries
library("ggplot2")
library("ggthemes")
library("scales")

# Convert the meeting_date column from character format to Date format
obrien_group_races_only$meeting_date <- as.Date(obrien_group_races_only$meeting_date)

# Plot the chart
ggplot(data=obrien_group_races_only, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  scale_y_continuous(breaks = seq(-300, 10, by = 50)) +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

The years 2009 through to the end of 2011 look to be particularly poor years for Aiden O’Brien in Group races, but 2014 onwards has been much more stable.

Today we have shown that just because the trainer may have the best historic strike rate in Group races, backing all of their runners blindly does not necessarily result in a profit. The market obviously knows the A P O’Brien has a very good record in these types of races and his horses are priced accordingly.

An exercise for the reader could be to calculate the P&L for Aiden O’Brien runners when ridden by Ryan Moore. To do this, start by filtering the historic obrien_group_races_only data by R L Moore. Hint: The answer is they are an profitable combination over 375 runs since 2006.

Questions and queries about this article should be posted as a comment below or to the Betwise Q&A board.

The full R code used in this article is found below.

# Load the RMySQL and dplyr library packages
library("RMySQL")
library("dplyr")

# Execute an SQL command to return some historic data
# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

# This SQL query selects the required columns from the historic_races and historic_runners tables, joining by unique race_id and filtering for results only since January 1st, 2006. The SQL query is saved in a variable called sql1.
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2006-01-01'", sep="")

# Now execute the SQL query, using the previously established connection. Results are stored in a variable called smartform_results.
smartform_results <- dbGetQuery(con, sql1)

# Close the database connection, which is good practice
dbDisconnect(con)

# Race type IDs 12 and 15 correspond to Flat and All Weather races
flat_races_only <- dplyr::filter(smartform_results,
                                  race_type_id == 12 |
                                   race_type_id == 15)

# Filter using the group_race field for Group 1, 2 and 3 races only
group_races_only <- dplyr::filter(flat_races_only,
                                group_race == 1 |
                                  group_race == 2 |
                                  group_race == 3 )

# For each trainer name, count the number of runs
trainer_group_runs % count(trainer_name)

# Rename the second column in trainer_group_rides to something more logical
names(trainer_group_runs)[2]<-"group_runs"

# Now filter for only winning runs
group_winners_only <- dplyr::filter(group_races_only,
                                    finish_position == 1)

# For each trainer, count the number of winning runs
trainer_group_wins % count(trainer_name)

# Rename the second column in trainer_group_wins to something more logical
names(trainer_group_wins)[2]<-"group_wins"

# Join the two dataframes, trainer_group_runs and trainer_group_wins together, using trainer_name as a key
trainer_group_data <- dplyr::full_join(trainer_group_runs, trainer_group_wins, by = "trainer_name")


# Rename all the NA fields in the new dataframe to zero
# If a trainer has not had a group winner, the group_wins field will be NA
# If this is not changed to zero, later calculations will fail
trainer_group_data[is.na(trainer_group_data)] <- 0

# Now calculate the Group race strike rate for all trainers
trainer_group_data$strike_rate <- (trainer_group_data$group_wins / trainer_group_data$group_runs) * 100

# Execute an SQL command to return some daily data
# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

# This SQL query selects the required columns from the daily_races and daily_runners tables, joining by unique race_id and filtering for results only with today's date. The SQL query is saved in a variable called sql2.  
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_runners.name,
              daily_runners.jockey_name,
              daily_runners.trainer_name
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-05-18'", sep="")

# Now execute the SQL query, using the previously established connection. Results are stored in a variable called smartform_daily_results.
smartform_daily_results <- dbGetQuery(con, sql2)

# Close the database connection, which is good practice
dbDisconnect(con)

# Filter for the Lockinge Stakes only
lockinge_only <- dplyr::filter(smartform_daily_results,
                              grepl("Lockinge", race_title))

# Using dplyr, join the lockinge_only and trainer_group_data dataframes
lockinge_only_with_sr <- dplyr::inner_join(lockinge_only, trainer_group_data, by = "trainer_name")

# Filter historic data just for A P O'Brien horses
obrien_group_races_only <- dplyr::filter(group_races_only, 
                                         grepl("A P O'Brien", trainer_name))

# Non Runners appears as NA, which need to be removed for a true picture and so calulations do not fail
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))

# Calculate the cumulative total for all O'Brien runners
obrien_cumulative <- cumsum(
  ifelse(obrien_group_races_only$finish_position == 1, (obrien_group_races_only$starting_price_decimal-1),-1)
)

# Add a new column back into obrien_group_races_only, with the cumulative totals
obrien_group_races_only$cumulative <- obrien_cumulative

# Load relevant libraries
library("ggplot2")
library("ggthemes")
library("scales")

# Convert the meeting_date column from character format to Date format
obrien_group_races_only$meeting_date <- as.Date(obrien_group_races_only$meeting_date)

# Plot the chart
ggplot(data=obrien_group_races_only, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  scale_y_continuous(breaks = seq(-300, 10, by = 50)) +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

##################################################################################
# Reader Exercise: Aiden O'Brien and Ryan Moore combined in Group Races
##################################################################################

obrien_moore_group_races_only <- dplyr::filter(obrien_group_races_only, 
                                               grepl("R L Moore", jockey_name))

obrien_moore_cumulative <- cumsum(
  ifelse(obrien_moore_group_races_only$finish_position == 1, (obrien_moore_group_races_only$starting_price_decimal-1),-1)
)

obrien_moore_group_races_only$cumulative <- obrien_moore_cumulative


obrien_moore_group_races_only$meeting_date <- as.Date(obrien_moore_group_races_only$meeting_date)


ggplot(data=obrien_moore_group_races_only, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

This entry was posted on Saturday, May 19th, 2018 at 9:16 am. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Calculating Profit and Loss from Historic Data using R

Leave a comment

Recent Posts

Archives