Bookmaker vs Betfair Odds Comparison

By Tom Bardrick on Monday, July 8th, 2019

This blog will look into:

How to process the output of a query from the SmartForm database into a dataframe
An example of how to carry out analysis on this data, in particular looking at the relationship between forecast prices / starting prices and investigating how overround varies for bookmakers odds compared to odds offered by Betfair.

Setting up Connection to Database
Connecting to MySQL server (as shown in the previous article below).

import pymysql
connection = pymysql.connect(host='localhost', user='root', passwd = '*****', database = 'smartform')
cursor = connection.cursor()

Querying the Database
Running a query to merge ‘historic_runners’ table with the ‘historic_races’ table. This will be done using an inner join to connect the runners data with the races data…

SELECT the race name and date (from historic_races) and all runner names, the bookies’ forecast price and their starting price (from historic_runners)
Using an INNER JOIN to merge the ‘historic_races’ database to the ‘historic_runners’ database. Joining on ‘race_id’
Using WHERE (with CAST AS due to data type) clause to only search for races in 2018 AND only returning flat races

query = ''' SELECT historic_races.race_name, historic_races.meeting_date, historic_runners.name, historic_runners.forecast_price_decimal, historic_runners.starting_price_decimal
            FROM historic_races
            INNER JOIN historic_runners ON historic_races.race_id = historic_runners.race_id
            WHERE (CAST(historic_races.meeting_date AS Datetime) BETWEEN '2018-01-01' AND '2018-12-31')
                   AND
                  (historic_races.race_type = 'Flat')
                  
        '''
cursor.execute(query)
rows = cursor.fetchall()

Converting Query Result to a Dataframe
Converting the query results (a tuple of tuples) into a pandas dataframe and printing the first 15 instances to check the conversion was carried out as expected. The pandas package will need to be imported in order to do this. This can be done by running ‘pip install pandas’ in the same way ‘pymysql’ was installed in the previous article below.

import pandas as pd
# for convenience of future queries the following code enables any SELECT query and conversion to dataframe without direct input of column names
start = query.find('SELECT') + 7
end = query.find('\n            FROM', start)
names = query[start:end].split(', ')

df = pd.DataFrame(list(rows), columns=names)
df.head(10)

If wanting to output the dataframe into excel/csv at any time, this can be done with either of the following commands:

df.to_excel("output.xlsx")

df.to_csv("output.csv")

DataFrame Pre-processing
Checking the dimensions of the dataframe

print('DataFrame Rows: ', df.shape[0], '\nDataFrame Columns: ', df.shape[1])

DataFrame Rows:  50643 
DataFrame Columns:  5

Checking missing values of the dataframe

print('Number of missing values in each column: ', df.isna().sum())

Number of missing values in each column:  race_name            0
race_date            0
runner_name          0
forecast_price    1037
starting_price    4154
dtype: int64

Keeping rows with only non-missing values and checking this has worked

df = df[pd.notnull(df['historic_runners.forecast_price_decimal'])]
df = df[pd.notnull(df['historic_runners.starting_price_decimal'])]
df.isna().sum()

race_name         0
race_date         0
runner_name       0
forecast_price    0
starting_price    0
dtype: int64

Checking the new dimensions of the dataframe

print('DataFrame Rows: ', df.shape[0], '\nDataFrame Columns: ', df.shape[1])

DataFrame Rows:  46374 
DataFrame Columns:  5

Approximately 3700 rows lost to missing data

Producing a Distribution Plot
Seaborn is a statistical data visualisation package and can imported in the same way ‘pandas’ and ‘pymysql’ were installed. From this package different types of plots can be produced by simply inputting our data. (matplotlib can also be installed to adjust the plot size, titles, axis etc…).

The following code produces distribution plots:
* The distribution of the forecast price for all runners
* The distribution of the starting price for all runners
This allows us to have a visual understanding of the most common forecasted and starting prices and how these prices are distributed.

import seaborn as sns; sns.set(style="white", color_codes=True)
import matplotlib.pyplot as plt
%matplotlib inline 

# '%matplotlib inline' this line enables plots to be shown inside of the jupyter environment
plt.figure(figsize = (12,6))
sns.distplot(df['historic_runners.forecast_price_decimal'], kde = True) # distribution plot for forecasted prices

plt.figure(figsize = (12,6))
sns.distplot(df['historic_runners.starting_price_decimal'])  # distribution plot for starting prices

From the distribution plots it is very clear that the data is skewed due to some very large outsider prices of some horses. Wanting to investigate how the majority of prices are related to forecasted prices, these outsiders will be removed and the analysis will only focus on those with a prices below 16/1.

Having observed horse racing markets for a while it appears that many outsiders’ prices are very sensitive to market forces and can change between 66/1 and 200/1 with only little market pressure, therefore these data points have been removed for this analysis.

# creating new dataframe with prices <= 16/1
df_new = df.loc[((df['historic_runners.forecast_price_decimal'] <= 17.0) & (df['historic_runners.starting_price_decimal'] <= 17.0))]
sns.distplot(df_new['historic_runners.forecast_price_decimal']) # new distribution plot of forecasted prices

From the data it can be seen that the prices appear discrete in some places yet continuous in others, this is likely due to the traditional way of bookmakers formulating odds, favouring some odds (e.g. 16/1) over others (e.g. 19/1).
Also, the data looks far less skewed after the removal of large outsiders.

Producing a Scatter Plot
In order to have a look at how these variables may relate to one another, a scatter plot is constructed to plot both distributions against one another.

sns.jointplot(x="historic_runners.forecast_price_decimal", y="historic_runners.starting_price_decimal", data=df_new) # plotting forecasted price against starting price

As seen, these variables appear to have a moderate to strong positive linear correlation with a pearson correlation coefficient of 0.63.

Due to the large difference between certain higher prices and many points being plotted on top of one another it can be difficult to visualise the relationships between the two variables. A heatmap can be a helpful tool in this case and can be simply produced by adding in the parameter ‘kind=”kde”‘.

sns.jointplot(x="historic_runners.forecast_price_decimal", y="historic_runners.starting_price_decimal", data=df_new, kind="kde");

As shown, by the map, there is a high density of prices between 0 and 10/1 with most prices being between 6/4 and 4/1. The correlation appears to get somewhat weaker as the prices increase however this may in part be accredited to the use of more decimal places for lower priced horses.

Assessing Accuracy of Forecasted Prices

The given forecasted price can be used to assess the accuracy of more sophisticated predictive models. This can be done by comparing the accuracy of the new model to the accuracy of the forecasted prices.

The following code outlines a way of using the scikit learn package to calculate an R-Squared value. The R-squared is a measure of how much variance of a dependent variable is explained by an independent variable and is a way of assessing how ‘good’ a model is.

import sklearn
from sklearn.metrics import r2_score

print(r2_score(df_new['historic_runners.forecast_price_decimal'], df_new['historic_runners.starting_price_decimal']))

0.3075885002150722

This is a relatively low R-Squared value and it is likely to be improved upon with a more sophisticated model.

Betfair Price Analysis from Smartform Database

Another very insightful data source from Smartform is the historic Betfair prices table which can also be merged with the historic_races and historic_runners table. As shown below, there is a great number of variables that have been extracted from the Betfair exchange for pre-race and in-play price movements. (More can be read about this data source here).

query = ''' SELECT race_id, name, starting_price, 
                   historic_betfair_win_prices.bsp, historic_betfair_win_prices.av_price, historic_betfair_win_prices.early_price,
                   historic_betfair_win_prices.ante_maxprice, historic_betfair_win_prices.ante_minprice, historic_betfair_win_prices.inplay_max,
                   historic_betfair_win_prices.inplay_min, historic_betfair_win_prices.early_traded, historic_betfair_win_prices.total_traded,
                   historic_betfair_win_prices.inplay_traded
            FROM historic_races
            JOIN historic_runners USING (race_id) join historic_betfair_win_prices ON race_id=sf_race_id and runner_id = sf_runner_id
            WHERE (CAST(historic_races.meeting_date AS Datetime) BETWEEN '2018-01-01' AND '2018-12-31')
                   AND
                  (historic_races.race_type = 'Flat')
        '''
cursor.execute(query)
rows = cursor.fetchall()

df_bf = pd.DataFrame(list(rows), columns=['race_id','runner_name','SP','bsp', 'av_price', 'early_price', 'ante_maxprice', 'ante_minprice', 'inplay_max',
                                       'inplay_min', 'early_traded', 'total_traded', 'inplay_traded'])
df_bf.head(15)

Similar to before, analysis can be carried out on the early price and the starting price but this time using Betfair prices, to assess if there are still similar results on the Betfair Exchange.

import seaborn as sns
%matplotlib inline 

# 'matplotlib inline' this line enables plots to be shown inside of a jupyter environment
df_bf['early_price'] = df_bf['early_price'].astype('float')
sns.distplot(df_bf['early_price']) # distribution plot for early prices

df_bf['bsp'] = df_bf['bsp'].astype('float')
plt.figure(figsize = (12,6))
sns.distplot(df_bf['bsp'])  # distribution plot for starting prices

df_bf_new = df_bf.query('early_price <= 200 & bsp <= 200') # creating new dataframe with prices <= 200/1
plt.figure(figsize = (12,6))
sns.distplot(df_bf_new['early_price']) # new distribution plot of forecasted prices

plt.figure(figsize = (12,6))
sns.distplot(df_bf_new['bsp']) # new distribution plot of forecasted prices

As seen from the graphs, the plots are much smoother in comparison to the bookies’ prices, suggesting a greater granularity in the prices on offer through Betfair. In regards to the distributions, there appears to be little difference between bookmakers and betfair starting prices from visual inspection.

sns.jointplot(x="early_price", y="bsp", data=df_bf_new) # plotting early prices against starting prices

Again, there appears to be a moderate to strong positive linear correlation with a pearson correlation coefficient of 0.71. This increase (from 0.63) is to be somewhat expected given that early prices are being investigated instead of forecast prices. This finding may suggest that early prices are slightly more telling of what the starting price will be (compared to forecast prices).

It should also be noted that there are many really low early prices less than 1, compared to no Betfair starting prices within this range. It is believed that this may be caused by some markets having low liquidity in their early stages and lack of a fully priced market at this point. In order to investigate this further, the effect of overround in these early markets has been analysed below.

How does Overround differ in Betfair markets?

Another interesting aspect of these markets is the amount of overround – or ‘juice’ offered to punters. Overround can be defined as a measure of how much of a margin the bookmaker is taking out of the market in order for themselves to make a profit. Effectively, the higher the overround the worse the odds you are likely to get. In other words the odds are likely to be worse value compared to the true probability of an event happening.

From the analysis above, there were signs that early prices may have a greater overround than starting prices. The following data and analysis has been carried out to see if anything can be inferred about this.

Running the following query calculates the overround for every market (all flat races in 2018) by extracting the implied probability from the odds of each runner.

Overround = the sum of all implied probabilities from every runner – 1.

query = ''' SELECT race_id, SUM(1/historic_betfair_win_prices.early_price)-1 AS 'early_overround', SUM(1/historic_betfair_win_prices.bsp)-1 AS 'SP_overround'
            FROM historic_races
            JOIN historic_runners USING (race_id) join historic_betfair_win_prices ON race_id=sf_race_id and runner_id = sf_runner_id
            WHERE (CAST(historic_races.meeting_date AS Datetime) BETWEEN '2018-01-01' AND '2018-12-31')
                   AND
                  (historic_races.race_type = 'Flat')
            GROUP BY race_id
        '''
cursor.execute(query)
rows = cursor.fetchall()

df_bf_overround = pd.DataFrame(list(rows), columns=['race_id', 'early_price_overround', 'betfair_starting_price_overround'])
df_bf_overround.head(15)

This dataframe consists of the overround for early prices and starting prices on the Betfair exchange for every market (flat race in 2018).

df_bf_overround['early_price_overround'] = df_bf_overround['early_price_overround'].astype('float')
df_bf_overround['starting_price_overround'] = df_bf_overround['starting_price_overround'].astype('float')
sns.distplot(df_bf_overround['early_price_overround']) # distribution plot for early prices

plt.figure(figsize = (12,6))
sns.distplot(df_bf_overround['starting_price_overround']) # distribution plot for early prices

There appears to be some anomalies created via the calculation for early markets. This could perhaps be attributed to prices not yet being offered by the market for some horses within the market. To continue with the analysis, it has been assumed that any market with an overround less than 50% is incomplete and thus will only focus on markets with overrounds below this amount.

df_bf_overround_new = df_bf_overround.query('early_price_overround <= 0.5 & starting_price_overround <= 0.5')

overrounds = list(df_bf_overround_new.columns.values)[1:3]
plt.figure(figsize=(12,7))

for overround in overrounds:
    sns.distplot(df_bf_overround_new[overround])
    
plt.title('Distribution of Market Overrounds (Early v Starting Prices)', size = 22)
plt.xlabel('Overround %',  size = 16)
plt.legend(labels = overrounds, loc = 'upper left', prop={'size': 14} )
plt.show()

print('Average (mean) overround for early priced markets: ', df_bf_overround_new['early_price_overround'].mean())

Average (mean) overround for early priced markets:  0.03054834814308512

print('Average (mean) overround for starting price markets: ', df_bf_overround_new['starting_price_overround'].mean())

Average (mean) overround for starting price markets:  0.003256390977443608

from scipy import stats
stats.ttest_ind(df_bf_overround_new['early_price_overround'], df_bf_overround_new['starting_price_overround']) 

Ttest_indResult(statistic=54.96055888606393, pvalue=0.0)

This T-test result (p-value of 0.0) confirms that there is a statistical difference between the means of each sample i.e. there is a difference between the averages of overround early and starting prices

Betfair starting prices appear to have approximately 0.0% overround on average, compared to an average 3% on their early prices.

From this it could be inferred that starting prices have a better overround on average than early prices – meaning that punters are in effect more likely to get ‘more for their money’ if entering the market directly before post time compared to betting on early prices. This may be because there is a greater amount of liquidity in the markets at this point in time.

Finally, how does starting price overround differ between bookmaker and exchange prices?

In order to retrieve the data to answer this question, two separate queries were run from the database for simplicity and then their dataframes concatenated as shown below:

First, extracting the Betfair starting prices into a dataframe…

query = ''' SELECT race_id, SUM(1/historic_betfair_win_prices.bsp)-1 AS 'SP_overround'
            FROM historic_races
            JOIN historic_runners USING (race_id) join historic_betfair_win_prices ON race_id=sf_race_id and runner_id = sf_runner_id
            WHERE (CAST(historic_races.meeting_date AS Datetime) BETWEEN '2018-01-01' AND '2018-12-31')
                   AND
                  (historic_races.race_type = 'Flat')
            GROUP BY race_id
        '''
cursor.execute(query)
rows = cursor.fetchall()

df_bf_sp_overround = pd.DataFrame(list(rows), columns=['race_id', 'betfair_starting_price_overround'])
df_bf_sp_overround.head(15)

Next, extracting the bookies’ starting prices into a dataframe…

query = ''' SELECT race_id,  SUM(1/starting_price_decimal)-1 AS 'early_overround'
            FROM historic_races join historic_runners using (race_id) 
            WHERE (CAST(historic_races.meeting_date AS Datetime) BETWEEN '2018-01-01' AND '2018-12-31')
                   AND
                  (historic_races.race_type = 'Flat')
            GROUP BY race_id
        '''
cursor.execute(query)
rows = cursor.fetchall()

df_bookies_sp_overround = pd.DataFrame(list(rows), columns=['race_id', 'bookies_starting_price_overround'])
df_bookies_sp_overround.head(15)

Then, merging both dataframes together, joining them on the variable ‘race_id’…

#Merging the two dataframes on race_id
df_merge_col = pd.merge(df_bookies_sp_overround, df_bf_sp_overround, on='race_id')
print('betfair df size :', df_bf_sp_overround.shape, 'bookies df size :', df_bookies_sp_overround.shape, 'total size :', df_merge_col.shape) 

betfair df size : (4829, 2) bookies df size : (4863, 2) total size : (4829, 3)

(This merge had a loss of 34 rows. It appears the Betfair data had more races in this time period than bookies had priced up in this time period).

df_merge_col.head()

import matplotlib.pyplot as plt
overrounds = list(df_merge_col.columns.values)[1:3]
plt.figure(figsize=(12,7))

for overround in overrounds:
    sns.distplot(df_merge_col[overround].astype('float'))
    
plt.title('Distribution of Market Overrounds (Bookies v Betfair Starting Prices)', size = 22)
plt.xlabel('Overround %',  size = 16)
plt.legend(labels = overrounds, loc = 'upper left', prop={'size': 14} )
plt.show()

print('Average (mean) overround for bookies starting prices : ', df_merge_col['bookies_starting_price_overround'].mean())

Average (mean) overround for bookies starting prices :  0.17552602236487852

print('Average (mean) overround for Betfair starting prices : ', df_merge_col['betfair_starting_price_overround'].mean())

Average (mean) overround for betfair starting prices :  0.0035572168150755853

from scipy import stats
stats.ttest_ind(df_merge_col['bookies_starting_price_overround'].astype('float'), df_merge_col['betfair_starting_price_overround'].astype('float'))

Ttest_indResult(statistic=152.77780026940124, pvalue=0.0)

This T-test result (p-value = 0.0) confirms that there is a statistical difference between the means of each sample i.e. there is a difference between the averages of bookmaker and exchange starting prices

As shown above, bookie’s had a much greater overround (for starting prices) of approximately 17% compared to Betfair’s 0%. This reflects a large difference in value between the two betting mediums, reflecting that you are likely to find much better odds through betting on Betfair than with bookmakers, and to do so just before post time (in the large majority of cases).

Further analysis could look into if these findings holds true for all price ranges (and if the same results are found across different market types, not just a sample of data from 2018 flat races).

1 Comment »

Querying From Smartform Using Python

By Tom Bardrick on Friday, April 26th, 2019

Python is now one of the most commonly used programming languages – at the time of writing 4th in popularity according to the TIOBE index. It’s also a popular choice for data manipulation and data science, with plenty of packages such as Pandas for preparing data and Scikit-learn for machine learning meaning that – like R – it can be an ideal environment to use for analysing horseracing data and building prediction models. Here, we discuss first steps to start using Python with Smartform (the MySQL horseracing database from Betwise) in order to connect to the database, run queries, and start using the data within the Python environment.

Installing ‘PyMySQL’
In order to query into a MySQL database directly from Python, the PyMySQL package needs to be installed. More can be read about the package requirements here:

This needs to be done outside of the IPython Shell using the ‘pip’ command. In a command prompt (if using Windows) or bash prompt (if using macOS) use

 $ pip install pymysql

More information on how to install packages for python can be found here:

Importing ‘PyMySQL’
After doing so, the pymysql package then needs to be imported into the IPython Shell, by running the following code:

import pymysql

Now you have the necessary package installed to make contact between Python and the MySQL database.

Establishing a Connection
In order to connect the IPython Shell to the MySQL database you will need to know your MySQL database credentials. These details would have been inputted by the user when creating the MySQL database.

These following details are: host name, user name, password and database name (e.g. ‘smartform’). These details need to be inputted into the strings below.

Note: The password has been filled with asterisks for security reasons but do enter your actual password here

# Inputting database credentials
connection = pymysql.connect(host='localhost', user='root', passwd ='********', database = 'smartform')

If all of the credentials are correct, the connection should be established and the code should run without an error message. If for whatever reason this code doesn’t work, make sure you have entered the correct details and you have imported the ‘pymysql’ package correctly.

Creating a Cursor
To be able to make queries from the MySQL database, a cursor needs to be created. The cursor is effectively a control structure that enables traversal over the records in a database. This can be done by simply running the following code.

cursor = connection.cursor()

Making a Query
Write out your desired query as a string, as you would normally write out a SQL query.

The following query is an example of how to return all of the unique runners (and their associated runner names) with an OR > 160 and then order the runners’ names alphabetically.

query = '''SELECT DISTINCT runner_id, name
           FROM historic_runners
           WHERE official_rating > 160
           ORDER BY name ASC
        '''

Then call this query into the ‘cursor.execute’ function (this will return the number of records in the query as an output).

cursor.execute(query)

Then use ‘cursor.fetchall()’ to retrieve all of the data entries corresponding to this query, calling it into a variable.

rows = cursor.fetchall()

Use a ‘for loop’ to print and inspect the query output.

for row in rows[:10]:
    print(row)

(514205, 'Afsoun')
(2048799, 'Agrapart')
(1435705, 'Al Ferof')
(547061, 'Albertas Run')
(1692968, 'Alelchi Inois')
(160435, 'Alexander Banquet')
(229855, 'Allegedly Red')
(2037368, 'Altior')
(447515, 'Andreas')
(2104005, 'Anibale Fly')

No Comments (add your own) »

Plotting Trainer, Jockey and Sire Statistics in a Stacked Bar Chart with R

By Phill Clarke on Saturday, June 23rd, 2018

Earlier in the week we looked at how to use a for loop to iterate across rows of a dataframe to calculate statistics in an automated manner. Interesting and useful, but we only looked at one specific set of circumstance; trainer and jockey combinations in Group races. There are many other useful statistics which can be used to examine a race. This article focuses on today’s Diamond Jubilee Stakes at Royal Ascot, extends the one collection of statistics to four and finally plots the outcome in a visual format.

As the code examples for this article now extend to beyond 550 lines, it is not practicle to include all the code in-line with the article text. Therefore, only certain examples will be included in-line with the full R code will be provided at the end of the article.

The initial assumption is that data has been returned from the Smartform database, although some additional field are now returned, specifically trainer_id, jockey_id and sire_name.

# Select relevant historic results
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_races.distance_yards,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal,
              historic_runners.trainer_id,
              historic_runners.jockey_id,
              historic_runners.sire_name
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2012-01-01'", sep="")

Previously we created a trainer & jockey function to investigate these specific combinations in Group races. This is now extended to just trainer, just jockey and just sire functions. The trainer function is found below.

# Trainer stats
# Name the function and add some arguments
tr <- function(race_filter = "", price_filter = 1000, trainer){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer id
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_id))


  # Filter by price
  trainer_price_filtered <- dplyr::filter(trainer_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_cumulative <- cumsum(
    ifelse(trainer_price_filtered$finish_position == 1, 
           (trainer_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer$tr_runners <- runners
  trainer$tr_winners <- winners
  trainer$tr_sr <- strike_rate
  trainer$tr_pot <- profit_on_turnover
  trainer$tr_iv <- iv
  trainer$tr_ae <- ae
  trainer$tr_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

Note that in the above code, instead of filtering by trainer_name, we are now filtering by trainer_id. This is due to the fact that sometimes the trainer names in the daily racing data do not exactly match those in the historic data. For example, Sir Michael Stoute hasn’t always been a knight. Therefore, if we were just matching on trainer_name there would be some occassions where this fails and no results are returned. Smartform instead provides a unique identification number for trainers and jockeys, which insures there will always be a match between historic and daily data.

The function above is just one example. In order to produce the charts later in this article, additional jockey and sire functions have been added, bringing the total to four; trainer, jockey, trainer & jockey and sire. The number of statistics could be extended much further to include angles such as trainer & distance, trainer & course, trainer & age (2yo, 3yo, 4yo+ races) and many more.

The for loop also now includes all four of these functions.

# Create placeholder lists which will be required later
row_tr <- list()
row_jc <- list()
row_tj <- list()
row_sr <- list()

# Setup the loop
# For each horse in the group_races_only dataframe
for (i in group_races_only$name) {


  runner_details = group_races_only[group_races_only$name==i,]

  # Extract trainer, jockey id and sire names
  trainer <- runner_details$trainer_id
  jockey <- runner_details$jockey_id
  sire <- runner_details$sire_name

  # Apply the Trainer function for Group races only
  trainer_combo <- tr(race_filter = "group", 
                                  trainer = trainer)

  # Add results row by row to the previously defined list
  row_tr[[i]] <- trainer_combo

  # Apply the Jockey function for Group races only
  jockey_combo <- jc(race_filter = "group", 
                             jockey = jockey)

  # Add results row by row to the previously defined list
  row_jc[[i]] <- jockey_combo

  # Apply the Trainer/Jockey function for Group races only
  trainer_jockey_combo <- tj(race_filter = "group", 
                             trainer = trainer, jockey = jockey)

  # Add results row by row to the previously defined list
  row_tj[[i]] <- trainer_jockey_combo

  # Apply the Sire function for Group races only
  sire_combo <- sr(race_filter = "group", 
                             sire = sire)

  # Add results row by row to the previously defined list
  row_sr[[i]] <- sire_combo

  # Create a final dataframe
  stats_final_tr <- as.data.frame(do.call("rbind", row_tr))
  stats_final_jc <- as.data.frame(do.call("rbind", row_jc))
  stats_final_tj <- as.data.frame(do.call("rbind", row_tj))
  stats_final_sr <- as.data.frame(do.call("rbind", row_sr))

}

# Create a new variable called racecard. Bind together the generic race details with the newly created stats
racecard <- cbind(group_races_only,stats_final_tr)
racecard <- cbind(racecard,stats_final_jc)
racecard <- cbind(racecard,stats_final_tj)
racecard <- cbind(racecard,stats_final_sr)

Viewing the final racecard now shows forty columns and a wall of data. This isn’t perhaps the easiest way to visualise the overall picture. Instead, we’ll create a stacked barchart showing Impact Values for all four angles. The legend shows tr_iv, jc_iv, tj_iv and sr_iv for the trainer, jockey, trainer & jockey and sire impact values.

# Filter for Diamond Jubilee Only
diamond_jubilee <- dplyr::filter(racecard,
                                 grepl("Diamond Jubilee", 
                                       race_title))

# Filter for just the IV columns which we will plot
racecard_filtered_iv <- diamond_jubilee[,c("name","tr_iv","jc_iv", "tj_iv", "sr_iv")]

# Convert the racecard from wide to long format
racecard_long_iv <- melt(racecard_filtered_iv, id.var="name")

# Plot a stacked barchart
ggplot(racecard_long_iv, aes(x = name, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The highest bars indicate the highest cumulative Impact Values. Some bars do not include all four factors, as sometimes there were no results returned. For example, Bound For Nowhere’s sire, The Factor, has only had one Group runner in the UK & Ireland, which did not win. Therefore, there is no data to calculate strike rate, impact value etc.

This means one needs to be careful examining the chart and also take time to ponder the data in the dataframe. Some sample sizes may be very small and a question should be asked if they are statistically relevant. Bound For Nowhere’s Trainer & Jockey Impact Value is the highest in the race, but this is from only six runners. Compared to Merchant Navy with the second highest tj_iv, but from 357 runners.

Nonetheless, a visual method like this can still assist to narrow the field. Harry Angel, the favourite for the race, is certainly not a standout in the chart, with decent sample sizes across all four factors.

The stacked barchart can also be applied to Actual vs Expected figures, strike rates or Confidence figures. The chart below displays stacked A/E for a more value oriented view.

# Filter for just the AE columns which we will plot
racecard_filtered_ae <- diamond_jubilee[,c("name","tr_ae","jc_ae", "tj_ae", "sr_ae")]

# Convert the racecard from wide to long format
racecard_long_ae <- melt(racecard_filtered_ae, id.var="name")

# Plot a stacked barchart
ggplot(racecard_long_ae, aes(x = name, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Once again, keep Bound for Nowhere’s small sample sizes in mind. Harry Angel appears to be a better betting proposition based on this chart.

Another way to look at this data might be as a grouped bar chart, where the IV and A/E figures are plotted for each horse next to each other.

# Filter for just the AE columns which we will plot
racecard_filtered_all <- diamond_jubilee[,c("name","tr_iv","jc_iv", "tj_iv", "sr_iv", 
                                            "tr_ae","jc_ae", "tj_ae", "sr_ae")]

# Convert the racecard from wide to long format
racecard_long_all <- melt(racecard_filtered_all, id.var="name")

# Plot a grouped barchart
ggplot(racecard_long_all, aes(x = name, y = value, fill = variable)) +   
  geom_bar(position = "dodge", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Although Sire IV is missing for Bound for Nowhere, his other figures do all point to a superier trainer and jockey, albeit from small sample size. How to statistically deal with these small sample sizes will be covered in a future article on Bayesian techniques.

Which horses might be included in a shortlist for today’s Diamond Jubilee? Even with small sample sizes, but knowing Wesley Ward’s Ascot success with sprinters, it may be wise to include Bound for Nowhere, who is currently 14.0 on Betfair. Merchant Navy (IV) and Harry Angel (A/E) are both positives, although much shorter priced at the top of the market.

It is important not to just rely on the data. There are many different factors to consider and a good knowledge of general form is also required. Therefore, after all that work, one might still decide just to back the Aussie danger and triple Group 1 sprint winner, Redkirk Warrior.

Good luck!

Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.

The full R code used in this article is found below.

# Load the library packages
library("RMySQL")
library("dplyr")
library("reshape2")
library("ggplot2")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                 host='127.0.0.1', 
                 user='yourusername', 
                 password='yourpassword', 
                 dbname='smartform')

# Select relevant historic results
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_races.distance_yards,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal,
              historic_runners.trainer_id,
              historic_runners.jockey_id,
              historic_runners.sire_name
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2012-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

# Remove non-runners and non-finishers
smartform_results <- dplyr::filter(smartform_results, !is.na(finish_position))

# Select relevant daily results for tomorrow
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_races.distance_yards,
              daily_runners.cloth_number,
              daily_runners.name,
              daily_runners.trainer_name,
              daily_runners.jockey_name,
              daily_runners.sire_name,
              daily_runners.forecast_price_decimal,
              daily_runners.trainer_id,
              daily_runners.jockey_id
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-06-23'", sep="")

smartform_daily_results <- dbGetQuery(con, sql2)

dbDisconnect(con)

# Remove non-runners
smartform_daily_results <- dplyr::filter(smartform_daily_results, !is.na(forecast_price_decimal))

# Trainer stats
# Name the function and add some arguments
tr <- function(race_filter = "", price_filter = 1000, trainer){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_id))


  # Filter by price
  trainer_price_filtered <- dplyr::filter(trainer_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_cumulative <- cumsum(
    ifelse(trainer_price_filtered$finish_position == 1, 
           (trainer_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer$tr_runners <- runners
  trainer$tr_winners <- winners
  trainer$tr_sr <- strike_rate
  trainer$tr_pot <- profit_on_turnover
  trainer$tr_iv <- iv
  trainer$tr_ae <- ae
  trainer$tr_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Jockey stats
# Name the function and add some arguments
jc <- function(race_filter = "", price_filter = 1000, jockey){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  jockey_filtered <- dplyr::filter(filtered_races, 
                                    grepl(jockey, jockey_id))


  # Filter by price
  jockey_price_filtered <- dplyr::filter(jockey_filtered,
                                          starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  jockey_cumulative <- cumsum(
    ifelse(jockey_price_filtered$finish_position == 1, 
           (jockey_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(jockey_price_filtered,
                                finish_position == 1))

  runners <- nrow(jockey_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(jockey_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/jockey_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  jockey <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  jockey$jc_runners <- runners
  jockey$jc_winners <- winners
  jockey$jc_sr <- strike_rate
  jockey$jc_pot <- profit_on_turnover
  jockey$jc_iv <- iv
  jockey$jc_ae <- ae
  jockey$jc_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(jockey)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Trainer and Jockey stats
# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_id))

  # Remove non-runners
  #  trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

  # Filter by jockey name
  trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                           grepl(jockey, jockey_id))

  # Filter by price
  trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_jockey_cumulative <- cumsum(
    ifelse(trainer_jockey_price_filtered$finish_position == 1, 
           (trainer_jockey_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_jockey_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer_jockey <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer_jockey$tj_runners <- runners
  trainer_jockey$tj_winners <- winners
  trainer_jockey$tj_sr <- strike_rate
  trainer_jockey$tj_pot <- profit_on_turnover
  trainer_jockey$tj_iv <- iv
  trainer_jockey$tj_ae <- ae
  trainer_jockey$tj_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer_jockey)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Sire stats
# Name the function and add some arguments
sr <- function(race_filter = "", price_filter = 1000, sire){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  sire_filtered <- dplyr::filter(filtered_races, 
                                    grepl(sire, sire_name))


  # Filter by price
  sire_price_filtered <- dplyr::filter(sire_filtered,
                                          starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  sire_cumulative <- cumsum(
    ifelse(sire_price_filtered$finish_position == 1, 
           (sire_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(sire_price_filtered,
                                finish_position == 1))

  runners <- nrow(sire_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(sire_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/sire_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  sire <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  sire$sr_runners <- runners
  sire$sr_winners <- winners
  sire$sr_sr <- strike_rate
  sire$sr_pot <- profit_on_turnover
  sire$sr_iv <- iv
  sire$sr_ae <- ae
  sire$sr_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(sire)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Filter tomorrow's races for Group races only
group_races_only <- dplyr::filter(smartform_daily_results,
                                  grepl(paste(c("Group 1", "Group 2", "Group 3"), collapse="|"), race_title))

# Create placeholder lists which will be required later
row_tr <- list()
row_jc <- list()
row_tj <- list()
row_sr <- list()

# Setup the loop
# For each horse in the group_races_only dataframe
for (i in group_races_only$name) {


  runner_details = group_races_only[group_races_only$name==i,]

  # Extract trainer and jockey names
  trainer <- runner_details$trainer_id
  jockey <- runner_details$jockey_id
  sire <- runner_details$sire_name

  # Apply the Trainer function for Group races only
  trainer_combo <- tr(race_filter = "group", 
                                  trainer = trainer)

  # Add results row by row to the previously defined list
  row_tr[[i]] <- trainer_combo

  # Apply the Jockey function for Group races only
  jockey_combo <- jc(race_filter = "group", 
                             jockey = jockey)

  # Add results row by row to the previously defined list
  row_jc[[i]] <- jockey_combo

  # Apply the Trainer/Jockey function for Group races only
  trainer_jockey_combo <- tj(race_filter = "group", 
                             trainer = trainer, jockey = jockey)

  # Add results row by row to the previously defined list
  row_tj[[i]] <- trainer_jockey_combo

  # Apply the Sire function for Group races only
  sire_combo <- sr(race_filter = "group", 
                             sire = sire)

  # Add results row by row to the previously defined list
  row_sr[[i]] <- sire_combo

  # Create a final dataframe
  stats_final_tr <- as.data.frame(do.call("rbind", row_tr))
  stats_final_jc <- as.data.frame(do.call("rbind", row_jc))
  stats_final_tj <- as.data.frame(do.call("rbind", row_tj))
  stats_final_sr <- as.data.frame(do.call("rbind", row_sr))

}

# Create a new variable called racecard. Bind together the generic race details with the newly created stats
racecard <- cbind(group_races_only,stats_final_tr)
racecard <- cbind(racecard,stats_final_jc)
racecard <- cbind(racecard,stats_final_tj)
racecard <- cbind(racecard,stats_final_sr)

# Filter for Diamond Jubilee Only
diamond_jubilee <- dplyr::filter(racecard,
                                 grepl("Diamond Jubilee", 
                                       race_title))

# Filter for just the IV columns which we will plot
racecard_filtered_iv <- diamond_jubilee[,c("name","tr_iv","jc_iv", "tj_iv", "sr_iv")]

# Convert the racecard from wide to long format
racecard_long_iv <- melt(racecard_filtered_iv, id.var="name")

# Plot a stacked barchart
ggplot(racecard_long_iv, aes(x = name, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Filter for just the AE columns which we will plot
racecard_filtered_ae <- diamond_jubilee[,c("name","tr_ae","jc_ae", "tj_ae", "sr_ae")]

# Convert the racecard from wide to long format
racecard_long_ae <- melt(racecard_filtered_ae, id.var="name")

# Plot a stacked barchart
ggplot(racecard_long_ae, aes(x = name, y = value, fill = variable)) + 
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Filter for just the AE columns which we will plot
racecard_filtered_all <- diamond_jubilee[,c("name","tr_iv","jc_iv", "tj_iv", "sr_iv", 
                                            "tr_ae","jc_ae", "tj_ae", "sr_ae")]

# Convert the racecard from wide to long format
racecard_long_all <- melt(racecard_filtered_all, id.var="name")

# Plot a grouped barchart
ggplot(racecard_long_all, aes(x = name, y = value, fill = variable)) +   
  geom_bar(position = "dodge", stat="identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

No Comments (add your own) »

2 year old sire stats for Royal Ascot

By Nick Franks on Thursday, June 21st, 2018

Two year old races at Royal Ascot are some of the most exciting out there, but trying to find a winner based on their racing form alone is a difficult, if not impossible, task.

Why is this? At this stage of the season, most contenders have only had one, two or three runs; most of the form is from diverse courses in varying classes on different ground, with different form lines, and to cap it all the fields are typically large. Needles and haystacks spring to mind. Trainer strike rates can be useful, as we’ve covered in recent posts, as can their records for Royal Ascot in particular, but still don’t tell us much about the horse’s ability itself.

Fortunately there’s more than one way to look at assessing form and future potential, and it’s at times when runners’ form is unexposed that it generally pays to look at other factors in the horse’s profile as indicators of potential ability – especially when there is far more information available than the bare runs – such as the form of the horse’s sire.

This can be a particularly strong pointer at Royal Ascot and allows us, by a different means other than runner form alone, to use a powerful new angle for establishing the potential of the animal in question.

Today we’re going to look at measuring sire strike rates and ranking them, a method which has been doing quite well so far this Royal Ascot. Of the two 2 year old races so far, on Tuesday, the top contender by sire strike rate sire produced the winner of the Coventry Stakes in Calyx at 2/1.

Here’s a screenshot of the query results for the Coventry:

Yesterday, Wednesday, saw the joint top sire strike rate produce the second in Gossamer Wings at 25/1, and the fourth, So Perfect, at 8/1. Another screenshot of the query follows:

Sod’s law says that today will be the day that the two year old sires system fails, since that happens all the time in betting – and of course there is no such thing as a sure thing. But there is such a thing as gaining an “an edge” with a method or a combination of methods. The edge needs to be measured and weighed up against the prices on offer to see if there is value, but that’s not what today’s post is about – it’s about a method to generate a possible edge in the first place.

We can also pick holes in ranking anything by strike rate. The winner on day 1 included its own previous win in the very small sample size – because, as a sire, Kingman’s progency have not yet had many runs.

But given those warnings, there is usually little influence of the horse itself in this method, particularly when there is a large sample of previous runs. Also, on the question of sample size, it’s possible to overcome the problem of small samples by applying some simple Bayesian priors to augment the winner and runner ratios – but more about that as well on another day.

So without further ado, here is today’s ranking of horses by sire strike rate for the Norfolk Stakes. Only 10 runners, so not such a cavalry charge as the first two days, and also note very narrow variance between strike rates, with a low strike rate at 11%, as top.

And – since it’s better to teach a man to fish, here is the Smartform query that subscribers can run for themselves for the rest of Royal Ascot.


--  Select the flat turf races for 2yos today at Ascot with selected columns from the daily races and runers tables

--  Note Database lists the course as Royal_Ascot so looking for all races with Ascot in the course hense using like  with %


DROP TABLE IF EXISTS today_2yoturf_races;
CREATE TABLE today_2yoturf_races AS (

select

race_id, meeting_date, scheduled_time, Course, cloth_number, name, foaling_date, sire_name ,
forecast_price_decimal, Trainer_Name Trainer, Jockey_Name Jockey, Stall_Number Draw
from daily_races
join daily_runners using (race_id)
where meeting_date > curdate()

and race_type = 'flat'
and track_type = 'turf'
and age_range  = '2YO only'
and course like  '%Ascot%'     );


--   Create history for 2yo turf sires
--

DROP TABLE IF EXISTS hist_2yoturf_sires;

CREATE TABLE hist_2yoturf_sires AS (

select z.sire_name   ,
COUNT(*) AS Runners,
SUM(winner) AS Winners,
sum(WinProfit)  as WinProfit,

ROUND(((SUM(CASE WHEN z.finish_position = 1 THEN 1 ELSE 0 END) / COUNT(*)) * 100),2) AS WinPct,
case when SUM(winner) = 0 then NULL else
round((SUM(CASE WHEN z.winner = 1 THEN z.distance_yards ELSE 0 END)/220) / SUM( z.winner  ),1) END AS AveWinDist,
sum(Placer) as Placers,
sum(PLaceProfit) as PlaceProfit,

ROUND(((SUM(CASE WHEN z.Placer = 1 THEN 1 ELSE 0 END) / COUNT(*)) * 100),2) AS PlacePct,

case when SUM(placer) = 0 then NULL else

round((SUM(CASE WHEN z.Placer = 1 THEN z.distance_yards ELSE 0 END)/220) / SUM( z.Placer  ),1) END AS AvePlaceDist

from (

select
hru.sire_name , hra.distance_yards,hru.starting_price_decimal,hru.days_since_ran , hra.class, hru.finish_position ,
case when hru.finish_position = 1 then 1 else 0 end as Winner,
case when num_runners < 8 then case when finish_position in ( 2) then 1 else 0 end

else

case when num_runners < 16 then case when finish_position in ( 2,3) then 1 else 0 end

else

case when handicap = 1 then case when finish_position in (2,3,4) then 1 else 0 end

else

case when finish_position in ( 1,2,3) then 1 else 0 end

end end end as Placer,

round(CASE WHEN finish_position = 1 THEN (starting_price_decimal -1) ELSE -1 END,2) AS WinProfit,

round(case when (

Case when num_runners < 5 then case when finish_position = 1 then 1 else 0 end

else

case when num_runners < 8 then case when finish_position in ( 1,2)  then 1 else 0 end

else

case when num_runners < 16 then case when finish_position in ( 1,2,3)  then 1 else 0 end

else

case when handicap = 1 then case when finish_position in (1,2,3,4) then 1 else 0 end

else

case when finish_position in ( 1,2,3)  then 1 else 0

end end end end end )

= 1

then (starting_price_decimal -1) /

Case when num_runners < 5 then 1

else

case when num_runners < 8 then 4

else

case when num_runners < 12 then 5

else

case when handicap = 1 then 4 else 5

end end end end else -1 end,2)

PlaceProfit

from today_2yoturf_races
join  historic_runners hru using (sire_name)
join historic_races hra on hru.race_id = hra.race_id

where hra.race_type_Id = 12
and hra.max_age = 2

and in_race_comment <> 'Withdrawn'
and starting_price_decimal IS NOT NULL

) z


group by z.sire_name
order by z.sire_name);


--

--   Create current 2yo turf runners with sire stats

select CONCAT(substr(tdr.scheduled_time, 9, 2),'-',substr(tdr.scheduled_time, 6, 2)) as Date,

substr(tdr.scheduled_time, 11, 6) as Time, tdr.course as Course,

tdr.cloth_number as 'No.',  Draw, tdr.name as Name,
case when tdr.forecast_price_decimal is NULL then 'Res' else tdr.forecast_price_decimal - 1 end as FSP,

CONCAT(substr(tdr.foaling_date, 9, 2),'-',substr(tdr.foaling_date, 6, 2)) as DOB,

Trainer,  Jockey,
tds.sire_name Sire, tds.Runners,
tds.Winners, tds.WinProfit, tds.WinPct, IFNULL(tds.AveWinDist,'-') AveWinDist,
tds.Placers, tds.PlaceProfit, tds.PlacePct, IFNULL(tds.AvePlaceDist,'-') AvePlaceDist

from today_2yoturf_races tdr

left join hist_2yoturf_sires tds using (sire_name)
order by tdr.scheduled_time,   tdr.course, tds.WinPct desc;

The notes in the query tell you what’s going on at every stage. Copy and paste this query into your favourite MySQL client – Heidi, MySQL Workbench, or Sequel Pro on Mac – and after a few seconds you’ll have the top contenders for tomorrow’s two year old racing, too.

No Comments (add your own) »

Loops With R – Creating a Racecard with Trainer and Jockey Stats

By Phill Clarke on Tuesday, June 19th, 2018

Yesterday we looked at how to create a function in order to easily run the same set of code multiple times, without having to manually edit the code every time. While this is a highly useful concept to understand, we’re still left with manually applying the trainer and jockey combinations for each runner. It remains a time consuming task. Therefore, this article covers creating a basic for loop to iterate over the rows of a dataframe and apply a function to each of them.

There are many ways to loop over the rows of a dataframe, list or matrix in R. Some methods are more efficient than others, while some are perhaps more logical than others. Specifically this article demonstrates how to apply a for loop, which rightly receives some criticism for being slow to execute in certain circumstances. For our purposes, with a limited number of rows, this will not be a problem. However, the reader should investigate R’s apply family of functions and also the map function in the purrr library.

The full R code will be provided within the article, as there have been some useful changes made to code used previously. The complete code will also, as usual, be provided at the end of this article.

Begin by returning historic racing results, this time since 2013 for a full five year dataset and also from tomorrow’s race card:

# Load the library packages
library("RMySQL")
library("dplyr")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                 host='127.0.0.1', 
                 user='yourusername', 
                 password='yourpassword', 
                 dbname='smartform')

# Select relevant historic results
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2012-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

# Select relevant daily results for tomorrow
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_runners.cloth_number,
              daily_runners.name,
              daily_runners.trainer_name,
              daily_runners.jockey_name,
              daily_runners.forecast_price_decimal
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-06-20'", sep="")

smartform_daily_results <- dbGetQuery(con, sql2)

dbDisconnect(con)

Next is the Trainer/Jockey function explained yesterday. However, the code will be broken into a few sections, as there have been some changes incorporated.

The function, as detailed previously:


# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_name))

  # Remove non-runners
  trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

  # Filter by jockey name
  trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                           grepl(jockey, jockey_name))

  # Filter by price
  trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_jockey_cumulative <- cumsum(
    ifelse(trainer_jockey_price_filtered$finish_position == 1, 
           (trainer_jockey_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_jockey_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0

The last line above is new. This line is being used to catch any instances where the profit on turnover figure is of zero length. That means that the calculation has not been successful, usually because there were no runners for the combination of trainer and jockey.

Continuing with the function:


  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer_jockey <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer_jockey$tj_runners <- runners
  trainer_jockey$tj_winners <- winners
  trainer_jockey$tj_sr <- strike_rate
  trainer_jockey$tj_pot <- profit_on_turnover
  trainer_jockey$tj_iv <- iv
  trainer_jockey$tj_ae <- ae
  trainer_jockey$tj_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer_jockey)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

Once again, there are some new lines in the final part of the function above. The results are checked for NaN values, which again occur if a calculation has failed. It is not possible, for example, to calculate strike rate if there are no runners for the trainer and jockey combination. Error checking such as this will take some time to implement, but does save a lot of headaches later.

The results are then transformed, with t from a long to wide dataframe, the rownames are removed and all results rounded to two decimal places.

Now, we move on to the new section for this article. First, filter the current daily races for Group races only and also create an empty placeholder list.

# Filter tomorrow's races for Group races only
group_races_only <- dplyr::filter(smartform_daily_results,
                                  grepl(paste(c("Group 1", "Group 2", "Group 3"), collapse="|"), race_title))

# Create a placeholder list which will be required later
row <- list()

Then start the for loop, which essentially says for every value of name, which is the column containing the horse’s name, apply the code which follows. This code includes extracing the trainer and jockey names, then executing the function defined earlier. Lastly, the data is iteratively added to the emtpy list and converted to a dataframe.


# Setup the loop
# For each horse in the group_races_only dataframe
for (i in group_races_only$name) {

  runner_details = group_races_only[group_races_only$name==i,]

  # Extract trainer and jockey names
  trainer = runner_details$trainer_name
  jockey = runner_details$jockey_name

  # Apply the Trainer/Jockey function for Group races only
  trainer_jockey_combo <- tj(race_filter = "group", 
                             trainer = trainer, 
                             jockey = jockey)

  # Add results row by row to the previously defined list
  row[[i]] <- trainer_jockey_combo

  # Create a final dataframe
  stats_final <- as.data.frame(do.call("rbind", row)) 

}

As a final piece of code, we bind the new data and the general racing data from the Smartform database in a new variable called racecard. Viewing this racecard will now display the strike rate, profit on turnover, impact value, actual vs expected and confidence figure for every jockey and trainer combination in Group races on tomorrow’s race card. They are of course all at Royal Ascot.

# Create a new variable called racecard. Bind together the generic race details with the newly created stats
racecard <- cbind(group_races_only,stats_final)

This data can now be reviewed for interesting angles. The screenshot below displays the data ordered descending from the highest A/E value. The first thing to notice about the top three entries is the very small sample size, with only three of four runs. However, the fourth entry for Cox and Kirby does have a robust sample size and some very good figures. Shades of Blue in the Queen Mary Stakes tomorrow at Royal Ascot is certainly worth a closer look. As are the Gosden and Dettori trio of Stream of Stars, Cracksman and Purser. This combination already struck three times today with Calyx, Without Parole and Monarch’s Glen.

Good luck!

Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.

The full R code used in this article is found below.

# Load the library packages
library("RMySQL")
library("dplyr")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                 host='127.0.0.1', 
                 user='yourusername', 
                 password='yourpassword', 
                 dbname='smartform')

# Select relevant historic results
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2012-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

# Select relevant daily results for tomorrow
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_runners.cloth_number,
              daily_runners.name,
              daily_runners.trainer_name,
              daily_runners.jockey_name,
              daily_runners.forecast_price_decimal
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-06-20'", sep="")

smartform_daily_results <- dbGetQuery(con, sql2)

dbDisconnect(con)

# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

  # Filter for flat races only
  flat_races_only <- dplyr::filter(smartform_results,
                                   race_type_id == 12 |
                                     race_type_id == 15)

  # Add an if else statement for the race_filter argument
  if (race_filter == "group"){

    filtered_races <- dplyr::filter(flat_races_only,
                                    group_race == 1 |
                                      group_race == 2 |
                                      group_race == 3 )
  } else {

    filtered_races = flat_races_only
  }

  # Filter by trainer name
  trainer_filtered <- dplyr::filter(filtered_races, 
                                    grepl(trainer, trainer_name))

  # Remove non-runners
  trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

  # Filter by jockey name
  trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                           grepl(jockey, jockey_name))

  # Filter by price
  trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                                 starting_price_decimal <= price_filter)

  #  Calculate Profit and Loss
  trainer_jockey_cumulative <- cumsum(
    ifelse(trainer_jockey_price_filtered$finish_position == 1, 
           (trainer_jockey_price_filtered$starting_price_decimal-1),
           -1)
  )

  # Calculate Strike Rate
  winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                finish_position == 1))

  runners <- nrow(trainer_jockey_price_filtered)

  strike_rate <- (winners / runners) * 100

  # Calculate Profit on Turnover or Yield
  profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100

  # Check if POT is zero length to catch later errors
  if (length(profit_on_turnover) == 0) profit_on_turnover <- 0 

  # Calculate Impact Values
  # First filter all runners by price, to return those just starting at the price_filter or less
  all_runners <- nrow(dplyr::filter(filtered_races,
                                    starting_price_decimal <= price_filter))

  # Filter all winners by the price filter 
  all_winners <- nrow(dplyr::filter(filtered_races,
                                    finish_position == 1 &
                                      starting_price_decimal <= price_filter))

  # Now calculate the Impact Value
  iv <- (winners / all_winners) / (runners / all_runners)

  # Calculate Actual vs Expected ratio
  # # Convert all decimal odds to probabilities
  total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

  # Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
  ae <- winners / total_sp

  # Calculate Archie
  archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

  # Calculate the Confidence figure
  conf <- pchisq(archie, df = 1)*100

  # Create an empty variable
  trainer_jockey <- NULL

  # Add all calculated figures as named objects to the variable, which creates a list
  trainer_jockey$tj_runners <- runners
  trainer_jockey$tj_winners <- winners
  trainer_jockey$tj_sr <- strike_rate
  trainer_jockey$tj_pot <- profit_on_turnover
  trainer_jockey$tj_iv <- iv
  trainer_jockey$tj_ae <- ae
  trainer_jockey$tj_conf <- conf

  # Add an error check to convert all NaN values to zero
  final_results <- unlist(trainer_jockey)
  final_results[ is.nan(final_results) ] <- 0

  # Manipulate the layout of returned results to be a nice dataframe
  final_results <- t(as.data.frame(final_results))
  rownames(final_results) <- c()

  # 2 decimal places only
  round(final_results, 2)

  # Finally, close the function
}

# Filter tomorrow's races for Group races only
group_races_only <- dplyr::filter(smartform_daily_results,
                                  grepl(paste(c("Group 1", "Group 2", "Group 3"), collapse="|"), race_title))

# Create a placeholder list which will be required later
row <- list()

# Setup the loop
# For each horse in the group_races_only dataframe
for (i in group_races_only$name) {


  runner_details = group_races_only[group_races_only$name==i,]

  # Extract trainer and jockey names
  trainer = runner_details$trainer_name
  jockey = runner_details$jockey_name

  # Apply the Trainer/Jockey function for Group races only
  trainer_jockey_combo <- tj(race_filter = "group", 
                             trainer = trainer, 
                             jockey = jockey)

  # Add results row by row to the previously defined list
  row[[i]] <- trainer_jockey_combo

  # Create a final dataframe
  stats_final <- as.data.frame(do.call("rbind", row)) 

}

# Create a new variable called racecard. Bind together the generic race details with the newly created stats
racecard <- cbind(group_races_only,stats_final)

No Comments (add your own) »

Creating Functions with R – using trainer and jockey combinations

By Phill Clarke on Monday, June 18th, 2018

In the previous article we looked at how to calculate some useful figures such as profit on turnover, impact values, actual vs expected, Archie and a confidence figure regarding how much luck was involved in the returned figures. The code demonstrated works well, but can be cumbersome to easily alter items such as trainer, jockey or price filters.

Fortunately the R language supports creation of user defined functions. A function is essentially a wrapper around a set of code routines, which are executed when the function is called at a later time. This makes it very easy to re-run the same code multiple times, using specific arguments to alter the results. Sounds difficult? It’s not really. Much of the R code we’re already familiar with, such as dplyr::filter are functions themselves. If you ever find yourself writing the same R code snippet more than three times in a larger program or script, think about how to create a function instead.

The goal of the function described in this article is to return a set of results for a specific trainer and jockey combination, with some additional argument options added.

The code examples again assume data has already been returned from the Smartform database and is contained in a variable called smartform_results. Also assumed is that part of the initial MySQL query was to limit results to those only since January, 1st, 2016. Full R code will be provided at the end of this article.

The first step is to define the function name and arguments.

# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

The function is now named tj for Trainer and Jockey. The function has four defined arguments. A race filter, a price filter and arguments for trainer and jockey. If a value is not defined for a function argument, the user must include and define the argument when calling the function. This is the case for trainer and jockey above. However, default values for arguments may be specified. In the above code the race_filter has an empty default value, for all races, and the price_filter is defined as 1000, which is the maximum possible price on Betfair, thus also including all possible prices when applied as a less than or equal to fitler.

Now, we begin the remainder of our function, which is essentially the same code as previously, with some additional changes to account for the function arguments.

# Filter for flat races only
flat_races_only <- dplyr::filter(smartform_results,
                                 race_type_id == 12 |
                                   race_type_id == 15)

# Add an if/else statement for the race_filter argument
if (race_filter == "group"){

filtered_races <- dplyr::filter(flat_races_only,
                                  group_race == 1 |
                                    group_race == 2 |
                                    group_race == 3 )
} else {

  filtered_races = flat_races_only
}

The if else statement above is another new concept. It states that if the race_filter equals the word group apply one set of code, otherwise (else) run a different set of code. In the case of the current function, only one race_filter is supported, that is filter by Group races only or return results from all races. Additional race filters, such as class or age perhaps, could also be added to the function through additional else options.

The next block of code should be largely familiar from the previous article.

# Filter by trainer name
trainer_filtered <- dplyr::filter(filtered_races, 
                                         grepl(trainer, trainer_name))

# Remove non-runners
trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

# Filter by jockey name
trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                               grepl(jockey, jockey_name))

# Filter by price
trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                               starting_price_decimal <= price_filter)

The above lines now filter by the values provided in the arguments trainer, jockey and price_filter. If values for trainer and jockey are not provided by the user, and because no defaults were supplied, the function will fail. Also, if an incorrect name, which does not match values in the dataset, is supplied the function will also fail. There is no error checking provided in this example code. The price_filter was provided with a default value of 1000 and therefore if the user does not define it, the function will return all values equal to or less than 1000.

The next blocks of code are once again very similar to that used previously when calculating the statistics we’re interested in.

#  Calculate Profit and Loss
trainer_jockey_cumulative <- cumsum(
  ifelse(trainer_jockey_price_filtered$finish_position == 1, 
         (trainer_jockey_price_filtered$starting_price_decimal-1),
         -1)
)

# Calculate Strike Rate
winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                           finish_position == 1))

runners <- nrow(trainer_jockey_price_filtered)


strike_rate <- (winners / runners) * 100

# Calculate Profit on Turnover or Yield
profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100


# Calculate Impact Values
# First filter all runners by price, to return those just starting at the price_filter or less
all_runners <- nrow(dplyr::filter(filtered_races,
                                        starting_price_decimal <= price_filter))

# Filter all winners by the price filter 
all_winners <- nrow(dplyr::filter(filtered_races,
                                        finish_position == 1 &
                                          starting_price_decimal <= price_filter))

# Now calculate the Impact Value
iv <- (winners / all_winners) / (runners / all_runners)

# Calculate Actual vs Expected ratio
# # Convert all decimal odds to probabilities
total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

# Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
ae <- winners / total_sp

# Calculate Archie
archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

# Calculate the Confidence figure
conf <- pchisq(archie, df = 1)*100

That covers all the calculations. Now we return the results in a nice dataframe.

# Create an empty variable
trainer_jockey <- NULL

# Add all calculated figures as named objects to the variable, which creates a list
trainer_jockey$runners <- runners
trainer_jockey$winners <- winners
trainer_jockey$sr <- strike_rate
trainer_jockey$pot <- profit_on_turnover
trainer_jockey$iv <- iv
trainer_jockey$ae <- ae
trainer_jockey$conf <- conf

# Convert and return as a dataframe
as.data.frame(trainer_jockey)

# Finally, close the function
}

The last line here is very important and should not be forgotten. The curly bracket was used to start the function at the beginning, and therefore the matching closing curly bracket must be used at the end.

We now have a trainer/jockey function defined. How do we use it? Simply call the function, with defined arguments. Using the previous filters of Aiden O’Brien trained runners, ridden by Ryan Moore, in Group races and starting at a price of 4.0 or less, we do the following:

# Run the function with arguments and store in a results object
results <- tj(race_filter = "group", 
              price_filter = 4.0, 
              trainer = "A P O'Brien", 
              jockey = "R L Moore")

# Show results
results

  runners winners       sr      pot       iv       ae    conf
1     137      66 48.17518 13.62774 1.279968 1.069609 53.9223

This matches the results previously obtained when running through the code manually. Tomorrow in the Group 1 Queen Anne Stakes at Royal Ascot, Rhododendron is trained by Aiden O’Brien, ridden by Ryan Moore and is currently 3/1, thus matching the filters used in this function.

Now the function is defined, it is easy to start looking at alternative filter sets, without having to manually adjust any code. Some examples are outlined below:

# No price filter, which works because a default of 1000 was defined in the function
results_no_price <- tj(race_filter = "group", 
                       trainer = "A P O'Brien", 
                       jockey = "R L Moore")

  runners winners       sr      pot       iv       ae     conf
1     246      78 31.70732 -3.04878 2.825292 1.004775 4.056167

# All races, not just Group, with a price filter of 4.0
results_all_races <- tj(price_filter = 4.0, 
                        trainer = "A P O'Brien", 
                        jockey = "R L Moore")

  runners winners       sr      pot       iv        ae     conf
1     227     100 44.05286 2.718062 1.343485 0.9936154 6.841419

# All races and no price filter for this trainer and jockey combination
results_all_races_no_price <- tj(trainer = "A P O'Brien", 
                                 jockey = "R L Moore")

  runners winners       sr       pot       iv        ae     conf
1     387     119 30.74935 -8.183463 2.919438 0.9519335 48.63522

Also keep in mind these results are only since January 1st, 2016 as this filter was previously defined in the original SQL query. Hopefully, it should be reasonably clear how to add a date argument to extend this function.

Finally, if we wanted to investigate alternative trainer and jockey combinations, this is also quite easy now the function is already defined.

# David Simcock and Oisin Murphy together in Group races
simcock_murphy <- tj(race_filter = "group", 
                    trainer = "D M Simcock", 
                    jockey = "Oisin Murphy")

  runners winners      sr       pot        iv        ae     conf
1      21       2 9.52381 -66.66667 0.8486226 0.9350056 7.989577

David Simcock and Oisin Murphy team up with Lightning Spear, also in the Queen Anne at Ascot.

Good luck at the big meeting tomorrow!

Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.

The full R code used in this article is found below.

# Load the RMySQL library package
library("RMySQL")
library("dplyr")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2016-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

dbDisconnect(con)

# Name the function and add some arguments
tj <- function(race_filter = "", price_filter = 1000, trainer, jockey){

# Filter for flat races only
flat_races_only <- dplyr::filter(smartform_results,
                                 race_type_id == 12 |
                                   race_type_id == 15)

# Add an if else statement for the race_filter argument
if (race_filter == "group"){

filtered_races <- dplyr::filter(flat_races_only,
                                  group_race == 1 |
                                    group_race == 2 |
                                    group_race == 3 )
} else {

  filtered_races = flat_races_only
}

# Filter by trainer name
trainer_filtered <- dplyr::filter(filtered_races, 
                                         grepl(trainer, trainer_name))

# Remove non-runners
trainer_name_filtered <- dplyr::filter(trainer_filtered, !is.na(finish_position))

# Filter by jockey name
trainer_jockey_filtered <- dplyr::filter(trainer_filtered, 
                                               grepl(jockey, jockey_name))

# Filter by price
trainer_jockey_price_filtered <- dplyr::filter(trainer_jockey_filtered,
                                               starting_price_decimal <= price_filter)

#  Calculate Profit and Loss
trainer_jockey_cumulative <- cumsum(
  ifelse(trainer_jockey_price_filtered$finish_position == 1, 
         (trainer_jockey_price_filtered$starting_price_decimal-1),
         -1)
)

# Calculate Strike Rate
winners <- nrow(dplyr::filter(trainer_jockey_price_filtered,
                                           finish_position == 1))

runners <- nrow(trainer_jockey_price_filtered)

strike_rate <- (winners / runners) * 100

# Calculate Profit on Turnover or Yield
profit_on_turnover <- (tail(trainer_jockey_cumulative, n=1) / runners) * 100

# Calculate Impact Values
# First filter all runners by price, to return those just starting at the price_filter or less
all_runners <- nrow(dplyr::filter(filtered_races,
                                        starting_price_decimal <= price_filter))

# Filter all winners by the price filter 
all_winners <- nrow(dplyr::filter(filtered_races,
                                        finish_position == 1 &
                                          starting_price_decimal <= price_filter))

# Now calculate the Impact Value
iv <- (winners / all_winners) / (runners / all_runners)

# Calculate Actual vs Expected ratio
# # Convert all decimal odds to probabilities
total_sp <- sum(1/trainer_jockey_price_filtered$starting_price_decimal)

# Calculate A/E by dividing the number of  winners, by the sum of all SP probabilities.
ae <- winners / total_sp

# Calculate Archie
archie <- (runners * (winners  - total_sp)^2)/ (total_sp  * (runners - total_sp))

# Calculate the Confidence figure
conf <- pchisq(archie, df = 1)*100

# Create an empty variable
trainer_jockey <- NULL

# Add all calculated figures as named objects to the variable, which creates a list
trainer_jockey$runners <- runners
trainer_jockey$winners <- winners
trainer_jockey$sr <- strike_rate
trainer_jockey$pot <- profit_on_turnover
trainer_jockey$iv <- iv
trainer_jockey$ae <- ae
trainer_jockey$conf <- conf

# Convert and return as a dataframe
as.data.frame(trainer_jockey)

# Finally, close the function
}

No Comments (add your own) »

Further Calculations using R to analyse the performance of jockeys and trainers

By Phill Clarke on Friday, June 1st, 2018

Last week we explored some visual ways in which to analyse performance. This included representations such as line charts, scatterplots and regression fits. This week we examine some mathematical and statistical approaches to expand on the simple calculations of strike rate and profit & loss.

A positive profit & loss is obviously critical in making money from a series of bets. A negative profit & loss largely renders everything else irrelevant. However, a simple positive profit & loss calculation still does not expose the full story. Imagine a positive profit of 20 points, from a series of 1000 bets. Is that a good or bad performance? A calculation of Profit on Turnover (POT) or Yield can assist.

Again, we’ll use the combination of Aiden O’Brien and Ryan Moore, in Group Races only, since the beginning of 2016, with a filter of less than or equal to 4.00 starting price, as the example dataset. The code examples again assume data has already been returned from the Smartform database and is contained in a variable called smartform_results. Also assumed is that part of the initial MySQL query was to limit results to those only since January, 1st, 2016. Full R code will be provided at the end of this article.

# Filter for flat races only
flat_races_only <- dplyr::filter(smartform_results,
                                 race_type_id == 12 |
                                   race_type_id == 15)
# Filter for Group races only
group_races_only <- dplyr::filter(flat_races_only,
                                  group_race == 1 |
                                    group_race == 2 |
                                    group_race == 3 )

# Filter for Aiden O'Brien runners only
obrien_group_races_only <- dplyr::filter(group_races_only, 
                                         grepl("A P O'Brien", trainer_name))

# Remove non-runners
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))

# Filter for Ryan Moore rides only
obrien_moore_group_races_only <- dplyr::filter(obrien_group_races_only, 
                                               grepl("R L Moore", jockey_name))

# Filter for Starting Prices of 4.00 or less
obrien_moore_group_races_only_price_filter <- dplyr::filter(obrien_moore_group_races_only,
                                                            starting_price_decimal <= 4.0)

# # Calculate Profit and Loss
obrien_moore_cumulative <- cumsum(
  ifelse(obrien_moore_group_races_only_price_filter$finish_position == 1, (obrien_moore_group_races_only_price_filter$starting_price_decimal-1),-1)
)

obrien_moore_group_races_only_price_filter$cumulative <- obrien_moore_cumulative  

# Calculate Strike Rate
obrien_moore_winners <- nrow(dplyr::filter(obrien_moore_group_races_only_price_filter,
                                   finish_position == 1))

obrien_moore_runners <- nrow(obrien_moore_group_races_only_price_filter)


strike_rate <- (obrien_moore_winners / obrien_moore_runners) * 100

Detailed explanations of the above calculations have been provided in previous articles. The P&L for the combination examined now stands at 19.67 points, with a Strike Rate of 48.52%.

How good is this in the context of the total amount wagered? Next, calculate Profit on Turnover (POT) or Yield. This is simply the total cumulative profit, divided by the total number of runners and multiplied by 100 to return a percentage.

# Calculate POT
profit_on_turnover <- (tail(obrien_moore_cumulative, n=1) / obrien_moore_runners) * 100

This returns a POT of 14.46%, which is a pretty reasonable figure. A return of almost 15% on an investment would certainly keep many people very happy.

Moving on, there are some other calculations which can assist with providing additional clarity to the overall picture.

The first of these is Impact Value (IV). This measure helps to assertain whether a specific combination of factors returns winners at a higher rate than the rate of winners which did not meet the specific criteria being examined. In our case, we are looking at the performance of Aiden O’Brien trained and Ryan Moore ridden Group runners, with a price filter, versus those Group runners, with the same price filter, who were not trained by Aiden O’Brien and ridden by Ryan Moore. It is worth keeping in mind that IVs are only an indication of the rate of winners, and does not take price into account.

In order to calculate Impact Values, the ratio of the filtered winners to all runners is divided by the ratio of filtered runners to all runners. In our specific case, this is the ratio of O’Brien and Moore winners to all Group winners, starting at a price equal to or less than 4.00, divided by the ratio of O’Brien and Moore runners to all Group Runners, with the same price filter applied.

An Impact Value of greater than 1.0 indicates that the filtered angle in question is outperforming runners in the entire dataset who did meet our filtering criteria.

# Calculate Impact Values
# First filter all Group runners by price, to return those just starting at 4.00 or less
all_group_runners <- nrow(dplyr::filter(group_races_only,
                                        starting_price_decimal <= 4.0))

# Filter all Group winners by the 4.00 price limit                          
all_group_winners <- nrow(dplyr::filter(group_races_only,
                                        finish_position == 1 &
                                          starting_price_decimal <= 4.0))

# Now calculate the Impact Value
iv <- (obrien_moore_winners / all_group_winners) / (obrien_moore_runners / all_group_runners)

An IV of 1.29 is returned. This is a very healthy result and indicates that the O’Brien and Moore combination generally returns a higher ratio of winners than that of all Group runners, also filtered by the 4.00 or less price.

Finding winners is one thing, making a profit may be something entirely different. Therefore, there is a futher calculation to examine. Now we’ll look at the ratio of Actual vs Expected (A/E) winners, based on probabilties calculated from starting price. This figure will help to inform whether the filtered combination is outperforming market expectations, based on Starting Price probabilities.

A/E is calculated by dividing the number of winners from the filtered dataset, by the sum of all win probabilities. Decimal starting prices first need to be converted to probabilities. Individual starting prices may be converted to probabilities by expressing as a decimal, a fraction with 1 as the numerator and the decimal starting price as the denominator. i.e. 1/Decimal SP or 1/4.0 = 0.25.

# Calculated Actual vs Expected ratio
# Convert all decimal odds to probabilities
total_sp <- sum(1/obrien_moore_group_races_only_price_filter$starting_price_decimal)

# Calculate A/E by dividing the number of all O'Brien and Moore winners, but the sum of all SP probabilities.
obrien_moore_ae <- obrien_moore_winners / total_sp

An A/E figure of 1.07 is returned. Once again, any figure above 1.00 should be viewed positively. If the return is greater than 1.00 it essentially means the filtered dataset is outperforming the market’s expectations. Or, to put it another way, a figure above 1.00 means that the filtered selections win more often than their probabilities (odds) indicate they should. Using a combination like Aiden O’Brien and Ryan Moore it may be somewhat surprising that they perform in excess of market expectations, especially given the regression line seen in last week’s scatterchart, but the A/E figure shows this to be true.

Finally, the last computation included in today’s article is known as Archie. If you are statistically minded, Archie is a variation on the Chi Squared goodness of fit calculation. However, in layman’s terms, Archie provides a figure showing how much the results obtained from a set of data can be attributed to luck. The higher the Archie figure, the less likely the results are down to pure luck. The larger the Archie figure, the better. As a rough guide, an Archie figure of 1.00 still means there is a 32% likelihood the results obtained were due to chance.

I also like to convert the Archie result to a percentage Confidence figure. In this case also, the higher the Confidence figure, the less likely the results are only down to luck.

# Calculate Archie
archie <- (obrien_moore_runners * (obrien_moore_winners  - total_sp)^2)/ (total_sp  * (obrien_moore_runners - total_sp))

# Calculate the Confidence figure
conf <- pchisq(archie, df = 1)*100

In the case of Group race runners, starting at a price of 4.00 or less, trained by Aiden O’Brien and ridden by Ryan Moore, the Confidence figure is almost 60%. This is reasonable, but not particularly strong.

It is worth keeping in mind that Archie and Confidence figures can be seen to change rapidly if the sample data size is only small. Our filtered dataset in this case contains only 136 runners, which is not particularly large.

Further details concerning all these calculations – POT/Yield, IV, A/E and Archie/Confidence – can always be found through Internet searches using popular search engines.

Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.

The full R code used in this article is found below.

# Load the RMySQL library package
library("RMySQL")
library("dplyr")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2016-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

dbDisconnect(con)

# Filter for flat races only
flat_races_only <- dplyr::filter(smartform_results,
                                 race_type_id == 12 |
                                   race_type_id == 15)
# Filter for Group races only
group_races_only <- dplyr::filter(flat_races_only,
                                  group_race == 1 |
                                    group_race == 2 |
                                    group_race == 3 )

# Filter for Aiden O'Brien runners only
obrien_group_races_only <- dplyr::filter(group_races_only, 
                                         grepl("A P O'Brien", trainer_name))

# Remove non-runners
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))

# Filter for Ryan Moore rides only
obrien_moore_group_races_only <- dplyr::filter(obrien_group_races_only, 
                                               grepl("R L Moore", jockey_name))

# Filter for Starting Prices of 4.00 or less
obrien_moore_group_races_only_price_filter <- dplyr::filter(obrien_moore_group_races_only,
                                                            starting_price_decimal <= 4.0)

# # Calculate Profit and Loss
obrien_moore_cumulative <- cumsum(
  ifelse(obrien_moore_group_races_only_price_filter$finish_position == 1, (obrien_moore_group_races_only_price_filter$starting_price_decimal-1),-1)
)

obrien_moore_group_races_only_price_filter$cumulative <- obrien_moore_cumulative  

# Calculate Strike Rate
obrien_moore_winners <- nrow(dplyr::filter(obrien_moore_group_races_only_price_filter,
                                   finish_position == 1))

obrien_moore_runners <- nrow(obrien_moore_group_races_only_price_filter)


strike_rate <- (obrien_moore_winners / obrien_moore_runners) * 100

# Calculate POT
profit_on_turnover <- (tail(obrien_moore_cumulative, n=1) / obrien_moore_runners) * 100

# Calculate Impact Values
# First filter all Group runners by price, to return those just starting at 4.00 or less
all_group_runners <- nrow(dplyr::filter(group_races_only,
                                        starting_price_decimal <= 4.0))

# Filter all Group winners by the 4.00 price limit                          
all_group_winners <- nrow(dplyr::filter(group_races_only,
                                        finish_position == 1 &
                                          starting_price_decimal <= 4.0))

# Now calculate the Impact Value
iv <- (obrien_moore_winners / all_group_winners) / (obrien_moore_runners / all_group_runners)

# Calculated Actual vs Expected ratio
# Convert all decimal odds to probabilities
total_sp <- sum(1/obrien_moore_group_races_only_price_filter$starting_price_decimal)

# Calculate A/E by dividing all O'Brien and Moore winners, but the sum of all SP probabilities.
obrien_moore_ae <- obrien_moore_winners / total_sp

# Calculate Archie
archie <- (obrien_moore_runners * (obrien_moore_winners  - total_sp)^2)/ (total_sp  * (obrien_moore_runners - total_sp))

# Calculate the Confidence figure
conf <- pchisq(archie, df = 1)*100

No Comments (add your own) »

O’Brien and Moore in Group Races – Scatterplots with R

By Phill Clarke on Saturday, May 26th, 2018

Last week we looked at the performance of Aiden O’Brien trained horses running in Group races. As an exercise for the reader it was suggested to look at the combination of both Aiden O’Brien and Ryan Moore in Group races. The result was that this combination showed an overall profit since 2007.

The R code was provided at the end of the article and this will be used as the starting point today. Rather than show all code during the article, it is assumed the reader can now connect to the Smartform MySQL database and retrieve basic data. Nonetheless, the full R code for today’s investigations will be provided at the end of the article.

Starting with O’Brien and Moore in all Group races since 2007.

# Variable smartform_results contains data retrieved from the Smartform MySQL database
# Filter SQL results for flat races only
flat_races_only <- dplyr::filter(smartform_results,
                                 race_type_id == 12 |
                                   race_type_id == 15)
# Filter for Group races only
group_races_only <- dplyr::filter(flat_races_only,
                                  group_race == 1 |
                                    group_race == 2 |
                                    group_race == 3 )

# Filter for Aiden O'Brien runners only
obrien_group_races_only <- dplyr::filter(group_races_only, 
                                         grepl("A P O'Brien", trainer_name))

# Remove non-runners
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))

# Filter for Ryan Moore rides only
obrien_moore_group_races_only <- dplyr::filter(obrien_group_races_only, 
                                               grepl("R L Moore", jockey_name))

# Calculate Profit and Loss
obrien_moore_cumulative <- cumsum(
  ifelse(obrien_moore_group_races_only$finish_position == 1, (obrien_moore_group_races_only$starting_price_decimal-1),-1)
)

obrien_moore_group_races_only$cumulative <- obrien_moore_cumulative

# Convert meeting_date columns to Date type
obrien_moore_group_races_only$meeting_date <- as.Date(obrien_moore_group_races_only$meeting_date)

# Plot the results
ggplot(data=obrien_moore_group_races_only, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

The chart and profit and loss calculations indicate a small profit of 10.31 for a one unit stake across all 377 runners in the dataset. Decent enough, however, we should always ask if this is the full story.

One way of investigating further is to create a scatter plot of the Starting Price for all O’Brien and Moore winners. The ggplot library is again used for this. There are many very powerful features of this charting tool and a wide number of tutorials available online.

# Filter for winners only
obrien_moore_group_races_winners_only <- dplyr::filter(obrien_moore_group_races_only,
                                                       finish_position == 1)

# Scatter plot of winning prices for all O'Brien and Moore winners
ggplot(obrien_moore_group_races_winners_only , aes(x=meeting_date, y=starting_price_decimal)) +
  geom_point(colour="blue") +
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  theme_tufte(base_family="serif", base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80")) +
  geom_smooth(method=loess,
              color="darkred")

This very interesting chart includes a loess regression line with confidence intervals. The line indicates that the price of O’Brien and Moore winners has been reducing over time, although it also appears that the confidence interval narrows (a good thing!) as the number of rides increases. The slight increase in confidence interval in 2018 is due to the fact the season is not yet half completed and the overall number of runners for this combination is low so far this year.

The chart also clearly shows that the overall Profit and Loss is probably skewed by the big priced winners in 2009 and 2012. If these two rides were removed, the combination would show an overall large loss.

We could also plot every individual runner for this combination, with a separate regression line for both winners and all other runners. In the chart below, winners are the blue points and all other runners are orange.

# Scatter plot of winning prices for all O'Brien and Moore runners
ggplot(obrien_moore_group_races_only , aes(x=meeting_date, y=starting_price_decimal, color=finish_position == 1)) +
  geom_point() +
  scale_x_date(labels = date_format("%Y-%m-%d"), 
               date_breaks="12 months") +
  scale_y_continuous(breaks= seq(0,35,by=2)) +
  theme_tufte(base_family="serif", 
              base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"),
        legend.position="top") +
  scale_color_manual(values=c("#FF9933","#000CCC")) +
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

This chart shows that there are tight clusters of winners and runners in 2016 and 2017. We can also see that the starting price of winners has been contracting faster than the overall price of all starters for this combination. As bettors, we should always keep in mind the value proposition of a bet.

Next, we could filter the data and examine just runners from the 2016 season onwards, with a starting price of less than 4.00.

# Filter the data for starters at an SP of less than 4.0 and since 2016 only
obrien_moore_group_races_only_price_filter <- dplyr::filter(obrien_moore_group_races_only,
                                                            starting_price_decimal <= 4.0 &
                                                              meeting_date >= "2016-01-01")

# Calcualte profit and loss
obrien_moore_cumulative <- cumsum(
  ifelse(obrien_moore_group_races_only_price_filter$finish_position == 1, (obrien_moore_group_races_only_price_filter$starting_price_decimal-1),-1)
)

obrien_moore_group_races_only_price_filter$cumulative <- obrien_moore_cumulative

# Plot the results as a line chart
ggplot(data=obrien_moore_group_races_only_price_filter, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

# Calculate strike rate     
winners_only <- nrow(dplyr::filter(obrien_moore_group_races_only_price_filter,
                                    finish_position == 1))

runners <- nrow(obrien_moore_group_races_only_price_filter)

strike_rate <- (winners_only / runners) * 100

We now have a very healthy 22.67 profit from 133 runners over the last two seasons, with a strike rate of 49.62%. This seems like a quite nice angle into Group races where Aiden O’Brien has a horse ridden by Ryan Moore.

The chart shows that 2016 was a bit slow to begin with, but than picked up nicely from around June onwards. However, 2017 was a very strong year. This year, 2018, seems to be off to a decent start.

There are many other angles which could be investigated. For example, we could filter for:

Group 1 races only
Odds on runners only
Odds between 2.0 and 4.0
Odds less than 10.0 only
A distance filter
A UK vs Ireland scatter chart and filter

There are many, many different ways of investigating further.

Today’s races includes the Irish 2000 Guineas at the Curragh, as well as some other Group races on the same card. Are there any O’Brien and Moore qualifiers?

There are three qualifying horses – US Navy Flag in the Irish 2000 Guineas, Merchant Navy in the Group 2 Greenland Stakes and Hydrangea in the Group 2 Lanwades Stud Stakes, all at prices less than 3/1.

Are all three worth a bet? There’s always something else to consider. Why did Ryan Moore’s number of rides and winners increase markedly from 2016 onwards? In March 2016 Joseph O’Brien announced his retirement from race riding. Ryan Moore most likely then picked up a number of high quality rides which would have otherwise gone to Joseph.

In the 2000 Guineas at Newmarket a few weeks ago, another of Aiden O’Brien’s sons, Donnacha, had a winning ride on Saxon Warrior. This may have only been because Moore was otherwise engaged for Ballydoyle at the Kentucky Derby. What are the current internal dynamics at the stable now? Has Donnacha’s success elevated him in the pecking order?

Donnacha O’Brien rides Gustav Klimt in today’s 2000 Irish Guineas. How important is this race for the stable? Is it an opportunity for Donnacha to ride another Group 1 winner? Of the three qualifying horses today, is it worth considering not betting on US Navy Flag? Racing is never an easy or straightforward game.

Questions and queries about this article should be posted as a comment below or on the Betwise Q&A board.

The full R code used in this article is found below.

# Load the RMySQL library package
library("RMySQL")
library("dplyr")
library("ggplot2")
library("ggthemes")
library("scales")

# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2006-01-01'", sep="")

smartform_results <- dbGetQuery(con, sql1)

dbDisconnect(con)

# Filter SQL results for flat races only
flat_races_only <- dplyr::filter(smartform_results,
                                 race_type_id == 12 |
                                   race_type_id == 15)
# Filter for Group races only
group_races_only <- dplyr::filter(flat_races_only,
                                  group_race == 1 |
                                    group_race == 2 |
                                    group_race == 3 )

# Filter for Aiden O'Brien runners only
obrien_group_races_only <- dplyr::filter(group_races_only, 
                                         grepl("A P O'Brien", trainer_name))

# Remove non-runners
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))

# Filter for Ryan Moore rides only
obrien_moore_group_races_only <- dplyr::filter(obrien_group_races_only, 
                                               grepl("R L Moore", jockey_name))

# Calculate Profit and Loss
obrien_moore_cumulative <- cumsum(
  ifelse(obrien_moore_group_races_only$finish_position == 1, (obrien_moore_group_races_only$starting_price_decimal-1),-1)
)

obrien_moore_group_races_only$cumulative <- obrien_moore_cumulative

# Convert meeting_date columns to Date type
obrien_moore_group_races_only$meeting_date <- as.Date(obrien_moore_group_races_only$meeting_date)

# Plot the results
ggplot(data=obrien_moore_group_races_only, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

# Filter for winners only
obrien_moore_group_races_winners_only <- dplyr::filter(obrien_moore_group_races_only,
                                                       finish_position == 1)

# Scatter plot of winning prices for all O'Brien and Moore winners
ggplot(obrien_moore_group_races_winners_only , aes(x=meeting_date, y=starting_price_decimal)) +
  geom_point(colour="blue") +
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  theme_tufte(base_family="serif", base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80")) +
  geom_smooth(method=loess,
              color="darkred")

# Scatter plot of winning prices for all O'Brien and Moore runners
ggplot(obrien_moore_group_races_only , aes(x=meeting_date, y=starting_price_decimal, color=finish_position == 1)) +
  geom_point() +
  scale_x_date(labels = date_format("%Y-%m-%d"), 
               date_breaks="12 months") +
  scale_y_continuous(breaks= seq(0,35,by=2)) +
  theme_tufte(base_family="serif", 
              base_size = 14) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"),
        legend.position="top") +
  scale_color_manual(values=c("#FF9933","#000CCC")) +
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

# Filter the data for starters at an SP of less than 4.0 and since 2016 only
obrien_moore_group_races_only_price_filter <- dplyr::filter(obrien_moore_group_races_only,
                                                            starting_price_decimal <= 4.0 &
                                                              meeting_date >= "2016-01-01")

# Calcualte profit and loss
obrien_moore_cumulative <- cumsum(
  ifelse(obrien_moore_group_races_only_price_filter$finish_position == 1, (obrien_moore_group_races_only_price_filter$starting_price_decimal-1),-1)
)

obrien_moore_group_races_only_price_filter$cumulative <- obrien_moore_cumulative

# Plot the results as a line chart
ggplot(data=obrien_moore_group_races_only_price_filter, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

# Calculate strike rate     
winners_only <- nrow(dplyr::filter(obrien_moore_group_races_only_price_filter,
                                    finish_position == 1))

runners <- nrow(obrien_moore_group_races_only_price_filter)

strike_rate <- (winners_only / runners) * 100

No Comments (add your own) »

Calculating Profit and Loss from Historic Data using R

By Phill Clarke on Saturday, May 19th, 2018

Previously we looked at how to calculate the strike rate for jockeys, specifically with relation to those riding in the 1000 Guineas. The result of that race showed how strikes rates are really only part of the overall picture. Sean Levey had the lowest historic strike rate for Group races, of all those peforming in the 1000 Guineas, yet rode a convincing winner. If we had of looked at trainer strike rates, it may have been a different story.

This article will now look at trainer strike rates, but also specifically calculate the profit and loss (P&L) to show what would happen if one had backed every runner for a specific trainer at starting price (SP) in all Group races.

We start very similar to previously, returning historic data from the Smartform database and calculating trainer strike rates.

# Load the RMySQL and dplyr library packages
library("RMySQL")
library("dplyr")

# Execute an SQL command to return some historic data
# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

# This SQL query selects the required columns from the historic_races and historic_runners tables, joining by unique race_id and filtering for results only since January 1st, 2006. The SQL query is saved in a variable called sql1.
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2006-01-01'", sep="")

# Now execute the SQL query, using the previously established connection. Results are stored in a variable called smartform_results.
smartform_results <- dbGetQuery(con, sql1)

# Close the database connection, which is good practice
dbDisconnect(con)

# Race type IDs 12 and 15 correspond to Flat and All Weather races
flat_races_only <- dplyr::filter(smartform_results,
                                  race_type_id == 12 |
                                   race_type_id == 15)

# Filter using the group_race field for Group 1, 2 and 3 races only
group_races_only <- dplyr::filter(flat_races_only,
                                group_race == 1 |
                                  group_race == 2 |
                                  group_race == 3 )

# For each trainer name, count the number of runs
trainer_group_runs <- group_races_only %>% count(trainer_name)

# Rename the second column in trainer_group_rides to something more logical
names(trainer_group_runs)[2]<-"group_runs"

# Now filter for only winning runs
group_winners_only <- dplyr::filter(group_races_only,
                                    finish_position == 1)

# For each trainer, count the number of winning runs
trainer_group_wins <- group_winners_only %>% count(trainer_name)

# Rename the second column in trainer_group_wins to something more logical
names(trainer_group_wins)[2]<-"group_wins"

# Join the two dataframes, trainer_group_runs and trainer_group_wins together, using trainer_name as a key
trainer_group_data <- dplyr::full_join(trainer_group_runs, trainer_group_wins, by = "trainer_name")


# Rename all the NA fields in the new dataframe to zero
# If a trainer has not had a group winner, the group_wins field will be NA
# If this is not changed to zero, later calculations will fail
trainer_group_data[is.na(trainer_group_data)] <- 0

# Now calculate the Group race strike rate for all trainers
trainer_group_data$strike_rate <- (trainer_group_data$group_wins / trainer_group_data$group_runs) * 100

The variable trainer_group_data now contains the strike rate for every trainer who has had a runner in a flat or all weather Group race since January 1st, 2006.

The Group 1 Lockinge Stakes is run today at Newbury. Therefore, let’s look at this field only for today’s examples.

# Execute an SQL command to return some daily data
# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

# This SQL query selects the required columns from the daily_races and daily_runners tables, joining by unique race_id and filtering for results only with today's date. The SQL query is saved in a variable called sql2.  
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_runners.name,
              daily_runners.jockey_name,
              daily_runners.trainer_name
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-05-18'", sep="")

# Now execute the SQL query, using the previously established connection. Results are stored in a variable called smartform_daily_results.
smartform_daily_results <- dbGetQuery(con, sql2)

# Close the database connection, which is good practice
dbDisconnect(con)

The variable smartform_daily_results contains information about all races being run today.

Now we filter for just today’s Lockinge Stakes.

# Filter for the Lockinge Stakes only
lockinge_only <- dplyr::filter(smartform_daily_results,
                              grepl("Lockinge", race_title))

The variable lockinge_only now contains just the basic details for today’s Group 1 Lockinge Stakes.

Lastly, dplyr is again used to perform a different type of join, which combines the lockinge_only and trainer_group_data dataframes, using trainer_name as the key.

# Using dplyr, join the lockinge_only and trainer_group_data dataframes
lockinge_only_with_sr <- dplyr::inner_join(lockinge_only, trainer_group_data, by = "trainer_name")

If we now view the dataframe lockinge_only_with_sr we can see that the trainer with the best strike rate in Group races, with runners also in today’s Lockinge Stakes, is Aiden O’Brien.

lockinge_only_with_sr

    course                                race_title meeting_date             name    jockey_name         trainer_name group_rides group_wins strike_rate
1  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19  Lightning Spear   Oisin Murphy          D M Simcock         256         24    9.375000
2  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19          Suedois      D Tudhope            D O'Meara         190         16    8.421053
3  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19           Limato  Harry Bentley              H Candy         103         15   14.563107
4  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19   Librisa Breeze      R Winston            D K Ivory          44          2    4.545455
5  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19       Zonderland        A Kirby              C G Cox         265         30   11.320755
6  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19        Deauville     W M Lordan          A P O'Brien        2347        385   16.403920
7  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19 Accidental Agent Charles Bishop Eve Johnson Houghton          70          1    1.428571
8  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19 Lancaster Bomber  J A Heffernan          A P O'Brien        2347        385   16.403920
9  Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19    Beat The Bank    Jim Crowley          A M Balding         367         25    6.811989
10 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19       War Decree      P B Beggy          A P O'Brien        2347        385   16.403920
11 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19     Rhododendron      R L Moore          A P O'Brien        2347        385   16.403920
12 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19 Alexios Komnenos      C D Hayes            J A Stack          19          3   15.789474
13 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19           Lahore     S De Sousa              C G Cox         265         30   11.320755
14 Newbury Al Shaqab Lockinge Stakes (Group 1) (Str)   2018-05-19          Addeybb    James Doyle           W J Haggas         459         48   10.457516

No real suprises there perhaps. However, what would now happen if we’d placed a one Pound bet on every Aiden O’Brien runner in Group races over the last twelve years. Would we have made a profit?

To begin, we filter all historic Group races, from the data obtained earlier, for just Aiden O’Brien’s horses.

# Filter historic data just for A P O'Brien horses
obrien_group_races_only <- dplyr::filter(group_races_only, 
                                         grepl("A P O'Brien", trainer_name))

# Non Runners appears as NA, which need to be removed for a true picture and so calulations do not fail
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))

The dataframe obrien_group_races_only now contains the data we need.

The next step is to cumulatively sum all the start prices for winning horses and deduct our one Pound stake where appropriate. The cumsum function is used to achieve this. The ifelse statement below basically says, for all lines in the dataframe where the finishing position is one, use the decimal starting price but also subtract our initial one Pound stake, otherwise (if the finish position isn’t one) simply subtract our one Pound stake from the total.

# Calculate the cumulative total for all O'Brien runners
obrien_cumulative <- cumsum(
  ifelse(obrien_group_races_only$finish_position == 1, (obrien_group_races_only$starting_price_decimal-1),-1)
)

# Add a new column back into obrien_group_races_only, with the cumulative totals
obrien_group_races_only$cumulative <- obrien_cumulative

When viewing the dataframe obrien_group_races_only you will now see a new column included called cumulative. Looking at the last entry for this column in the dataframe, the results is -275.76. Therefore, a single Pound bet on all 2178 Aiden O’Brien Group race runners since 2006 would have resulted in a loss of £275.76. Not really the route to profitable punting.

We can also take this a step further and plot the results for something more visual. The code below uses the gglot2 library again, along with some helper libraries in order to make a prettier chart.

# Load relevant libraries
library("ggplot2")
library("ggthemes")
library("scales")

# Convert the meeting_date column from character format to Date format
obrien_group_races_only$meeting_date <- as.Date(obrien_group_races_only$meeting_date)

# Plot the chart
ggplot(data=obrien_group_races_only, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  scale_y_continuous(breaks = seq(-300, 10, by = 50)) +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

The years 2009 through to the end of 2011 look to be particularly poor years for Aiden O’Brien in Group races, but 2014 onwards has been much more stable.

Today we have shown that just because the trainer may have the best historic strike rate in Group races, backing all of their runners blindly does not necessarily result in a profit. The market obviously knows the A P O’Brien has a very good record in these types of races and his horses are priced accordingly.

An exercise for the reader could be to calculate the P&L for Aiden O’Brien runners when ridden by Ryan Moore. To do this, start by filtering the historic obrien_group_races_only data by R L Moore. Hint: The answer is they are an profitable combination over 375 runs since 2006.

Questions and queries about this article should be posted as a comment below or to the Betwise Q&A board.

The full R code used in this article is found below.

# Load the RMySQL and dplyr library packages
library("RMySQL")
library("dplyr")

# Execute an SQL command to return some historic data
# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

# This SQL query selects the required columns from the historic_races and historic_runners tables, joining by unique race_id and filtering for results only since January 1st, 2006. The SQL query is saved in a variable called sql1.
sql1 <- paste("SELECT historic_races.course,
              historic_races.meeting_date,
              historic_races.conditions,
              historic_races.group_race,
              historic_races.race_type_id,
              historic_races.race_type,
              historic_runners.name,
              historic_runners.jockey_name,
              historic_runners.trainer_name,
              historic_runners.finish_position,
              historic_runners.starting_price_decimal
              FROM smartform.historic_runners
              JOIN smartform.historic_races USING (race_id)
              WHERE historic_races.meeting_date >= '2006-01-01'", sep="")

# Now execute the SQL query, using the previously established connection. Results are stored in a variable called smartform_results.
smartform_results <- dbGetQuery(con, sql1)

# Close the database connection, which is good practice
dbDisconnect(con)

# Race type IDs 12 and 15 correspond to Flat and All Weather races
flat_races_only <- dplyr::filter(smartform_results,
                                  race_type_id == 12 |
                                   race_type_id == 15)

# Filter using the group_race field for Group 1, 2 and 3 races only
group_races_only <- dplyr::filter(flat_races_only,
                                group_race == 1 |
                                  group_race == 2 |
                                  group_race == 3 )

# For each trainer name, count the number of runs
trainer_group_runs % count(trainer_name)

# Rename the second column in trainer_group_rides to something more logical
names(trainer_group_runs)[2]<-"group_runs"

# Now filter for only winning runs
group_winners_only <- dplyr::filter(group_races_only,
                                    finish_position == 1)

# For each trainer, count the number of winning runs
trainer_group_wins % count(trainer_name)

# Rename the second column in trainer_group_wins to something more logical
names(trainer_group_wins)[2]<-"group_wins"

# Join the two dataframes, trainer_group_runs and trainer_group_wins together, using trainer_name as a key
trainer_group_data <- dplyr::full_join(trainer_group_runs, trainer_group_wins, by = "trainer_name")


# Rename all the NA fields in the new dataframe to zero
# If a trainer has not had a group winner, the group_wins field will be NA
# If this is not changed to zero, later calculations will fail
trainer_group_data[is.na(trainer_group_data)] <- 0

# Now calculate the Group race strike rate for all trainers
trainer_group_data$strike_rate <- (trainer_group_data$group_wins / trainer_group_data$group_runs) * 100

# Execute an SQL command to return some daily data
# Connect to the Smartform database. Substitute the placeholder credentials for your own. 
# The IP address can be substituted for a remote location if appropriate.
con <- dbConnect(MySQL(), 
                        host='127.0.0.1', 
                        user='yourusername', 
                        password='yourpassword', 
                        dbname='smartform')

# This SQL query selects the required columns from the daily_races and daily_runners tables, joining by unique race_id and filtering for results only with today's date. The SQL query is saved in a variable called sql2.  
sql2 <- paste("SELECT daily_races.course,
              daily_races.race_title,
              daily_races.meeting_date,
              daily_runners.name,
              daily_runners.jockey_name,
              daily_runners.trainer_name
              FROM smartform.daily_races
              JOIN smartform.daily_runners USING (race_id)
              WHERE daily_races.meeting_date >='2018-05-18'", sep="")

# Now execute the SQL query, using the previously established connection. Results are stored in a variable called smartform_daily_results.
smartform_daily_results <- dbGetQuery(con, sql2)

# Close the database connection, which is good practice
dbDisconnect(con)

# Filter for the Lockinge Stakes only
lockinge_only <- dplyr::filter(smartform_daily_results,
                              grepl("Lockinge", race_title))

# Using dplyr, join the lockinge_only and trainer_group_data dataframes
lockinge_only_with_sr <- dplyr::inner_join(lockinge_only, trainer_group_data, by = "trainer_name")

# Filter historic data just for A P O'Brien horses
obrien_group_races_only <- dplyr::filter(group_races_only, 
                                         grepl("A P O'Brien", trainer_name))

# Non Runners appears as NA, which need to be removed for a true picture and so calulations do not fail
obrien_group_races_only <- dplyr::filter(obrien_group_races_only, !is.na(finish_position))

# Calculate the cumulative total for all O'Brien runners
obrien_cumulative <- cumsum(
  ifelse(obrien_group_races_only$finish_position == 1, (obrien_group_races_only$starting_price_decimal-1),-1)
)

# Add a new column back into obrien_group_races_only, with the cumulative totals
obrien_group_races_only$cumulative <- obrien_cumulative

# Load relevant libraries
library("ggplot2")
library("ggthemes")
library("scales")

# Convert the meeting_date column from character format to Date format
obrien_group_races_only$meeting_date <- as.Date(obrien_group_races_only$meeting_date)

# Plot the chart
ggplot(data=obrien_group_races_only, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  scale_y_continuous(breaks = seq(-300, 10, by = 50)) +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

##################################################################################
# Reader Exercise: Aiden O'Brien and Ryan Moore combined in Group Races
##################################################################################

obrien_moore_group_races_only <- dplyr::filter(obrien_group_races_only, 
                                               grepl("R L Moore", jockey_name))

obrien_moore_cumulative <- cumsum(
  ifelse(obrien_moore_group_races_only$finish_position == 1, (obrien_moore_group_races_only$starting_price_decimal-1),-1)
)

obrien_moore_group_races_only$cumulative <- obrien_moore_cumulative


obrien_moore_group_races_only$meeting_date <- as.Date(obrien_moore_group_races_only$meeting_date)


ggplot(data=obrien_moore_group_races_only, aes(x=meeting_date, y=cumulative, group=1)) +
  geom_line(colour="blue", lwd=0.7) + 
  scale_x_date(labels = date_format("%Y-%m-%d"), date_breaks="6 months") +
  theme_tufte(base_family="serif", base_size = 14) + 
  geom_rangeframe() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        panel.grid.major.x = element_line(color = "grey80"),
        panel.grid.major.y = element_line(color = "grey80"))

No Comments (add your own) »

Daily trainer strike rates – how to match for all today’s runners

By Nick Franks on Saturday, May 12th, 2018

Today we’re going to build on the last SQL post where we created a query for 14 day strike rates for all trainers who had runners in the past 14 days. It’s all very well knowing how trainers have performed in the last 14 days, but in practice the way we’ll typically want to use that information is to match the strike rates with today’s runners, so we can compare trainer strike rates within a race.

That is what today’s query is going to show. We’ll be converting the query from the last blog post to a new database table, then querying that table with a left outer join combined with data on today’s runners. First, here’s the query that you can run in Smartform yourself, then we’ll explain what is going on.

-- If the temp table exists delete it

DROP TABLE IF EXISTS temp_trainer_stats_table;

-- create the temporary table by using the first query
CREATE TABLE temp_trainer_stats_table AS (
Select a.*,
ROUND(Winners/Runners * 100,0) AS WinPct,
ROUND(Placers/Runners * 100,0) AS PlacePct
from(
SELECT hru.trainer_name as Trainer, hru.trainer_id,
COUNT(*) AS Runners,
SUM(CASE WHEN hru.finish_position = 1 THEN 1 ELSE 0 END) AS Winners,
sum(case when hra.num_runners < 8 then case when hru.finish_position in ( 2) then 1 else 0 end
else
case when hra.num_runners < 16 then case when hru.finish_position in ( 2,3) then 1 else 0 end
else
case when hra.handicap = 1 then case when hru.finish_position in (2,3,4) then 1 else 0 end
else
case when hru.finish_position in ( 1,2,3) then 1 else 0 end
end
end
end )as Placers,
ROUND(((SUM(CASE WHEN hru.finish_position = 1 THEN (hru.starting_price_decimal -1) ELSE -1 END))),2) AS WinProfit
FROM historic_runners hru
JOIN historic_races hra USING (race_id)
WHERE hra.meeting_date >= ADDDATE(CURDATE(), INTERVAL -14 DAY)
and hra.race_type_id in( 15, 12)
and hru.in_race_comment <> 'Withdrawn'
and hru.starting_price_decimal IS NOT NULL
GROUP BY trainer_name, trainer_id) a);

-- Join the daily runners and daily races to get details of todays card, then do a LEFT OUTER JOIN to get the trainer stats,
-- Where no trainer stats exist NULLS are shown

SELECT substr(dra.scheduled_time, 12,5) as Time, dra.course as Course,
dru.cloth_number as 'No.', dru.name as Horse,
case when dru.stall_number is NULL then "" ELSE dru.stall_number end as Draw, dru.jockey_name as Jockey,
dru.forecast_price as FSP, tmp.*
from daily_races dra
join daily_runners dru using (race_id)
LEFT OUTER JOIN temp_trainer_stats_table tmp using (trainer_id )
where dra.race_type = 'Flat'
and dra.meeting_date = adddate(curdate(), INTERVAL 0 DAY)
-- add a condition here if you want to restrict stats for trainers with less than a certain number of runners
-- eg. [ and runners >=5 ]
order by Time, dru.cloth_number;

Having created a temporary table with the trainer statistics, we now want to join this data to the race card.
The race card data is held in two tables – daily_races and daily runners – which we will join together using the race_id

from daily_races dra
join daily_runners dru using (race_id)

To this we also want to the join the temp_trainer_stats_table we have just created – for this we use a LEFT OUTER JOIN.

This type of join ensures we have the data for every runner in every race on the card, and the trainer data for trainers for which it exists.

For each table I use an identifier eg. Dra for daily_races, tmp for the temp_trainer_stats_table, this enables correct identification of data fields which exist in more than one table.

There is far more data available in the race card data that I am using here, but the basics are:

Time – for this I am using a substring function of the scheduled_time data element, the whole value is for example 2018-05-06 13:30:00 using substr(dra.scheduled_time, 12,5). I am using 5 characters from position 12 which gives me 13:30.
We could also use this to extract the date. There is also a meeting_data data element. Using either of them you can create a date field using a substring and concatenation function like this:

concat(substr(dra.scheduled_time, 9,2),'-',substr(dra.scheduled_time, 6,2),'-',substr(dra.scheduled_time, 1,4)) as RaceDate

Added to the join of the race card data and the trainer data are two conditions to get only flat races for today

where dra.race_type  = 'Flat' 
and dra.meeting_date = adddate(curdate(), INTERVAL 0 DAY)

Further criteria could be added to limit the selection to trainers with more than a certain number of runners (eg greater than 4) in the 14 day analysed period

and  tmp.runners > 4

and with a strike race greater or equal to 20%

and WinPct >=20

combinations of win percentage, win profit and number of runners would look like this

where dra.race_type  = 'Flat'
and dra.meeting_date = adddate(curdate(), INTERVAL 0 DAY)
and tmp.runners > 4
and tmp.WinPct >= 20
and tmp.WinProfit > 0

Running the first query gives us a result set of 488 rows, neatly ordered by rate time for each of today’s races. We’ve attached a CSV of the output for all today’s races, so you can analyse each race according to trainer strike rate and profitability – something you can of course do for yourself every day in Smartform.

CSV Download: 14 day trainer strike rates for every race on Saturday 12th May

In the next post, we’ll look at different options for automating this query on a daily basis.

No Comments (add your own) »

Bookmaker vs Betfair Odds Comparison

Assessing Accuracy of Forecasted Prices

Betfair Price Analysis from Smartform Database

How does Overround differ in Betfair markets?

Finally, how does starting price overround differ between bookmaker and exchange prices?

Querying From Smartform Using Python

Plotting Trainer, Jockey and Sire Statistics in a Stacked Bar Chart with R

2 year old sire stats for Royal Ascot

Loops With R – Creating a Racecard with Trainer and Jockey Stats

Creating Functions with R – using trainer and jockey combinations

Further Calculations using R to analyse the performance of jockeys and trainers

O’Brien and Moore in Group Races – Scatterplots with R

Calculating Profit and Loss from Historic Data using R

Daily trainer strike rates – how to match for all today’s runners

Recent Posts

Archives