The Epsom Classics Through Two Lenses: How Far Can Pedigree Statistics Take You?

By colin on Friday, June 5th, 2026

Yesterday’s dam-table release completed Smartform’s pedigree picture. So we built two models of the Epsom Classics — one using everything, one using breeding alone — and asked how much of the picture pedigree can paint on its own.

With yesterday’s release of daily_dams_insights, the Smartform pedigree canon is now three deep: daily_sires_insights, the new daily_dams_insights, and the sire × damsire cross, which only exists because the native database carries damsire references for every horse. Thus a complete picture, paternal and maternal, is queryable for every UK and Irish runner since 2008: the production form of a runner’s sire, dam, damsire and breeding cross, across aggregate, recent, age-, distance-, course- and condition-specific lenses, in both PRB and strike rate.

The Epsom Classics are the natural place to test it. The Oaks and Derby are the deepest mid-distance Classic trials in the calendar — top-of-pyramid three-year-old Group 1s over twelve furlongs, where any stamina and Classic-distance signal in the breeding should be doing its hardest work — and they come with a tightly bounded sample (18 useful runnings since our pedigree data starts in 2008), which makes it tractable to ask exactly which signals move the needle.

So we ran a two-model experiment. A full model uses everything we hold — pedigree plus jockey, trainer and the runner’s own prior form. A parallel pedigree-only model sets the form and connections aside and keeps only the pedigree features: sire, dam, damsire and the breeding cross. The question: how much of the full model’s predictive power survives that restriction?

The short answer is most of it — and on the Oaks, all of it. The pedigree-only model is a genuine second opinion in its own right, not a stripped-down copy of the full one. And whether the two models agree or disagree turns out to be more useful than either on its own.

Why a composite, not a black box

Eighteen runnings is no place for a hundred-feature gradient booster; throw that much at this little data and you model noise. We needed something simpler and more honest. For every candidate stat — 138 in all, every variant across sire, dam, damsire and the breeding cross, plus jockey, trainer and self prior form for the full model — we take its Spearman rank correlation with finishing position across the cohort. The strongest discriminators become the features; each one’s weight is its signed Spearman value. A runner’s score is its z-score on each weighted stat, summed and softmaxed within the race.

To check it generalises, we ran leave-one-year-out across 2008–2025: hold out each year, refit the weights on the other seventeen, score the held-out race. That gives eighteen genuinely out-of-sample predictions per Classic, per model.

What the data picks — and what it ignores

We let the data choose from the whole arsenal. What rose to the top was sire and dam. The damsire and the breeding cross, on this sample, didn’t reach the top tier of discriminators at all — which is itself worth knowing: at the very top of the mid-distance pyramid, it is the direct sire and dam lines that carry the signal. Having the full picture is precisely what let us find that out.

	Top-1 hits	Top-3 hits	Mean winner rank
Oaks — full	3 / 18 (17%)	9 / 18 (50%)	4.4
Oaks — pedigree	3 / 18 (17%)	9 / 18 (50%)	5.0
Derby — full	2 / 18 (11%)	7 / 18 (39%)	5.9
Derby — pedigree	2 / 18 (11%)	5 / 18 (28%)	6.7

Verdict	Count
Both models had the winner in their top-3	8 / 36 (22%)
Only the full model did	8 / 36 (22%)
Only the pedigree model did	6 / 36 (17%)
Neither	14 / 36 (39%)

The Epsom Classics Through Two Lenses: How Far Can Pedigree Statistics Take You?

Why a composite, not a black box

What the data picks — and what it ignores

How both models held up over 18 years

The 2026 read

What it tells us about the data

Leave a comment

Recent Posts

Archives