What 18 leagues and 979 matches taught us

Twelve international friendlies wasn't enough. So before the World Cup kicked off we ran the same five-model setup over 18 club leagues across six continents — 1,427 fixtures fetched, 979 with complete predictions, every match graded against the actual result. This is what came out the other end.

Why a club-football detour

We needed a sandbox where we could test bias correction at scale, on data the models had not seen during their training cutoff. Club leagues from January through April 2026 fit the bill: real outcomes, real betting markets, more fixtures in a month than the World Cup will produce in three weeks.

The same five models — GPT-5.4, GPT-5.5, Claude Sonnet 4.6, Gemini 3.1 Pro, Grok 3 — answered the same prompt for every match. Same context, same evidence, same question: what are the home/draw/away probabilities?

The headline

Across the 18 leagues, our bias-corrected ensemble delivered a meaningful improvement over raw model predictions on Brier score, the standard calibration metric. The improvement was not uniform. In five leagues it was large — double-digit. In four leagues it actually made the predictions worse. The other nine were somewhere in between, mostly mild improvements.

What the validation looked like

• 18 leagues — five major European, four secondary European, four Americas, three Asian, plus MLS and A-League
• 979 matches with complete prediction sets across all five models
• Same prompt structure as the one we'll use during the World Cup
• Brier score as the primary metric — penalises both over- and under-confidence
• Held-out matches — none of these were used to fit the bias corrections

Three things we did not expect

One: the size of the league mattered more than the size of the model.We assumed the most-trained models would be most consistent across leagues. They weren't. The variation came from the league, not the model.

Two: bias correction can hurt. In a handful of leagues, the ensemble that explicitly corrects for known biases predicted worsethan the raw model average. We have a working hypothesis for why, and it's the most useful finding from the whole exercise.

Three: the Premier League barely moved. The most-coverage league in the world produced the smallest correction. That tells us something about where AI training data is concentrated — and where it's thin.

What this week's posts cover

Over the next eight days we'll unpack the league validation in detail. Where correction worked, where it failed, why the Premier League is a blind spot, why La Liga gave us our cleanest signal, what the geographic pattern looks like, and how each individual model performed across all 18 leagues. We'll close on what the data points say (and don't say) about the World Cup.

One thing we won't share: the specific weights or formulas that go into the bias correction. That's the work product we're testing live. Everything else — the questions, the inputs, the per-league results, the limitations — is on the table.

The 18-league test was about earning the right to make pre-tournament claims. We'd rather know now that the methodology has limits than learn it during the tournament.

What's next