Why La Liga is the calibration sweet spot

La Liga is the result that surprised us most. It's a deeply-covered league with mountains of training data — by all our priors it should have looked like the Premier League, where bias correction barely moved the needle. Instead it produced a 13% Brier improvement, with every single one of our four models lifting. That's our cleanest signal of the whole 18-league test.

The numbers

La Liga, 44 matches — Brier improvement after bias correction

• Claude Sonnet 4.6 — +17.5%
• Gemini 3.1 Pro — +14.2%
• GPT-5.4 — +11.3%
• Grok 3 — +10.9%
• Naive ensemble — +13.2%

Every model improved by at least 10%. That kind of agreement is rare across 18 leagues — usually one or two models lift sharply while others stay flat or drop. La Liga had a uniform direction.

Why La Liga and not the Premier League

Both leagues have abundant training data. The difference is what the data looks like. English-language coverage of La Liga is dominated by Real Madrid and Barcelona — and to a lesser extent Atlético and the European qualifying chase. The middle and lower of the table get less attention.

That asymmetry inside the league is exactly the situation our calibration is designed for. Models inherit the coverage gradient: they over-estimate the dominant clubs and under-rate everyone else, even when the underlying performance numbers say otherwise. When we correct for that, predictions for mid-table fixtures — the bulk of any season — move into closer agreement with what the actual match outcomes were saying.

The Premier League doesn't have the same coverage gradient. Even Brentford get serious tactical coverage in English. The models' priors on the EPL are flatter, less anchored on a few brand teams, and so there's less for a correction to fix.

The Claude effect

One detail worth flagging: Claude was the biggest beneficiary of correction in La Liga. Claude is also the model with the strongest documented home advantage over-adjustment in our fingerprinting study. La Liga has a relatively pronounced home advantage in some seasons but not others, so a model that always tilts "home" will land badly across a 44-match sample. Correcting for that predictable Claude tendency is where most of its 17.5% improvement came from.

The other three models needed less correction along that dimension and improved through other channels — narrative weighting, recency, prestige. What's elegant about the result is that different corrections on different models all converged on the same direction, which is what you'd expect if the underlying ground truth is consistent and the biases are real.

The cleanest experimental result is the one where every model independently lifts in the same direction. La Liga gave us that.

What it implies for Spain at the World Cup

Spain's World Cup squad will be drawn almost entirely from La Liga and a handful of European clubs that the models also know well. Predictions for Spain matches should be well-supported by the type of correction that worked in our La Liga test — particularly when Spain plays a team whose squad sits in less-covered leagues, where the prestige-anchor effect is largest.

Five of our six prediction methods picked Spain in the top three favourites heading into the tournament. The methodology that put them there is the one that just delivered our cleanest 44-match validation.

Tomorrow: the geographic pattern across all 18 leagues, and what it suggests about the matchups our methodology will be most and least useful for in June.

The numbers

La Liga, 44 matches — Brier improvement after bias correction

Why La Liga and not the Premier League

The Claude effect

What it implies for Spain at the World Cup

Discussion