What 18 leagues tell us about the World Cup

Eighteen club leagues are not the World Cup. Tournament football has different stakes, different timing, different selection pressures, and different small-sample dynamics. We can't copy a finding from Brasileirão to a USA-Iran group-stage fixture and call it the same thing. But the league validation does give us a working map of where the methodology is strong, where it's weak, and what to watch.

Five propositions we are taking into June

1. The methodology should add the most value where the field is asymmetric

Group-stage matches with a heavy mismatch between squads — Argentina vs a non-traditional power, England vs an underdog — are where the prestige-anchor and league-coverage gradients we measured in the league test should most distort raw predictions. Those are exactly the fixtures where our correction earned the biggest league-by-league lifts. We'd expect the World Cup data to follow the same shape.

2. Big-v-big fixtures will be a tighter test

When two well-covered nations meet — Spain vs France, Germany vs Argentina, England vs Brazil — the models start with reasonable priors on both sides. The Premier League result tells us those fixtures may not move much under correction. They're still the most prestigious tests because any improvement we eke out is improvement on top of an already-calibrated baseline.

3. Knockout football will stress-test our recency and pressure dimensions

Most of our 18-league test was league football — repeated, low-stakes, low-variance compared with tournament knockout rounds. The dimensions we specifically measured around tournament pressure and stakes (PCM03 and PCM11 in our research) didn't get a fair workout in club football. The knockout rounds are where those dimensions matter, and we'll be watching whether the correction we apply on them holds up or has to be re-tuned in real time.

4. We will be honest when correction is silent

Several leagues showed near-zero benefit from correction — Premier League, Belgian Pro, Primeira Liga, Russian Premier. If a World Cup fixture resembles those leagues (closely-matched squads from well-covered leagues), we'll publish a smaller delta between our raw average and our bias-corrected pick — because the data says the correction shouldn't move much there.

5. We expect at least one finding we don't like

The four leagues where calibration hurt — Eredivisie, Scottish Prem, Swiss Super, Serie A — are the leagues that humbled us. Some World Cup fixtures will look like those: high-variance, atypical, models leaning the right way for the wrong reasons. We expect to find at least one stage of the tournament where our methodology underperforms a simple average. If we don't, we should be suspicious of our own measurement.

What we are explicitly not predicting

We are not claiming our Edge prediction will beat the betting market overall.League data was on a different mix of fixtures than the bookmakers' main attention, so a like-for-like comparison there isn't clean. The World Cup will be the cleanest test we've had — bookmakers will have sharp lines on every match, and our predictions are public and time-stamped.

We are not claiming a single model will dominate. The 18-league test made it clear that no model wins everywhere. The ensemble exists because the models complement each other. If one model ends up best across all 104 matches, the methodology has failed in an interesting way — either the corrections we're applying aren't actually differentiating, or we got lucky with one model's biases.

We are not claiming bias correction is universally good.The four leagues where it hurt are evidence enough. The interesting question for the World Cup is whether we can predict in advancewhich fixtures to apply heavy correction to and which to leave alone. The league test is the calibration of that judgement.

The clear falsification line

If after 104 World Cup matches the simple naive average produces a better Brier score than our bias-corrected Edge prediction, the methodology didn't earn its keep on this tournament. We'll publish that result with the same prominence as a positive one. That's the deal we're making by going on record before the matches.

Wrapping the week

Eight days, eight angles. The thing that surprised us most across all of them is that the failures were as informative as the wins. Knowing where our correction makes things worse is a precondition for trusting it where it makes things better. Without the 18-league test we'd have entered June making confident claims and then explaining away the misses; with it, we know where to stand firm and where to hedge.

From here, the predictions are live. Every fixture between now and the final is on record before kickoff. The next results posts will move from calibration to actuals.