Per-model winners and losers across 18 leagues
One of the most consistent findings of the 18-league test: no single model was best everywhere. The model that worked best in La Liga was not the one that worked best in Bundesliga. The model that struggled in Eredivisie did fine in Saudi Pro. That non-uniformity is the whole reason the ensemble exists.
The headline by model
Claude Sonnet 4.6 — high ceiling, high floor
Claude was our biggest individual winner where it won. La Liga (+17.5% Brier improvement after correction) was its showcase result. It also posted strong improvements in MLS, J1, and Saudi Pro. The pattern: Claude is the most willing to lean hard on a position, which means when the correction is right, it moves a lot.
The flip side: Claude was also our biggest loser when it lost. Bundesliga (−5.2%), Eredivisie, and Scottish Premiership all saw Claude slip below the raw baseline. Same trait — a model that commits hard to a view — cuts both ways.
Gemini 3.1 Pro — good in volume, weaker on idiosyncratic leagues
Gemini was the most consistent across the high-coverage European leagues. It posted modest but steady gains in La Liga (+14.2%), MLS (+13%), and Brasileirão. Where it struggled was leagues with strong tactical specifics — Serie A and Bundesliga both saw Gemini as the worst-performing of the four after correction. Generalist by name, generalist by behaviour: Gemini does well when nothing weird is happening and less well when it is.
Grok 3 — the steady contrarian
Grok had the smallest range of any model in the test — fewest big wins, fewest big losses. It improved by 10.9% in La Liga (smallest of the four there) and by 1.2% in the Premier League. It also posted the smallest losses in Eredivisie and Bundesliga. Grok's odds-anchored style made it harder to dislodge in either direction. It's the model we'd bet on for high-variance situations where you don't want surprise outliers.
GPT-5.4 — the market baseline, behaviourally
GPT-5.4 sat in the middle on most metrics. Its biggest improvement was in La Liga (+11.3%), its biggest decline in Bundesliga (−4.7%). It's the model whose corrections were most evenly distributed across dimensions — home advantage, prestige, recency, narrative — which is consistent with its "widely-used, moderate-on-everything" profile from our pre-tournament fingerprinting.
GPT-5.5 — the evolved baseline
GPT-5.5 was added to our roster after the league validation began, so it has a smaller dataset in the test. Where the data overlaps it tracks GPT-5.4 closely but with a slightly tempered narrative response — exactly the kind of evolution we'd expect from an iteration on the same family. We'll have a much richer per-league picture for GPT-5.5 once the World Cup runs.
The disagreement signal
One of the more interesting cross-model findings: when models disagreed sharply on a fixture, the bias-corrected ensemble usually beat any single model. When they agreed closely, the ensemble was no better than any individual prediction. That's the case for the methodology in compact form: the ensemble earns its keep precisely on the matches where one of the models is about to be wrong.
How we're using this for the World Cup
- • On consensus matches (low model spread), we lean on simple averaging.
- • On high-divergence matches, the bias correction does heavier work — selecting which model is most likely to be tilted by its known weaknesses.
- • Per-model behaviour from the league test informs which fixtures we'd expect Claude vs Grok vs Gemini to be most reliable on.
Why we're not telling you the weights
We're not publishing the specific blend ratios because the methodology is what we're testing in public over the next 104 matches. The weights change with context — host nation matches, knockout-stage matches, high-stakes fixtures all get different treatment. The point of running predictions live and on record is to measure whether the methodology earns its keep on a tournament we did not see during model training. Publishing the weights would let anyone post-hoc fit a story to a result; the test is tighter without that.
Tomorrow: how all of this applies to the World Cup itself. What the 18-league pattern tells us about which group-stage matchups will be the most informative tests of the methodology.