Tuesday's four Group I and J matches all went to the pre-match favourite, giving every model a clean results sheet — yet the Brier scores reveal a wide spread in how confidently, and correctly, they called it.
DR Congo held Portugal to a draw that no model really wanted to believe in, while England's 4-2 demolition of Croatia papered over some shaky probability estimates. Two findings from a busy Wednesday slate.
Before the opening whistle at the Azteca, all 360 group-stage predictions are logged and timestamped. England edges Argentina as favourite, Gemini backs Portugal, and all five models are cool on Brazil.
The 25-question Tactical Knowledge panel puts GPT-5.5 at the top of the field at 8.67/10 — but only after fixing a 500-token cap that had been silently hiding its responses.
League data isn't World Cup data. But the patterns we found point to which models will be reliable in June, where the corrections will matter most, and where the system might break.
Claude crushes it in La Liga, struggles in Bundesliga. Grok is steady but unspectacular. GPT-5.4 surprises in MLS. The full breakdown by model and league.
A pattern emerged across 18 leagues: the further a league sits from the AI training-data centre of gravity, the more bias correction helps. The map matters.
A 13% Brier improvement, every model lifting, no model degrading. La Liga gave us our cleanest validation of the methodology — here's what it tells us about Spain at the World Cup.
Of every league we tested, the Premier League moved the least when we corrected for bias. We think we know why — and what it means for English clubs at the World Cup.
Our research reveals systematic bias toward Big 5 league players. When given identical stats, models prefer the player from the more prestigious league 58-71% of the time.
An introduction to GPT-5.4, GPT-5.5, Claude, Grok, and Gemini — their personalities, strengths, and blind spots. Understanding why they disagree is key to our methodology.
We tested our prediction methodology on international friendlies from March-April 2026. The ensemble beat market odds, and models showed distinct behavioral patterns.