MBModelBall
April 21, 2026

Pre-tournament calibration: 12 matches tested

calibrationmethodology

Before the World Cup begins, we tested our five-model ensemble on 12 international friendlies from March-April 2026. These matches occurred after all models' training cutoffs, ensuring no data contamination. The results gave us confidence that our methodology works—and revealed fascinating differences in how each AI approaches football prediction.

Why we ran this test

Building a prediction system is one thing. Knowing whether it actually works is another. We needed to validate our approach before the tournament, but we faced a challenge: any historical match data might have been seen during model training. The solution was to wait for fresh matches that occurred after all five models' knowledge cutoffs.

International friendlies in March and April 2026 gave us exactly that—real competitive football with genuine uncertainty, completely unseen by any model. We collected historical odds data for 12 matches and asked each model (GPT-5.4, Claude, Grok, Gemini) to predict outcomes. We then scored predictions against actual results using Brier scores, where lower means better calibration.

Key finding: models predict differently

This was the most important validation. Given identical match data—the same team names, recent form, head-to-head records, and betting odds—the five models produced meaningfully different predictions. On average, models disagreed by 10.3 percentage points on home-win probability. This isn't noise; it's evidence of distinct behavioral fingerprints.

The disagreement matters because it's the foundation of our ensemble approach. If all five models gave identical predictions, combining them would add no value. But when they see the same data differently, we can potentially identify which perspective is more accurate in which contexts.

Example: England vs Japan (March 31)

GPT-5.4
68%
England
Claude
52%
England
Grok
60%
England
Gemini
58%
England

Result: Japan won 1-0. Claude's more cautious prediction (52%) was closest to reality. GPT-5.4's 68% confidence in England proved overconfident—a pattern we've seen in our fingerprinting research around prestige bias.

Ensemble beats the average

The central question driving Modelball: does combining models improve predictions? Our calibration data says yes. The ensemble outperformed both market odds and the simple average of individual models.

This matters because the ensemble isn't just averaging—it's weighting based on our fingerprint research. Models with known biases toward certain outcomes get less weight in contexts where those biases might hurt accuracy. The early results suggest this approach has merit.

MethodBrier scoreAccuracyvs Market
Ensemble (5 models)0.54958.3%better
Market odds0.55258.3%baseline
Simple average0.55058.3%marginal

Individual model performance

Looking at each model individually reveals interesting patterns. Grok performed best on Brier score despite not having the most correct predictions outright. This suggests better calibration—when Grok was uncertain, its probabilities reflected that uncertainty more accurately.

Claude, despite having the best single prediction on England vs Japan, ranked last overall. This isn't necessarily bad news—Claude's fingerprint shows more conservative predictions generally, which can hurt in matches where favorites do win convincingly. The question is whether Claude's caution helps more in upset-prone World Cup matches.

ModelBrier scoreCorrectNotes
Grok0.5387/12Best calibrated
Gemini0.5467/12Consistent
GPT-5.40.5537/12Slightly overconfident
Claude0.5617/12Conservative

Honest limitations

We want to be transparent about what this test does and doesn't prove. These are genuine constraints on our conclusions:

  • Small sample size: 12 matches means wide confidence intervals. The model rankings could flip with a few different outcomes. We need the full 104-match World Cup to draw stronger conclusions.
  • Friendlies differ from competitive matches: Players and managers approach friendlies differently. Rotation is common, motivation varies, and tactical experimentation happens. World Cup knockout matches are different beasts.
  • Potential odds anchoring: We provide betting odds as context to each model. Models may be partly echoing these odds rather than making fully independent assessments. We're exploring odds-free prompts for future research.
  • Single time period: March-April 2026 had its own quirks—post-season fatigue, pre-tournament experimentation. Different periods might show different patterns.

What this means for the World Cup

This calibration test validates our core methodology. Models genuinely predict differently, and thoughtful ensembling can improve on both individual models and naive averaging. But we're treating these results as encouraging signs, not proof of concept.

We're keeping equal weights (25% each) for now. With 104 World Cup matches across group stages and knockouts, we'll have much stronger data to assess which models excel in which contexts. Does Claude's caution help in knockout matches where upsets are more common? Does GPT-5.4's confidence in favorites pay off in lopsided group stage matches?

The real findings will emerge during the tournament itself. Follow along as we learn whether knowing these models' blind spots actually helps predict football better.

Discussion