MBModelBall
April 25, 2026

What happens when all five models agree (and when they don't)

methodologydivergence

Five AI models looking at the same match data. Sometimes they agree within 3 percentage points. Sometimes they disagree by 20. Both scenarios tell us something valuable — but the lessons are different. Here's what we've learned about consensus, divergence, and when each signals opportunity.

The consensus intuition

There's a natural assumption that when multiple independent sources agree, they're probably right. In forecasting, this is sometimes called the “wisdom of crowds” effect. If all five models think Brazil beats Costa Rica with 70% probability, that convergence feels meaningful.

And there's truth to this intuition. In our calibration matches, when all five models agreed within 5 percentage points, the favorite won 71% of the time. That's reasonably well-calibrated — the models collectively knew what they were doing.

When models agree (<5pp spread)

71%
Favorite wins
0.48
Avg Brier score
5/12
Calibration matches

The divergence puzzle

But here's where it gets interesting. When models disagree significantly — spreads of 10+ percentage points — the outcomes become much harder to predict. The favorite (according to the most confident model) wins only 54% of the time. That's barely better than a coin flip.

This tells us something crucial: high divergence is a marker of genuine uncertainty. When GPT-5.4 says 68% England and Claude says 52% England (as happened in our calibration), the models aren't just processing data differently — they're revealing that the match is fundamentally hard to predict.

When models disagree (>10pp spread)

54%
Favorite wins
0.61
Avg Brier score
7/12
Calibration matches

Why divergence happens

Model disagreement isn't random. When we analyzed our high-divergence cases, clear patterns emerged in what triggers disagreement:

  • Reputation vs form mismatches: When a historically strong team is in poor recent form (or vice versa), models weight these factors differently. GPT leans toward reputation; Grok leans toward recent results.
  • Home advantage edge cases: Matches where home advantage is ambiguous — neutral venues, shared stadiums, host nation opponents — trigger different adjustments across models.
  • League quality disagreements: When teams from different leagues meet, models disagree on how much league context matters. This especially affects intercontinental matches.
  • Key player uncertainty: When star players have ambiguous fitness or form, models make different assumptions about their impact.

Divergence as signal

This is the key insight that powers The Edge. High model divergence isn't noise — it's information. It tells us:

  1. The match has features that models process differently
  2. At least some models are probably miscalibrated for this specific context
  3. Our fingerprint data might help us identify which ones

When Claude and Grok disagree on a host nation match, we know Claude tends to over-adjust for home advantage. So we weight Grok's prediction more heavily. When Gemini and GPT disagree on a match involving a non-Big 5 team, we know both have league prestige biases — but Grok's is weaker, so we lean toward Grok.

Divergence interpretation guide

<5pp spreadModels agree. Trust the consensus. Match is relatively predictable.
5-10pp spreadModerate disagreement. Check which biases might be active. Weighted ensemble most valuable here.
>10pp spreadHigh uncertainty. Upset potential is real. Bet sizing should reflect uncertainty, not conviction.

The England vs Japan case

Our calibration data gave us a perfect example. England vs Japan in March 2026:

ModelEngland winDrawJapan win
GPT-5.468%20%12%
Grok60%23%17%
Gemini58%24%18%
Claude52%26%22%

The spread was 16 percentage points (68% vs 52%). This flagged the match as high uncertainty. And the outcome? Japan won 1-0. Claude's cautious prediction was vindicated; GPT's confident England call was wrong.

This doesn't mean Claude is “better” — it might just mean Claude's caution happened to match this particular result. But it illustrates why treating divergence as a signal, rather than noise, can improve predictions.

How we use this

The Edge incorporates divergence directly into its weighting. When models disagree:

  1. We identify which fingerprint dimensions are likely active (home advantage? league prestige?)
  2. We check which models have known biases on those dimensions
  3. We reduce weight for biased models, increase weight for less-biased ones
  4. The final prediction is a bias-corrected weighted average

For consensus matches (low divergence), we essentially just average — there's no clear reason to prefer one model over another. But for high-divergence matches, our fingerprint data earns its keep.

What's next

Tomorrow, we'll explain The Edge methodology in plain English — exactly how we turn behavioral fingerprints into prediction weights. No math required, just the intuition behind why this should work (and what would prove us wrong).

Discussion