MBModelBall
April 22, 2026

Meet the models: how five AIs see football differently

modelsintroduction

Before the World Cup begins, we want to introduce you to the five AI models making predictions for every match. They're not just different products from different companies — they genuinely see football differently. Understanding their personalities is key to understanding why our ensemble approach works.

Why models disagree

Give five analysts the same match data and they'll reach different conclusions. The same is true for AI models. Despite having access to identical information — team names, recent form, head-to-head records, betting odds — GPT-5.4, GPT-5.5, Claude, Grok, and Gemini consistently produce different probability estimates.

This isn't randomness. Each model has been trained on different data, with different objectives, by teams with different priorities. These differences create systematic patterns — what we call behavioral fingerprints. In our calibration testing, models disagreed by an average of 10.3 percentage points on home-win probability. That's significant enough to matter for predictions.

GPT-5.4: the market baseline

OpenAI's flagship model tends to produce predictions that closely track betting market consensus. When bookmakers favor a team, GPT-5.4 usually agrees. This makes it a useful baseline — it represents what “conventional wisdom” looks like when processed through AI.

The downside? GPT-5.4 inherits the market's biases. It tends to overvalue teams from prestigious leagues, especially the Premier League. When Manchester City plays a strong MLS side, GPT-5.4's confidence in City often exceeds what the underlying statistics support. This prestige bias is measurable and consistent.

GPT-5.4 profile

  • Strength: Tracks market consensus accurately
  • Weakness: Inherits market biases, especially league prestige
  • Best for: Matches where market pricing is likely efficient
  • Watch out: Overconfidence in Big 5 league teams

Claude: the analyst

Anthropic's Claude takes a more cautious approach. Where other models might give a favorite 65% odds, Claude often pulls back to 55-58%. This conservatism comes from Claude's training emphasis on careful reasoning and acknowledging uncertainty.

Claude's distinctive feature is how it handles home advantage. Our testing shows Claude systematically over-adjusts for home field — adding 8-12 percentage points more than historical data supports. This will be especially relevant for matches involving host nations (USA, Mexico, Canada) where home advantage is already priced in.

The upside? Claude's caution pays off when upsets happen. In our calibration matches, Claude had the best single prediction when Japan beat England — it was the only model to give Japan a realistic chance.

Claude profile

  • Strength: Cautious predictions, good at spotting upset potential
  • Weakness: Over-adjusts for home advantage
  • Best for: Matches where favorites might be overvalued
  • Watch out: Host nation matches where home boost is doubled

Grok: the contrarian

xAI's Grok stands apart from the pack. It shows the least sensitivity to home advantage of any model — sometimes treating away teams almost as favorably as home teams. At the same time, Grok weights betting odds more heavily than the others, creating an interesting tension between market-following and crowd-ignoring.

This makes Grok valuable precisely when it disagrees with consensus. If three models favor the home team and Grok doesn't, that's a signal worth investigating. Grok's contrarian streak isn't always right, but when it is, it often catches outcomes others missed.

Grok profile

  • Strength: Independent thinking, less herd behavior
  • Weakness: Underweights home advantage
  • Best for: Identifying when consensus might be wrong
  • Watch out: Away teams in hostile environments

Gemini: the generalist

Google's Gemini 3.1 Pro sits in the middle of the pack on most dimensions. It doesn't have Claude's caution or Grok's contrarianism. What it does have is the strongest league prestige bias of any model — a roughly 1.4x multiplier on perceived quality for Premier League teams.

This makes Gemini particularly unreliable when evaluating players or teams from smaller leagues. An Argentine Primera player with identical stats to a Premier League player gets systematically downgraded in Gemini's assessments.

Gemini profile

  • Strength: Balanced across most dimensions
  • Weakness: Strongest league prestige bias
  • Best for: Matches between similarly-ranked leagues
  • Watch out: Any match involving non-Big 5 league teams

Why we combine them

Each model has blind spots. But crucially, they have different blind spots. Claude over-adjusts home advantage while Grok under-adjusts it. Gemini has the strongest league bias while Grok has the weakest. These opposing tendencies create an opportunity.

Our ensemble approach doesn't just average the five models — that would still inherit shared biases. Instead, we weight models based on the match context. For a host nation match, we reduce Claude's influence because we know it over-adjusts. For a match between a Premier League side and an MLS team, we reduce Gemini's weight.

The result is The Edge: a fingerprint-corrected ensemble that should, in theory, outperform any individual model. Whether it actually does is what we're testing across 104 World Cup matches.

Following along

Throughout the tournament, you can see exactly how each model predicts every match on our matches page. When models disagree significantly, we'll highlight why based on their fingerprints. And on the leaderboard, you can track which models are actually performing best.

Tomorrow, we'll dive deeper into one specific bias: why AI systematically overvalues players from prestigious leagues, and what that means for World Cup predictions.

Discussion