Honest limitations: what we don't know yet
This week we've explained our methodology, our findings, and our predictions. Now it's time for the uncomfortable part: being honest about what we don't know. Good research requires transparency about limitations, not just confidence about conclusions.
Sample size constraints
Our pre-tournament calibration tested 12 international friendlies. That's enough to see patterns, but not enough for statistical certainty. With 12 matches, the difference between models ranking first and fourth could easily be noise.
The World Cup gives us 104 matches — a much larger sample. But even 104 isn't huge for the granular questions we want to answer. “Does Claude over-adjust on host nation matches?” might only have 20-30 relevant data points (matches involving USA, Mexico, or Canada). That's enough for directional evidence, not proof.
What sample size means in practice
- 12 matches: Can see if patterns exist. Cannot prove they're real vs random.
- 104 matches: Meaningful overall rankings. Subgroup analysis still noisy.
- 500+ matches: What we'd need for confident conclusions about specific contexts.
Friendlies aren't World Cups
Our calibration data comes from international friendlies in March-April 2026. These matches have real competition but different stakes than World Cup matches. Teams rotate squads, experiment tactically, and manage injury risk differently.
We don't know if our fingerprint findings transfer cleanly to tournament football. Maybe Claude's home advantage over-adjustment is actually correct for high-stakes matches where crowds matter more. Maybe Gemini's prestige bias is more accurate when elite players perform to their reputation under pressure.
This is why we're running the experiment publicly. We're not claiming our calibration proves the methodology works — we're claiming it's promising enough to test on the real thing.
Potential odds anchoring
Our prediction prompts include betting odds as context. We do this because odds contain real information — they aggregate millions of dollars of market intelligence. But it creates a methodological concern: are models making independent predictions, or are they just echoing odds back with slight adjustments?
We've seen evidence of both. Models clearly incorporate odds information (their predictions correlate with market prices). But they also diverge from odds in systematic ways that reflect their training biases. The divergence is the interesting part — but we can't be certain how much of each prediction is original analysis versus odds processing.
Future research direction
We're exploring odds-free prompts for future studies — asking models to predict without seeing market prices. This would test how much independent analysis each model can do. But for the World Cup experiment, we're using odds-inclusive prompts to match how these models would actually be used in practice.
Single time period
Our fingerprinting study was conducted in early 2026. AI models update constantly. The Claude that exists during the World Cup might behave differently than the Claude we tested. OpenAI might push a GPT-5.4 update that changes its biases.
We're logging model version strings for every prediction, so we'll know if versions change mid-tournament. But we can't re-run 45,300 fingerprinting queries before each match. Our bias corrections are based on a snapshot that might drift out of date.
Unknown unknowns
Our 12-dimension fingerprinting framework captures the biases we thought to test. But models might have biases we didn't measure:
- Weather effects on predictions
- Time-of-day biases
- Specific country or player name effects
- Tournament stage progressions (do models get better or worse as stakes rise?)
- Recency bias toward recent results
We can't correct for biases we haven't measured. The Edge accounts for known biases; unknown biases will affect all methods equally.
What would prove us wrong
We've designed this experiment to be falsifiable. Here are specific outcomes that would indicate our methodology doesn't work:
| Outcome | What it means |
|---|---|
| Naive beats Edge by >0.01 Brier | Fingerprint corrections actively hurt predictions. Methodology is wrong. |
| Context-specific corrections backfire | Reducing Claude on host matches makes it worse. Bias direction was wrong. |
| All methods perform identically | Model differences are noise, not signal. Fingerprinting was measuring nothing real. |
| Single model dominates | One model is just better at football. No need for ensembles or corrections. |
What we're confident about
Despite these limitations, some findings are robust enough to state confidently:
- Models genuinely differ: The 10+ percentage point disagreements we observe aren't measurement error. They reflect real differences in how models process football information.
- Biases are systematic: Claude's home advantage pattern, Gemini's prestige bias — these appear consistently across hundreds of trials, not randomly.
- Combining models helps: Even naive averaging outperformed most individual models in our calibration. Ensemble approaches work.
The open question is whether smart combining (The Edge) beats simplecombining (naive average). That's what the tournament will test.
Why we publish this
Most prediction services don't tell you their limitations. They present confidence, not uncertainty. We think that's backwards.
If you're going to use our predictions — or learn from our research — you deserve to know what we don't know. Hiding limitations doesn't make them go away; it just prevents you from properly weighting our claims.
By July 2026, we'll have much stronger evidence. We'll know if The Edge beat naive averaging. We'll know which specific corrections helped or hurt. And we'll publish that analysis with the same transparency we've shown here — including if the results prove our methodology was wrong.
Following the experiment
The World Cup starts June 11, 2026. From that day forward, predictions for every match will be published on our matches page before kickoff. Results will flow to the leaderboard in real time.
We'll publish weekly analysis posts examining what we're learning. Did host nation matches go as our corrections predicted? Are certain models emerging as more accurate in specific contexts?
And in July, when it's all over, we'll publish the full retrospective: what worked, what didn't, and what it means for anyone using AI in prediction contexts.
Until then, thank you for following along. The honest truth is we don't know what will happen. That's what makes it an experiment.