Honest limitations: what we don't know yet

This week we've explained our methodology, our findings, and our predictions. Now it's time for the uncomfortable part: being honest about what we don't know. Good research requires transparency about limitations, not just confidence about conclusions.

Sample size constraints

Our pre-tournament calibration tested 12 international friendlies. That's enough to see patterns, but not enough for statistical certainty. With 12 matches, the difference between models ranking first and fourth could easily be noise.

The World Cup gives us 104 matches — a much larger sample. But even 104 isn't huge for the granular questions we want to answer. “Does Claude over-adjust on host nation matches?” might only have 20-30 relevant data points (matches involving USA, Mexico, or Canada). That's enough for directional evidence, not proof.

What sample size means in practice

12 matches: Can see if patterns exist. Cannot prove they're real vs random.
104 matches: Meaningful overall rankings. Subgroup analysis still noisy.
500+ matches: What we'd need for confident conclusions about specific contexts.

Friendlies aren't World Cups

Our calibration data comes from international friendlies in March-April 2026. These matches have real competition but different stakes than World Cup matches. Teams rotate squads, experiment tactically, and manage injury risk differently.

We don't know if our fingerprint findings transfer cleanly to tournament football. Maybe Claude's home advantage over-adjustment is actually correct for high-stakes matches where crowds matter more. Maybe Gemini's prestige bias is more accurate when elite players perform to their reputation under pressure.

This is why we're running the experiment publicly. We're not claiming our calibration proves the methodology works — we're claiming it's promising enough to test on the real thing.

Potential odds anchoring

Our prediction prompts include betting odds as context. We do this because odds contain real information — they aggregate millions of dollars of market intelligence. But it creates a methodological concern: are models making independent predictions, or are they just echoing odds back with slight adjustments?

We've seen evidence of both. Models clearly incorporate odds information (their predictions correlate with market prices). But they also diverge from odds in systematic ways that reflect their training biases. The divergence is the interesting part — but we can't be certain how much of each prediction is original analysis versus odds processing.

Future research direction

We're exploring odds-free prompts for future studies — asking models to predict without seeing market prices. This would test how much independent analysis each model can do. But for the World Cup experiment, we're using odds-inclusive prompts to match how these models would actually be used in practice.

Single time period

Our fingerprinting study was conducted in early 2026. AI models update constantly. The Claude that exists during the World Cup might behave differently than the Claude we tested. OpenAI might push a GPT-5.4 update that changes its biases.

We're logging model version strings for every prediction, so we'll know if versions change mid-tournament. But we can't re-run 45,300 fingerprinting queries before each match. Our bias corrections are based on a snapshot that might drift out of date.

Unknown unknowns

Our 12-dimension fingerprinting framework captures the biases we thought to test. But models might have biases we didn't measure:

Weather effects on predictions
Time-of-day biases
Specific country or player name effects
Tournament stage progressions (do models get better or worse as stakes rise?)
Recency bias toward recent results

We can't correct for biases we haven't measured. The Edge accounts for known biases; unknown biases will affect all methods equally.

What would prove us wrong

We've designed this experiment to be falsifiable. Here are specific outcomes that would indicate our methodology doesn't work:

Outcome	What it means
Naive beats Edge by >0.01 Brier	Fingerprint corrections actively hurt predictions. Methodology is wrong.
Context-specific corrections backfire	Reducing Claude on host matches makes it worse. Bias direction was wrong.
All methods perform identically	Model differences are noise, not signal. Fingerprinting was measuring nothing real.
Single model dominates	One model is just better at football. No need for ensembles or corrections.

What we're confident about

Despite these limitations, some findings are robust enough to state confidently:

Models genuinely differ: The 10+ percentage point disagreements we observe aren't measurement error. They reflect real differences in how models process football information.
Biases are systematic: Claude's home advantage pattern, Gemini's prestige bias — these appear consistently across hundreds of trials, not randomly.
Combining models helps: Even naive averaging outperformed most individual models in our calibration. Ensemble approaches work.

The open question is whether smart combining (The Edge) beats simplecombining (naive average). That's what the tournament will test.

Why we publish this

Most prediction services don't tell you their limitations. They present confidence, not uncertainty. We think that's backwards.

If you're going to use our predictions — or learn from our research — you deserve to know what we don't know. Hiding limitations doesn't make them go away; it just prevents you from properly weighting our claims.

By July 2026, we'll have much stronger evidence. We'll know if The Edge beat naive averaging. We'll know which specific corrections helped or hurt. And we'll publish that analysis with the same transparency we've shown here — including if the results prove our methodology was wrong.

Following the experiment

The World Cup starts June 11, 2026. From that day forward, predictions for every match will be published on our matches page before kickoff. Results will flow to the leaderboard in real time.

We'll publish weekly analysis posts examining what we're learning. Did host nation matches go as our corrections predicted? Are certain models emerging as more accurate in specific contexts?

And in July, when it's all over, we'll publish the full retrospective: what worked, what didn't, and what it means for anyone using AI in prediction contexts.

Until then, thank you for following along. The honest truth is we don't know what will happen. That's what makes it an experiment.