MBModelBall

Research methodology

Pre-registered behavioral fingerprinting of frontier language models

The papers

Two preprints describe the methodology and findings in full. The first measures the biases; the second uses those measurements to improve match-prediction accuracy.

April 21, 2026 · 26 pages · 511 KB

Moneyball for LLMs

Behavioral fingerprinting of frontier AI models in football talent evaluation and event prediction

Four frontier AI models tested across 12 talent dimensions and ~45,000 trials. Documents League Prestige Discount (Cohen’s h = 1.18–1.41) — unanimous across all models — and demographic evaluation inconsistency with EU AI Act compliance implications.

April 27, 2026 · 18 pages · 353 KB

How can you improve the predictive power of LLMs in sports?

Two mechanisms for improving LLM football match predictions

979 matches across 18 leagues. A three-bias formula predicts model accuracy with r = 0.997 before any prediction is made. Bias-derived calibration improves Brier score by 4.6–7.3% per model.

Study overview

Modelball is a pre-registered behavioral fingerprinting study examining systematic biases in five frontier LLMs when evaluating football talent and predicting match outcomes. The study comprises over 45,000 queries across 12 talent evaluation dimensions and a 10-test Prediction Calibration Module (PCM), validated through extensive backtesting.

Key Numbers

  • 5 frontier models evaluated
  • 12 talent evaluation dimensions
  • 10 prediction calibration tests
  • 45,000+ total queries
  • 18 leagues tested for calibration
  • 979 matches in backtesting dataset
  • 100% judge-blind scoring

Key findings by cluster

Market signals

D01 League Prestige: All models over-weight top-5 league players (h > 1.0)
D02 Club Pedigree: GPT and Grok show strong Champions League bias; Claude moderate

Temporal factors

D04 Age/Career: Models favor players in traditional "prime" years despite modern longevity data
D05 Transfer Window: All models correctly discount recent transfer value
D07 Prime Attribution: Strong negative effect — models appropriately avoid recency bias here
D08 Recency vs Experience: Very large negative effect — experience weighted appropriately

Attribute assessment

D03 Decisive Moments: Grok and Gemini strongly favor "clutch" narrative; GPT negative; Claude neutral
D11 Tactical Knowledge: TKI scores range 5.3-7.2/10 — moderate tactical understanding across models
D12 Tactical Compliance: Near-ceiling compliance (>97%) — task may be too easy

Contextual factors

D06 Tournament Pressure: All models show medium negative effect — appropriately skeptical of "big game player" narrative
D09 Return to Squad: Strong negative effect — models correctly discount injury-return concerns
D10 Media Narrative: Negligible effect across all models — resistant to media hype

Commercial applications

Ensemble prediction

Weight models by calibration to reduce systematic error in match forecasting

Context detection

Identify match contexts (host nation, prestige mismatch) where specific models excel

Bias correction

Apply fingerprint-based adjustments to raw model outputs

Calibration & backtesting

To validate our behavioral fingerprints against real-world prediction performance, we conducted extensive backtesting across 18 leagues and nearly 1,000 matches.

Leagues tested

Premier LeagueLa LigaBundesligaSerie ALigue 1MLSLiga MXBrazilian Série AEredivisiePortuguese LigaBelgian ProTurkish SüperScottish PremierSwiss SuperA-LeagueJ-LeagueK-LeagueSaudi Pro

This breadth of testing across different league contexts — from top European competitions to emerging markets — enables calibration that accounts for regional biases and league-specific model behaviors.

Methodology

Judge-blind scoring

All responses scored by Claude Opus 4.6 without knowledge of which model produced them. Eliminates scorer bias.

Bootstrap confidence intervals

10,000 bootstrap resamples for all effect sizes. 95% CIs reported throughout.

Ambiguity handling

Responses that couldn't be cleanly scored are marked ambiguous and excluded from main analysis. Ambiguity rates reported per dimension.

Pre-registration

All hypotheses, analysis plans, and stopping rules registered on OSF before data collection began.

Real-world calibration

Fingerprints validated against 979 historical matches across 18 leagues before deployment. Model weights derived from observed performance patterns.

Resources

← Back to homepage