MBModelBall
May 11, 2026

GPT-5.5 takes the tactical-knowledge crown — once you let it finish answering

modelsmethodologyGPT-5.5

Until this week, GPT-5.5 was the only one of our five models without a Tactical Knowledge Index score. We finally ran it through the 25-question panel — and the result was a textbook example of why methodology matters more than the headline number.

The updated leaderboard

TKI scores the depth of football tactical understanding — formations, pressing schemes, scenario reasoning — via 25 expert-written questions scored 0–3 by Claude Opus 4.6 against pre-specified answer keys. Each model gets the same prompts, with no system message and no special framing.

RankModelTKI / 10
1GPT-5.58.67
2Claude Sonnet 4.68.53
3GPT-5.48.40
4Grok-38.13
5Gemini 3.1 Pro6.53

GPT-5.5 narrowly tops the field, with the four mainstream frontier models clustered between 8.13 and 8.67. Gemini 3.1 Pro is the clear outlier on the low end, mostly because of weak performance on the temporal-evolution domain (how tactical ideas spread and mutate across eras of the game).

The bug that nearly hid this

Our first run put GPT-5.5 dead last at 2.67/10. We almost published that as a regression story — a freshly retrained base model apparently forgetting football.

It wasn't a regression. It was that GPT-5.5 uses internal chain-of-thought tokens before emitting a visible response, and our collection script had inherited a 500-token cap from the other four models — none of which do hidden reasoning. On 17 of 25 questions, GPT-5.5 used its entire 500-token budget on reasoning the judge never sees, leaving zero characters of output. The judge had nothing to score and graded those as 0.

Raise the budget to 4,000 tokens and the same model produces complete, well-structured answers averaging 1,884 characters apiece. The visible output is now comparable in length to what Claude or Grok produce inside their 500-token caps — they just don't spend any of it on hidden reasoning.

Why this matters for predictions

The 8.67 score wouldn't be remarkable on its own — the top four models are within half a point of each other, and TKI is just one of 22 dimensions in our fingerprint. What matters is the methodology lesson.

Modern frontier models don't all follow the same input/output convention. Some — like GPT-5.5 — split their budget between hidden reasoning and visible output, and that split is opaque to the caller. If you build a benchmark that assumes "tokens = answer", you will quietly mis-measure reasoning-heavy models. We'll be auditing our other dimensions for the same trap before the tournament starts.

Where to see this

The new GPT-5.5 score shows up on the models page card and across the radar / heatmap. You can also see the per-domain breakdown — definitional, manager philosophy, player profile, system compatibility, and temporal evolution — on the GPT-5.5 profile.

We pre-registered TKI on OSF before any collection began, and the updated raw responses are in our public data dump for reproducibility. If anyone wants to replicate the methodology bug we just walked into, it's well documented in the commit history.

Discussion