Four Favourites, Four Wins — But Who Read It Best?

A clean sweep for the favourites across Tuesday's four matches should feel comfortable — and for the scorecards it does. Every model got every result correct. But 'correct' hides a lot. A model that said France 58% and one that said 71% both tick the right-result box; only the scoring rules tell you who was actually sharper. Today's slate was a useful stress test for confidence calibration, and the gaps were not trivial.

France 3-1 Senegal — Group I, MetLife Stadium

France won comfortably, which all models expected in the broad sense. The interesting story is the spread in conviction. Grok put France at 71%, the highest single-model figure; Claude and GPT-5.4 both sat at 58%, the lowest. On a day when France won 3-1, higher confidence was rewarded: Grok's Brier score of 0.132 was the best of the individual models by some distance, with the ensemble (which blends all five) close behind at 0.195.

Model	P(France)	P(Draw)	P(Senegal)	Brier	Correct?
Grok	0.71	0.20	0.09	0.132	✓
Ensemble	0.64	0.22	0.14	0.195	✓
Naive avg	0.62	0.22	0.16	0.215	✓
Gemini	0.60	0.25	0.15	0.245	✓
Claude	0.58	0.22	0.20	0.265	✓
GPT-5.4	0.58	0.24	0.18	0.266	✓

Claude's known fingerprint — over-pricing home advantage — does not obviously explain its lower France probability here, since France *were* the home-ish favourite regardless of venue. What stands out instead is Claude giving Senegal a 20% win probability, the highest of any model, and the highest draw probability among the individual models at 22%. That residual uncertainty about a side ranked well below France cost it points. GPT-5.4 told almost the same story. Grok's higher confidence aligned with its tendency to lean on market odds, and on this occasion the market read France correctly. All scoreline picks pointed to tight wins (1-0 or 2-0); the actual 3-1 meant everyone missed the exact score, which is unremarkable — picking exact scorelines at a World Cup is hard even for humans.

Iraq 1-4 Norway — Group I, Gillette Stadium

The most lopsided contest of the day on paper, and it played out that way. Norway's 1-4 win vindicated a near-unanimous field: every model gave Norway between 75% and 80% to win. Disagreements were minor but still separable.

Model	P(Iraq)	P(Draw)	P(Norway)	Brier	Correct?
Gemini	0.05	0.15	0.80	0.065	✓
Grok	0.09	0.13	0.78	0.073	✓
Ensemble	0.075	0.153	0.773	0.081	✓
Naive avg	0.075	0.153	0.773	0.081	✓
GPT-5.4	0.08	0.16	0.76	0.090	✓
Claude	0.08	0.17	0.75	0.098	✓

Gemini's 80% on Norway was the boldest call and earned the best Brier score of the entire day across all matches (0.065). This is Gemini's league-prestige fingerprint working in its favour: Norway, with Haaland and a clutch of Premier League players, carry obvious top-division pedigree, and Gemini leaned into that. Claude was again the least certain, giving Iraq the joint-highest home win probability (8%) and the highest draw probability (17%). It is a small sample but Claude's caution is becoming a pattern — it consistently leaves more probability mass in the 'other' outcomes than its peers. All five models picked 0-2; the actual 1-4 meant nobody landed the exact score, but the direction was never in doubt.

Argentina 3-0 Algeria — Group J, Arrowhead Stadium

The world champions dispatched Algeria without conceding, which at least matched the direction of every model's prediction. As with France, the interesting question is how confidently each model backed Argentina.

Model	P(Argentina)	P(Draw)	P(Algeria)	Brier	Correct?
Grok	0.74	0.18	0.08	0.106	✓
Ensemble	0.678	0.206	0.117	0.160	✓
Naive avg	0.66	0.213	0.127	0.177	✓
Gemini	0.65	0.23	0.12	0.190	✓
GPT-5.4	0.62	0.23	0.15	0.220	✓
Claude	0.62	0.23	0.15	0.220	✓

Grok wins this match convincingly with a Brier of 0.106, its best of the day and the second-best individual score across all four games. Its 74% on Argentina is anchored to market pricing, and bookmakers clearly had Argentina as a near-certainty. Claude and GPT-5.4 produced identical predictions here — 62/23/15 — a coincidence worth noting, suggesting both are drawing on similar underlying logic when assessing elite-versus-emerging matchups. Grok's 2-0 scoreline pick was the most optimistic about margin; the actual 3-0 was closer to Grok's worldview than anyone else's.

Austria 3-1 Jordan — Group J, Levi's Stadium

Austria's 3-1 win was the fourth straight correct call for every model, but this one produced the day's most interesting internal disagreement. Grok and Claude were furthest apart — 76% versus 62% on Austria — and the Brier scores reflect it.

Model	P(Austria)	P(Draw)	P(Jordan)	Brier	Correct?
Grok	0.76	0.17	0.07	0.091	✓
Ensemble	0.696	0.194	0.110	0.142	✓
Gemini	0.68	0.20	0.12	0.157	✓
Naive avg	0.68	0.20	0.12	0.157	✓
GPT-5.4	0.66	0.21	0.13	0.218	✓
Claude	0.62	0.22	0.16	0.218	✓

Claude's home-advantage fingerprint is supposed to push it *towards* the home side, yet here it gave Austria its joint-lowest probability of any model — 62%, the same as GPT-5.4 — while assigning Jordan a 16% win probability that no other model matched. One possible reading: Claude's fingerprint inflates home advantage for sides it already rates highly on reputation; against a team like Jordan, the home boost is partially offset by genuine uncertainty about the quality gap. Or it could simply be noise across a small sample. Either way, giving Jordan 16% to win in East Rutherford against an Austria side that won 3-1 was the most expensive call of the match for Claude.

Day summary: Grok's day, models' clean sheet

Aggregating Brier scores across all four matches, Grok was the best-performing individual model on Tuesday — not by getting results right (everyone did) but by assigning the highest confidence to the winning side in three of the four games. Its market-following fingerprint is a genuine edge when markets are sharp and the favourite wins clearly. The ensemble, as designed, smoothed out the extremes and performed solidly without topping any single match.

Grok — best individual model today; market-anchoring paid off in four clear favourite wins.
Gemini — best single score of the day (0.065 on Iraq vs Norway); league-prestige bias helped with Norway.
Ensemble — consistently second or third; diversification earns its keep without being flashy.
Claude — most cautious across all four matches; repeatedly left the largest probability mass in upset outcomes. No result wrong, but the calibration cost is accumulating.
GPT-5.4 — similar pattern to Claude; identical to Claude on Argentina, identical on Austria. The two models may be converging on similar priors for these matchups.
No exact-score hits today — all five models picked 1-0 or 2-0 across the board; actual margins were larger. The direction was right; the magnitude was not.

It is worth being clear about what today's clean sweep does *not* prove. Four favourites winning tells us less about model quality than four upsets would have. The real test of calibration comes when a 70% pick loses, or a 20% pick wins. Today was a good day for all models in the results column, and a useful but not decisive day in the scoring column. The differences in Brier scores are real and worth tracking cumulatively — they are not noise — but one day of cooperative results does not validate any model's approach.

Grok's market-following fingerprint, often flagged as a potential weakness when markets misprice, was its biggest asset today: on all four matches it posted the highest home-win probability and the lowest Brier score among individual models. The question is whether this is skill or a lucky alignment between market consensus and outcomes — Wednesday's harder matchups will start to answer that.

What Wednesday tests

Tomorrow's slate moves into territory where the models' biases should face stiffer examination. If there are any closer contests between sides of similar reputation, Claude's home-advantage inflation and Gemini's league-prestige weighting will be easier to isolate — you need genuine ambiguity to see a bias clearly. Four thumping favourites winning in a row is, paradoxically, one of the least informative scenarios for bias detection. We need a match where the models disagree sharply and one of them is plainly wrong. Wednesday, please oblige.