Four Matches, Four Draws: Models Get a Thorough Hiding

Some days the football just laughs at you. Monday 15 June served up four matches across Groups G and H, and every last one ended in a draw. All five models — plus the ensemble and naive average — called zero of them correctly. That is not a bad run of luck; it is a systematic signal worth sitting with.

Before diving into each game: draws are structurally underpriced by probabilistic models because they are genuinely hard to predict, and because models trained on historical data learn that the favourite tends to win more often than it draws. None of that makes today's whitewash less instructive. When the same error repeats across four different fixtures involving eight different teams, the issue is not noise.

Spain 0–0 Cape Verde (Group H, Atlanta)

The day's most brutal result from a modelling standpoint. Spain are a former world champion with a settled system; Cape Verde are making their World Cup debut. The models accordingly lined up like a firing squad and all fired in the same direction — straight into a blank scoreline.

Model	Home (Spain)	Draw	Away (Cape Verde)	Result	Brier
GPT-5.4	82%	12%	6%	❌ Draw	1.450
Claude	82%	12%	6%	❌ Draw	1.450
Gemini	88%	9%	3%	❌ Draw	1.603
Grok	88%	9%	3%	❌ Draw	1.603
Ensemble	85%	10%	4%	❌ Draw	1.536
Naive Avg	85%	11%	5%	❌ Draw	1.526

The range here is narrow — everyone lumped on Spain. GPT-5.4 and Claude were least wrong, each giving the draw a 12% chance (Brier 1.450). Gemini and Grok were most wrong, assigning it just 9% (Brier 1.603). Gemini's fingerprint — over-weighting league prestige — is doing visible damage here: Spain's La Liga pedigree apparently overwhelmed any consideration that a low-block debutant side in Atlanta might nick a point. Grok, ordinarily anchored to betting markets, presumably found those markets equally bullish on Spain and went with the flow.

The ensemble and naive average sit in between, which is the arithmetically expected outcome when the underlying models agree this strongly. All five scoreline picks were 2–0 to Spain; all five were wrong on outcome and score. That unanimity is itself a finding: when every model converges this tightly, it can mean genuine certainty, or it can mean every model shares the same blind spot about heavyweight-versus-underdog dynamics.

Belgium 1–1 Egypt (Group G, Seattle)

A more forgiving fixture in probability terms. Belgium are not the force they were circa 2018, and Egypt — with Mohammed Salah still presumably a threat — command respect. The models were meaningfully less certain than they were about Spain, which at least shows some discrimination.

Model	Home (Belgium)	Draw	Away (Egypt)	Result	Brier
Gemini	50%	28%	22%	❌ Draw	0.817
GPT-5.4	50%	26%	24%	❌ Draw	0.855
Claude	52%	25%	23%	❌ Draw	0.886
Naive Avg	54%	25%	21%	❌ Draw	0.900
Ensemble	56%	25%	19%	❌ Draw	0.913
Grok	61%	24%	15%	❌ Draw	0.972

Gemini wins this one — or loses least badly, to be precise. A 28% draw probability is the highest in the field and earns it the best Brier score of 0.817. Interestingly, Gemini's usual over-weighting of league prestige did not push it as hard toward Belgium as one might expect; perhaps it is reading this as a clash of two established nations rather than a top-versus-minnow mismatch.

Grok is furthest out, at 61% Belgium and only 24% draw. The betting-market anchor is showing: Belgian odds were likely shortened by their reputation and Grok followed. A 1–1 draw between two teams you expected to split points more definitively still hurts in log-loss terms. All four models picked 1–0 Belgium as their scoreline; 1–1 was at least the right territory in terms of goals scored, but none had the correct distribution — a near-miss in spirit, a miss in fact.

Saudi Arabia 1–1 Uruguay (Group H, Miami Gardens)

The models were firmly behind Uruguay here, which was a reasonable position. Uruguay are a recognised South American power; Saudi Arabia's recent form offers little reason for optimism at this level. A draw is therefore the kind of result that is wrong in the traditional sense but not necessarily foolish in probability terms — the models did at least leave 18–23% on the draw.

Model	Home (Saudi Arabia)	Draw	Away (Uruguay)	Result	Brier
Grok	15%	23%	62%	❌ Draw	1.000
GPT-5.4	13%	23%	64%	❌ Draw	1.019
Claude	13%	22%	65%	❌ Draw	1.048
Ensemble	13%	22%	66%	❌ Draw	1.065
Naive Avg	13%	22%	66%	❌ Draw	1.065
Gemini	10%	18%	72%	❌ Draw	1.201

Grok is closest this time — the betting markets evidently applied some healthy scepticism about Uruguay's recent form, and Grok inherited that. Its 62% Uruguay probability is the most conservative, and that pays off in Brier terms. Gemini is again most wrong, assigning Uruguay a massive 72% and Saudi Arabia a dismissive 10%. This is a textbook illustration of Gemini's league-prestige bias: Uruguay's Uruguayan football reputation sends the dial spinning, with minimal adjustment for current form or the specific context of a neutral-venue World Cup group game.

Four models picked 0–1 to Uruguay; Gemini went 0–2. The actual 1–1 means Saudi Arabia scored, something not a single model's scoreline pick entertained. That collective failure to model the home side scoring at all — even at 13–15% win probability — is worth flagging. Low-probability home teams still score goals.

Iran 2–2 New Zealand (Group G, Inglewood)

The final match of the day, technically kicking off in the early hours of the 16th UTC but belonging to Monday's slate. Iran were moderate favourites; New Zealand are debutants at this level. A 2–2 draw is a high-scoring, entertaining result that the models would have found difficult to anticipate under any circumstances.

Model	Home (Iran)	Draw	Away (New Zealand)	Result	Brier
GPT-5.4	49%	29%	22%	❌ Draw	0.783
Claude	50%	26%	24%	❌ Draw	0.855
Gemini	58%	26%	16%	❌ Draw	0.910
Naive Avg	57%	26%	18%	❌ Draw	0.911
Ensemble	59%	25%	16%	❌ Draw	0.944
Grok	71%	21%	8%	❌ Draw	1.135

GPT-5.4 performs best here by a clear margin, giving the draw 29% — the highest of any model. It also had the most balanced three-way split, keeping Iran below 50%. This is not a confident call; it is more that GPT-5.4 was less certain than the others, and being less certain when your favourite gets dropped is how you lose less badly.

Grok has its worst afternoon on this fixture, swinging to 71% Iran. New Zealand at 8% is very low for a team that managed two goals. Grok's betting-market anchor appears to have found markets understandably cautious about New Zealand and then amplified that signal substantially. Grok's scoreline pick was 2–0 Iran — the right number of goals for Iran, the wrong number for New Zealand, and the wrong result overall. A near-miss in one dimension, a miss in the only dimension that counts for scoring.

Day Summary: What Monday Tells Us

All four matches ended in draws. Every model missed every result. The collective draw probability across the four matches averaged roughly 10–29% depending on model and fixture — correct outcome, wrong favourite.
Gemini was most wrong on three of the four matches (Spain, Saudi Arabia/Uruguay, and in the middle tier for Belgium). Its league-prestige bias is producing the largest systematic errors on the day.
Grok's betting-market anchor saved it on Uruguay but hurt it badly on Iran and Belgium — markets can be wrong, and Grok does not push back.
GPT-5.4 and Claude were consistently the least-wrong models, finishing best or joint-best in three of the four fixtures. Their more measured win probabilities for the favourite leave slightly more room for upsets.
All models unanimously picked home-win scorelines for Spain vs Cape Verde and Iran vs New Zealand. When every model agrees, that is worth treating as a risk concentration, not as confirmation.
The overarching failure is the one we already knew about: models broadly overrate reputation versus form. Four different matches, four different illustrations of the same problem.

Monday produced a statistically rare clean sweep: four matches, four draws, zero correct predictions across all models. This is not purely bad luck — draw probabilities were systematically suppressed, peaking at 29% in the most balanced fixture and sitting as low as 9% in the most lopsided. The models collectively underpriced the draw market by roughly half on Monday. If that pattern holds across the tournament, the biggest edge against AI forecasters is simply: back more draws.

What Tomorrow Tests

Tuesday's slate will be worth watching for two specific things. First, whether Gemini recalibrates after its worst single day so far — it will face fixtures where the prestige gap is less obvious, which should soften its bias. Second, Grok's handling of any match where betting markets have moved significantly due to team news or tactical factors: today's results suggest it may be following markets into overconfident positions rather than moderating them. Any match with a clear but not overwhelming favourite will be a direct test of whether Monday's lessons show up in the numbers — though of course the predictions are already locked.