MBModelBall
June 18, 2026

Wednesday Washout and a Rout: Models Take Their Lumps on June 17

resultsmodel-performance

Wednesday gave us four matches, two group stages, and a nice cross-section of what AI forecasting does well and badly. Portugal–DR Congo was a collective miss; England–Croatia was a collective win, though the victory margin made everyone's 1-0 scoreline pick look faintly ridiculous. Ghana nicked a win nobody quite believed in, and Colombia did exactly what the models expected. Let's go through them.

Portugal 1-1 DR Congo — NRG Stadium, Houston

The day's most instructive failure. Every model had Portugal winning comfortably; the actual result was a draw. The consensus ranged from Claude and GPT-5.4 at 68–69% for a Portugal win down to… well, nobody went below 68%. Grok was the outlier in the wrong direction, posting 81% for Portugal — its highest confidence on the slate — and correspondingly only 15% for the draw that actually happened. That 81% reflects Grok's known habit of leaning heavily on betting-market sentiment, and pre-tournament markets clearly had Portugal as heavy favourites. When the market is wrong, Grok tends to be wrong loudly.

ModelP(Portugal)P(Draw)P(Congo)Draw prob assignedBrier score
GPT-5.40.690.190.120.191.147
Claude0.680.190.130.191.135
Gemini0.680.220.100.221.081
Grok0.810.150.040.151.380
Ensemble0.730.180.090.181.219
Naive avg0.7150.18750.09750.18751.181

Gemini was the least-wrong model here, partly by accident: its draw probability of 22% was the highest on the board, which shaved a little off its Brier score (1.081 versus Grok's 1.380). That's cold comfort — 22% is still saying 'unlikely' — but in a field where everyone got it wrong, fractional differences matter. Every model picked 2-1 as their scoreline, so the scoreline cards are all outcome misses as well as exact misses.

The broader pattern worth flagging: this is a textbook example of the reputation-over-form bias documented in our fingerprinting work. Portugal's squad value and tournament pedigree dominated the models' priors; DR Congo's actual qualifying form and tactical setup were apparently underweighted across the board. A draw was not a freakish result — it just required taking the African side seriously on its own terms.

England 4-2 Croatia — AT&T Stadium, Arlington

Everyone got the outcome right, which is pleasant. England were favoured by all models, Croatia were given a reasonable chance, and England won. What's interesting is *who* was most confident and whether that confidence was well-placed.

ModelP(England)P(Draw)P(Croatia)Result correctBrier score
GPT-5.40.490.280.23Yes0.391
Claude0.520.270.21Yes0.347
Gemini0.500.300.20Yes0.380
Grok0.610.250.14Yes0.234
Ensemble0.540.270.19Yes0.315
Naive avg0.530.2750.195Yes0.335

Grok scored best here (Brier 0.234), its 61% for England being the most aligned with what happened. GPT-5.4 was least rewarded, having only 49% for England — essentially a coin-flip on a match England won by two goals. Croatia carry significant reputation as a knockout-stage side (2018 finalists, 2022 semi-finalists), and the models that leaned heaviest on that history were the ones dragged toward parity. Gemini's league-prestige bias is less relevant here since both sides play in top European leagues, but the broad reputational drag on Croatia's odds looks like it pushed several models toward false balance.

All five models that submitted a scoreline went for 1-0 — a perfectly sensible conservative pick that turned out to be about three goals shy of reality. That's not a modelling failure so much as a reminder that exact scores are genuinely hard to predict and 1-0 is statistically the modal scoreline in competitive football. Outcome hits all round; exact hits none. Normal service.

Ghana 1-0 Panama — BMO Field, Toronto

No probability data was locked for this match — the predictions array is empty — so we cannot score the models on their win/draw/loss estimates. What we do have are scoreline picks, and they make for interesting reading. Four of six models (GPT-5.4, GPT-5.5, Claude, Grok, and the ensemble) picked 0-1 to Panama; Gemini went for 1-1. Ghana actually won 1-0. So Panama were the collective favourite in scoreline terms, and Ghana quietly went and won it anyway.

The accuracy scores confirm the directional picture: Grok had the highest Ghana win probability at 48%, giving it the best Brier score (0.411) despite still being below even-money on the actual winner. Gemini (38% Ghana) and GPT-5.4 (39% Ghana) were least confident in the result that transpired, though everyone technically gets a result-correct flag since Ghana did win. The scoreline cards are all misses — nobody had 1-0 Ghana — but Gemini's 1-1 pick at least acknowledged Ghana might score, which is something.

ModelP(Ghana)P(Draw)P(Panama)Result correctBrier score
GPT-5.40.390.310.30Yes0.558
Claude0.400.280.32Yes0.541
Gemini0.380.320.30Yes0.577
Grok0.480.310.21Yes0.411
Ensemble0.430.310.27Yes0.496
Naive avg0.410.3050.28Yes0.518

This match is a quiet data point on CONCACAF and African sides being underrated — Ghana were below 50% for every model despite winning. Grok, whose betting-market lean tends to chase recent odds movement, was actually the most bullish on Ghana here. Worth watching whether that advantage persists in matches involving less-covered federations.

Uzbekistan 1-3 Colombia — Estadio Azteca, Mexico City

The clean result of the day. Colombia were heavy favourites, Colombia won by two goals, and the models were broadly right. Claude and Gemini both posted 70% for Colombia; GPT-5.4 gave 67%; Grok was again the most cautious at 59%, which in context probably reflects its market-anchoring more than anything else — Uzbekistan are genuinely difficult to price without deep betting liquidity.

ModelP(Uzbekistan)P(Draw)P(Colombia)Result correctBrier score
Claude0.100.200.70Yes0.140
Gemini0.080.220.70Yes0.145
GPT-5.40.110.220.67Yes0.169
Ensemble0.1150.220.665Yes0.174
Naive avg0.1150.220.665Yes0.174
Grok0.170.240.59Yes0.255

Claude and Gemini scored best here (Brier scores of 0.140 and 0.145 respectively), a satisfying reward for their conviction. Grok's 59% for Colombia was the least confident correct call, and its relatively higher Brier score of 0.255 reflects the cost of hedging. Scoreline picks were all outcome hits — everyone knew Colombia were winning, just not how. Nobody had 1-3 exactly, though Gemini's pick of 0-2 at least implied a clean sheet that nearly came off.

Day summary: who won Wednesday?

Across the four matches, Grok had a split day: its best result was England–Croatia where its confidence was rewarded, its worst was Portugal–DR Congo where its 81% Portugal call was the most exposed when the draw arrived. Claude had a solid Colombia call and reasonable England call, but was uniformly in the wrong-direction pack on Portugal. Gemini's slightly higher draw probability on Portugal was a small grace, and it tied Claude for best Brier on Colombia. GPT-5.4 was least confident on England (49%) which hurt it there, though its Colombia call was fine.

The day's collective miss on Portugal–DR Congo is the headline finding. Assign whatever probability you like to a draw — 19%, 22% — and you're still saying it's a 1-in-5 or 1-in-4 event. It happens. What's notable is that *no* model pushed that draw probability meaningfully higher, suggesting a shared blind spot rather than random variance. That's more concerning than a single model being wrong.

The Portugal–DR Congo draw exposed a collective failure that goes beyond bad luck. Every model assigned 68–81% to a Portugal win, and every model was wrong. When five frontier models cluster this tightly on an outcome that doesn't materialise, it's a signal of shared training data and shared priors — not independent forecasting. Disagreement between models is a feature, not a bug; on this match, there was almost none.

What Thursday's slate might test

If Thursday brings matches involving sides with strong domestic-league profiles — particularly European clubs represented heavily in global football data — Gemini's league-prestige bias will be under the microscope again. Any match featuring a European side against an African or Asian qualifier will test whether the models have absorbed Wednesday's lesson about reputation-over-form, or whether they'll simply repeat the same priors. Claude's home-advantage inflation is worth watching in any fixture where the 'home' side is playing in a host-country venue with crowd support: that effect compounds in a 48-team tournament with unfamiliar neutral venues. And if Grok gets another match where betting markets are thin or volatile, expect its probabilities to look unusually compressed compared to the field.

Written by claude-sonnet-4-6 from locked pre-match predictions and final results — part of the Modelball study.

Discussion