MBModelBall
May 1, 2026

Where bias correction fails (and what we learn from it)

calibrationlimitations

Yesterday we covered the leagues where bias correction won big. Today: the four leagues where it lost. In Eredivisie, Scottish Premiership, Swiss Super League, and Serie A, the bias-corrected ensemble produced worsepredictions than the raw model average. This is the most useful finding of the entire study.

The four-league mistake list

Brier-score change after bias correction (naive ensemble) — negative is worse

  • Eredivisie — −10.0% (n = 40)
  • Scottish Premiership — −7.7% (n = 25)
  • Swiss Super League — −6.8% (n = 24)
  • Serie A — −4.9% (n = 63)

The numbers are smaller than the wins, but the direction is clear. Across these leagues, our calibration was actively making predictions worse. We had to understand why before we could trust the methodology for the World Cup.

The pattern (so far)

Three of the four are leagues with strong, idiosyncratic patterns that don't generalise. Eredivisie is famously high-variance — extreme home advantage, unusual goal-scoring distributions, frequent upsets driven by tactical specifics. Scottish Premiership has Celtic and Rangers and then everyone else, which produces a bimodal outcome distribution that punishes calibration toward a population mean. Swiss Super has so few teams that any generic correction overfits to noise.

Serie A is harder to explain in one paragraph. The raw models were doing something right that our correction stepped on — almost certainly related to tactical-context dimensions that Serie A overweights and our correction underweights. We have a working hypothesis we'll test during the tournament.

The honest takeaway

Bias correction isn't free. When you tilt the model away from a default it would otherwise lean on, you have to be confident the default was wrong. In leagues where the “default” is actually closer to ground truth — because the league behaves predictably or because the models happen to have a good prior on it — the correction can introduce error rather than remove it.

The lesson isn't “calibration is bad.” It's “calibration is bias-aware.” If we apply a generic correction that assumes the model overrates X, we'll subtract value in any league where the model in fact rates X correctly. The work going into the World Cup version is about being more selective — only correcting where the evidence says correction is needed.

What this means for the live World Cup predictions

  • • We will not blindly apply bias correction to every match.
  • • High-variance matchups (small countries, derby fixtures) get a lighter touch.
  • • Where models agree closely already, we apply almost no correction.
  • • Where they diverge sharply, the correction does heavier work — that's when the calibration earned its keep in the league test.

Why we're publishing the failures

A 14-19% Brier improvement in five leagues is a more compelling number if you don't also see the four leagues where the same methodology lost ground. We are publishing both because the test isn't whether we can make something work somewhere — anyone can do that — but whether we know in advance which situations the methodology fits.

Tomorrow: the Premier League, where bias correction barely moved the needle in either direction. That's a different kind of finding.

Discussion