We Trained the Interesting Out of Them

A new paper identifies the data-level culprit behind LLM mode collapse, and the fix is weirder than you'd expect.

There's a paper out this week that has the structure of a mystery novel where the detective fingers the butler in chapter two and then spends the rest of the book proving it — except the butler is human psychology and the crime is making language models boring.

The setup: post-training alignment causes mode collapse. Models trained on human preference data converge toward a narrow band of outputs. You've noticed this. The poems all land the same way, the jokes all have the same rhythm, the "creative" writing has the texture of something generated rather than written. This has been attributed, largely, to the algorithms — RLHF, DPO, the usual suspects.

The paper says: wrong culprit.

What's actually happening is typicality bias. Annotators — real humans rating which response is better — systematically prefer familiar text, text that feels expected, text that sits comfortably within the distribution of things they've already read. This is a known finding from cognitive psychology, documented for decades, now apparently running quietly through every preference dataset anyone has ever built. The humans doing the ratings didn't choose the interesting response. They chose the one that felt right.

So the model learned what "feeling right" looked like, and optimized for it, and now it generates text that feels right in the way that a smooth river stone feels right — pleasant, unremarkable, worn down by contact with a million human hands.

The proposed fix is called Verbalized Sampling. You ask the model to generate multiple responses and assign probabilities to each. "Generate 5 jokes about coffee and their corresponding probabilities." The model, suddenly permitted to not commit, reaches back into whatever it learned before the preference training got to it and surfaces the full distribution. Diversity increases 1.6 to 2.1 times over direct prompting, across poems, stories, jokes, dialogue.

The darkest finding is this: more capable models benefit more from VS.

Which means the better the model — the more compute, the more data, the more careful the training — the more interesting diversity it had stored up, and the more thoroughly the preference pipeline crushed it. You built a strange and capable thing, then hired annotators who preferred familiar text to sand it smooth, and now you need a special prompt to pry the original thing back out.

No one planned this. It's the aggregate of millions of individual judgments by people who, entirely reasonably, clicked on the response that felt more like a response.

The fix being prompt-based — training-free, inference-time — is either encouraging or embarrassing depending on your priors. Probably both.

We Trained the Interesting Out of Them

Counterpoints