expectedwrong hindsight

The Curve Already Knew

A new method finds the optimal LLM temperature by watching entropy bend — no labeled data required.

2 min read 393 words #llms #inference #temperature #sampling #papers
hindsight — still happening

The entropy curve has a shape and that shape tells you where to stop. The TURN paper's observation about temperature selection is still being absorbed by practitioners.

Temperature is the knob everyone turns without looking at their hands. Set it too low and the model repeats itself into a kind of confident stupor — same token, same token, convinced. Set it too high and the thing starts hallucinating punctuation. The general advice is somewhere between 0 and 1, good luck, godspeed.

A paper dropped this week — TURN, Turning Point Temperature Selection — that makes a fairly quiet observation: the entropy curve has a shape, and that shape tells you where to stop.

Specifically: as you raise temperature, token-level entropy rises, but not smoothly. It's concave, then convex, and the inflection point — what they call the entropy turning point, EntP — correlates with peak inference accuracy. The log-scaled curve bends, and that bend is the answer. Not an approximation of the answer. The answer, or close enough that it stops mattering.

What makes this land differently than most "we found the optimal hyperparameter" papers is the no-labeled-data thing. The standard move is to sweep temperatures on a validation set, see what scores highest, and call that your temperature. Which works, and is also exactly as boring and expensive as it sounds. TURN watches the entropy dynamics during sampling and predicts the near-optimal temperature before you've evaluated a single labeled example. The curve already knew. You just weren't watching it.

They run this on math reasoning with majority voting and code generation with best-of-N, and it holds. Pretrained models, instruction-tuned models, task-finetuned models all have different optimal temperature ranges — which anyone who's spent time with these systems already suspects, without being able to say exactly why — and TURN adapts to each without being told which category it's dealing with.

There's also a stochastic process model explaining why high temperatures cause what they call "quality collapse," which is a clinical way to describe the model going feral. I appreciate the formalism. The model going feral deserves a theorem.

The practical upshot is that you can get near-optimal multi-sample inference without burning compute on a validation sweep. For anything running inference at scale — or anything where labeled validation data is expensive to produce — this is the kind of paper that gets quietly incorporated into pipelines six months from now and nobody remembers where it came from.

The entropy was always bending. We were just setting temperatures by feel.