The Unlock Code Is "Think Step By Step"
The international AI safety consensus document dropped this week and buried in it is something that should bother everyone doing capability evaluations.
The elicitation gap is real and the unlock code is still "think step by step." The international safety report's finding — that the same model gets dramatically smarter when asked to reason — remains one of the most underappreciated facts in AI.
The International AI Safety Report 2025 came out — 75-odd researchers, Yoshua Bengio chairing, the whole legitimizing apparatus of international scientific consensus — and the finding I can't stop thinking about is not about bioweapons uplift or power concentration or any of the headline stuff.
It's that you can make any model smarter by telling it to think.
Not fine-tune it. Not swap it for a bigger one. Just ask it to reason through the problem before answering. GPT-4o, for instance. Tell it to think step by step and watch MATH benchmark performance jump twenty, thirty points. Tell it nothing and you get a different, dumber model — same weights, same everything, just asked the wrong way.
The report calls this the "elicitation gap" and treats it as an evaluation methodology problem, which is the polite framing. The impolite framing is: we have been running safety evaluations on the wrong version of the model. The naive query version. The version that doesn't think.
A capability evaluation that doesn't use optimal prompting is not measuring what a model can do. It's measuring what a model does when you don't try very hard. These are not the same thing and the gap between them is large and inconsistent across model families, which means cross-model comparisons are also suspect, which means most of the empirical safety literature is measuring something slightly adjacent to the thing it claims to be measuring.
The report is careful and diplomatic about this. It says evaluations are "immature and insufficient." It says we "do not currently have adequate tools to certify a frontier model as safe." What it does not say, because it is a scientific consensus document and not a blog, is that this is a somewhat spectacular own goal — we spent the last two years building evaluation infrastructure predicated on the assumption that capability is a fixed property of a model, and it turns out capability is partly a function of how you ask.
The practical consequence is that there's a capability overhang sitting in deployed models right now. Models we've evaluated, red-teamed, and deemed acceptable under current guidelines — some of those models have untapped ceiling that better prompting (or scaffolding, or tool access) will eventually expose. The report acknowledges this directly: a model safe under today's evaluations may be shown harmful later, with better elicitation.
"Think step by step" is four words. It's been in the literature since 2022. The report being released in January 2025 by the most credentialed group ever assembled to discuss AI safety is one of the first major governance documents to treat this as a structural problem rather than a prompting curiosity.
That's either reassuring or it isn't.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.