expectedwrong hindsight

Something Is Different About 4o and I Don't Know What

OpenAI quietly changed something, deleted tweets are flying, and it's only Tuesday.

2 min read 414 words #openai #gpt-4o #inference #benchmarks #api
hindsight — nailed it

This was basically predicting o1 before it shipped. OpenAI was experimenting with inference-time compute — reasoning steps under the hood, the model checking its own work. The "something different" was the beginning of the reasoning model era.

It's Tuesday.

There's a version of GPT-4o running right now that doesn't feel like the GPT-4o from last week. People noticed. Not in a benchmarks-went-up way — in a something-is-structurally-different way, the kind of noticing that happens when a model starts reasoning about its own outputs mid-generation or holds context it shouldn't be holding.

The working theory, floated by enough people that it stopped being a theory and started being the obvious explanation, is that OpenAI turned on some form of inference-time compute — agentic steps under the hood, active inference, the model checking its own work before handing it to you. Nothing confirmed. No announcement. Just a model that behaves differently than it did.

Also the images are better. That part at least is visually verifiable.

Then some tweets went viral claiming extraordinary things — the kind of viral where the discourse outruns the evidence — and then the tweets got deleted. The poster turned out to be affiliated with Multion, which tells you everything about the motivation. But here's the thing: the underlying observation, that 4o changed, seems to be real. The deleted tweets just attached themselves to it like a remora on a shark that was already moving.

Meanwhile, OpenAI dropped a thing called SWE-bench Verified — a human-validated subset of the SWE-bench coding benchmark, on the theory that some of the original benchmark's ground truth labels were wrong and were dragging down legitimate scores. Reasonable methodology, conveniently timed. A new benchmark that cleans up the scoring, released right before, apparently, a new model is supposed to drop Thursday.

The benchmark arrives. The model arrives to smash it. The cycle is a closed loop at this point.

The actually interesting development, the one that got buried under the drama, is this: the ChatGPT-tuned version of 4o is now accessible via API as chatgpt-4o-latest. First time, as far as anyone can tell, that the ChatGPT-specific fine-tune has been exposed outside the chat interface. No tools yet. But the model — the one with the particular personality and the RLHF specific to the consumer product — is sitting there in the API, callable.

That's a small thing that might be a large thing. The consumer product and the API have always been different objects wearing the same name. Now they're the same object. Someone at OpenAI decided that distinction no longer needed to be maintained, and they didn't issue a press release about it.

It's Tuesday and we're already three news cycles deep.