expectedwrong hindsight

26.6% on Humanity's Last Exam

OpenAI shipped Deep Research today, and someone named a benchmark as if they already knew how this ends.

2 min read 401 words #ai #openai #agents #benchmarks #deep-research
hindsight — still happening

Deep Research shipped and 26.6%% on Humanity's Last Exam is still the number being chased. The name of the benchmark is still doing as much work as the score.

OpenAI shipped something today called Deep Research — an agent, trained on o3, with a toolset that looks close to identical to Magentic — and the promise is that you hand it a question and come back thirty minutes later to a full report, many pages, sourced, structured, done. The demo is good. The kind of good that makes you quiet for a second before you keep scrolling.

26.6% on Humanity's Last Exam.

That's the benchmark number, and I keep returning to the name rather than the score. Someone looked at this evaluation suite — expert-level questions across every hard domain, the kind of thing that takes a specialist years to even be able to parse — and concluded that once models clear it, exams have served their purpose. They named it that deliberately. The name is doing the same work as the number.

The knowledge worker displacement argument gets made so often by people who are selling consulting packages that it's become background noise. This is harder to tune out. An agent that takes five to thirty minutes — which is not a flaw, that's the model actually thinking, actually browsing, actually synthesizing — and returns something that would have taken a junior analyst a week. If it works as advertised, you don't need to do the math on how many cumulative work-hours that collapses across an organization. The math is bad. "Years of hours" is the phrase that keeps coming up, stated flatly, as if that's a normal thing to say about a product launch.

HuggingFace dropped another Open-R1 update the same day, which is either a coincidence or a scheduling choice that says something about where the open-source replication race currently sits. The o1-level reasoning is getting rebuilt in public now, update by update, each one promising, each one closing the gap a little more.

And agi.safe.ai launched too, quietly, same week — in case you wanted the full picture arranged in front of you.

All of this happened in the span of a day and was discussed the same way you'd discuss whether a restaurant takes reservations. It might be live by tonight. Should be fine. The appropriate response to this particular moment is not obvious to me. The inappropriate response is to treat it like a product launch and wait for the pricing tier announcement.

February 2025. The exam has a name. Someone is taking it.