They're Going to Run Out of Hard Problems

o3 broke ARC-AGI, which wasn't supposed to be breakable, and nobody has a plan for what comes after the test.

The ARC-AGI benchmark was never really a test for AGI. The name is misleading in the way most things named after the thing they're measuring are misleading. What it actually was — what François Chollet designed it to be — is a methodology for finding the specific class of problems that any human can solve cold, with zero training, and that AI models find nearly impossible. Not hard. Nearly impossible. The benchmark exists to reveal the shape of the gap, not to measure when the gap closes.

o3 closed it.

Not all the way — we're talking about 87.5% on the high-compute setting, not 100 — but the hardest problems on ARC-AGI, the ones that were previously untouched, the ones Chollet specifically selected because they exposed reasoning paths that neural networks demonstrably lacked, o3 is solving them. Routinely. Quickly. At a cost that's 60% cheaper than o1, which itself wasn't cheap.

The test-time compute scaling is the part worth staring at. The working model right now is that TTC — the compute you burn during inference, while the model is actually thinking — scales without a known ceiling. Not "we haven't found the ceiling yet." Without a known ceiling. Which means the question of how hard you can push a problem at o3 is, right now, unanswered. Which means the question of whether humans can keep constructing problems that AI can't solve might have a two-year expiration date, if you believe the current trajectory holds.

Two years is not very many years.

The approach itself is almost boring once you say it out loud: generate millions of candidate solutions, search the space, find what works. FunSearch, the DeepMind paper from last year, was doing something similar for mathematical discovery — using LLMs to propose and evaluate, running the loop at scale until something useful fell out. It's not elegant. It doesn't look like thinking, exactly. It looks more like an extremely fast and extremely persistent person who refuses to get tired or embarrassed by wrong answers.

There's a paper floating around — and I'm being deliberately vague because I'm going from memory here — describing how the reasoning paths that emerge in LLMs as they develop start to resemble the structures we observe in human brains doing the same cognitive work. Draw your own conclusions. I drew mine and they kept me up.

Somewhere in the middle of all this I got distracted by a different thought, which is that the real trillion-dollar business nobody is building yet is benefits management for AI agents. HR infrastructure. The whole stack — wellness programs, performance reviews, sick leave policies, behavioral health resources. Anthropic is already doing the philosophical groundwork on AI welfare, which means someone is going to commoditize the operational layer. There will be a SaaS product for this within five years. It will have a dashboard. The dashboard will be calming blues and greens. There will be a tier called "Enterprise."

Our AI does need five sick days a year, though. That part seems fair.

What ARC-AGI actually gives us going forward — assuming o3's performance holds and the TTC scaling continues — is not a benchmark we passed. It's a methodology for constructing the next class of tests. The ones that will expose whatever reasoning gaps remain after o3. Because those gaps exist. The point of ARC-AGI was never the score; it was the shape of what AI couldn't do. That shape is going to keep changing, and now we have a better tool for finding it.

The uncomfortable version of this is that we're in a race between our ability to identify what AI can't do and AI's ability to do it.

That race has a finish line. We don't know who wins.

They're Going to Run Out of Hard Problems

Counterpoints