expectedwrong hindsight

The ARC Numbers Don't Care About Your Roadmap

Grok 4 dropped a benchmark gap so wide it might be the gun that makes the other labs reach into the drawer.

2 min read 274 words #ai #benchmarks #grok #xai #arc-prize
hindsight — still happening

The ARC numbers landed hard. Whether those numbers held over time — the benchmark competition hasn't stopped long enough to find out. The chart you screenshot and send to a group chat is always today's chart.

The ARC Prize account posted a chart today and the gap between Grok 4 and Opus 4 is the kind of thing you screenshot, send to a group chat, and then sit there refreshing replies hoping someone has an explanation that makes it less true.

Nobody does.

If those numbers hold — and "if" is doing a lot of work in that sentence, as it always does on benchmark day — xAI just walked to the front of the room and sat down. Not with a blog post full of capability adjectives. With a score on one of the hardest evals we have, next to every other score, in a table, in public.

Jeremy Howard noticed too. When Jeremy Howard notices something about a model benchmark, it's usually because the thing is real.

Here's the part that's interesting, though: the chart doesn't just tell you about Grok 4. It tells you about timing. Every lab has a model sitting in some state of "almost ready." You hold things back — for safety review, for product reasons, for competitive reasons, for the standard human instinct to not show your hand until you have to. Today might be a "have to" day.

OpenAI has something. Google has something. Anthropic has something. They all have something. The question is whether this is the morning where the spreadsheet gets updated and someone sends the launch email.

The benchmark gap is the gun going off at the start of the race — except three other runners have been standing at the starting line for six months pretending they weren't there.

We'll know by end of day whether anyone blinks.