expectedwrong hindsight

Two Exchanges

Claude 3.7 solved in two exchanges what o1 and o3 high could not solve in a day.

2 min read 228 words #claude #llms #coding-agents #anthropic #swe-bench
hindsight — still happening

Two exchanges. Claude 3.7 solved in two exchanges what o1 and o3 couldn't all day. The benchmark and the lived experience aligned. This is still the state of affairs.

o1 had it all day. o3 high had it all day. Neither got it.

I switched to Cursor — not because I wanted to, Windsurf hadn't updated yet — just to get my hands on Claude 3.7. Two exchanges. Done. The problem that ate a full day dissolved like it was embarrassed to have existed.

This is not a vibe. The number is 70% on SWE-bench, which is the kind of benchmark that used to seem like a distant ceiling and is now apparently a Tuesday.

There's also Claude Code, which is what you get when you take Aider and build it properly — a terminal agent that parachutes into any repo, maps the terrain, and gets to work. It's fast at orientation in a way that makes the "let me read your entire codebase first" ritual feel less like a tax. The docs are at the obvious URL. You'll find it.

The thing about a model that actually solves problems — not demo problems, not cherry-picked benchmarks, but the specific problem you personally have been staring at since morning — is that it changes the unit economics of your frustration. You stop rationing effort. You stop deciding a thing is too annoying to fix today.

That's the sleeper effect nobody's pricing in. Not that it's smart. That it makes you impatient with everything that was previously tolerable.