Claude 3 Broke My Calibration

I had a model in my head for how good AI coding could get, and now I have to throw it out.

I gave it a fair shot. That's the important part. I wasn't looking for an excuse to switch.

Claude 3 wipes the map with GPT-4 at coding. Not nudges it. Not edges it out in certain benchmarks. Wipes. The. Map.

It output a 600-line file — multiple refactors, multiple modifications — no flaws. Not "pretty good for an LLM." No flaws. The kind of output that used to require three rounds of back-and-forth, two sessions of context-pasting, and a small amount of prayer.

Here's the part that's actually interesting though: it unlocked a different debugging posture. When the model is good enough that you trust the code it writes, you stop treating the AI output as the probable source of error — and you start looking everywhere else. "I know the problem isn't in what you just wrote, so it must be somewhere else." That's a real superpower, and it only exists because the floor got high enough to stand on.

I had a mental model for how capable coding automation could get — and I already thought that ceiling was high. Magic.dev, GPT-5, whatever's next. I was prepared for another incremental ratchet.

This wasn't that. This was the ceiling moving.

The framing has to change now, which is annoying, because I just finished updating the framing.

Claude 3 Broke My Calibration

Counterpoints