Gemini Might Actually Be a Better Coder Than Opus Right Now
A one-test sample size that nonetheless feels like evidence of something.
the coding leaderboard kept shifting. claude ultimately won the coding crown with sonnet 3.5 and claude code. gemini stayed competitive but the 'best coder' title was never stable for more than a few months. that instability was the actual insight.
Reserving final judgment until I've run more tests. But the early numbers are not flattering for the model that's supposed to be the best coder on the planet.
I gave the same recent coding challenge to Opus, Claude Haiku Turbo, and Gemini. One problem, three models, see what happens.
Turbo failed. Horrifically. We don't need to spend more time on that.
Opus failed. Tried to be clever about it — Opus is always trying to learn, always doing something interesting and slightly wrong — but failed.
Gemini missed two import statements. That's it. Ran perfectly otherwise.
Two missing imports is not a victory lap. You can't ship that. But it's a different category of failure — the kind where you scan the error, add the two lines, run it again, and it works. Compare that to Opus, which apparently decided the problem was an opportunity for self-improvement, and Turbo, which did something too bad to describe.
There's a version of this where Gemini just got lucky on a problem that happened to match its training distribution, and next week Opus crushes it on something else. That's probably true. One test is not a benchmark.
But the early reports I'd been seeing yesterday pointed this direction, and my test didn't push back on it.
Sitting with that for now.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.