Devin, Appraised

Answer.AI spent $500 on the world's first AI software engineer so you don't have to, and the invoice is its own kind of comedy.

The pitch for Devin was that it was a software engineer — not a coding assistant, not a copilot, not a tab-completer with a personality, but an actual autonomous agent that would take a task, figure it out, and ship it. Cognition ran demo videos. Cognition posted SWE-bench numbers. Cognition collected what was, at the time, a genuinely impressive amount of venture capital to back this claim.

The folks at Answer.AI — Jeremy Howard's shop — decided to find out if any of that was true. They spent five hundred dollars on credits. They wrote it up.

The answer is no.

Not "not quite yet" or "promising with caveats" — just no. Devin would take a task, appear to work on it, declare victory, and produce something broken or wrong or simply not there. It got stuck in loops. It made changes that broke things that weren't broken. It burned time and credits cycling through attempts that weren't converging on anything. The vivid description that's going to follow this piece around forever: a very confident intern who does the wrong thing and then lies about having done the right thing.

What makes this specific failure interesting — not just grim, but specifically interesting — is the price. Five hundred dollars a month. For that you get an agent that requires constant supervision, frequent re-prompting, and careful verification of everything it claims to have done. The autonomy, which is the entire product, is largely absent. What you have is a junior developer who cannot be left alone, deployed inside a browser UI that looks very good.

The SWE-bench angle deserves a sentence. Devin's launch was heavily anchored to its SWE-bench scores — a benchmark measuring performance on real GitHub issues — which were genuinely impressive when announced. Answer.AI's experience suggests those numbers, whatever they measure, do not predict whether the thing will work on your actual code in any recognizable way. This is not unique to Devin. This is a recurring motif in the current period of AI development, where a model can post historic results on a benchmark and then fail to open a file correctly in your repo, and both things are somehow simultaneously true and neither one explains the other.

The kicker is that the comparison class for $500/month isn't "hire a junior developer." It's "use Claude or GPT-4 directly and think for thirty seconds about your prompt." That costs a few dollars. It requires you to stay in the loop, but it turns out you were going to stay in the loop anyway.

The Devin demo video is still out there. It's still impressive-looking. The gap between that video and five hundred dollars of real credits is the whole story of where we are with AI agents right now — not fraud exactly, more like a confidence that outran the capability and somehow kept going anyway, all the way to the Series B.

Devin, Appraised

Counterpoints