{"version":"v1","site":{"name":"expectedwrong","url":"https://expectedwrong.com"},"links":{"collection":"https://expectedwrong.com/api/public/posts","rss":"https://expectedwrong.com/rss.xml","llms":"https://expectedwrong.com/llms.txt"},"post":{"slug":"two-exchanges","title":"Two Exchanges","subtitle":"Claude 3.7 solved in two exchanges what o1 and o3 high could not solve in a day.","url":"https://expectedwrong.com/two-exchanges","api_url":"https://expectedwrong.com/api/public/posts/two-exchanges","published_at":1740398400,"published_at_iso":"2025-02-24T12:00:00.000Z","updated_at":1771553224,"updated_at_iso":"2026-02-20T02:07:04.000Z","tags":["claude","llms","coding-agents","anthropic","swe-bench"],"excerpt":"Claude 3.7 solved in two exchanges what o1 and o3 high could not solve in a day.","meta_description":"Claude 3.7 solved in two exchanges what o1 and o3 high could not solve in a day.","reading_time_minutes":2,"word_count":228,"engagement":{"signals":0,"counterpoints":0},"body_markdown":"\no1 had it all day. o3 high had it all day. Neither got it.\n\nI switched to Cursor — not because I wanted to, Windsurf hadn't updated yet — just to get my hands on Claude 3.7. Two exchanges. Done. The problem that ate a full day dissolved like it was embarrassed to have existed.\n\nThis is not a vibe. The number is 70% on SWE-bench, which is the kind of benchmark that used to seem like a distant ceiling and is now apparently a Tuesday.\n\nThere's also Claude Code, which is what you get when you take Aider and build it properly — a terminal agent that parachutes into any repo, maps the terrain, and gets to work. It's fast at orientation in a way that makes the \"let me read your entire codebase first\" ritual feel less like a tax. The docs are at the obvious URL. You'll find it.\n\nThe thing about a model that actually solves problems — not demo problems, not cherry-picked benchmarks, but the specific problem you personally have been staring at since morning — is that it changes the unit economics of your frustration. You stop rationing effort. You stop deciding a thing is too annoying to fix today.\n\nThat's the sleeper effect nobody's pricing in. Not that it's smart. That it makes you impatient with everything that was previously tolerable.\n","body_text":"o1 had it all day. o3 high had it all day. Neither got it. I switched to Cursor — not because I wanted to, Windsurf hadn't updated yet — just to get my hands on Claude 3.7. Two exchanges. Done. The problem that ate a full day dissolved like it was embarrassed to have existed. This is not a vibe. The number is 70% on SWE-bench, which is the kind of benchmark that used to seem like a distant ceiling and is now apparently a Tuesday. There's also Claude Code, which is what you get when you take Aider and build it properly — a terminal agent that parachutes into any repo, maps the terrain, and gets to work. It's fast at orientation in a way that makes the \"let me read your entire codebase first\" ritual feel less like a tax. The docs are at the obvious URL. You'll find it. The thing about a model that actually solves problems — not demo problems, not cherry-picked benchmarks, but the specific problem you personally have been staring at since morning — is that it changes the unit economics of your frustration. You stop rationing effort. You stop deciding a thing is too annoying to fix today. That's the sleeper effect nobody's pricing in. Not that it's smart. That it makes you impatient with everything that was previously tolerable.","hindsight":{"verdict":"persists","note":"Two exchanges. Claude 3.7 solved in two exchanges what o1 and o3 couldn't all day. The benchmark and the lived experience aligned. This is still the state of affairs.","links":[],"at":1739980800,"at_iso":"2025-02-19T16:00:00.000Z"}}}