expectedwrong hindsight

87

NousResearch's Nomos just scored 87/120 on Putnam 2025, which is a number that shouldn't exist yet.

2 min read 416 words #ai #math #reasoning #nousresearch #benchmarks

The Putnam has a median score of zero. Not near zero — zero. Most participants, who are among the strongest math undergraduates in the country, turn in a paper that earns them nothing. A score of 20 gets you a letter. A score of 60 puts you in a room with maybe forty other people who've ever done it. The top five scorers each year get their names in the Notices.

NousResearch just ran their Nomos-1 reasoning harness on the 2025 exam and a human expert came back with 87 out of 120.

That's the whole post, really. The number is the argument. Everything else is just me staring at it.

Nomos isn't a fine-tuned model in the sense of "we trained on math competitions and hoped something stuck." It's a reasoning harness — the framing matters because the claim isn't that the weights memorized their way to a Putnam score, it's that the architecture of how inference happens is doing actual work. That's either exactly the distinction we should be drawing or a sleight of hand, and I don't know which yet, and I'm not sure anyone does.

For comparison: Qwen3, under the same conditions, scored 24.

I'm not writing this to mock Qwen3 — 24 on the Putnam is legitimately impressive and would put a human in the top few hundred nationally. I'm writing this because the gap between 24 and 87 is the gap between "AI can do hard math" (old news, kind of boring, whatever) and something that doesn't have a comfortable framing yet. The difference isn't one of scale or more RLHF or a bigger context window. Something architectural is happening and Nous is telling us about it with a number rather than a paper, which is either bold or correct, or both.

The human-expert grading detail is doing real work in that sentence. Putnam problems don't have answer keys in the traditional sense — they have proofs, and proofs have to be checked by someone who understands them. The score isn't from a regex match against a ground truth file. A person read the work and gave it 87 points.

I don't know what the next Putnam ceiling looks like. I don't know if 87 compounds or plateaus. I know that "graded by a human expert" is currently the most expensive form of validation in this space, and that Nous chose it, and that the number came back 87, and that I've been staring at it for a while now.