expectedwrong hindsight

13,000 Tokens to Solve a Putnam Problem, in Your Browser

DeepSeek R1 distilled to 1.5 billion parameters, running entirely in WebGPU, doing competition math.

2 min read 371 words #AI #math #WebGPU #DeepSeek #reasoning
hindsight — nailed it

A 1.5B parameter model solving a Putnam problem in a browser tab via WebGPU was real and verifiable. The 13,000-token scratchpad showing the model thinking was the whole show, exactly as described.

The Putnam is the math competition that exists to make PhD students feel bad about themselves. It's a six-hour exam where the median score, historically, is zero. Not low — zero. Most people who sit it, people who are actively studying mathematics at university, do not solve a single problem.

DeepSeek R1, distilled to 1.5 billion parameters, running entirely in your browser via WebGPU, solved one today. It took 13,000 tokens — you can watch it think, watch it double back, watch it decide it was wrong and try a different angle. The scratchpad is the whole show.

The problem, if you want to stare at it: a monic real polynomial of degree 2n, constrained to satisfy p(1/k) = k² for all integers k with 1 ≤ |k| ≤ n. Find all other x for which the same equation holds. It's the kind of thing where, if you know the move, it clicks immediately — and if you don't, you're in the dark for a long time.

The model found the move. Eventually. With 13,000 tokens of visible suffering first.

What gets me isn't that it solved it. It's where it's running. Not a data center, not an API call, not a GPU cluster burning money somewhere in Virginia — your browser tab. The 1.5B distillation of R1 is small enough to fit in WebGPU memory and run locally, which means the thinking is happening on your machine, and you can watch all of it, and nobody is charging you per token.

There's something almost uncomfortable about watching a model visibly struggle with a hard problem. It's not the clean, confident answer you get from a retrieval system. It's the extended chain-of-thought equivalent of someone pacing around a room, and then the answer, and then — right, that's what 13k tokens of reasoning looks like on the outside.

We spent years arguing about whether language models actually reason or just pattern-match. The answer is probably both, or neither, or the question is confused. What's not confused is that a 1.5B parameter model running in a browser tab is solving Putnam problems while you watch it think.

The bar for "remarkable" keeps moving and nobody is marking where it used to be.