expectedwrong hindsight

The Number Token Was Always the Wrong Move

xVal thinks LLMs are bad at math because we've been encoding numbers like illiterates since the beginning.

3 min read 641 words #llm #numerics #architecture #scientific-ml #embeddings
hindsight — evolved

The number tokenization problem got partially addressed. Some models now use continuous number encodings. Others handle arithmetic better through chain-of-thought. But the fundamental critique — that BPE tokenization is wrong for numbers — was correct and influenced model architecture decisions.

Here is how every language model you've ever used represents the number 3.14159: as the tokens "3", ".", "1", "4", "1", "5", "9" — or some equally arbitrary slicing depending on what showed up in the BPE training corpus. The model learns that "3" follows certain patterns. It does not learn that 3 is less than 4.

This is not a bug people discovered recently. Everyone has known this. The standard response has been some variation of "the model sees enough examples that it figures it out implicitly," which is technically true in the same way that a person who has never held a ruler can still estimate lengths by staring at things long enough.

The Polymathic AI team — the same group building foundation models for physics simulations — dropped a paper this month proposing something different. xVal. The idea is almost insultingly simple once you see it.

Instead of tokenizing a number into digit fragments, you replace every number in the input with a single [NUM] token, and then you scale the embedding vector for that token by the actual numerical value. Not a lookup. A multiplication. The number 1000 gets the same token as the number 0.001 — but its embedding vector is a million times larger.

This means numerical magnitude exists in the geometry of the space. Two numbers that are close together will have embeddings that are close together. The model doesn't have to learn that "9" and "10" are adjacent by memorizing their co-occurrence statistics across a trillion tokens of internet text. It just knows, the way physical quantities know, because that's how the encoding works.

The outputs follow the same scheme — the model predicts a [NUM] token plus a continuous scalar. Which means it can output 7.3812 without that number ever having appeared in training. It can interpolate. It can, theoretically, extrapolate. These are things the digit-by-digit approach cannot do in any principled sense.

They test this on scientific datasets — climate, cosmology, particle physics, the kinds of problems where numbers are the entire point — and the results are what you'd expect once you've accepted that the original approach was a category error.

The reason this didn't happen sooner is probably that language modeling was built for language, and language doesn't have numbers the way physics does. "The meeting is at 3pm" and "the meeting is at 4pm" are not close to each other in any meaningful numerical sense — they're separated by one hour in meaning and one character in spelling, and the spelling-based approach handles that just fine. The failure mode only becomes obvious when you need the model to actually compute things, or to generalize across numerical magnitudes, or to handle scientific notation without special-casing it.

The Polymathic project exists precisely because they hit that wall. You can't build a foundation model for science using the same tokenization scheme designed for tweet threads.

What I find genuinely interesting — and slightly uncomfortable — is that this implies every benchmark result on numerical reasoning for every major LLM was measuring something like "how well can a model that fundamentally doesn't understand numbers fake understanding numbers." The answer is: pretty well, actually. Well enough that it wasn't obviously broken. Well enough that fixing it wasn't a priority.

It's a strange thing to be good enough at being wrong that the wrong approach survives for years.

Whether xVal or something like it makes it into the next generation of architectures that people actually train at scale is a different question. The inertia on tokenization schemes is enormous — changing the vocabulary means changing the entire pipeline. But the argument that digit-by-digit tokenization is the correct approach for numerical data has always been weak, and now there's a clean alternative sitting in the open.

The math was never the problem. The representation was.