Everything Is Converging to the Same Thing

The Platonic Representation Hypothesis says sufficiently large models are all finding the same reality, regardless of what they were trained on.

There is a new paper out of MIT — Huh, Cheung, Wang, Isola — making a claim that is either the most important thing anyone has said about large models in a while, or a very elegant way to be completely wrong.

The claim: as models get bigger and train on more data, their internal representations converge. Not just within a modality. Across modalities. Vision models and language models, trained on entirely different data, using entirely different architectures, with entirely different objectives, are groping their way toward the same statistical structure.

They call it the Platonic Representation Hypothesis. As in Plato. As in there is a cave, and the shadows on the wall, and a real thing casting the shadows — and the real thing is a shared model of reality that every sufficiently trained neural network is independently discovering.

This is a wild claim stated very calmly.

The evidence is that if you take a large vision model and a large language model and you measure the similarity of their internal representations — using something like Centered Kernel Alignment, which compares how models organize their internal geometry — they are more similar than you'd expect. And the larger the models, the more similar they get. Scale drives convergence. Small models stay weird and idiosyncratic. Big models start to agree.

Which implies something uncomfortable: the specific thing you trained on might matter less than you think. The modality might matter less than you think. If you train long enough, on enough data, you will find the thing. Whatever the thing is.

The Plato framing is doing real work here, not just decorative philosophy. The allegory is specifically about how different observers — seeing different shadows, from different angles — are all seeing projections of the same underlying reality. The paper is saying that's what's happening inside transformers. Different training regimes, different inputs, same destination.

I keep turning this over. The implication isn't just academic — it's that there might be a natural endpoint to representation learning, and we're watching multiple independently-launched expeditions all triangulate toward the same coordinates without talking to each other. GPT-4 and a vision model you've never heard of, converging in hyperspace.

The alternative explanation is that both kinds of models are finding the statistical structure of the data they were trained on, and that data (human language, human images) reflects a shared underlying world, so of course the representations rhyme. Which is technically less dramatic but actually equally strange when you think about it for more than thirty seconds.

Either way: these things are less different from each other than anyone thought. That's the sentence.

There's a separate thing going around right now — quiz.cord.com — that's a good parlor trick for showing this to people who haven't thought much about what's actually happening inside these models. Worth bookmarking for when someone asks you to explain what LLMs "really do." The gap between what people imagine (a very fast search engine, a stochastic parrot) and what the geometry actually looks like is significant.

The PRH paper is the weirder and more important thing. It is asking whether, under the hood, everything is the same thing.

Everything Is Converging to the Same Thing

Counterpoints