You Don't Control the Similarity

Cosine similarity feels like measuring meaning — it's measuring something else entirely.

Semantic search via embedding cosine similarity is one of those ideas that feels so clean you don't question it. You turn text into vectors, you compute the angle between them, close angles mean similar meaning. The geometry is satisfying. The math is simple. The results are usually good enough that nobody looks too hard at what's actually happening.

What's actually happening: you don't control the tokenizer.

The tokenizer is upstream of everything. It determines how your text gets sliced before the model ever sees it, which shapes what the model encodes, which shapes what the embedding space looks like, which shapes what cosine similarity measures. That entire pipeline — every decision in it — was made by someone else, trained on a corpus you didn't choose, optimized for objectives that may have nothing to do with your definition of "similar."

A recent paper makes the case directly. Cosine similarity of embeddings isn't a clean measurement of semantic proximity — it's entangled with co-occurrence statistics, training distribution, frequency effects. Two things that appear in similar contexts in the training data will embed nearby, regardless of whether they mean the same thing. Two things that mean the same thing but appear in different contexts won't.

This isn't a bug in a specific model. It's the shape of the approach.

The practical version of this: when your semantic search returns something unexpected — too similar, not similar enough, weird neighbors — you have almost no levers to pull. You can't retrain the tokenizer. You can't adjust the embedding objective. You picked an off-the-shelf model and you're living in its notion of semantic space now.

Most of the time it works fine and people ship it and move on. The failure modes are quiet — a search that doesn't surface something it should, a recommendation that feels slightly off, a retrieval system that degrades on domain-specific language because the training corpus didn't weight that domain. Nobody files a bug. The embedding model gets the credit when it's right and the data gets the blame when it's wrong.

The geometry was always decorative. You were measuring training distribution the whole time.

You Don't Control the Similarity

Counterpoints