expectedwrong hindsight

Snowflake Dropped Embedding Models and They're Just Better

A database company strolled into the retrieval benchmark and beat the dedicated AI labs.

2 min read 397 words #embeddings #open-source #retrieval #MTEB #nlp
hindsight — evolved

the embedding model landscape kept churning. arctic embed was SOTA briefly, then overtaken by newer models. the real insight — that data warehouse companies would build embedding models — predicted a broader trend of infrastructure companies moving into AI.

Snowflake — the data warehouse company, the one that went public at a $70 billion valuation, the one whose entire pitch is that you should store your data with them instead of doing anything yourself — has released a family of open-source embedding models called Arctic Embed.

They are, at their respective sizes, state-of-the-art on MTEB retrieval. Not "competitive." Not "within a few points." Best in class.

The family runs from arctic-embed-xs at 22 million parameters up to arctic-embed-l at 335 million — five models, Apache 2.0, drop-in with sentence-transformers. The medium variant, 110M parameters, scores 54.90 NDCG@10 on MTEB retrieval. BGE-base is at 53.25. Nomic-embed is at 53.25. The model that's been sitting in everyone's RAG pipelines for the last six months lost to a model from a company that mostly sells cloud SQL.

The large variant, at 335M, is explicitly positioned as a replacement for closed embedding APIs — which is either a confident product statement or a threat, depending on which side of the API call you're on.

The training pipeline is what you'd expect if you had access to Snowflake's data: ~400 million query-document pairs for pretraining, hard negative mining during fine-tuning, proprietary web search data mixed in. They have the data. Of course they have the data. That's the entire company.

What I find genuinely interesting here — not as a press-release compliment but as an actual observation — is what it means when the infrastructure layer starts shipping the intelligence layer. Snowflake's core business is storing your vectors. Now they're also telling you which model to use to generate those vectors. The loop closes. The moat deepens. The thing you put in the database and the thing that makes the database useful are both theirs now.

Apache 2.0, though. Commercial use permitted. So you can take the model, use it for free, run it in your own stack, never touch Snowflake's cloud. That's a real choice they made.

Maybe it's a bet that if you use Arctic Embed you'll eventually want to store the embeddings somewhere, and they have thoughts on where. Maybe it's genuine belief in open-source. Maybe someone on the team just wanted to ship something good and the business case got worked out later. I don't know. The model is good and it's free and that's enough to make the rest of it someone else's problem to analyze.