The Half-Day Window

Microsoft's Phi-2 is a 2.7B model that beats 7B models, and Google had about twelve hours to feel good about Gemini Nano.

Microsoft just dropped Phi-2, a 2.7 billion parameter model that outperforms the 7B class — the class that, until recently, was the thing you pointed to when you wanted to prove small models were getting serious.

It comes close to Llama 2 70B. It crushes Gemini Nano. It is, numerically speaking, a 2.7 billion parameter model beating a 70 billion parameter model, which is either a miracle of data curation or a sign that we've been training models extremely wrong for a very long time — probably both, probably in proportions nobody has worked out yet.

Google dropped Gemini this week. There was, by my count, about half a day where Gemini Nano was the impressive small model story. Then Phi-2 landed and the website started buckling under load, and now Gemini Nano is a footnote in someone else's benchmark table. Not because Google did anything wrong — Gemini is a real thing that exists and works — but because the gap between "we shipped" and "already outdone" has compressed to the point where it barely registers as a gap at all. There was so much promise for half a day.

Phi-2 is not open source. This will matter for approximately one to two weeks, adjusted for holiday slowdown, after which someone will have trained something comparable on the same synthetic data reasoning trick and released it under an Apache license with a name that sounds like a Greek letter or a small animal.

Meanwhile, Mistral — six hours ago, as I'm writing this — became the first of the major model builders to remove the competing-use restriction from their ToS. Which is either a sign of confidence or a sign that the restriction was already unenforceable and they decided to get ahead of looking foolish. Probably both. The frontier moves fast enough now that the legal scaffolding is always running about three model generations behind the actual situation.

The interesting thing about Phi-2 isn't the benchmark numbers. It's what the benchmark numbers imply about where the floor lands next year — 1B outperforming 3B, 500M doing what we currently need a laptop GPU for. At some point the question stops being "how big does the model need to be" and starts being "what were we even measuring."

Nobody has a clean answer to that yet. But the website is down, so at least people are paying attention.

The Half-Day Window

Counterpoints