281 Gigabytes and a Dead Architecture
Mistral drops Mixtral 8x22B into a torrent and Google quietly ships a Gemma that isn't a transformer, all in the same afternoon.
MoE proved its worth — GPT-4 uses it, deepseek uses it, everyone uses it. but mixtral specifically was overtaken by llama 3, qwen, and deepseek within months. the 281GB torrent was the democratic moment. the architecture survived. the model didn't stay king.
Some days nothing happens. Today was not one of those days.
Mistral announced Mixtral 8x22B — a thing that exists as a 281-gigabyte torrent, which is a sentence that would have sounded like science fiction three years ago and now just sounds like a Tuesday. The model is, as the name implies, eight experts of 22 billion parameters each — a scaling of the 8x7B architecture that already had no business being as good as it was at a third the active parameters of its competitors.
The timing is not subtle. Cohere shipped Command R+ last week, landed near the top of everything, got praised for being the serious enterprise option. Mistral's answer is to drop something that, if the 8x7B scaling held, should be substantially better — with an Apache license, which is either a genuine ideological commitment or a calculated move to make Command R+ look expensive and restricted by comparison. Possibly both. Probably both.
Nobody knows how good it is yet. It takes a while to download 281 gigabytes.
The second thing that happened today is quieter and more interesting.
Google released RecurrentGemma, which is Gemma except it isn't a transformer. The underlying architecture is Griffin — a thing that mixes gated linear recurrences with local attention windows, described in a February paper that most people filed under "interesting, will revisit." The recurrence is the point. Transformers do quadratic work against sequence length, which is a sentence that sounds technical but just means they get expensive fast and forget things in their own way. Recurrent models are the older idea — they compress history into a state vector, which is either elegant compression or aggressive forgetting depending on how charitable you're feeling.
Nobody planned for recurrent architectures to come back. The transformer ate everything in 2017 and the field moved on. Then Mamba showed up, then RWKV, and now Google is shipping a recurrent Gemma like it's a normal product release and not a quiet acknowledgment that the transformer might not be the final word.
It's 2 billion parameters, so it's not a flagship. It's a signal.
Two major releases on the same afternoon — one that's a better version of a thing we already understand, and one that's a small version of a thing we don't know what to do with yet. The 281-gigabyte one will get all the attention. The 2-billion-parameter one might matter more.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.