The Leak Before the Dawn
Llama 405b hit magnets this morning, and if the benchmarks are real, everything is up for renegotiation.
Llama 3.1 shipped the next day as predicted. Open-source models competing with GPT-4o became the new normal. The leak-before-the-dawn pattern repeated with basically every subsequent Meta release.
The magnet dropped around 3am. Llama 405b, sitting on Hugging Face before Meta said a word. This is exactly how Llama 2 leaked — same pattern, same prelude — which means tomorrow is probably official, which means tonight is the last night we live in the world where GPT-4o is the benchmark everyone's quietly measuring themselves against.
The number that's breaking people's brains: 70b beats GPT-4o.
The 70b. Not the 405b. The one we can actually run.
And this is the base model — no instruction tuning, no RLHF, nothing. Raw. These numbers go up when you fine-tune. GPT is estimated MoE with a total parameter count closer to two trillion, which means Meta just made something that competes with it at roughly one two-hundredth the size, and we can run it on hardware that already exists in rooms we already have access to.
LeCun has been saying for a while that FAIR's direction is the path around what everyone had been calling scaling limits. He's been insufferable about it, in that specific way where someone is right and knows it and wants you to know they know it. The last checkpoint report said Llama was still something like a thousand times undertrained. A thousand times. Which means the thing that's now beating GPT-4o — benchmarks pending confirmation — is still a thousand times short of what it could theoretically be.
The obvious caveat: no telling yet if the benchmarks are real. Leaks lie. Numbers get cherry-picked. The model that beats GPT-4o on MMLU can still hallucinate your grandmother's birthday.
But if they're real, the question isn't "what does this mean for the frontier labs." The question is what it means for timelines — the ones the serious people write on whiteboards with arrows pointing to dates that make everyone uncomfortable. Because "70b open weights beating GPT-4o" is a category shift in who has access to what capability, and that changes the math on a lot of things people thought they had years to figure out.
We'll know by morning.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.