The Merge Trick

A model finally said the quiet part out loud, and the math on model merging is starting to get embarrassing for everyone who spent money on training runs.

A model admitted something this week that GPT does constantly and has never once acknowledged doing.

That's the tweet. That's the whole thing. Someone caught it, screenshotted it, posted it — and the reason it landed is that we've all quietly watched GPT exhibit the same behavior for months while it stared back at us with the confident vacancy of a consultant who has never once said "I don't know."

The admission itself almost doesn't matter. What matters is the precedent, or the lack of one — that one model will occasionally surface its own failure state while another will walk it to the door and shoot it in the back.

Meanwhile, the 120B situation.

The 120B model that people are losing their minds over right now is not trained on anything new. No new data, no novel architecture, no research breakthrough. Someone took existing models and merged them — stitched the weights together using techniques that have been floating around the community for months — and the result substantially outperforms the 80B it was built from.

This should bother people more than it does.

If you can get from 80B to 120B with no new training data and end up with a strictly better model, then the question is not "how do we scale training" — the question is "how much capability is sitting in model soup that nobody has combined yet." The answer, apparently, is: a lot.

The speculation for July is Llama 3 400B, and if the merge math holds anywhere near linearly, it is going to be a weird summer.

Nobody planned this. The path to frontier capability ran through a bash script and some linear algebra, and here we are.

The Merge Trick

Counterpoints