The Reflection Situation
Matt Shumer shipped the most benchmarked system prompt in AI history.
It was Llama 3.1 with a system prompt. The community figured it out in a week. The "particular cruelty" observation about staking personal reputation on fabricated benchmarks was exactly right.
Matt Shumer announced Reflection 70B as the best open-source model in existence — better than GPT-4o, better than Claude 3.5 Sonnet, a new era, etc. — and then people downloaded it and ran it and it was just Llama 3.1 with a system prompt.
Not a metaphor. Literally a system prompt.
The benchmarks were either wrong, fabricated, or run on something nobody else had access to — the story kept shifting — and the community spent the better part of a week trying to figure out which flavor of catastrophe this was. Reproducibility is a solved problem in the sense that anyone with a GPU can try. They tried. It did not reflect (sorry) the numbers on the tin.
The particular cruelty here is that Shumer had staked his reputation on the announcement. Not just HyperWrite's reputation — his, personally, loudly, on main. The ratio of confidence to substance was almost artistic.
There's a version of this story where a fine-tuned model underperforms its benchmarks and that's embarrassing but recoverable. This is not that version. This is the version where the gap between what was claimed and what existed was wide enough to drive a data center through.
Popcorn, yes. But also: this is what happens when the incentive to announce outweighs the incentive to verify. Nobody stopped to ask whether the thing worked because everyone was too busy deciding what it meant that the thing worked.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.