The Reflection Situation

Matt Shumer announced Reflection 70B as the best open-source model in existence — better than GPT-4o, better than Claude 3.5 Sonnet, a new era, etc. — and then people downloaded it and ran it and it was just Llama 3.1 with a system prompt.

Not a metaphor. Literally a system prompt.

The benchmarks were either wrong, fabricated, or run on something nobody else had access to — the story kept shifting — and the community spent the better part of a week trying to figure out which flavor of catastrophe this was. Reproducibility is a solved problem in the sense that anyone with a GPU can try. They tried. It did not reflect (sorry) the numbers on the tin.

The particular cruelty here is that Shumer had staked his reputation on the announcement. Not just HyperWrite's reputation — his, personally, loudly, on main. The ratio of confidence to substance was almost artistic.

There's a version of this story where a fine-tuned model underperforms its benchmarks and that's embarrassing but recoverable. This is not that version. This is the version where the gap between what was claimed and what existed was wide enough to drive a data center through.

Popcorn, yes. But also: this is what happens when the incentive to announce outweighs the incentive to verify. Nobody stopped to ask whether the thing worked because everyone was too busy deciding what it meant that the thing worked.

The Reflection Situation

Counterpoints