Huge If True

Reflection-70B landed today and Matt Shumer has either done something historically significant or permanently torched his credibility — no middle ground on this one.

Matt Shumer dropped Reflection-70B today and the benchmark numbers are, depending on your threshold for believing benchmark numbers, either historically significant or elaborate fiction.

The claim is that a 70B model — one that fits on hardware real people own — beats GPT-4o and Claude 3.5 Sonnet. The model talks to itself in hidden reasoning blocks, catches its own mistakes, corrects them, hands you the fixed answer. Reflection-Tuning, he's calling it.

Shumer is a hyperbolist. This is documented. HyperWrite's entire brand is announcing things at maximum volume. But there's a difference between overstating your writing assistant and claiming you've beaten the frontier on a model the community can actually download and run themselves. The open-source part is what makes this particular claim interesting — there's no place to hide. Either the weights do what he says or they don't, and everyone gets to find out simultaneously.

That's the reputational bet he's making. He has a reputation worth ruining, which is the only reason I'm paying attention instead of filing this under "founder hype." But "beating GPT-4o, locally, at 70B" is not a four-notch exaggeration. That's a different category of statement. Either he just did something remarkable or he's done.

Playground is live. Running the 70B myself tonight.

Huge If True

Counterpoints