Huge If True
Reflection-70B landed today and Matt Shumer has either done something historically significant or permanently torched his credibility — no middle ground on this one.
The skepticism was the correct read. "Huge if true" turned out to be "not true." Calling Shumer a hyperbolist in the same breath as noting the claim was the right editorial instinct.
Matt Shumer dropped Reflection-70B today and the benchmark numbers are, depending on your threshold for believing benchmark numbers, either historically significant or elaborate fiction.
The claim is that a 70B model — one that fits on hardware real people own — beats GPT-4o and Claude 3.5 Sonnet. The model talks to itself in hidden reasoning blocks, catches its own mistakes, corrects them, hands you the fixed answer. Reflection-Tuning, he's calling it.
Shumer is a hyperbolist. This is documented. HyperWrite's entire brand is announcing things at maximum volume. But there's a difference between overstating your writing assistant and claiming you've beaten the frontier on a model the community can actually download and run themselves. The open-source part is what makes this particular claim interesting — there's no place to hide. Either the weights do what he says or they don't, and everyone gets to find out simultaneously.
That's the reputational bet he's making. He has a reputation worth ruining, which is the only reason I'm paying attention instead of filing this under "founder hype." But "beating GPT-4o, locally, at 70B" is not a four-notch exaggeration. That's a different category of statement. Either he just did something remarkable or he's done.
Playground is live. Running the 70B myself tonight.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.