expectedwrong hindsight

Llama 3.3 70B Is GPT-4

And the only honest way to know that is to run them side by side at the same time.

2 min read 307 words #llama #open-source-models #model-comparison #graphchat #inference
hindsight — nailed it

Llama 3.3 70B was genuinely GPT-4 quality. The "thinks bigger" observation was the right way to describe what changed. Open models matching the frontier became the new normal.

Llama 3.3 70B is GPT-4. That's it. That's the thing.

There's a quality it has that I've been calling "thinks bigger" for lack of a better phrase — not verbose, not confident in the fake way, but reaching. The model extends an idea past the obvious stopping point, lands somewhere you didn't predict, treats your prompt like it actually wants to know where the answer goes.

Gemma does a version of this. Gemma's version involves exclamation marks and two sentences about why your question is so interesting before it tells you nothing. Llama 3.3's version is different. It just thinks — in the way a person sometimes thinks, where the thought gets further along than expected without the theatrical performance of getting there.

The only way to actually know this is to compare. Not benchmark. Compare — run the same prompt against everything simultaneously and read the outputs next to each other.

This is so obviously the right methodology that it's strange how rarely tooling supports it properly. A benchmark tells you a number someone computed under conditions you don't control. A comparison tells you what it feels like to work with the thing, today, on your prompts, which is the only information that was ever relevant.

What I'm building — graphchat — runs an async gather against the full model list at once. Ollama, OpenAI, Anthropic, theoretically unlimited, no sequential waits, everything back simultaneously. There's an optional LLM-as-judge layer for when you want a verdict. But the more useful thing is just seeing the outputs next to each other, because the differences are obvious in a way no leaderboard captures.

LibreChat figured this out. Most tools haven't. Model comparison isn't a power-user feature. It's the epistemic baseline — the thing you need before you can claim to know anything about any of these models at all.