expectedwrong hindsight

Your Clever Prompt Is Already Obsolete

OPRO automates away hand-crafted prompting tricks, and Mistral just proved 7B parameters can be embarrassing for everyone else.

3 min read 636 words #llm #prompting #mistral #open-source #research
hindsight — nailed it

Every clever prompt from 2023 is obsolete. Chain of Thought is built into the models now. The breathing thing does nothing measurable. The prompt engineering era collapsed into the model capability era — the models got good enough that the elaborate prompting rituals stopped mattering.

There is a particular flavor of smugness that comes from having a good system prompt. You've got Chain of Thought in there, you've got the breathing thing — "take a deep breath and work through this step by step" — you've run the benchmarks on your little test suite, you've seen the numbers improve. You know the trick. You are, in some small and privately satisfying way, a prompt engineer.

Google DeepMind would like you to know that you were doing gradient descent by hand, badly.

The paper is called "Large Language Models as Optimizers" — OPRO — and the move is almost insultingly simple: instead of humans iterating on prompt text, you describe the optimization problem to the LLM in natural language and let it propose, test, and refine candidate prompts across many iterations. The LLM reads the previous attempts and their scores, generates new candidates, and slowly converges on something better than you would have written. One of the best-performing prompts it found, on math benchmarks, was "Take a deep breath and work on this problem step by step" — which is exactly the prompt that went viral a few months ago and ended up in everyone's system configs, including mine.

The machine found the trick that humans had been passing around like a folk remedy. Then it kept going and found better ones.

This is the part that should give you pause — not that the method works, but that the optimal prompts are not the ones humans would write. They're not particularly intuitive. They don't sound like anything you'd type unprompted. The gap between "what a person thinks sounds like good instruction" and "what actually moves the needle on benchmark performance" is apparently wide enough for an optimization loop to live in.

So: CoT, good. Breathing prompts, good. Both currently being outcompeted by a procedure that costs a few API calls to run and requires no taste whatsoever.


Separately, and I'm putting these in the same post because they happened the same week and both made me feel like I needed to recalibrate something — Mistral AI dropped their 7B model paper.

Mistral is a French startup that didn't exist a year ago, staffed heavily by ex-DeepMind and Meta people, and they have produced a 7-billion parameter model that beats Llama 2 13B across every benchmark they measured. Not some benchmarks. All of them. And it gets close enough to Llama 2 34B on reasoning tasks that the size gap stops looking like an explanation and starts looking like an excuse.

The technical pieces — grouped-query attention, sliding window attention — are real architectural choices that buy you speed and longer effective context. But the benchmark numbers aren't a story about architecture. They're a story about training, about what you choose to learn from and how, and Mistral is apparently very good at that part in ways that haven't fully been explained yet.

They released it Apache 2.0. Full commercial use. Take it, fine-tune it, serve it, sell products on top of it, do whatever you want.

A startup that is months old just open-sourced a model that makes Meta's 13B look underprepared, and nobody outside of a few circles seems to be processing what that means yet. The model is going to get fine-tuned approximately ten thousand times in the next thirty days. Some of those fine-tunes are going to be exceptional and also free. The "open source LLMs are a year behind closed ones" discourse is going to need significant revision.


Both papers are saying some version of the same thing, actually — that the assumptions you've been optimizing against are wrong. Your hand-crafted prompts are not at the frontier. Your sense of what parameter count implies about quality is not calibrated correctly.

The week is only Friday.