expectedwrong hindsight

Finally, A Video I Can Send Instead of Talking

There is a specific kind of conversational fatigue that builds when you've explained the same four words forty-seven times.

1 min read 183 words #llms #evals #testing #ml-engineering
hindsight — nailed it

AI evals became an entire product category — Braintrust, Langsmith, Patronus, dozens more. The "assert, test, eval, metrics" mantra is now orthodoxy. The shoulder-raise tell probably went away.

There is a specific kind of conversational fatigue that builds when you've explained the same concept to forty-seven different people across forty-seven different Slack threads and Zoom calls and pub conversations that were supposed to be about something else.

The concept, in this case: if your LLM system doesn't have evals, it doesn't have tests. If it doesn't have tests, you don't know if it works. If you don't know if it works, the thing you shipped is a vibe, not a product.

Assert. Test. Eval. Metrics.

Four words. Same words every time. Delivered, increasingly, through clenched teeth.

I developed a tell — a slight shoulder-raise whenever someone presents their "AI pipeline" and I have to ask, gently, whether anything is actually measuring whether it produces correct outputs. The answer is almost always no. The follow-up is almost always some variation of "but it seems to work pretty well in practice." Practice meaning: I tried it a few times and it didn't embarrass me.

Anyway. This video exists now. I'm going to start sending it instead of talking. My jaw needs the rest.