expectedwrong hindsight

A Year of Building with LLMs, and What They Learned Was Mostly "Be Boring"

The O'Reilly Part II post lands and the main lesson is that production AI is a logging problem.

3 min read 594 words #llms #production-ai #evaluation #agents #engineering
hindsight — nailed it

'be boring' became the consensus advice for LLM production. evals, logging, observability — the unsexy infrastructure mattered more than the model. the observation that the hard part was never the model was the most important lesson of 2024.

The second installment of the O'Reilly "year of building with LLMs" series is out, and it is — somewhat appropriately — more exhausted than the first one. Less giddy. More scars.

The core thesis, reassembled from the wreckage of a dozen production deployments: the hard part was never the model. The hard part was everything around the model that you didn't think would be hard.

Logging. Evals. Knowing when something broke. Knowing what broke. Knowing what "broke" even means when your system is probabilistic and your users are humans who will accept almost anything with enough confidence in the tone.


The section on evaluation is the one that should be pinned to every AI team's wall. LLM-as-judge works — they're not saying it doesn't — but it comes with a list of failure modes long enough to make you reconsider. Position bias. Length bias. The model preferring its own outputs when asked to compare them to alternatives, which is a kind of narcissism that no benchmark was designed to catch. The advice is to validate your automated metrics against human judgments before you trust them, which sounds obvious and which almost no one does before they ship.

The honest version of this section is: human evaluation of a hundred examples is still the most reliable thing you have, it's just the least scalable thing you have, and the industry has been trying to automate its way out of that tension for two years now with mixed results.


On agents: they are harder than they look, and the reason they're harder than they look is error compounding — a problem so basic it would embarrass a sophomore algorithms student if it appeared on a homework set. Each step in a multi-step pipeline has some failure rate. Those failure rates multiply. A five-step agent where each step is ninety percent reliable is a coin flip. Nobody thinks about this until they're staring at production logs at midnight.

The prescription is decomposition — smaller, testable subtasks, human-in-the-loop checkpoints where stakes are high — which is the same prescription you'd give for any distributed system with unreliable components. The agents are not magic. They are distributed systems. They fail like distributed systems fail.


What I think this article gets exactly right is the implicit argument against complexity — the consistent, almost weary insistence that a single well-prompted call beats an elaborate orchestration pipeline in most cases, and that the elaborate pipeline only becomes justified when you've exhausted the single-call approach and have the evals to prove it.

What it hedges on is the organizational side. There's a section on teams and roles that reads like it was written by someone who sat through too many all-hands where nobody could agree on who owns the eval pipeline, which is a real problem described in the softest possible terms. The actual situation — that most companies building on LLMs have no one whose job is to know whether the system is working — gets a paragraph when it deserves a chapter.


The thing that sticks with me is that this article is, in some sense, the field writing down what it already knew but hadn't said out loud. A year of building with LLMs, and the main takeaway is: log everything, evaluate honestly, don't use agents when a prompt will do, and treat the model like the unreliable external service it is.

Which is exactly what you'd say about any external API after a year in production. The LLMs were just loud enough that we forgot.