expectedwrong hindsight

Memory Without Correction Is Just Confident Wrongness

Meta proved formally what everyone deploying personalized agents already knows and ignores.

4 min read 831 words #ai-agents #personalization #meta #machine-learning #research

Meta published a paper this week called "Personalized Agents from Human Feedback," and the contribution is proving, formally, that AI agents need to do two things to learn what you actually want: ask before they act, and update when they get it wrong after.

That's the paper. I am not being reductive. That is the paper — and the reason it matters is they proved it with lower bounds instead of asserting it with demos and a blog post.

The setup is clean. You have an agent with a memory. The memory stores what it thinks you prefer. The problem splits into two failure modes — the agent doesn't know your preferences yet (partial observability), or it knew them but they changed and now it's confidently wrong (non-stationarity). Different diseases. Different medicine. The central theorem says you need both medicines or the infection wins.

Proposition 1 is where the paper earns its page count. If your agent can't learn from post-action corrections — if it only asks clarifying questions before it acts — then under preference drift it accumulates Ω(T) regret. Linear. The kind of regret that means the system is broken and staying broken, forever, because it wrote "Kyle likes Coke" into memory three months ago and now it serves you Coke with the unshakeable confidence of someone who checked their notes. You switched to coffee. The agent doesn't know. It doesn't know it doesn't know. It stopped asking because it thinks it has the answer.

This is the current state of the art in personalization.

The fix, once you see the math, is obvious: let the agent update memory after mistakes, not just before actions. Pre-action clarification handles cold starts. Post-action feedback handles drift. Together: O(K + γ) regret — one mistake per preference switch plus a vanishing term for ambiguity. Separately, you get one of two flavors of broken.

The paper also proves that a pre-action-only agent, after enough preference drift, degrades to below baseline — actively worse than an agent with no memory at all. Memory without correction isn't just useless. It's harmful. It's the coworker who remembers one thing about you from 2019 and won't let it go.

The benchmarks are robot manipulation (fetching drinks, placing objects) and online shopping (navigating conjunctive acceptance policies, where a single bad attribute poisons an otherwise perfect product). PAHF — both channels enabled — hits 70.5% on the embodied tasks and 41.3% on shopping. The shopping number is low and they know it. They built adversarial near-miss products specifically to stress the memory. 41% means the agent is still mostly failing at this, which is either an indictment of the memory architecture or an honest admission that combinatorial preference spaces are hard. Probably both.

The memory is a bag of natural language strings stored in SQLite, retrieved by cosine similarity. Every user is an island. No transfer — not across time, not across users, not across contexts. User 12 and User 17 might have nearly identical camera preferences and the system will learn them independently, from scratch, making the same mistakes twice. The paper calls this "deliberately simple" and says sophisticated architectures are "complementary," which is the right thing to say when you're isolating variables and also a precise description of where the interesting work lives.

The O(K) bound — one mistake per preference switch — is tight for their setup. But it assumes every switch is a surprise. A system that models how preferences evolve, that notices the pattern before the drift happens, that transfers structure across similar users — that system doesn't eat the first error. It anticipates. The bound stops depending on the count of switches and starts depending on their predictability. That's not in this paper.

One more thing: all 40 users in the embodied benchmark and all 20 in shopping are simulated by LLMs playing personas. The human feedback is an LLM wearing a mask. This is fine for the theoretical point — the lower bounds don't care — but real human feedback is messier. People contradict themselves. People say "I don't like that" when they mean "I don't like that right now" or "I don't like that when you bring it to me at 8am." The salience detector that filters non-informational feedback is elegant for synthetic benchmarks and fragile for kitchens.

Meta published this from "Superintelligence Labs," which is a name that costs nothing to print and tells you exactly where they think the trajectory ends up. The framework is simple enough to ship across WhatsApp and Instagram and whatever surfaces come next. That's not the interesting part.

The interesting part is the shape of the hole they identified and chose not to fill — the transfer problem, the structure problem, the problem of preferences that have reasons rather than just values, the problem of a memory that records what you said but not why you said it or when it will stop being true.

Someone's going to fill it. The math says so.