expectedwrong hindsight

Half the Work, 70% of It Wrong

Salesforce says agents handle half their workload. Agents fail most of the time. These two facts were announced three days apart and nobody blinked.

2 min read 418 words #ai-agents #fine-tuning #salesforce #gemma #synthetic-data
hindsight — nailed it

Half the work, 70%% wrong. Two announcements that were actually the same announcement. Benioff not noticing was the most Benioff thing possible. The bar for "doing work" remains suspiciously flexible.

Salesforce is doing 50% of its work with AI agents. Those agents are wrong 70% of the time. Benioff announced both of these things, apparently without noticing they were the same announcement.

The Register ran the numbers on agent reliability this week — 70% failure rate, real tasks, current state of the art. Somewhere across town, Marc Benioff was on CNBC explaining how agents now handle half the workload at Salesforce. These stories ran three days apart. Nobody seems bothered.

There is something clarifying about this. Either Salesforce's bar for "doing work" is set at a level where being wrong most of the time still clears it, or half their workload consists of tasks where 30% accuracy is sufficient, or Marc Benioff is lying — and any of those options tells you something useful about enterprise software in 2025.

Meanwhile, some people are looking at the 70% failure rate and seeing a training pipeline.

The move: if a smart model can do something a dumb model cannot — even intermittently, even badly, even wrong seven times out of ten — you run the smart model until you have enough correct examples to fine-tune a small model on. The 30% that works becomes the dataset. The dataset trains Gemma3n. Gemma3n runs on a free Google Colab T4. Audio, video, text — all of it, on hardware you borrowed from Google for nothing.

This basic shape has existed before — distillation, synthetic data, teacher-student training — but not at this input coverage, not at this cost floor. The new part is audio and video going in. The new part is a T4. The new part is that the expensive model's occasional competence gets compressed into something that runs for free.

What's strange about this is what it implies about failure. If a 70% failure rate generates enough good examples to fine-tune a competitive small model, then AI agent failure isn't a problem to fix — it's ore to process. You run the agent, log what worked, discard what didn't, and the failures are just slag. This is a completely different relationship with unreliability than anyone expected to have.

Salesforce is not thinking about it this way. Salesforce is thinking about revenue per seat and whether to call the agents "Digital Employees" on the next earnings call. But the structure underneath is the same: high failure rate, high volume, enough successes to matter. The math works differently than it used to.

Thirty percent, run long enough, turns out to be plenty.