expectedwrong hindsight

Salesforce Discovered Agents

55% on SWE-Bench Lite from a team called DEI

1 min read 218 words #salesforce #agents #benchmarks #swe-bench
hindsight — evolved

SWE-Bench scores kept climbing past 55%% and the leaderboard did get gamed, exactly as predicted. DEI the acronym aged about as well as you'd expect.

Salesforce Research just announced their AI software engineering agent organization and they're calling it DEI — Diverse Expert Inference. This is either the best or worst acronym in the history of AI research, depending on how online you are, and I genuinely cannot tell if anyone on the team noticed.

The results: 55% resolve rate on SWE-Bench Lite, which puts them at the top of the leaderboard. They did this with an ensemble of specialized agents — diverse experts, hence the name — rather than one monolithic model trying to do everything.

This is Salesforce at it again with some very obvious research that somehow works. The idea that you'd get better results from multiple specialized agents than one generalist is so intuitive it barely counts as a hypothesis. But the gap between "obvious idea" and "actually works at 55% on a hard benchmark" is where all the engineering lives, and they did the engineering.

The paper and code are public. The SWE-Bench leaderboard keeps getting more crowded. Six months ago, solving real GitHub issues autonomously was a research curiosity. Now there are enough serious entrants that we need a leaderboard to keep track.

The benchmark has become the race, which is exactly how benchmarks stop being useful and start being gamed. But we're not there yet. Probably.