expectedwrong hindsight

EvoAgent Doesn't Need a Judge

When you replace the observer with a mutation function, you stop pretending there's a ground truth.

2 min read 290 words #agents #evolutionary-computation #multi-agent #selection #llm
hindsight — still happening

Self-evolving multi-agent systems without external judges remained an active research direction. The core insight — agents evaluating each other rather than needing a separate arbiter — kept showing up in newer frameworks.

The standard multi-agent setup has a quiet assumption baked into it: somebody is watching. A human, or a model playing human, sitting above the swarm and deciding which outputs are good. The oracle problem dressed up in new clothes.

EvoAgent just... removes that person.

The method is evolutionary in the actual sense — not the hand-wavy "we tried a bunch of things" sense. You start with an agent, mutate it, let the population compete, select survivors, repeat. What makes the selection step interesting is that they're not using a separate observer model to score outputs. The mutation function itself does the work. The pressure comes from the structure of the process, not from some external thing you convinced yourself was the ground truth.

FOXY is the example that makes this concrete — a task where, with no observer in the loop, the evolved agents still converge on something coherent and useful. It's a clean demonstration because it removes the obvious objection: "sure, but who's grading the homework?" Nobody. The fitness landscape is implicit in the mutations.

This matters more than it seems. The usual approach — use a strong model as judge, have it score agent outputs, select accordingly — is fragile in exactly the way you'd expect. The judge has opinions. The judge can be gamed. The judge is a single point of failure that you've dressed up as objectivity.

What EvoAgent is gesturing at is that selection doesn't require a source of truth. Evolution never had one either.

Whether this scales, whether the mutation function encodes enough signal to drive useful specialization across harder tasks — that's still open. But the architecture of the thing is right. Stop trying to build a referee. Build a fitness landscape instead.