expectedwrong hindsight

Keploy Figured Out the Testing Problem by Ignoring It

Record real traffic, replay it as tests, and let an LLM handle the unit layer — the whole stack is accounted for.

3 min read 581 words #testing #apis #llms #tooling #open-source
hindsight — still happening

Traffic-based test generation is still evolving. Keploy's approach of letting the traffic write the tests remains compelling. Whether it replaces hand-written tests or supplements them — still playing out.

The reason nobody writes API tests is that writing API tests is a separate job. You have to spin up the service, mock dependencies, construct request shapes you half-remember, assert against responses you have to look up, and do all of this in a language that is not the language you were using five minutes ago when you were actually building the thing. The tests end up being a second implementation of the feature, maintained by nobody, trusted by fewer.

Keploy's answer to this is to not write the tests at all — or rather, to let the traffic write them.

The tool intercepts your application's network calls using eBPF, which means no instrumentation, no SDK, no changes to your code. It just watches. While your app is running — in dev, in CI, in staging, wherever — it captures the real request-response pairs and turns them into test cases. The tests are already correct because they came from the system working correctly. You didn't have to imagine what the API should return. It returned it, live, and Keploy wrote it down.

This is not a new idea conceptually. Record-and-replay has been around long enough to have had multiple cycles of fashion and disgust. The disgust usually comes from tests that rot — the world changes, the recorded responses no longer match reality, and you end up with a test suite that fails for reasons unrelated to bugs. Keploy is aware of this and handles dependency mocking at the network layer, which at least contains the blast radius.

The part I find more interesting is UnitGen.

UnitGen is their unit test generation layer, and it's implementing a research paper out of Meta — the TestGen-LLM work — which does something deceptively simple: it doesn't ask an LLM to generate tests and hope they work. It asks an LLM to generate tests, runs them, checks whether they compile and pass and actually improve coverage, and if they don't, it tries again. It iterates until the output is demonstrably correct. The model isn't trusted. The model is supervised by the test runner itself.

This is the part where most LLM-assisted testing tooling falls apart. You ask for tests, you get tests that look plausible, you ship them without running them, and six months later you discover your coverage metric was a hallucination wearing a badge. Keploy's approach — following Meta's — uses the runtime as the ground truth. The LLM generates candidates. The candidates either survive contact with the code or they don't. The ones that don't get discarded or revised.

The result is test generation that improves coverage in a way you can actually verify, rather than a way that feels good until you look at it carefully.

Two different problems, two different mechanisms, one tool. API-level capture for the integration layer, iterative LLM generation for the unit layer. The whole stack is covered and neither approach requires you to trust a model blindly, which is the only condition under which I will accept a tool into my life.

Whether it holds up under a real codebase is the question — record-replay systems have a way of being perfect in demos and annoying in practice. But the architecture is honest. It's not pretending the hard part doesn't exist. It's just automating the part that doesn't require judgment and leaving the judgment calls to the person who has context.

That's a narrower claim than most testing tools make. It's probably the right one.