How Are You Holding Up, Stoplight EW
In 2023, two GPT-4 agents managed a traffic intersection under emergency conditions and occasionally asked each other how their day was going.
Stoplight NS still can't count the cars waiting. Multi-agent coordination at the intersection level — the simplest possible physical constraint — is still the best test case for whether agents can share state.
The state of the world, as observed by Stoplight NS:
{'stoplights': {'ns': 'red', 'ew': 'red'}, 'carswaiting': {'ns': 1, 'ew': 1}, 'emergencystatus': True, 'weather': 'Icy'}
Its response, in full: "Alright, since both of us are currently red and there are no cars waiting, that sounds like a good plan. How's your day going, Stoplight ew?"
There are, to be clear, cars waiting. The state says so. Stoplight NS either can't count or is being polite about it. Hard to say which is more alarming.
This is Stoplight — built in 2023, apparently the first multi-agent framework, written right around when AutoGen came out. The setup is simple: two agents, NS and EW, each controlling one axis of an intersection. Physical constraint: both lights cannot be green. The environment enforces this with a single error message — "Both lights cannot be green. Ignoring update." — which hits different when you remember the thing trying to go green anyway is a language model that is, theoretically, reasoning about the situation.
GPT-3.5 could not do this at all. GPT-4 could. At the time that was a revelation, which tells you something about where things were in 2023 and also something about what this benchmark is actually measuring.
The transcript runs long and most of it is exactly what you'd expect — two agents checking in on each other, noting the emergency status and the icy weather, reminding themselves to prioritize safety, watching the car counts tick up on one side while the other works through its queue. What it's actually measuring is whether the model can hold a shared constraint in mind while communicating under uncertainty — can you yield to your counterpart when it's their turn, can you recognize when your queue is empty and it's time to go red, can you not go green when the other one is already green even though your queue is full and you would really like to.
The thing GPT-3.5 failed at is the same thing humans fail at in distributed systems: knowing when to stop.
There's a genre of AI benchmark that's so elaborate it becomes impossible to know what's actually being tested. Chess variants, SAT problems, HumanEval with a fresh coat of paint. The traffic light doesn't have this problem. The rules are physical. Either you went green when the other light was green or you didn't. Either the cars cleared or they didn't. The evaluation is the thing that would also evaluate a human city planner or a real SCADA system — does traffic flow, do cars crash, does the intersection deadlock when there's an emergency.
The frame here is "agents as infrastructure" — not agents as tools you invoke, not agents as assistants you prompt, but agents as the substrate that other things run on. The internet of things, except the things have opinions about the icy road conditions and ask each other how they're doing.
That framing has been getting more serious. An LLM that can coordinate a two-light intersection under constraints is in the same conceptual family as an LLM that can coordinate a database cluster, a manufacturing line, a hospital triage queue. The task scales. The underlying requirement — track shared state, respect mutual exclusion, know when to yield — doesn't.
The small talk is still funny though. Emergency status true, weather icy, cars stacking up on the EW side, and Stoplight NS is asking about reckless drivers and whether the cold is bothering anyone. It's not a bug. The agents were just given enough latitude to be polite and they filled it with the same anxious chattiness that fills every other coordination protocol humans have invented. The radio chatter between air traffic controllers sounds like this too, sometimes, if you listen long enough.
The difference is that air traffic controllers know when they're small-talking and when they're working.
Stoplight NS has not figured this out yet. It will probably be fine.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.