The AI with a Phone Book

HuggingGPT uses ChatGPT as a dispatcher that routes tasks to specialist models — which sounds obvious until you watch it work.

The paper is called HuggingGPT. Microsoft's GitHub repo calls it JARVIS. The core idea is that ChatGPT is not the model that does the thing — it's the model that decides which model does the thing.

You send it a request. It reads the request, decomposes it into subtasks, consults its internal phone book of HuggingFace models, dispatches each subtask to the right specialist, collects the outputs, and synthesizes a response. The LLM is the brain. Everything else is a hand.

This is the correct mental model for where this is all going, and the fact that it's sitting right there in a paper from Zhejiang University and Microsoft Research and most of the discourse is still about whether GPT-4 can pass the bar exam is, if nothing else, a consistent data point about how we process information.

The thing that makes it interesting isn't the task decomposition — planners have existed forever. It's that the phone book is HuggingFace, which means the phone book is enormous and mostly someone else's problem. You want image segmentation? There's a model for that. You want pose estimation followed by text-to-speech in a language other than English? There are models for those. The LLM doesn't need to know how to do any of it. It just needs to know that those capabilities exist and when to call them.

There's something almost embarrassing about how clean this is. We spent years arguing about whether one model could do everything — could be good at code and poetry and math and vision — when the answer was maybe just: have the generalist handle the scheduling and let the specialists handle the work. Which is, notably, how organizations with more than four employees operate.

The failure mode is obvious. The planner can hallucinate capabilities, misroute tasks, chain errors in ways that compound silently. The HuggingFace ecosystem is also not exactly known for rock-solid model cards and reproducible outputs. You're trusting the phone book is accurate and the people you're calling actually answer.

But as a direction — using language models as coordinators rather than performers, treating the LLM as the reasoning layer and specialist models as the execution layer — it's the most coherent architectural bet I've seen this year.

The name JARVIS is carrying the whole pitch, emotionally. That's fine. It's still right.

The AI with a Phone Book

Counterpoints