500,000 Tokens Per Second Is a Silly Number

Etched built a chip that does one thing, and it does that one thing at a speed that makes current benchmarks feel like a joke.

Etched announced a chip today — Sohu — that runs transformer inference at 500,000 tokens per second. That is not a typo. That is not per rack or per cluster. Per chip.

For context: a good GPU setup today will get you a few thousand tokens per second if you're not being gouty about batch sizes. A very good setup, tuned carefully, maybe more. The gap between "very good setup, tuned carefully" and 500,000 is the kind of gap that doesn't invite comparison — it invites a category rethink.

The way they got there is also the interesting part. Sohu isn't a general-purpose accelerator that happens to run transformers efficiently. It is a transformer. The attention mechanism, the matrix multiplications, the whole pipeline — baked into silicon. Hardwired. You cannot run a convolutional net on this chip any more than you can run a spreadsheet on a thermostat. It does one thing.

Which is a bet. A large, expensive, irreversible bet that transformers aren't going anywhere — that the architecture that's eaten AI for the last several years will still be the architecture when these chips are in production, when they've shipped, when the amortization math has to work out. If someone shows up in eighteen months with a genuinely better architecture that isn't transformer-shaped, Sohu becomes a very costly paperweight.

The people at Etched clearly think that bet is obvious. They might be right. The history of technology is full of people who built specialized hardware for the wrong thing at the wrong time, and also full of people who built specialized hardware for the right thing and made a fortune while everyone else was still generalizing.

Meanwhile, on the software side, Lamini published results today on fine-tuning Llama 3 for SQL — and the numbers are the kind that make you realize how much performance is being left on the floor by running general models at general tasks. A Llama 3 model, tuned correctly, is not a worse version of GPT-4 on SQL. It is a different object entirely, trained to do one thing well and doing it.

These two announcements don't seem connected. They are completely connected.

What happens when inference is not the bottleneck — when you have 500,000 tokens per second available and the question shifts from "can we afford to run this" to "what do we actually want to run" — is that specialization becomes the obvious move. You stop asking a general model to approximate your task and you train the model for the task. You stop running one model and run fifty. You stop thinking about throughput as a resource you ration and start thinking about what you'd do if it were free.

We are not there yet. Sohu isn't shipping today. The SQL numbers from Lamini are on benchmarks, which are benchmarks. But the direction is legible from here, and the direction is: inference gets cheap enough to be weird, and then people start doing weird things with it.

500,000 tokens per second is a silly number because it doesn't fit in any of the mental models we've built for what inference costs and what that cost implies. The silly numbers are usually where things get interesting.

500,000 Tokens Per Second Is a Silly Number

Counterpoints