The Hot Neuron Trick
PowerInfer splits your LLM across GPU and CPU not by layer but by which neurons actually show up to work.
the insight about unequal neuron contribution was valid science. but powerinfer didn't become the standard — the field moved to quantization, speculative decoding, and other optimizations. right about the observation, wrong about the technique's trajectory.
Some researchers at Shanghai Jiao Tong University noticed something about large language models that everyone else apparently looked at and filed under "interesting, moving on."
Neurons are not pulling equal weight. Not even close. Across wildly different inputs — different prompts, different contexts, different everything — roughly the same small cluster of neurons lights up every single time. Maybe 10% of them. The rest are there for the edge cases, the weird sentences, the moments the model needs to reach.
They called the always-on ones hot neurons. The rarely-used ones, cold.
And then they did the obvious thing that nobody had done: put the hot ones on the GPU and the cold ones on the CPU.
That's PowerInfer. That's the whole trick. A predictor runs on the GPU to decide, token by token, which neurons are actually needed — the hot ones are already there, the cold ones get fetched from CPU RAM only when called. The paper drops this week and the benchmarks are the kind of numbers that make you re-read them looking for the footnote that explains why they don't count.
LLaMA 65B on a single RTX 4090. Eleven tokens per second. llama.cpp doing the same job with CPU offloading: 0.09. That's not a percentage improvement, that's a different category of thing.
The reason this works — and this is the part that should probably bother people more than it does — is that LLMs are deeply, structurally lazy. The power law distribution of neuron activations means that during actual inference, you're running a much smaller network than you think you are, just one that can occasionally wake up the reserves when it needs them. The model you're serving is not the model you trained.
The existing approach to fitting big models onto consumer hardware was to just offload layers wholesale to CPU, which is fine except that PCIE bandwidth is a narrower pipe than you'd like and you end up waiting. PowerInfer's approach is to stay on the GPU as much as possible and only go to CPU for the specific neurons that are genuinely needed — which, statistically, is not many.
Nobody planned this. The locality property emerged from training and someone finally decided to exploit it structurally rather than just noting it in a paper and moving on.
The thing that makes this interesting beyond the benchmarks is what it implies about how we think about these models. We've been treating them like monolithic blobs to be shuffled around between memory tiers. They're not. They're sparse, stateful, surprisingly predictable things — and the hardware story for inference is going to look very different once that property gets baked into the stack from the ground up.
We're at the part of the hardware inference story where the clever tricks are still winning. That doesn't last forever. But right now, today, a grad student in Shanghai figured out that your GPU doesn't need to talk to your CPU that much if you're smart about which neurons actually matter. And the result is a 124x speedup on a machine you can buy at Best Buy.
That's a good week for the field.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.