PyTorch Wrote a Fast Inference Engine in 1000 Lines of Python and It Actually Works

gpt-fast does what every "blazing fast" LLM repo claims to do, except it's real.

There's a genre of GitHub repo that announces itself with benchmarks in the README, a graph showing 10x speedup versus some unnamed baseline, and then you clone it and it segfaults because you're on the wrong CUDA minor version. You have seen this repo many times. You have the bruises.

gpt-fast is not that repo.

PyTorch Labs dropped this thing quietly — ~1000 lines of pure Python, no custom CUDA kernels, no C++ binding layer you have to compile separately while silently praying. Just PyTorch. torch.compile, int4/int8 quantization, speculative decoding, tensor parallelism if you need it. That's the whole list. They hit somewhere around 200 tokens per second on Llama-2-7B on a single A100, which is not a number you typically associate with "I wrote this in an afternoon in Python."

The thing that's actually novel here — and I want to be precise about this because the word "novel" gets dragged through the mud — is that torch.compile is doing real work. Not demo work. Not benchmark-gerrymandering work. The compiled path is genuinely faster than the eager path by a margin that matters, which is the thing the PyTorch team has been promising for like two years and which I had, frankly, stopped expecting to see.

Most fast inference libraries are C++ with a Python wrapper and a good PR team. This is the other thing, the thing where the abstraction actually pays off.

Whether that holds when you swap in a model that isn't Llama 2 on a clean A100, I genuinely don't know. But right now, on December 1st, 2023, this is the most interesting inference repo I've seen — and not for the numbers. For what the numbers imply about what's now possible without leaving Python.

PyTorch Wrote a Fast Inference Engine in 1000 Lines of Python and It Actually Works

Counterpoints