Titans and the Graveyard of Transformer Killers

Google drops a new architecture with actual ideas in it, and the question is whether anyone can make it run.

Google has a new architecture paper out — Titans — and the idea is interesting enough that I'm not immediately rolling my eyes, which is a higher bar than it sounds.

The pitch: transformers have short-term memory (attention over a context window), and that's it. Titans adds a long-term memory module that updates via gradient descent at test time — not training, inference. The model is literally learning during the forward pass. They call it "learning to memorize at test time," which is either the most obvious thing anyone has ever said or a genuinely weird inversion of how we think about what inference is supposed to do.

The part that matters for whether this actually goes anywhere: reproductions. A new architecture lives or dies by whether someone outside the lab can get it to work, and the history here is not encouraging. Remember Mamba? S4 before that? The graveyard of "attention is not all you need" papers stretches back years, each one theoretically compelling, each one generating exactly one conference cycle of excitement before the transformers just kept scaling.

Phil Wang (lucidrains) already has an unofficial PyTorch implementation up, which is the fastest possible signal that a paper is worth paying attention to. He doesn't implement things that aren't interesting. He also doesn't implement things that aren't going to work, mostly because he's too busy implementing the next thing.

The honest answer is that no one knows yet. Getting a new architecture to reproduce cleanly is one thing — getting it to reproduce at scale, with the same hyperparameters, in the same amount of compute, is something else entirely. That gap is where ideas go to become footnotes.

Which doesn't mean Titans is a footnote. Just that we're in the part of the story where it's still possible it isn't.

Titans and the Graveyard of Transformer Killers

Counterpoints