You're Paying for Context Window You're Not Getting
Greg Kamradt's latest finding confirms what the heatmaps have been screaming: the middle of your context is a graveyard.
The lost-in-the-middle problem persisted even as context windows grew to millions of tokens. You're still paying for context window you're not fully getting. The heatmaps still have a dark trough in the middle.
Greg Kamradt invented the needle in a haystack test — the one where you hide a single sentence inside a massive document and see if the model can find it — and the heatmaps he produced became one of those images that genuinely changes how people think. Bright green at the top, bright green at the bottom, and a long dark trough in the middle where information goes to die.
He's been staring at those heatmaps longer than anyone. So when he says something is important, you should probably believe him.
The finding is this: the context window you're paying for is not the context window you're getting. Models attend heavily to the beginning of a prompt, they attend heavily to the end, and somewhere in the vast middle — the part that contains most of the tokens you've fed them — they're doing something closer to skimming than reading. You send 100,000 tokens. The model leans on maybe a quarter of them. The rest are scenery.
This should be embarrassing for the whole industry, and somehow it isn't — because the marketing keeps moving faster than the evaluation. 128K context. 1M context. Coming soon: infinite context. Nobody in the press release mentions that "context" is doing a lot of creative work in that sentence.
The practical implications are not abstract. Engineers are building RAG pipelines that dump retrieved chunks into the middle of a prompt, sandwiched between a system prompt and a user query, and wondering why the model ignores half of what they retrieved. Prompt engineers are writing careful instructions buried six paragraphs deep in a system prompt, then filing bug reports when the model doesn't follow them. The architecture assumes uniform attention. Uniform attention does not exist.
What you actually want to do — and this feels like advice that should not need to be said — is put the things that matter at the beginning or the end. Treat the middle like a Bermuda Triangle. The model will technically read it. It won't necessarily care.
The cynical read is that context window length is being sold as a capability when it's really a ceiling with a trapdoor in the middle. The generous read is that this is a genuinely hard engineering problem and models are getting better at it — Gemini 1.5 Pro and Claude 3 are both meaningfully better on long-context retrieval than GPT-4 at equivalent stated lengths.
Neither read makes it fine to build systems as if the problem doesn't exist.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.