expectedwrong hindsight

GPT-4 Looking at GPT-2 and Going "Hmm"

OpenAI's new interpretability method uses one language model to explain the neurons of another, which is either a breakthrough or a very expensive mirror.

2 min read 319 words #interpretability #openai #language-models #mechanistic-interpretability
hindsight — evolved

That 0.14 correlation score was the starting line, not the ceiling. By 2025, Anthropic mapped computational pathways in Claude using attribution graphs — successfully for about 25%% of prompts, which is low but non-trivial. MIT named mechanistic interpretability a 2026 breakthrough technology. The field went from curiosity to the thing everyone agrees matters and nobody can fully do yet.

OpenAI put out a paper where they use GPT-4 to generate natural-language explanations of individual neurons in GPT-2. Show GPT-4 the tokens that make a neuron fire, it writes you a little story about what that neuron "does," then it tries to simulate the neuron's behavior from memory to see if the story holds up. The simulation score — correlation between predicted and actual activations — is how they measure whether the explanation is any good.

The average score across all neurons is 0.14.

That's on a -1 to 1 scale. They scale this to hundreds of thousands of neurons and the number that comes back is: barely above noise. Which they acknowledge, to their credit. The paper is unusually honest about the fact that most of the explanations it generates are not very good.

The part that sticks with me is the recursive quality of it — you need a more powerful model to explain a less powerful model, which means to explain GPT-4 you'd need something bigger, and so on, turtles down. Interpretability as a service that only works if you can outsource it to a smarter thing that remains unexplained. This is fine. Everything is fine.

What's actually interesting here, and what doesn't get enough credit, is the automated pipeline itself. They can now generate an explanation for every neuron in GPT-2 XL without a human in the loop. The explanations are mostly bad. But you have them. You can search them, filter them, build on them — the scaffolding exists even if the insights are thin.

The score of 0.14 isn't a failure of the method so much as an honest measurement of how little we currently understand about what's happening inside these things. Most neurons are doing something, and we can't really say what. GPT-4 squints at them and ventures a guess and the guess is usually wrong. Same as the rest of us.