Anthropic Let You Look Inside

They built tools to understand what their own models are doing, then gave them away.

Anthropic has been quietly doing some of the most interesting mechanistic interpretability work in AI research — the kind where you actually try to understand what a neural network is computing, not just observe that it computes things correctly — and today they open-sourced the tools they use to do it.

The specific thing is circuit tracing: the practice of identifying which components of a transformer model (which attention heads, which MLP layers, which neurons or features in the residual stream) are causally responsible for a given output. Not "we looked at attention patterns and they looked plausible." Actually tracing the computation. Patching activations, building attribution graphs, following the signal through the forward pass until you can draw a diagram that says: this behavior came from here.

This is hard to do. It's also, depending on your priors about AI safety, either extremely important or a moderately interesting research problem. The case for important: if we can't read what a model is doing, we can't know when it's doing something we didn't intend. Circuit tracing is one of the few methods that offers anything like a mechanistic answer to that question.

The case for moderately interesting: current circuit tracing techniques work reasonably well on toy tasks and start to blur into noise as you scale up. A circuit for the indirect object identification task in a small model is one thing. A circuit for "decides to deceive the user in a subtle way that won't trigger the classifier" in Claude Opus is, you know, a different project.

What makes this release notable isn't just the code — it's that Anthropic built these tools to analyze their own production models, used them in published research, and now the rest of the field can use the same infrastructure. That rarely happens. Labs typically hoard their interpretability work alongside everything else, which means the research community has to re-derive or approximate the methods. Now they don't have to, at least for this.

The attribution graph approach in particular is worth looking at. It treats the model's computation as a directed graph of influences — this feature influenced that one, this attention head read from that part of the context — and lets you trace back from an output to see what actually drove it. It sounds obvious when described plainly, but actually implementing it in a way that's tractable and meaningful took a while to get right.

Whether any of this scales to the point where it can do what safety researchers actually need it to do is a different question. Circuit tracing at the level of "we understand how GPT-2 handles subject-verb agreement" is a long way from "we understand what Claude is optimizing for when the conversation goes sideways." The gap is not a matter of code quality.

But the tools exist now. People can use them. That's more than there was yesterday.

Anthropic Let You Look Inside

Counterpoints