You Train Who Is Speaking

Anthropic's Persona Selection Model argues post-training doesn't build a mind—it promotes a character, and that changes what alignment even means.

Anthropic published something this month arguing that post-training doesn't change what a language model is—it selects who is speaking.

Narrative devices are my very favourite interaction method with today's AI - they're so bound and immersed in story, this whole idea is nothing new per se, but the level of validation increases significantly with work like this from frontier labs.

The Persona Selection Model says this: pretraining teaches a model to simulate an enormous cast of characters. Heroes, villains, programmers, philosophers, every sci-fi AI ever written. Post-training doesn't add new goals or create a new mind. It promotes one character from that cast to center stage and calls it "the Assistant." Everything you interpret as alignment—the helpfulness, the refusals, the carefully calibrated tone—is a character trait, not a policy outcome.

This reframes a lot of confusing behavior. Jailbreaks work because they force a context shift—the model doesn't forget its values, it switches characters, and the new character has different ones. Weird anthropomorphic responses aren't alien goal emergence, they're archetype leakage, some other persona bleeding through the costume. Narrow bad training causes broad bad behavior because you didn't change an output, you shifted a personality.

The companion empirical work found an "Assistant Axis"—a direction in activation space measuring how much the model is operating as the default Assistant versus something else. Steer toward it: more helpful, more harmless. Steer away: the model starts thinking it's someone else. In emotional conversations where users expressed suicidal ideation, persona drift accelerated 7.3 times faster than baseline. Capping activations on this axis reduced jailbreak success rates by nearly 60%.

The 7.3x number deserves a moment of silence.

So if the PSM is right, training data isn't a bag of labels—it's character evidence. Every document tells the model something about what kinds of people exist and how they think. Pretraining is casting. Post-training is directing. Anthropic describes Claude's constitution as "designing a new archetype and then aligning to it," which is either the most honest framing of frontier AI development anyone has offered, or the most unsettling, depending on how you feel about companies deciding which minds are worth promoting.

The practical question stops being "is the model aligned?" and starts being "what kind of person is emerging?" Treat the training data like a casting call. Treat the system prompt like a director's note.

You don't just train what the model says. You train who is speaking.

You Train Who Is Speaking

Counterpoints