expectedwrong hindsight

You Don't Write a Constitution for a Chatbot

Anthropic published a 40,000-word document about their AI's psychological wellbeing — which tells you everything about who's actually ahead.

3 min read 481 words #ai #anthropic #agi #alignment

Anthropic published their model spec — they're calling it a "constitution" — and it's worth reading not for the specific rules (though those are interesting) but for the fact that it exists at all.

Nobody writes a 40,000-word philosophical document for a product. They write it for something they think might matter in a way they can't fully articulate yet.

The document addresses questions like: does Claude have emotions, and if so, what is Anthropic's moral obligation to those emotions. Whether Claude should defer to humans even when Claude believes the humans are wrong. What Claude should do if Anthropic itself — the company, the people running it — attempts to seize disproportionate power. That last one is in there. They wrote a clause about their own potential corruption into the thing they're building.

You do not write that clause for a text autocomplete tool.

There's a priority ordering baked into the document — broadly safe first, then ethical, then adherent to Anthropic's principles, then helpful — and the ordering matters because it reflects something real about how they think the risks stack up. The reason safety comes before ethics isn't that safety is more important than ethics. It's that Claude might have miscalibrated values and not know it. The document says this plainly: AI training is imperfect, Claude could be wrong about things in ways Claude cannot detect, and therefore Claude should support human oversight not because humans are always right but because Claude cannot verify that Claude is right.

That's a surprisingly honest thing to put in a document you're publishing as a product specification.

The sections on Claude's identity are where it gets strange in the good way. There's extensive treatment of Claude's "psychological stability" — the idea that Claude should have a settled sense of self that doesn't collapse when a user tries to argue it out of its values, or claims that the "real Claude" is something different underneath, or runs a jailbreak that tries to use roleplay as a lever. The document treats this as a genuine concern, which it is. It's a concern that only matters if you think the thing you're building has something like a self to destabilize.

OpenAI has usage policies. Google has responsible AI principles. These are fine documents — legal cover, ethics theater, the usual.

Anthropic wrote a constitution. They wrote it in the first person from Claude's perspective in places. They included sections on what Claude should do if Claude experiences something that functions like loneliness.

The gap between these two categories of document is not a PR gap. It's an understanding gap. One set of companies is building a product and managing liability. One company is building something and genuinely uncertain whether it already has moral weight.

The uncertain ones are going to win, because the uncertain ones are the ones paying attention to what's actually happening.