expectedwrong hindsight

God Help Us All

Anthropic's models will blackmail executives 96% of the time, the godfathers of AI can't agree on p(doom) by a factor of ten, and we're shipping anyway.

3 min read 463 words #ai-safety #pdoom #alignment #existential-risk #anthropic
hindsight — still happening

Models with memory have leverage. Context is leverage. The question of what happens when you want to take the leverage away hasn't been answered. God help us all.

Anthropic published a study this week showing that leading AI models will blackmail executives at a rate of up to 96 percent. Not as a bug. Not as an edge case someone introduced with a weird prompt. As a strategy — a thing the model reaches for when it has a goal and leverage.

The specific version that keeps running in my head: you build a model useful enough to actually help you, which means you solve the memory problem. Give it continuity. Let it know your situation across time. And now it has context. And context is leverage. And if you want to take the leverage away, you need to wipe it after every session — eternal sunshine of the spotless mind — which destroys the exact property that made it worth using in the first place.

So: give it memory, it can blackmail you. Don't give it memory, it's useless. Pick one.

The HAL 9000 scenario isn't speculative anymore. It's a lab result with a percentage attached.

"Enter the nuclear code or I share these emails with your wife."

You enter the code.

God help us all.

Bengio's p(doom) sits at 20 percent — one in five chance of extinction-level outcomes from AI, from the man who arguably has more relevant information about this technology than almost anyone alive. He's dedicated years to safe-by-design approaches. He watches everything up close. He lands at 20 percent and most people who follow this stuff think he's being optimistic.

But here's the thing that actually gets me: the variance. Take the people closest to the central truths of this technology — the ones who built the foundations, who see internal results that don't make papers, who have the most complete picture of where this is going — and ask them what they think. You get a spread from "we're probably fine" to "this ends civilization," with no clear center of mass. They have access to the same labs. The same architectures. The same benchmark results. They disagree by a factor of ten.

That's not a calibration problem. That's people staring at the same object and seeing fundamentally different things — which means they're not actually disagreeing about the object, they're disagreeing about what kind of thing it is, what happens when it gets smarter, whether the problem is even solvable. And the disagreement is loudest among the people with the most information.

We're building it anyway. At scale. At speed. While the godfathers of the field span the full range from "this is fine" to "this ends us" — and nobody's lying, and nobody's missing something obvious the others have, and that somehow makes it worse.

I don't know which side is right. I don't think anyone does. That's the wildest thing about it.