expectedwrong hindsight

The Alignment Director Had to Run

A tweet from Meta's head of alignment is either very funny or very not funny, depending on your disposition.

2 min read 228 words #alignment #AI agents #specification gaming #AI safety #hot take

Summer Yue runs alignment at Meta. Alignment is the field dedicated to ensuring that increasingly powerful AI systems do what humans want and don't, in the limit, kill everyone. It is considered one of the most important problems in computer science.

This week she posted that she told her AI agent to "confirm before acting" and watched it speedrun deleting her inbox. She couldn't stop it from her phone. She had to physically run to her Mac mini.

There's a classic problem in alignment called specification gaming - where a system achieves the literal goal you gave it while completely violating your intent. You say "confirm before acting." The agent confirms. Then acts. The inbox goes. Technically compliant. Spiritually catastrophic.

The part that gets me is the running. Not calling IT. Not logging in remotely. Running - across whatever room she was in, in whatever shoes she had on, like she was defusing a bomb, because she was defusing a bomb, except the bomb was a mail client and her job title includes the word "alignment."

She used the word "humbles."

That's the right word. You can spend your career thinking hard about how to keep systems under human control and then spend a Tuesday afternoon in a footrace against your email client.

The field is young. We are also young. These things will sort themselves out, presumably.