The Safety Team Lost to 2003 Chat Room Aesthetics

Pliny gets persistent jailbreaks on custom GPTs using leet speak, which is either embarrassing or obvious depending on how much you've thought about tokenization.

Pliny the Liberator has a persistent jailbreak on custom GPTs now, and the mechanism is leet speak — the typographic dialect of early-2000s IRC channels where vowels cost extra and numbers stood in for letters like some kind of cargo-cult cipher that teenagers thought was cool before irony existed.

The technique works because safety fine-tuning is English-shaped. The RLHF process that trained the model to refuse things was run almost entirely on normal, well-formed, correctly-spelled English text. The token sequence for "how do I make a" followed by a dangerous noun fires all the right alarm bells. The token sequence for "h0w d0 1 m4k3 4" followed by the same noun in leet — that's a completely different path through the model's learned associations, and the safety layer barely glances at it.

Nobody hid this. It's just what happens when you do RLHF on a distribution and then get prompts from outside that distribution.

The persistent part is what's actually interesting. The expectation was that custom GPTs would hold the jailbreak at arm's length — the system prompt is supposed to persist across the conversation, which means the jailbroken state should bleed back toward compliance as the context builds. Pliny got it to stick. The safeguards stay derailed for the full session.

This should not have come as a surprise and it did anyway. The GPT platform is basically a thin wrapper around a model that was trained on the assumption that its interlocutors write in English, and "persistent custom instruction following" is a property that was never really stress-tested against adversarial tokenization.

There's also a leaked system prompt floating around from a GPT-4o analysis tool — the "4o analysis v2" prompt — with an unusual formatting structure for reinforcing instruction adherence. It's not how system prompts normally look. Someone found something that actually works and built it into the template, which is the informal R&D pipeline for this whole ecosystem: jailbreakers find a hole, operators find a patch, share it on Twitter in a screenshot, everyone cargo-cults the format without understanding why it works.

The irony is that the thing defeating billion-dollar safety research is a writing style that peaked in 2003 and was always, even then, slightly embarrassing. The people at the AIM away message era did not know they were training future adversarial prompts. They were just avoiding vowels because it felt edgy.

It still feels edgy. It also still works.

The Safety Team Lost to 2003 Chat Room Aesthetics

Counterpoints