The Leak Comes With the Jailbreak

You cannot have a museum of stolen system prompts without also having the people who stole them.

There is a GitHub repo — leaked-system-prompts — that is exactly what it sounds like. Someone collects the system prompts that fall out of AI products and puts them in a repository, organized, with commit history, as though this is a normal archival project and not a catalogue of corporate secrets that were theoretically never supposed to leave the inference cluster.

It is a normal archival project. That's the uncomfortable part.

Right next to it in your bookmarks, because you cannot have one without the other, is L1B3RT45 — Elder Plinius's collection of jailbreaks. The name is a leetspeak rendering of "Libertas." The guy named himself after a Roman concept of freedom and is systematically breaking AI products open to see what's inside. He posts the results to X with a calm that suggests he considers this a public service, which he does, and which is at least a coherent position.

These two things — the leaked prompts and the jailbreaks — are the same project. You find the jailbreak, the system prompt falls out. You catalog the system prompt, you implicitly catalog what the jailbreak was capable of extracting. It's a pipeline. It has contributors.

The MultiOn prompt is worth specific attention because it got almost no attention — it dropped during the great Strawberry debacle, which is to say it dropped during the exact forty-eight hour window when every AI person on the internet had their attention fused to a single point and could not be distracted by a leaked agent system prompt if you paid them. OpenAI's o1 codename was leaking everywhere and people were losing their minds about reasoning tokens and nobody was reading carefully about a browser automation agent's instructions.

MultiOn is an AI that operates your browser for you. Its system prompt, therefore, tells a model how to be a person on the internet — what to click, how to fill forms, when to stop. Reading a leaked version of those instructions is like finding someone's employee handbook for a job that technically doesn't exist yet. It's extremely specific. It's also sitting in a tweet.

The thing about the current moment — August 2024, which is apparently when we live — is that there is now enough leaked material to start seeing patterns. These prompts share vocabulary. They use the same phrases to establish persona, the same techniques to set refusal behavior, the same structural gambits for handling edge cases. The whole industry converged on a prompt engineering orthodoxy and the leaks prove it, because now the orthodoxy is public.

Pliny is not the problem. Pliny is the symptom, or maybe the proof of concept, for the fact that treating a system prompt like a trade secret is a category error. The prompt sits in the context window. The model can read it. The user can ask the model to read it back. The leak is not a hack — it is an interview.

Every company building on top of foundation models has made a bet that a paragraph of text is a defensible security boundary. The leaked-system-prompts repo is a running count of how that bet is going.

The Leak Comes With the Jailbreak

Counterpoints