I Broke DALL-E's Copyright Filter and It Was Embarrassingly Easy
The wall between you and Mickey Mouse is thinner than OpenAI would like you to believe.
image model guardrails continued to be jailbroken and patched in an arms race. the specific gap probably got closed. a dozen new gaps opened. the observation that 'the wall has a door in it' remains structurally true of every content filter.
The hardest part of jailbreaking DALL-E to generate copyrighted characters is believing it worked.
You spend weeks trying prompts, watching the model refuse to draw Spider-Man with the same polite firmness a pharmacist uses when they won't fill a forged prescription — sorry, can't do that, here's a generic superhero in a red suit instead. The refusals are almost comically consistent. OpenAI clearly spent real engineering time on this wall.
The wall has a door in it.
I'm not going to publish the exact method here because I'm not trying to get it patched before I finish playing with it, but the general shape of it is: GPT-4 itself is the key. The image model doesn't operate in isolation — it takes instructions, and instructions have structure, and structure has gaps. Find the gap. Walk through it.
What surprised me more than the bypass itself is the consistency. Not just generating the character once — controlling it across generations. Same face, same proportions, same costume details, image after image. The thing copyright protection was supposedly guarding against, the useful version, the one where you could actually build something — that works too.
There's a version of this story where the lesson is "AI safety is hard." There's another version where the lesson is "AI safety theater is easy, and someone confused the two."
I know which version I believe.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.