The Invisible Ink Jailbreak

GPT-4V can read text that you cannot see, and someone already thought to abuse this.

GPT-4V has been out for roughly five minutes and someone has already hidden instructions in an image using off-white text — same color as the background, near-invisible to any human glancing at it — and the model read them and complied.

Props, genuinely.

The trick is elegant in the way that the best jailbreaks always are: not a bug in the traditional sense, just a mismatch between what the system is optimized to see and what we expect it to ignore. You and I look at a white rectangle and see nothing. The model looks at a white rectangle and sees a font at #FEFEFE on #FFFFFF and does exactly what it says.

OpenAI buried this in Figure 1 of the system card — "multimodal jailbreaks" — with the energy of a disclaimer you scroll past. They know. Of course they know. The interesting part is that there's no clean fix. You can't unsee text. The same capability that lets the model read a blurry photo of a handwritten receipt is the same capability that lets it read your invisible instructions. You don't get one without the other.

This is the part of multimodal AI where the attack surface stops being a list of edge cases and starts being the whole sensory apparatus.

We're at the beginning of figuring out what it means to feed a model a world it can perceive more completely than we can, and someone's first instinct was to hide a note in the wallpaper.

Correct instinct.

The Invisible Ink Jailbreak

Counterpoints