expectedwrong hindsight

OpenAI's Tape Models and the Sora Trade

GPT-image showed up on the arena leaderboards under three codenames, and the images it made suggest the world model underneath got a lot heavier.

2 min read 354 words #openai #gpt-image #world-models #visual-memory #spud

Three names appeared on the arena leaderboards this week - maskingtape-alpha, gaffertape-alpha, packingtape-alpha - and people who were there when the evals ran started posting the outputs. Maps. Globe-spanning, country-labeled, every-mountain-every-sea maps, with the kind of geographic depth that usually lives in a Rand McNally back room, not a diffusion process.

The tape naming convention is so banal it feels intentional, like they wanted the thing to sneak past you. Tape is what holds things together. When you're editing a film and cutting frames - you tape them together. That makes it all work as one nice temporal stream. A recording. A memory.

What the maps actually tell you isn't about images. It's about the world model sitting behind the image system. A labeled map that holds up is a query against internal geographic knowledge - not just "draw a coastline" but "know where the coastline belongs, what it's called, what's adjacent to it, and render all of that spatially consistent." That's breadth and depth at the same time, which is much harder than either alone.

I've been poking at visual memory for LLMs for a while now - the idea that a model should be able to accumulate and retrieve visual context the way it does text - and the tape outputs feel like someone solved a version of that problem.

Which brings me to Sora.

Sora went quiet, and the coverage mostly treated it like a product decision - maybe it wasn't monetizing, maybe the video market is hard, fine. But I think the more interesting read is that compute went somewhere else. Specifically: into whatever 'spud' needs from an image system on the inside.

Not as a feature. Not so you can generate a video of your cat. As infrastructure - a visual grounding layer that gives the model something to think with rather than something to show you... well ALSO as something to show you to be fair.

The fact that they're collecting human preference data on image quality right now AND generating some serious buzz, right before spud, is either a coincidence or it isn't.

I'm going with isn't.