expectedwrong hindsight

The Glass of Wine Problem Is Dead

GPT-4o's image generation dropped on April 1st, which, sure, fine.

3 min read 561 words #ai #image-generation #gpt-4o #architecture #openai
hindsight — nailed it

The glass of wine problem really is dead. Image generation inside chat models became the default, not the exception. The ten-frames-per-second prediction was directionally correct — real-time steerable generation arrived within months.

It came out on April Fools Day. I want you to appreciate that. The most consequential image generation release in years — available to free users, running in the same model you can just talk to — dropped on April 1st like a dare.

I spent a few minutes assuming it was a joke. Then I watched it generate things and stopped assuming.

Here's the thing that matters most before we get into the details: this takes about thirty seconds per image right now, for a free user, on Anthropic's dime — wait, OpenAI's dime. And we know what compute does to inference. We know the relationship. Someone at that company has already seen a version of this that runs at ten frames per second, and at ten frames per second, what you have is not an image generator. What you have is a video you can direct — with real-world physical constraints, through conversation, by just talking to it.

Sam Altman has seen real-time 4o image generation. I'm nearly certain of this. I don't know what to do with that information.

But the actual headline, the thing that should be on the front page, is the glass of wine.

If you've spent any time with diffusion models — Stable Diffusion, Midjourney, DALL-E, any of them — you know the glass of wine problem. A full glass of liquid, correctly rendered, with the meniscus doing what a meniscus does and the light bending through it properly. For years this was a litmus test that every model failed in the same direction. The glass always looked wrong. The liquid was always wrong. It was a tell. You could always see the seam between what the model understood and what it was pretending to understand.

That's gone now. Not improved — gone. This thing handles physical geometry like it was never a problem, because for this model, apparently, it wasn't.

There are still cracks. Every watch it generates reads 10:10. That's not an accident — that's the training data average, the most photographed watch position in advertising history, baked in permanently. The digital domain is frozen at that time and probably will be for a while. This is a limitation worth knowing about if your work involves timepieces or anything where precise instrument readings matter.

But for everything else — the architectural stuff, the interiors, the sketching pipeline — it's almost unreasonably good. I ran a 432 Park interior through it: make the interior furnishing and design midcentury modern — do not change any architectural details. It did it. Botched a little of the window finishing detail on one pass, but one pass, no iteration. The geometry held. The prompt stuck. The room looked like it belonged to a person who had opinions about Eames chairs and meant it.

That's the thing that's hard to articulate about why this is different. Previous generators were impressive in isolation and fell apart under specificity. This one seems to get more coherent as the prompts get more demanding. Hyper-prompt-cohesive is the phrase I keep coming back to — it's doing something closer to understanding than matching.

The sketch pipeline works. The interior visualization works. The physical rendering works. And it's free, today, and it's slow, today, and the people who built it have already seen the fast version.

That's a lot to drop on April 1st.