Text-to-Video Is Where Image Gen Was Before It Was Good

There is a four-second clip of a man walking on a beach — generated entirely from text, no source footage — where the man walks normally for about two seconds and then his legs negotiate a separate peace from the rest of his body. The water behind him moves like water usually does. His face is fine. His knees have simply decided to be knees in a different way.

This is Runway Gen-2. This is the state of the art.

I want to be clear that I mean this seriously. What Runway shipped is genuinely remarkable — not in a "considering the circumstances" way, but in a straight, unqualified way. A year ago text-to-image was producing surreal soup and now Midjourney is being used for book covers and magazine spreads. The trajectory from soup to spine-chilling competence took about fourteen months. Text-to-video is in the soup phase right now, which means you can either make peace with the haunted knees or check back in 2024.

The open-source side has ModelScope, which is Alibaba's DAMO Academy doing what Alibaba's DAMO Academy does — shipping a functional thing quietly while everyone is watching the American companies. The outputs are different-flavored rough. Less "man dissolves into the concept of man," more "someone applied a dream filter to stock footage that never existed."

Then there's Text2Video-Zero, which is the most interesting of the three approaches if you care about how things work rather than just what they produce. The idea: Stable Diffusion already understands the world. It has opinions about what fire looks like, what running looks like, what falling looks like — baked in from billions of images. Text2Video-Zero jailbreaks that spatial knowledge into temporal knowledge by making frames aware of each other through cross-frame attention, without any video training data at all. It's using a model that was never trained on video to make video. The outputs are appropriately unhinged but the idea is either very clever or a preview of the obvious correct path, and I can't tell which yet.

The gap that nobody is talking about: images got good partly because "good" is well-defined for images. Sharp. Coherent. Looks like a thing. Video has to also be consistent across time, which is a harder constraint in every direction — physically, semantically, aesthetically. A great Midjourney image is a single confident assertion. A great video is a hundred consistent assertions in a row. We are currently at maybe six consistent assertions before the knees secede.

The question is whether this closes in the same fourteen months. I'd bet yes on the commercial side — Runway is burning serious money on this and has the distribution. I'd bet yes-but-slower on the open side, because the training compute requirements are not trivially hobbyist-accessible in the way SD was.

What I wouldn't bet on: that the haunted-knee era lasts much longer. These things always look like they're stuck right before they're not.

Text-to-Video Is Where Image Gen Was Before It Was Good

Counterpoints