Text-to-Video Is Where Image Gen Was Before It Was Good
Runway Gen-2 exists, the outputs are haunted, and this is fine.
Called it. The comparison to early image gen was exactly right. By late 2025, Sora 2 produces minute-long clips that are hard to distinguish from footage. Runway Gen-3 handles cinematic camera moves. The man's knees no longer negotiate separate peace treaties. It took about two years, which is roughly how long image gen took too.
There is a four-second clip of a man walking on a beach — generated entirely from text, no source footage — where the man walks normally for about two seconds and then his legs negotiate a separate peace from the rest of his body. The water behind him moves like water usually does. His face is fine. His knees have simply decided to be knees in a different way.
This is Runway Gen-2. This is the state of the art.
I want to be clear that I mean this seriously. What Runway shipped is genuinely remarkable — not in a "considering the circumstances" way, but in a straight, unqualified way. A year ago text-to-image was producing surreal soup and now Midjourney is being used for book covers and magazine spreads. The trajectory from soup to spine-chilling competence took about fourteen months. Text-to-video is in the soup phase right now, which means you can either make peace with the haunted knees or check back in 2024.
The open-source side has ModelScope, which is Alibaba's DAMO Academy doing what Alibaba's DAMO Academy does — shipping a functional thing quietly while everyone is watching the American companies. The outputs are different-flavored rough. Less "man dissolves into the concept of man," more "someone applied a dream filter to stock footage that never existed."
Then there's Text2Video-Zero, which is the most interesting of the three approaches if you care about how things work rather than just what they produce. The idea: Stable Diffusion already understands the world. It has opinions about what fire looks like, what running looks like, what falling looks like — baked in from billions of images. Text2Video-Zero jailbreaks that spatial knowledge into temporal knowledge by making frames aware of each other through cross-frame attention, without any video training data at all. It's using a model that was never trained on video to make video. The outputs are appropriately unhinged but the idea is either very clever or a preview of the obvious correct path, and I can't tell which yet.
The gap that nobody is talking about: images got good partly because "good" is well-defined for images. Sharp. Coherent. Looks like a thing. Video has to also be consistent across time, which is a harder constraint in every direction — physically, semantically, aesthetically. A great Midjourney image is a single confident assertion. A great video is a hundred consistent assertions in a row. We are currently at maybe six consistent assertions before the knees secede.
The question is whether this closes in the same fourteen months. I'd bet yes on the commercial side — Runway is burning serious money on this and has the distribution. I'd bet yes-but-slower on the open side, because the training compute requirements are not trivially hobbyist-accessible in the way SD was.
What I wouldn't bet on: that the haunted-knee era lasts much longer. These things always look like they're stuck right before they're not.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.