The First Open Text-to-Video Model Is Here and It Kind of Sucks
Which is exactly what was supposed to happen, and exactly why it matters.
Open text-to-video models got dramatically better — CogVideoX, Mochi, Wan. The first one sucking was the whole point. Availability beat quality, quality caught up later.
An open text-to-video model dropped today and the videos look like someone described a dream to a model that has never seen a dream — or a video, really. Soft around the edges, temporally confused, moving in that specific way where you can feel the diffusion process struggling against physics.
It's not Sora. It's not even close to Sora. Sora is still locked in a vault somewhere while OpenAI figures out how to charge for it without also enabling mass deepfake production, a problem they will not actually solve.
But here's the thing: we can run this one.
That's it. That's the whole story. The quality gap is real and probably large and also completely beside the point right now. The point is that the closed labs have had this category entirely to themselves — image generation went open almost immediately after Stable Diffusion hit in 2022, and that broke everything wide open in the best possible way. Video has been locked up. And now it isn't.
Every interesting thing that happened with open image generation — the fine-tunes, the LoRAs, the weird community experiments, the artists who figured out what the models were actually good for instead of what the press release said — all of that starts from someone being able to run the model locally and spend three weeks doing something nobody planned.
The first version of anything is never the point. The point is that the clock started.
The outputs are rough. Run it anyway.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.