Text-to-Video Is Moving Faster Than It Should

ZeroScope v2 XL is open source, runs at 1024×576, and the results are arriving faster than anyone warned us they would.

There's a specific feeling you get when a capability arrives ahead of schedule — not ahead of the hype, the hype is always early, but ahead of the actual timeline you'd internalized. ZeroScope v2 XL hit that nerve.

Watch the demo. The resolution is 1024×576. It's running faster than I expected something at that resolution to run. That's the part that should probably concern us — not the quality, which is still obviously synthetic, but the speed, which is not behaving like something that should be this cheap to run.

And then the other thing: it's open source. On HuggingFace. Right now. No waitlist, no API, no company between you and the weights.

The cadence here is worth sitting with. Text-to-image went from "research curiosity" to "anyone with a GPU" in about eighteen months. Text-to-video appears to be doing the same thing, except the clock started later and everyone's still acting like we have time.

We probably don't have time.

Text-to-Video Is Moving Faster Than It Should

Counterpoints