Black Forest Labs Is Coming for Video

Black Forest Labs has an "up next" page now, and it includes text-to-video.

This is the FLUX team. Robin Rombach and the people who built the actual architecture underneath Stable Diffusion, departed Stability AI, and within what feels like fifteen minutes shipped an image model that made most of the field look slow and embarrassed. If you've used FLUX, you know the gap is real.

So when they say video is coming, the correct response is not skepticism. The correct response is to quietly rearrange your assumptions about where the ceiling is.

The existing SOTA for text-to-video is a weird patchwork — Sora exists but you can't have it, Runway Gen-3 is impressive in the way that a very expensive magic trick is impressive, Kling appeared out of nowhere from a Chinese short-video company and is better than most things built by organizations with far more name recognition. The bar is high and also weirdly undefined, because the best stuff isn't public and the public stuff has obvious seams.

If FLUX-on-video is anywhere near what FLUX-on-images was, those seams are going to matter less.

There's a version of the next twelve months where the open image generation ecosystem gets a video equivalent that actually works, trained by people who understand the math, released with weights. That version is materially different from every other version.

Black Forest Labs just implied they're building it.

Black Forest Labs Is Coming for Video

Counterpoints