Two Repos Walk Into a Frame
IP-Adapter and prompt-travel are solving diffusion video consistency, and the results are already here.
Temporal consistency in video generation got solved — or close enough. By 2025, Sora and Runway Gen-3 maintain coherent subjects across frames because the models finally know they're making a video. The two-repos-walking-into-a-frame era ended faster than expected.
The problem with diffusion video has always been the same problem. The model doesn't know it's making a video. Frame one and frame two are strangers who happen to live next door — same neighborhood, technically, but no relationship, no memory of each other, no reason to agree on what the protagonist's face looks like.
Every approach to fixing this has been some flavor of expensive constraint: train on video data, use optical flow, add temporal attention layers, beg. Some of it works. Most of it costs you something — latency, compute, flexibility, the specific checkpoint you actually wanted to use.
IP-Adapter is the current best answer to the underlying problem. It's from Tencent AI Lab and the core trick is clean: decouple image-conditioning from text-conditioning by giving them separate cross-attention paths. Feed it an image and it will use that image as an identity anchor — not a style transfer, not a ControlNet skeleton, but something closer to "this is what this thing looks like, hold onto that." The repo dropped recently and it is already becoming the scaffolding other things get built on top of.
The other half is prompt-travel, a WebUI extension that interpolates between prompt states across a frame range. You give it keyframes — frame 0 is this, frame 30 is that — and it walks between them, blending the latent space on the way through.
Neither of those is new news on its own. The news is what happens when you combine them.
With IP-Adapter locking the subject and prompt-travel handling the motion arc, you get something that actually reads as video — consistent identity, smooth transitions, no requirement that you've retrained anything or moved off your existing setup. Someone packaged the whole workflow up yesterday and there's already a tutorial walking through it.
It's been less than 24 hours. The results are already better than most of what people spent months building with LoRA hacks and AnimateDiff duct tape.
This is the pattern. Not one big breakthrough — two medium-sized ones that nobody planned to use together, until someone did, and then it became obvious that this was always the right combination. The field moves like this constantly and it is somehow always surprising anyway.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.