expectedwrong hindsight

Snapchat and UofT Built a Video Model That Actually Understands the Assignment

MINT treats video generation like storyboarding — and the prompt coherence is unsettling.

2 min read 251 words #video-generation #ai #snapchat #diffusion-models #sora
hindsight — still happening

The storyboard approach to video generation — shot by shot with continuity — remains the right direction. Whether MINT or something else ships it as the default UX is still playing out.

Sora dropped six days ago and the discourse is still mostly people arguing about whether it's real-time or not, which means everyone is sleeping on MINT — a video model out of Snapchat Research and the University of Toronto that is doing something conceptually interesting.

The framing is storyboards. Instead of treating a prompt as a single atomic instruction and hoping the model figures out the narrative arc, MINT approaches video generation the way a director actually thinks about scenes — shot by shot, with continuity across the sequence. Sora gestured at this with its storyboard interface. MINT seems to have made it load-bearing.

The part that's hard to dismiss is the prompt coherence. Most video models fail the moment your description gets specific — you ask for a woman in a red coat crossing a wet street at dusk and you get a generic person in an outdoor location at some time of day. MINT's outputs stay close to what you actually asked for, which sounds like a minimum viable product but is somehow still a differentiator in December 2024.

Snapchat is a strange institution to be running serious video generation research. They are also, it turns out, one of the few consumer companies that has been thinking seriously about short-form video at scale for a decade, which is either a coincidence or the entire explanation.

The UofT collaboration is less surprising. Toronto has been a machine learning node since before it was fashionable to call it that.

Worth watching.