Snapchat and UofT Built a Video Model That Actually Understands the Assignment

MINT treats video generation like storyboarding — and the prompt coherence is unsettling.

Sora dropped six days ago and the discourse is still mostly people arguing about whether it's real-time or not, which means everyone is sleeping on MINT — a video model out of Snapchat Research and the University of Toronto that is doing something conceptually interesting.

The framing is storyboards. Instead of treating a prompt as a single atomic instruction and hoping the model figures out the narrative arc, MINT approaches video generation the way a director actually thinks about scenes — shot by shot, with continuity across the sequence. Sora gestured at this with its storyboard interface. MINT seems to have made it load-bearing.

The part that's hard to dismiss is the prompt coherence. Most video models fail the moment your description gets specific — you ask for a woman in a red coat crossing a wet street at dusk and you get a generic person in an outdoor location at some time of day. MINT's outputs stay close to what you actually asked for, which sounds like a minimum viable product but is somehow still a differentiator in December 2024.

Snapchat is a strange institution to be running serious video generation research. They are also, it turns out, one of the few consumer companies that has been thinking seriously about short-form video at scale for a decade, which is either a coincidence or the entire explanation.

The UofT collaboration is less surprising. Toronto has been a machine learning node since before it was fashionable to call it that.

Worth watching.

Snapchat and UofT Built a Video Model That Actually Understands the Assignment

Counterpoints