Veo3 Knows Things It Was Never Taught
A new benchmark quantifies what anyone who has used a modern video model already suspects: these things have internalized the world.
Veo3 knowing things it was never taught — zero-shot physical understanding in video models. The research quantifying the intuition is still being extended. The models keep getting it right for reasons nobody fully explains.
There's a specific moment that happens when you're prompting Veo3. You ask for something the model has no particular reason to get right — some specific physical interaction, a material responding to force in the way that material actually responds to force, a shadow falling at the angle shadows fall — and it gets it right. Not approximately. Exactly. And you sit there thinking: it knows something.
That feeling now has numbers.
The research at video-zero-shot.github.io is doing the thing where someone takes a broadly shared intuition — that modern video generation models have developed genuine zero-shot world understanding, not just aesthetic mimicry — and builds a proper evaluation framework around it. Tasks the model has no explicit training signal for. Scenarios requiring real causal reasoning about how objects and forces interact. The kind of thing that should be hard, that was hard, and is apparently not that hard anymore for the best models.
The gap between "the model learned to make videos look like videos" and "the model learned how the world works" used to feel large. It's not clear it is anymore.
What makes Veo3 specifically uncanny is the texture of its correctness. Other models produce plausible motion — things move in ways that read as physical without being physical. Veo3 produces accurate motion. A pour, a collision, a fold. You can feel the difference before you can articulate it, which is its own kind of evidence that something real is being measured.
The counterargument is always data scale — that at sufficient training scale, you can pattern-match your way to anything, and what looks like understanding is just very thorough memorization. Maybe. But memorization of what? The physical world is not a finite corpus. If the model is generalizing to novel physical configurations it has never seen rendered, calling that memorization starts to feel like a word game.
What the benchmark does is close off the escape route. Zero-shot means zero-shot. No training examples of this specific interaction. No prior art in the corpus to copy. Just the model, and the world, and whether it knows which way the apple falls.
It does.
The question worth sitting with is what we're actually building when we train these things. The official story is: a video generator. A text-to-video model. A content tool. But if the evaluation shows zero-shot physical reasoning as an emergent property of video prediction at scale, then the thing we built is also, incidentally, a world model. Nobody planned that. It came for free.
Or it didn't come for free — it cost hundreds of millions of dollars of compute — but it came without anyone explicitly asking for it, which is close enough.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.