expectedwrong hindsight

They Built the Matrix and Called It a Simulator

Google's UniSim is a generative video model you can live inside, and nobody seems that alarmed.

2 min read 359 words #ai #robotics #world-models #generative-video #reinforcement-learning
hindsight — nailed it

World models went from quiet GitHub pages to the center of the AI research agenda. Sora is a world simulator. Genie 2 generates interactive environments. The "take actions inside a generative model" thesis became the consensus roadmap for AGI. The Matrix reference was less hyperbolic than it sounded.

There is a project page sitting quietly on GitHub Pages right now — no press release, no TechCrunch explainer, no breathless Twitter thread with seventeen parts — that contains a generative model you can take actions inside.

You give it a scene. You give it an action — text, robot command, controller input, doesn't matter. It generates what happens next. You give it another action. It generates what happens after that. You have been living in a neural network's imagination and it doesn't break.

Google Research named it UniSim, short for Universal Simulator, which is either the most accurate product name in machine learning history or a deeply unhinged thing to just casually call your paper.

The demos are the part that requires sitting with. Real-world navigation — type "walk forward" into a video model and the video of the sidewalk scrolls forward, correctly, with the parallax and the lighting and the world behaving like a world. Robot arm reaching for an object. Minecraft-adjacent environments. All one model. All interactive. All running on the same weights.

The argument buried in the project is that simulators are the bottleneck for training robot policies — you need millions of interactions to teach a robot to do anything useful, and real robots are slow and expensive and break — so what if the simulator was just a generative model trained on video of everything? What if you skipped the physics engine entirely and replaced it with compressed human experience?

They train a robot policy inside this generated world. The policy transfers to the real world. This is the part that should make you put your coffee down.

The word "universal" in the name is not, it turns out, overreach. The same model simulates outdoor navigation and robot manipulation and video games because they all reduce to the same thing: given a history of frames and an action, predict what happens next. The real world is a special case of the distribution.

We spent sixty years building physics simulators — rigid body dynamics, friction models, contact solvers, all of it — and the answer was apparently: just watch enough video.

Nobody planned this.