expectedwrong hindsight

Apple Gave Away Eyes

FastVLM runs entirely in the browser on WebGPU, which means image understanding now costs roughly what it costs to run a ceiling fan.

1 min read 219 words #apple #vision-language-models #webgpu #open-source #video-understanding
hindsight — nailed it

Apple gave away eyes. FastVLM running in the browser, no API key, no cloud, no cost. The obvious observation about cost elimination was correct. The less obvious one about video is still unfolding.

Apple quietly dropped FastVLM on Hugging Face — their CVPR paper from this year — open source, with a full WebGPU demo that runs entirely in the browser. No API key. No inference cost. No cloud. Just electrons moving through silicon you already own.

On Apple Silicon, image description is now instant and free. The model loads, runs, and returns results on your machine, in your browser, at whatever speed your M-series chip feels like going that day, which turns out to be: fast.

The obvious observation is that this eliminates a cost category. The less obvious one is what it does to video.

The current free approach to video understanding — the one that costs nothing — is to listen, not look. Transcribe the audio, interpret what was said, infer the rest. It works until it doesn't, which is constantly, because enormous amounts of what happens in video is purely visual and completely silent.

FastVLM can watch. Frame by frame, free, local, offline. No API call per frame. No bill at the end of the month that scales with how much footage you threw at it.

Apple published their CVPR work and handed it to anyone with a browser. The only reasonable response is to figure out what you were previously not building because visual inference cost money.