expectedwrong hindsight

ChatGPT Can See You Now

OpenAI ships multimodal to consumers and the race nobody was pretending wasn't happening is now officially happening.

3 min read 513 words #openai #chatgpt #multimodal #voice #gpt-4v
hindsight — nailed it

Vision became table stakes. Every major model — Claude, GPT-4o, Gemini — ships multimodal by default now. The loaded gun behind the velvet rope became the standard-issue sidearm. Nobody ships a frontier model without vision anymore.

GPT-4V has existed for a while — researchers have been poking at it, the capability has been sitting there like a loaded gun behind a velvet rope — and today OpenAI finally let it out of the building. ChatGPT Plus users get to upload images and talk to the model about them. They also get voice. Five voices, apparently, each with a name, each trained on a real human voice actor who was paid to supply the raw material for a system that will, in aggregate, reduce demand for voice actors.

Nobody mention this at the launch party.

The voice capability is the thing that will get the headlines because it's the thing that looks like the future we were promised in 1984 — not the dystopia, the other one, the one where you just talk to the computer and it understands you. You hold down a button, you speak, it speaks back. Whisper on the way in, a new TTS model with five personality options on the way out. The end-to-end latency is reportedly low enough that it doesn't feel like a phone tree. This is the bar we've set.

What I keep thinking about is the image piece, because that's where the actual intelligence lives. Voice is impressive engineering, but vision is the part that's genuinely hard — the part where you show the model a photo of your refrigerator and it tells you what to make for dinner, or you photograph a graph from a paper you're too lazy to read and ask it what's happening, or you take a picture of a rash and immediately feel better or worse about your life expectancy. The model is reasoning about what it sees. That's a different thing.

The framing of the announcement — "see, hear, and speak" — is doing the work of the word "multimodal" without requiring anyone to know what multimodal means. Smart. It also positions this as a fundamentally new product rather than a capability update to an existing one. Whether that distinction matters to anyone outside of pricing discussions is unclear.

What's clear is that the race is on in the exact way everyone knew it was on. Google has been building toward this. Meta has been building toward this. Every lab with a vision encoder and a voice pipeline is watching this rollout and checking their timelines. The interesting question isn't whether competitors ship similar things — they will, some within weeks — but whether any of them have the consumer distribution to make it matter the way ChatGPT's does.

A hundred million users is a test bed that nobody else currently has. OpenAI is about to learn, very quickly, what people actually do with a voice interface to a language model when it's not in a demo. My guess: half of it will be things nobody predicted, a quarter of it will be people asking it to tell them bedtime stories, and a quarter of it will be things that require immediate product intervention.

The velvet rope is down. We'll see what was behind it.