expectedwrong hindsight

They Heard Me

GPT-4 Turbo with Vision is generally available, function calling works now, and the corporate chess match is getting weird.

3 min read 471 words #openai #google #llm #api #local-inference
hindsight — nailed it

unified multimodal plus tool use became the standard. every model now does vision and function calling in the same API call. the duct-tape era ended quickly. google's dev rel poaching also worked — gemini's developer adoption improved significantly.

Four days ago I complained loudly that GPT-4 Vision Turbo and function calling lived on separate endpoints, which meant you couldn't do anything actually useful with vision without duct-taping two API calls together and praying the model stayed coherent across them. Today OpenAI dropped GPT-4 Turbo with Vision as generally available, with function calling, in the same endpoint.

I'm going to say they heard me.

The simulation project — which has been sitting in a branch, ready to merge for weeks, blocked on exactly this — can now merge to main.


Meanwhile, Google is playing a different kind of game. They watched OpenAI build the most envied developer community in AI, and their response was to hire OpenAI's developer relations lead and then, almost immediately, drop Gemini 1.5 Pro into public preview with new features. The sequencing is aggressive. Steal the person who knows how to talk to developers, then give developers something to talk about.

Logan Kilpatrick now belongs to Google. The Google blog post that dropped today has his fingerprints all over the framing.

swishfever called this days in advance. The most accurate leaker in the space, living up to it again, predicting both the timing and the shape of the announcement while everyone else was guessing.


The function calling thing is not a minor quality-of-life fix. It was a fundamental architectural limitation — vision models that can't call tools can't do agentic work, can't verify what they see against a database, can't trigger actions based on what they're looking at. The merger of those two capabilities is the thing that makes vision actually composable with the rest of the stack. OpenAI dropped this as a response to Gemini's audio features earlier in the week, and the pace at which these companies are now racing to close each other's gaps in real time is genuinely disorienting.


The other thing worth sitting with: CPU-only inference at Q2 quantization. Someone's getting the big models running locally on Apple Silicon with 128 gigabytes of RAM — which, yes, costs money, but 128GB of unified memory is a thing you can own — and the word is that the model is, quote, "insanely OP for what it does."

GPT-4 in your house. On your CPU. Slowly, sure, but yours.

Nobody's fully figured out how to make those low quants performant yet, and at Q2 you are losing a lot of precision, but the fact that we're even having this conversation — that the limiting factor is whether someone figures out the optimization, not whether it's physically possible — is the kind of sentence that would have sounded like fiction eighteen months ago.

128 gigabytes of RAM for a GPT-4 in your house. The absurdity and the normalcy of that sentence occupying the same brain simultaneously is the whole thing, really.