The Model Is Already There

Gemini Nano runs in Chrome with no server, no API key, and no model download — because Chrome already did that for you.

Google shipped a language model inside Chrome and most people have no idea it's sitting on their machine right now.

Gemini Nano — the small one, the one that fits on a phone — runs fully in the browser via an experimental Prompt API in Chrome Dev and Canary. No API key. No network call. No model download, because Chrome already downloaded it for you when you weren't looking.

You flip two flags in chrome://flags, navigate to chrome://components to make sure the model component is actually there, and then you have this:

const session = await ai.createTextSession();
const result = await session.prompt("Write me a poem");

That's it. That's the whole thing. The inference runs on your GPU, the data goes nowhere, and the API looks like a toy you'd give to someone learning JavaScript.

There's a streaming mode — promptStreaming() — which is currently broken in an endearingly specific way. It returns cumulative chunks rather than deltas, so you get the first sentence, then the first two sentences, then the first three, like someone slowly revealing a sentence by pressing an old typewriter key. There's a known Chromium bug filed. In the meantime you deduplicate manually, which is fine, this is experimental, nobody promised you anything.

The prompt structure is genuinely strange. Gemini Nano doesn't have an official chat format, so multi-turn conversations get assembled by hand using a <ctrl23> separator token — a raw control token, the kind of thing that usually lives somewhere you're not supposed to look. Few-shot examples go in the same way. It's less a chat API and more a very polite request to a model that happens to understand what you mean.

What's actually interesting here is the distribution model. Chrome is going to be on billions of devices. If this API ships stable, every web app gets access to on-device inference for free — not through a CDN, not through WebAssembly weights you serve yourself, but through the browser vendor who already decided the model was worth the disk space. The developer doesn't negotiate that download. The user never sees it happen.

That is either the most convenient thing anyone has done for client-side ML or a fairly significant decision to make on someone else's behalf, and probably both.

The Model Is Already There

Counterpoints