The Private AI Dream Keeps Collapsing Into "Just Use 4o"

Every path through the local model maze eventually dumps you at the same OpenAI invoice.

The pitch is always the same. Run it locally. Own your data. No per-token fees. Unlimited requests for a flat monthly bill — $180 if you quantize, $800 if you want the full thing running in fp16 like a responsible adult.

Then someone does the math on what it actually costs to run a frontier call for a real task — a 1,700-token analysis, say — and the frontier number comes out lower.

We've been here before. We'll be here again.

Llama 4 shipped and it's somehow worse than Llama 3. Not marginally. Worse while everyone else got better. This is the model that was supposed to make the "run it locally, match GPT-4o quality" argument viable for production systems. Instead it handed that argument to DeepSeek, which is Chinese, which creates a different kind of customer conversation — not about intelligence but about whose servers your law firm's privileged communications are passing through.

DeepSeek V3 quantizes efficiently enough that 32b at 4-bit probably outperforms Llama 3 at full precision. Eight B200s runs a full instance for around $16/hour with enough concurrency for multiple customers. Drop to three if you're willing to serve fewer simultaneous requests. A customer with a 36GB Mac can run the quantized version on their own machine, which is either a feature or a support ticket waiting to happen, depending on which side of the conversation you're on.

The actually interesting thing buried in all the cost math is the architecture that doesn't get named directly but is right there: a two-tier system where cheap models do the reading and frontier models do the talking.

Local models — or DeepSeek, or whatever is efficient and fast and doesn't need to be impressive — read every email, scan every document, look for alignment and context. They never interact with a user. They never produce displayable output. They're retrieval, not generation. Filters, not writers. And then the expensive model handles the 1,700-token call that a customer actually sees.

This is a sensible architecture. The mistake people make is expecting the cheap layer to do the same work as the expensive layer but for less money. It can't. But it can do a different, humbler job — figure out what matters, throw away what doesn't, pass the right context forward — and do that job very well.

You spend all your time trying to make Llama smart, the note says, when changing to 4o would just fix it. This is correct. The local model is not a degraded frontier model. It's a different tool with different appropriate uses, and using it as a substitute for the thing it can't replace is how you end up with a hallucination machine in a law firm.

Which, yes. A system with no grounding, no retrieval, no guardrails — ready to ship, data not included — is technically deployable today and technically guaranteed to produce confident nonsense. Probably fine for many contexts. Probably not the one where someone is trying to understand their legal exposure.

The privacy marketing angle is there too, of course. Swap the model, say the scanning and analysis and email reading all runs locally, charge whatever the market will bear. This is a real option. It just requires being honest with yourself about what "private" means when the quantized model is making stuff up about your clients' cases.

The frontier models are where the intelligence is. They're also, for most realistic workloads, not actually that expensive — because the tasks that require real intelligence turn out to be smaller in volume than the tasks that just require processing, and it's the processing that the cheap models can handle.

The cascade architecture is the answer. Not local versus frontier. Local and frontier — doing different jobs — with the costs flowing to where the intelligence actually needs to be.

Everything else is the dream.

The Private AI Dream Keeps Collapsing Into "Just Use 4o"

Counterpoints