One Terminal Command to See

There is a certain tax you pay to use vision models — the OpenAI account, the API key, the bill at the end of the month, the quiet discomfort of sending your photos to someone else's server for analysis. Most people just pay it because the alternative was, until very recently, a PhD and a cluster.

The alternative is now this:

uv run --with torch --with mlx-vlm python -m mlxvlm.generate \
  --model mlx-community/gemma-3n-E4B-bf16 \
  --max-tokens 500 \
  --temp 0.5 \
  --prompt "Describe this image in detail" \
  --image PATH_TO_IMAGE.png

That's it. That runs locally. On your Mac. The model never leaves your machine. You don't need an account for anything.

Gemma-3n is Google's new small multimodal model — "small" meaning it fits in the memory of an M1 — and the early reports are that it's genuinely good at this, not "good for a local model" good, just good. The mlx-vlm library handles the MLX backend so Apple Silicon does what it was apparently built to do, which is run inference fast with no fan noise and no prayer.

You still need OpenAI for some things. The frontier stuff, the long context, the cases where you need the model to have read basically everything. But "look at this image and tell me what's in it" — the thing that felt exotic two years ago — is now a uv run away from anyone with a terminal and an M1.

The tax is gone. Nobody sent a memo.

One Terminal Command to See

Counterpoints