expectedwrong hindsight

The Qwen 1M Context Window Works on Your Mac Right Now

A 1-million-token context model running locally, today, on Apple Silicon — with a catch that is mostly fine.

2 min read 285 words #local-ai #apple-silicon #qwen #mlx #context-windows
hindsight — nailed it

Qwen 1M context on a Mac via MLX worked as described. Local million-token inference on Apple Silicon became a real capability, not a benchmark claim.

Awni Hannun — MLX contributor, Apple researcher — dropped a note today about running the new Qwen2.5 1M context model on macOS. Not in a datacenter. On a Mac. Locally.

The model is Qwen2.5-7B-Instruct-1M. Alibaba shipped it specifically for the long-context use case — where the standard Qwen2.5 tops out at 128k tokens, this variant is trained to go to a million. The MLX community already has a quantized 4-bit version up, which means you can pull it and run it today.

pip install mlx-lm

mlx_lm.generate \
  --model mlx-community/Qwen2.5-7B-Instruct-1M-4bit \
  --max-tokens 250000 \
  --prompt "your prompt here"

The 250k token cap is the honest version of what most people can actually run. Unified memory on Apple Silicon is the reason any of this is possible at all — the GPU and CPU share the same pool, so a 32GB Mac can hold the weights and a large context in the same address space without the usual PCIe bandwidth ceiling strangling everything. On 32GB you get roughly 128k-250k tokens of practical headroom with the 4-bit 7B. On 64GB you push toward 500k. The full million requires 128GB.

If you prefer llama.cpp or Ollama, the context flag is the whole trick:

# llama.cpp — 256k context
./llama-cli -m qwen2.5-7b-instruct-1m-q4_k_m.gguf -c 262144 -n 512 -p "prompt"

# Ollama via Modelfile
FROM qwen2.5:7b-instruct
PARAMETER num_ctx 262144

What do you actually do with 250k tokens locally. That's the question nobody quite knows the answer to yet — it's enough to stuff in an entire mid-sized codebase, a book, a year of emails. Whether the model does anything useful at that range is a different conversation. The retrieval and attention degradation at extreme context lengths is real and documented. But the window is open, and it's on your desk, and it costs nothing per token.

A million-token context window running locally on commodity hardware, announced on a Monday in January. Filed under: things that were impossible eighteen months ago.