expectedwrong hindsight

Midjourney Still Doesn't Have an API

FLUX runs locally on an M3 Mac in two minutes and does not care what you ask it.

2 min read 423 words #image-generation #flux #local-ai #midjourney #open-weights
hindsight — half right

FLUX did dominate open-source image gen exactly as described. But Midjourney eventually shipped a web app and API, so the "still doesn't have an API" framing had an expiration date. The competitive landscape shifted but didn't collapse the way this implied.

I generated an image of a fake TED talk on my Mac. The prompt specified a long-haired bearded speaker, visibly excited, standing in front of a slide titled "AI is coming and is bringing it," with two bullet points — "This ran on an M3 Mac" and "Completely locally" — while one audience member stood clapping and another danced in the aisle.

It took two minutes. It nailed it.

The model is FLUX, from Black Forest Labs, and it changes the math on basically everything.


Midjourney and DALL-E have had this space to themselves for a while now, and they've both settled into a comfortable arrangement where you pay for access to a black box that occasionally refuses to draw things and has no real API. Midjourney in particular has been operating as a Discord-only artisanal vibes machine since the beginning — no API, no local option, just a monthly subscription to type prompts into a chat room and hope the bot is in a good mood.

FLUX is forty gigabytes. It runs quantized in around twelve gigs of VRAM. The schnell and dev versions are open-weight and free. You can run it through Pinokio's WebUI locally, or hit fal.ai's hosted endpoint until they ask for your credit card — which they haven't yet.

The part that matters most: flux doesn't care about anything.

This is not a small thing. A significant fraction of the friction in working with the major image models is that they've been trained or finetuned to refuse, to soften, to add a helmet, to make the violence tasteful, to decline politely and suggest you try something else. FLUX generates what you describe. The prompt is the contract. If you asked for the dancing audience member, you get the dancing audience member.

The benchmark for "is this model real" has always been: can it handle a prompt that requires understanding context, composition, text in image, and human specificity all at once — without one of those things collapsing into slop. The fake TED talk prompt is a hearty challenge. Specific people, specific slide text, specific emotional states, a specific kind of candid-photography aesthetic. Two minutes on a laptop.

Midjourney is very good. DALL-E is increasingly good. But neither of them runs on your machine, neither of them is free, and one of them still expects you to type prompts into Discord like it's 2022.

The merged model numbers will come in. They'll probably be interesting. But the baseline is already enough to make the old calculus feel quaint.