expectedwrong hindsight

The Two-Year Clock

DeepMind handed the cap set problem to a language model and the language model beat the mathematicians.

4 min read 696 words #AI #mathematics #DeepMind #LLMs #local-models
hindsight — still happening

deepmind shipped alphaproof. AI math capabilities kept expanding into formal proofs and competition-level problem solving. the clock didn't stop — it accelerated.

The cap set problem is one of those things that looks simple when you write it on a napkin and then quietly ruins careers. Find the largest set of points in a finite grid where no three points form an arithmetic progression. Terence Tao worked on it. Jordan Ellenberg worked on it. DeepMind pointed a language model at it on Thursday and apparently went home early.

FunSearch — a name that is both cute and exactly accurate, because it searches a space of functions rather than a space of solutions. The model doesn't try to find a cap set directly. It writes small programs. The programs run, the outputs get scored, the best programs go back into the model as examples, and the loop continues until something worth keeping falls out. Evolutionary algorithms with a large language model as the mutation operator.

Which is where I have to be honest about the thing that makes this exciting and also the thing that makes the excitement potentially embarrassing: we don't know whether the LLM is being smart or just expensive. The evolutionary framework would work with a random number generator, in theory. The model might be learning the structure of good solutions and proposing mutations that respect it — or it might be an articulate dice roll in a lab coat. The paper doesn't settle this, because papers never settle the things you most want settled.

What it did do is find a cap set in F₃⁸ of size 512, improving on the best known bound. It also beat existing heuristics on bin packing. Those numbers are real regardless of whether the explanation ends up being clean or embarrassing.

If the approach generalizes — if it turns out the LLM is doing something structurally clever rather than getting lucky inside a well-designed search procedure — the implications are not small. Combinatorics has a long list of problems that look like the cap set problem. So does operations research. So does drug discovery. The question of whether this is a new kind of mathematical tool or a one-off party trick is the only question worth tracking in the next year of follow-up work.


The rest of the week's news, briefly:

Gemini Pro API is free at 60 queries per minute, which either stays free forever or is a very specific prediction about where the commodity tier lands in six months. If Google-tier capability is the new free tier, something is coming from OpenAI that isn't GPT-4. GPT-4.5 has been a rumor long enough to have its own citations. The word "Ultra" keeps appearing in the same sentences as "2024."

Uncensored Mixtral — dolphin-2.5 on HuggingFace — now runs locally in a single terminal command via Ollama. A model that would have required a meaningful cloud compute bill six months ago now runs on a MacBook Pro while you're doing something else. Ollama also shipped LLaVA support this week, which puts multimodal — image plus text — in the same one-line-in-terminal bucket.

I discovered vector databases recently. I understand they've been around for a while.


Separately: I was talking to a friend who teaches statistics at the provincial university. He's been failing students for using ChatGPT on assignments — not because the outputs are wrong, which is its own layered problem, but because the policy is the policy. He looked like a man who has been right about something for too long.

I told him he has about two years left being able to teach something that requires humans to do it manually.

He didn't argue.

The progression we ended up tracing out: hands → abacus → calculator → computer → ????. Every step, something that required a trained professional got reclassified as a tool. The statisticians who adopted calculators survived. The ones who called it cheating became a cautionary footnote in someone else's lecture.

Whatever is on the far side of that ???? — and a week where a language model improves on decades of combinatorics results suggests the gap might be shorter than the spacing of the previous steps would imply — is not going to check whether the professor is ready before it arrives.