35 Out of 42

Gemini with Deep Think just scored gold at the International Mathematical Olympiad, solved the hardest problem on the sheet, and failed the fourth-hardest, which is not how smart is supposed to work.

Gemini with Deep Think scored 35 out of 42 at the International Mathematical Olympiad, which is enough for a gold medal, which is remarkable, and also the kind of sentence that requires you to sit with it for a moment before you decide how to feel.

The gold threshold at IMO 2025 was 29 points. Gemini hit 35. That puts it well inside gold territory — not at the very top, where the handful of perfect-scorers live, but solidly above the line. The per-problem breakdown is where it gets strange: full marks on P1, P2, P4, P5, and P6. Zero on P3.

P3 is the hard problem on day one. P6 is the hard problem on day two — the one where the top students from 100+ countries stare at the page and produce nothing, where maybe 50 to 100 people out of 600 get full credit in a good year, where the gap between a gold medalist and a legend gets sorted out. Gemini solved P6. Gemini did not solve P3.

That is a very strange way to be smart.

The load-bearing part of DeepMind's announcement — and "load-bearing" here means "the thing everything else depends on" — is that the solutions were graded by actual IMO coordinators, applying the actual rubric, the same way they graded the 600 students sitting in an exam room in Australia. Human judges read these proofs and said yes, this is correct, here are seven points. This distinguishes the result from every previous AI-at-Olympiad claim, where the lab graded its own homework and the rest of us were supposed to take their word for it.

The model also produced natural language proofs — not Lean, not code, not numerical approximations. Mathematical arguments, written in the idiom that competition mathematics has used for 80 years. Earlier DeepMind systems like AlphaProof required the whole formal verification scaffold. Gemini just wrote the proof and handed it in.

There is one asterisk doing real structural work in all of this, which is the question of how many times the model ran.

A human contestant gets one attempt. They sit in a room with a sheet of paper and 4.5 hours and that is the entirety of the situation. Whether Gemini ran once per problem, or ran multiple times and submitted the best answer, is not something DeepMind's blog post is especially clear about. The "officially graded" framing covers the grading side. It does not cover the generation side, which is a separate question, and one worth asking before you get too deep into the milestone-speak.

This isn't a reason to dismiss the result — it's a reason to hold it precisely. 35/42, human-graded, novel problems. That's real. The exact conditions are a separate thing.

What I keep coming back to is P6.

P6 is where the competition separates the gold medalists from the perfect scorers — the one that most of the best mathematical teenagers on earth will not solve, cannot solve, will spend the final hours of the exam staring at and then leave blank. Gemini solved it. Full seven points. Which means somewhere in that extended Deep Think chain-of-thought — the compute-heavy, let-it-run reasoning mode — the model found a path through a problem that is specifically designed to have no obvious path.

And then it could not do P3.

Maybe P3 required a trick that wasn't in the model's repertoire. Maybe it went down the wrong path and didn't recover. Maybe, in the same way a human can have a bad day on a specific problem type — walk in brilliant, stare at P3, feel the panic arrive — the model had something functionally equivalent. I don't have a clean interpretation. The fact that I can say "maybe it had a bad day" as a genuine hypothesis about a language model is its own kind of data point.

Demis Hassabis has described this as a step toward AI that contributes to research mathematics, not just competition mathematics — and that framing is correct, not just as PR positioning but as a genuine distinction. IMO problems are curated. They're known to be solvable. They're designed to have elegant solutions reachable in finite time by a prepared teenager. Research mathematics involves not knowing if the problem is solvable, not knowing if the approach will work, not knowing if the thing you're stuck on is hard or just hard for you.

IMO gold is the qualification exam. The actual game is something else.

But still. 35 out of 42. Graded by the people who grade IMOs. Novel problems, none of them in any training set. And P6, full marks, the problem most people leave blank.

Something is happening. It's just not entirely clear what.

35 Out of 42

Counterpoints