OpenAI’s gold medal performance on the International Math Olympiad

So what’s different? We developed new techniques that make LLMs a lot better at hard-to-verify tasks. IMO problems were the perfect challenge for this: proofs are pages long and take experts hours to grade. Compare that to AIME, where answers are simply an integer from 0 to 999.

Also this model thinks for a long time. o1 thought for seconds. Deep Research for minutes. This one thinks for hours. Importantly, it’s also more efficient with its thinking. And there’s a lot of room to push the test-time compute and efficiency further.
— Read on simonwillison.net/2025/Jul/19/openai-gold-medal-math-olympiad/


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *