> the answers are known to the authors of the questions but will remain encrypted for a short time.
Ok. But humans may be able to solve the problems too. What prevents Anthropic or OpenAI from hiring mathematicians, have them write the proof and pass it off as LLM written? I'm not saying that's what they'll do. But shouldn't the paper say something about how they're going to validate that this doesn't happen?
Honest question here. Not trying to start a flame here. Honestly confused how this is going to test what it wants to test. Or maybe I'm just plain confused. Someone help me understand this?
Yep. "possible but unlikely" was my take too. As another person commented, this isn't really a benchmark, and as long as that's clear, it seems fair. My only fear is that some submissions may be AI-assisted rather than fully AI-generated, with crucial insights coming from experienced mathematicians. That's still a real achievement even if it's human + AI collaboration. But I fear that the nuance would be lost on news media and they'll publish news about the dawn of fully autonomous math reasoning.
This is what is special about them:
> a set of ten math questions which have arisen naturally in the research process of the authors. The questions had not been shared publicly until now;
I.e. these are problems of some practical interest, not just performative/competitive maths.
And this is what is know about the solutions:
> the answers are known to the authors of the questions but will remain encrypted for a short time.
I.e. a solution is known, but is guaranteed to not be in the training set for any AI.
Not a mathematician and obviously you guys understand this better than I do. One thing I can't understand is how they're going to judge if a solution was AI written or human written. I mean, a human could also potentially solve the problem and pass it off as AI? You might say why would a human want to do that? Normal mathematicians might not want to do that. But mathematicians hired by Anthropic or OpenAI might want to do that to pass it off as AI achievements?
Of course a math expert could solve the problems themselves and lie by saying that an AI model did it. In the same way, somebody with enough money could secretly film a movie and then claim that it was made by AI. That's outside the scope of what this paper is trying to address.
The point is not to score models based on how many of the problems they can solve. The point is to look at the models' responses and see how good they are at tackling the problem. And that's why the authors say that ideally, people solving these problems with AI would post complete chat transcripts (or the equivalent) so that readers can assess how much of the intellectual contribution actually came from AI.
It seems likely that PhD students in the subfields of the authors are capable of solving these problems. What makes them interesting is that they seem to require fairly high research level context to really make progress.
It’s a test of whether the LLMs can really synthesize results from knowledge that require a human several years of postgraduate preparation in a specific research area.
samasblack•1h ago