I recruited a team of smart undergraduates to construct a dataset of ChatGPT responses to every open Erdos problem and verify the output.
They found:
- 3 problems with new proofs (though in 2 cases, historical partial results were found that could be extended to solve the same problem)
- 4 problems where 5.2 Pro or Deep Research found an exact solution in the prior literature that hadn't been documented
- 3 problems where 5.2 Pro or Deep Research were able to strengthen a prior result in the literature
- 3 problems where typos were identified in the problem statement
The most common failure case is that 5.2 Pro solves the problem as stated, but professional mathematicians understand there's an implicit constraint for the problem. For example, maybe the problem says integers, but they really mean only positive integers.
Happy to answer any questions about the dataset!