Introduction
The First Proof paper (Abouzaid et al., 2026) aims to evaluate AI capabilities through a set of research-level mathematical problems. While the mathematical content of the questions is not in dispute, the experimental design suffers from significant methodological gaps that undermine the authors' primary conclusions. Specifically, the paper conflates binary outcomes with processual states, lacks independent verification protocols, and exhibits an asymmetric approach to transparency. This review examines five core logical inconsistencies (Axioms I–V) where the study's rigorous mathematical standards appear to have been decoupled from its empirical methodology. (Link: https://arxiv.org/abs/2602.05192)
Axiom I: One-shot experiments produce binary outcomes, not processual states.
The design is strictly non-iterative. The state space is S = {0, 1} (correct/incorrect). The authors conclude that AI systems "struggle" — yet "struggle" denotes a processual state (iterative refinement, near-misses, variance across trials). Such a state is fundamentally unobservable in a singleton trial. The precise descriptor for the data collected is "failed," not "struggled." In a methodology paper, lexical precision is as vital as numerical accuracy.
Axiom II: Preliminary conclusions that overclaim propagate upward.
The authors acknowledge: "we expect that through such interactions we would be able to coax the systems to produce better answers." By their own admission, the one-shot setting is an artificial constraint. Testing a system in a deliberately sub-optimal configuration and then drawing generalized conclusions about "capability" is a logical non-sequitur. This methodological choice directly fueled misleading headlines suggesting AI is fundamentally incapable of mathematical research.
Axiom III: Circular evaluation design and the "Oracle" problem.
The evaluation lacks a double-blind or independent grading protocol. The authors set the questions, hold the "canonical" solutions, and act as the sole arbiters of correctness. While they admit "correct answers are not always unique," a proof that diverges from the authors' anticipated path is judged by the same individuals committed to that path. This design is neither reproducible nor shielded from confirmation bias.
Axiom IV: Asymmetric transparency.
The study mandates that participants "share a complete transcript of their interaction with an AI system." However, the authors have not published the full transcripts of their own preliminary tests on GPT-5.2 or Gemini 3.0. Applying a standard of transparency to participants that is not met by the experimenters themselves is a breach of standard peer-review expectations.
Axiom V: The Benchmark Paradox.
The authors explicitly state: "our question list should not be considered a benchmark in its current form." Yet, in the subsequent analysis, global conclusions regarding AI's aptitude for mathematical research are derived from this very dataset. One cannot logically disclaim benchmark status while simultaneously using the data to perform benchmarking functions.
Conclusion
Internal logical consistency and methodological rigor are core professional standards in mathematics — standards to which the eleven authors here hold themselves and their work. It is therefore striking to observe these standards applied so inconsistently in this paper. This review suggests that the community discussion should pivot from "Can AI do math?" to "Does this methodology support its own claims?"
Disclosure: AI was used as a language editor and for retrieving publicly available information about the First Proof paper. The methodological issues identified in this review were independently observed by the author. The irony of using AI to critique an AI benchmark paper is noted — and unlike the paper under review, the author's use of AI is fully disclosed.