I would like to carefully design my response to this article with a downvote
1. Scoring unsolvable challenges as incorrect
2. Not accounting for token span
3. Not allowing LLMs to code as part of solution.
I tend to see Apple’s paper as an excuse for not having competitive products.
This is the difference between someone who has memorized leetcode solutions and someone who can work through a novel problem.
Until they will manage to, then claim they invented AI
https://chatgpt.com/share/68504396-e300-800c-a7ff-dde5fe1572...
- Impossible river claim: Again in figure 6, you can see that the performance declines before we reach 5 actors. So while it wasn’t necessary to test until 20, the results still indicate, impossibility doesn't explain the results.
ForHackernews•7mo ago
Results: Very high accuracy across tested models (Claude-3.7-Sonnet, Claude Opus 4, OpenAI o3, Google Gemini 2.5), completing in under 5,000 tokens.
The generated solutions correctly implement the recursive algorithm, demonstrating intact reasoning capabilities when freed from exhaustive enumeration requirement""
Is there's something I'm missing here?
This seems like it demonstrates the exact opposite of what the authors are claiming: Yes, your bot is an effective parrot that can output a correct Lua program that exists somewhere in the training data. No, your bot is not "thinking" and cannot effectively reason through the algorithm itself.
ForHackernews•7mo ago
When given access to Google and prompted to "tell me how to find the length of hypotenuse of a right triangle", a majority of middle-schoolers produced the correct Pythagorean Theorem, demonstrating intact reasoning capabilities when freed from the exhaustive comprehension requirement.
TIcomPOCL•7mo ago
ForHackernews•7mo ago