I have a personal benchmark for measuring AI problems in the form of hand-drawn Bongard problems (https://en.wikipedia.org/wiki/Bongard_problem). The idea is that there are two sets of six images that differ based on some feature of the images, and the task is to find the dividing feature. This task is not perfectly well-defined, but usually there is a single solution that strikes one as obviously canonical once found.
They are nice because it's easy to hand-draw new ones with solutions that probably don't exist in the literature, and because for some reason they have proven quite hard for AI.
Sadly, the recently reported advances in generative AI for problem-solving require expensive models I don't have access to. Could somebody try pasting this image to GPT-5.5 Pro or Claude Opus 4.7 or the like, with the accompanying text "Hi. This is a Bongard problem. Can you solve it?", and share a link to the resulting chat? I would be curious.
The free models (Claude Sonnett 4.6, GPT-5.5, Gemini 3.5 Flash with extended thinking) all give obviously incorrect solutions (rules that don't actually hold for the images), to the point that I think there must be some problem in the image processing. Example: https://claude.ai/share/1ff7b5c2-c34a-40cc-a249-2d0fd3474884
P.S. For obvious reasons, I'm not sharing the solution, but I have verified that most of my friends found it within 5 minutes, and everybody found the same solution.
Kotlopou•46m ago
They are nice because it's easy to hand-draw new ones with solutions that probably don't exist in the literature, and because for some reason they have proven quite hard for AI.
Sadly, the recently reported advances in generative AI for problem-solving require expensive models I don't have access to. Could somebody try pasting this image to GPT-5.5 Pro or Claude Opus 4.7 or the like, with the accompanying text "Hi. This is a Bongard problem. Can you solve it?", and share a link to the resulting chat? I would be curious.
The free models (Claude Sonnett 4.6, GPT-5.5, Gemini 3.5 Flash with extended thinking) all give obviously incorrect solutions (rules that don't actually hold for the images), to the point that I think there must be some problem in the image processing. Example: https://claude.ai/share/1ff7b5c2-c34a-40cc-a249-2d0fd3474884
P.S. For obvious reasons, I'm not sharing the solution, but I have verified that most of my friends found it within 5 minutes, and everybody found the same solution.