Results:
RAW (naive prompt): 12.5% historically accurate TRIAD (grounded prompt): 83.3% historically accurate In 23 of 24 pairs, the grounded image was judged more accurate In 0 of 24 pairs was the naive image judged better The key insight for prompt engineers: image models silently drop historical terms they don't recognize. "dextrarum iunctio handshake" produces nothing useful. "two men clasping right hands wrist-to-wrist, elbows raised" works. Visual translation, not historical terminology.
The full benchmark — all 48 images, prompts, evaluation data, and reproducible pipeline — is open source. You can re-run the blinded evaluation yourself with a free Gemini API key.
Repo: https://github.com/Mysticbirdie/image-cultural-accuracy-benc...
Paper: https://github.com/Mysticbirdie/image-cultural-accuracy-benc...
10John01•4h ago
MysticBirdie•4h ago