We ran a small visual benchmark [1] of GPT, Gemini, Claude, and our new visual agent Orion [2] on a handful of visual tasks: object detection, segmentation, OCR, image/video generation, and multi-step visual reasoning.
The surprising part: models that ace benchmarks often fail on seemingly trivial visual tasks, while others succeed in unexpected places. We show concrete examples, side-by-side outputs, and how each model breaks when chaining multiple visual steps.
We go into more details in our technical whitepaper [3]. Play around with Orion for free here [4].
fzysingularity•1h ago
The surprising part: models that ace benchmarks often fail on seemingly trivial visual tasks, while others succeed in unexpected places. We show concrete examples, side-by-side outputs, and how each model breaks when chaining multiple visual steps.
We go into more details in our technical whitepaper [3]. Play around with Orion for free here [4].
[1] Showdown: https://chat.vlm.run/showdown
[2] Learn about Orion: https://vlm.run/orion
[3] Technical whitepaper: https://vlm.run/orion/whitepaper
[4] Chat with Orion: https://chat.vlm.run/
Happy to answer questions or dig into specific cases in the comments.