I'm a boat captain by trade, but I've spent the last 16 months building Camera Search.
Agents in the physical world need a different set of skills to be useful. We've optimized our harness and architecture to specialize in diagnosing and fixing problems in traditional industries like mining, oil & gas, telecom, construction, and the skilled trades.
Existing benchmarks didn't cover what workers in these industries actually do day-to-day, so today I'm publishing FieldOps-Bench on github and Hugging Face [https://huggingface.co/datasets/CameraSearch/fieldopsbench].
It's a 157 case multimodal benchmark across 7 industries, testing visual diagnostics, code/standard citations, and general industrial field knowledge.
I ran it against our agent and the frontier models. Camera Search beat Claude Opus 4.6 on 87% of cases. I scored it two ways: a rubric and pairwise judging.
I'm not a benchmarks specialist, so criticism is welcome, and yes, it's apples-to-oranges because my agent has tool use the baseline models don't. I think it still shows what's possible when you tune the system and corpus for a specific vertical instead of relying on a general-purpose model.
Happy to answer any questions, and would love to connect with people building agents for the physical world. Especially where the stakes are high and the information is incomplete.
-Pete
Aeroi•1h ago
Happy to walk through specific cases if anyone wants to dig in.