Show HN: FieldOps-Bench an open eval for physical-world AI agents

https://www.camerasearch.ai/benchmark

1•Aeroi•1h ago

Hey HN, I'm Pete.

I'm a boat captain by trade, but I've spent the last 16 months building Camera Search.

Agents in the physical world need a different set of skills to be useful. We've optimized our harness and architecture to specialize in diagnosing and fixing problems in traditional industries like mining, oil & gas, telecom, construction, and the skilled trades.

Existing benchmarks didn't cover what workers in these industries actually do day-to-day, so today I'm publishing FieldOps-Bench on github and Hugging Face [https://huggingface.co/datasets/CameraSearch/fieldopsbench].

It's a 157 case multimodal benchmark across 7 industries, testing visual diagnostics, code/standard citations, and general industrial field knowledge.

I ran it against our agent and the frontier models. Camera Search beat Claude Opus 4.6 on 87% of cases. I scored it two ways: a rubric and pairwise judging.

I'm not a benchmarks specialist, so criticism is welcome, and yes, it's apples-to-oranges because my agent has tool use the baseline models don't. I think it still shows what's possible when you tune the system and corpus for a specific vertical instead of relying on a general-purpose model.

Happy to answer any questions, and would love to connect with people building agents for the physical world. Especially where the stakes are high and the information is incomplete.

-Pete

Comments

Aeroi•1h ago

One thing that surprised me is how much code citation data is in most of the models training data already. Where the agents still fall apart is visual analysis like a corroded valve photo with a vague description and they'll confidently cite the wrong API standard. That gap is most of where the 87% delta comes from for us.

Happy to walk through specific cases if anyone wants to dig in.

nigardev•28m ago

visual analysis is the right bottleneck to call out. most coding agents can read and write code fine because its just text. but identify a corroded valve from a photo and suggest the right fix? thats a different problem entirely. curious how your benchmark scores the gap between text-reasoning and visual-reasoning tasks

Per-image PCA characterization of the Kodak image suite (PDF and JSON)

I Climb Trees – Learn Deep Learning – From Simon JD Prince

Prevalence of psychiatric morbidity among gender-referred adolescents

Looking for an Apartment the Landing

OpenAI Image 2.0 claims to generate an existing image

California has more money than projected after admin miscalculated state budget

Zindex – Diagram Infrastructure for Agents

Running Faster to Go Nowhere: The AI Adoption Trap

Attention Is All You Need

GPT Image 2 Launch

Usmnt players designed the boldest kits in generations for World Cup 2026

Flex Routing (EU and EFTA) for Copilot LLM Data Processing

I don't want your PRs anymore

Courier: Real-Time Messaging for ESP32

Bond: A new AI social network that turns memories into discoveries

Nothing ever dies. It merely becomes embarrassing

The New Age of Performance Anxiety

What It's Like to Live with an Experimental Brain Implant

Wearable health tech might be Tim Cook's greatest legacy

The Fossils 1969

Amtrak's "1MB" National Route Map PDF Is a 574MB File

Iconiq, Go-To Wealth Adviser for Tech's Elite, Is Putting Billions into AI

The power keeping wages low

InvenTree: Open-source inventory management system with OpenAPI

Brex founder open sourced his stack for running the company through OpenClaw

Cube Sandbox: Instant, Concurrent, Secure and Lightweight Sandbox for AI Agents

Plastic film covered in tiny pillars can tear apart viruses on contact

Privacy raised during teen social media ban tech trial were ignored

OpenAI Shuts Down Sora AI? But Why?

Show HN: FMQL – graph query and bulk-edit CLI for Markdown and YAML frontmatter