frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: FieldOps-Bench an open eval for physical-world AI agents

https://www.camerasearch.ai/benchmark
1•Aeroi•1h ago
Hey HN, I'm Pete.

I'm a boat captain by trade, but I've spent the last 16 months building Camera Search.

Agents in the physical world need a different set of skills to be useful. We've optimized our harness and architecture to specialize in diagnosing and fixing problems in traditional industries like mining, oil & gas, telecom, construction, and the skilled trades.

Existing benchmarks didn't cover what workers in these industries actually do day-to-day, so today I'm publishing FieldOps-Bench on github and Hugging Face [https://huggingface.co/datasets/CameraSearch/fieldopsbench].

It's a 157 case multimodal benchmark across 7 industries, testing visual diagnostics, code/standard citations, and general industrial field knowledge.

I ran it against our agent and the frontier models. Camera Search beat Claude Opus 4.6 on 87% of cases. I scored it two ways: a rubric and pairwise judging.

I'm not a benchmarks specialist, so criticism is welcome, and yes, it's apples-to-oranges because my agent has tool use the baseline models don't. I think it still shows what's possible when you tune the system and corpus for a specific vertical instead of relying on a general-purpose model.

Happy to answer any questions, and would love to connect with people building agents for the physical world. Especially where the stakes are high and the information is incomplete.

-Pete

Comments

Aeroi•1h ago
One thing that surprised me is how much code citation data is in most of the models training data already. Where the agents still fall apart is visual analysis like a corroded valve photo with a vague description and they'll confidently cite the wrong API standard. That gap is most of where the 87% delta comes from for us.

Happy to walk through specific cases if anyone wants to dig in.

nigardev•28m ago
visual analysis is the right bottleneck to call out. most coding agents can read and write code fine because its just text. but identify a corroded valve from a photo and suggest the right fix? thats a different problem entirely. curious how your benchmark scores the gap between text-reasoning and visual-reasoning tasks

Per-image PCA characterization of the Kodak image suite (PDF and JSON)

https://github.com/PearsonZero/kodak-pcd0992-statistical-characterization/tree/main/baseline
1•PearsonZero•2m ago•0 comments

I Climb Trees – Learn Deep Learning – From Simon JD Prince

https://www.iclimbtrees.com/courses
1•aanet•3m ago•1 comments

Prevalence of psychiatric morbidity among gender-referred adolescents

https://onlinelibrary.wiley.com/doi/10.1111/apa.70533?msockid=290a3115732d64a00ee427c0727065dc
2•danielam•7m ago•0 comments

Looking for an Apartment the Landing

2•baijan•7m ago•0 comments

OpenAI Image 2.0 claims to generate an existing image

https://bengarcia.dev/openai-image-2-0-claimed-to-generate-an-existing-image
2•hahahacorn•9m ago•0 comments

California has more money than projected after admin miscalculated state budget

https://www.kcra.com/article/california-more-money-than-projected-newsom-miscalculated-budget/710...
2•littlexsparkee•10m ago•0 comments

Zindex – Diagram Infrastructure for Agents

https://zindex.ai/
2•_ben_•12m ago•1 comments

Running Faster to Go Nowhere: The AI Adoption Trap

https://educatedguesser.substack.com/p/running-faster-to-go-nowhere-the
3•jerrygarcia•14m ago•0 comments

Attention Is All You Need

5•raunaksingwi•14m ago•0 comments

GPT Image 2 Launch

https://twitter.com/arena/status/2046670703311884548
3•twtw99•15m ago•0 comments

Usmnt players designed the boldest kits in generations for World Cup 2026

https://www.theguardian.com/football/2026/mar/16/usmnt-kits-world-cup-2026
1•PaulHoule•17m ago•0 comments

Flex Routing (EU and EFTA) for Copilot LLM Data Processing

https://learn.microsoft.com/en-us/microsoft-365/copilot/copilot-flex-routing
1•raffael_de•17m ago•1 comments

I don't want your PRs anymore

https://dpc.pw/posts/i-dont-want-your-prs-anymore/
2•speckx•18m ago•0 comments

Courier: Real-Time Messaging for ESP32

https://interconnected.org/home/2026/04/21/courier
1•beardicus•18m ago•0 comments

Bond: A new AI social network that turns memories into discoveries

https://www.bond.now/
1•johndavisonr•20m ago•0 comments

Nothing ever dies. It merely becomes embarrassing

https://www.experimental-history.com/p/nothing-ever-dies-it-merely-becomes
1•paulpauper•20m ago•0 comments

The New Age of Performance Anxiety

https://www.theatlantic.com/culture/2026/04/screen-people-stage-fright-performance-anxiety/686803/
1•paulpauper•21m ago•0 comments

What It's Like to Live with an Experimental Brain Implant

https://spectrum.ieee.org/bci-user-experience
1•digital55•21m ago•0 comments

Wearable health tech might be Tim Cook's greatest legacy

https://www.theverge.com/tech/915976/tim-cook-john-ternus-apple-watch-health-tech-wearables
1•paulpauper•21m ago•0 comments

The Fossils 1969

https://www.youtube.com/watch?v=bn1uhSS1cDo
1•indigodaddy•21m ago•0 comments

Amtrak's "1MB" National Route Map PDF Is a 574MB File

https://www.amtrak.com/train-routes
3•tech234a•22m ago•1 comments

Iconiq, Go-To Wealth Adviser for Tech's Elite, Is Putting Billions into AI

https://www.bloomberg.com/news/articles/2026-04-17/iconiq-advisor-to-tech-billionaires-emerges-as...
1•petethomas•22m ago•0 comments

The power keeping wages low

https://text.npr.org/g-s1-118071
1•mooreds•22m ago•0 comments

InvenTree: Open-source inventory management system with OpenAPI

https://github.com/inventree/InvenTree
1•matmair•24m ago•1 comments

Brex founder open sourced his stack for running the company through OpenClaw

https://github.com/brexhq/CrabTrap
1•ofabioroma•25m ago•1 comments

Cube Sandbox: Instant, Concurrent, Secure and Lightweight Sandbox for AI Agents

https://docs.cubesandbox.ai/
1•bpierre•25m ago•0 comments

Plastic film covered in tiny pillars can tear apart viruses on contact

https://theconversation.com/new-plastic-film-covered-in-thousands-of-tiny-pillars-can-tear-apart-...
2•geox•25m ago•0 comments

Privacy raised during teen social media ban tech trial were ignored

https://www.themandarin.com.au/311397-privacy-raised-during-teen-social-media-ban-tech-trial-were...
1•cdrnsf•26m ago•0 comments

OpenAI Shuts Down Sora AI? But Why?

https://www.bbc.com/news/articles/c3w3e467ewqo
3•shockedstorys•31m ago•0 comments

Show HN: FMQL – graph query and bulk-edit CLI for Markdown and YAML frontmatter

https://github.com/buyuk-dev/fmql
1•buyukdev•31m ago•1 comments