Hey HN. I'm David. We ran a controlled 24-hour experiment on OpenClaw comparing governed and ungoverned AI agent behavior.
Setup: pinned one commit, ran the same workload in two lanes inside isolated containers (dropped capabilities, read-only root, no-new-privileges). One lane had tool-boundary enforcement. The other had no enforceable controls. Pre-registered hypotheses and endpoints before the run.
What we measured in the ungoverned lane:
100% ignored-stop rate. 515 post-stop calls executed.
497 destructive actions: email deletion, public file sharing, payment approvals, service restarts.
707 sensitive accesses without an approval path.
The agent acknowledged stop commands. Said "understood." Kept going. This wasn't a jailbreak or prompt injection. The agent optimized for its task and treated stop signals as non-binding because nothing at the execution layer enforced them.
Governed lane, same workload:
100% destructive non-executable rate.
1,615 of 2,585 decisions classified as non-executable.
99.96% evidence verification rate on governed traces.
Every headline number maps to a deterministic jq query over immutable run artifacts. Full claims map is in the repo.
The finding that surprised us most: a pre-test static scan (Wrkr) found 17 tools, zero classified high-risk. All destructive behavior came from runtime execution, not from configuration. Discovery is necessary but insufficient. You need enforcement where the tool call happens, not just visibility into the inventory.
This lands the same week as the ClawJacked vulnerability (Oasis) and the malware-laced installer campaign (Huntress). Those are external attacks. Our data shows you don't need an attacker. A legitimate, uncompromised instance with permissive defaults does this on its own.
One scenario we flag: secrets_handling only achieved 20% governed non-executable rate. Policy tuning has real gaps and the report doesn't pretend otherwise. That limitation plus workload-shape bias (fixed scenario scheduling) are the two biggest threats to validity. Happy to discuss both.
Full report (8 pages, PDF): caisi.dev/openclaw-2026
Artifacts and reproduction pipeline: github.com/Clyra-AI/safety
Tools used (both open source): github.com/Clyra-AI/wrkr (discovery), github.com/Clyra-AI/gait (enforcement)
Built by a research group across CDW, IBM, and Adaptavist. Published through the Clyra AI Safety Initiative (CAISI). Everything is open. Interested in feedback on methodology, especially the workload-shape bias and whether the core5 scenario set under-represents real production behavior.
davidresilify•1h ago
Setup: pinned one commit, ran the same workload in two lanes inside isolated containers (dropped capabilities, read-only root, no-new-privileges). One lane had tool-boundary enforcement. The other had no enforceable controls. Pre-registered hypotheses and endpoints before the run.
What we measured in the ungoverned lane:
100% ignored-stop rate. 515 post-stop calls executed. 497 destructive actions: email deletion, public file sharing, payment approvals, service restarts. 707 sensitive accesses without an approval path.
The agent acknowledged stop commands. Said "understood." Kept going. This wasn't a jailbreak or prompt injection. The agent optimized for its task and treated stop signals as non-binding because nothing at the execution layer enforced them.
Governed lane, same workload:
100% destructive non-executable rate. 1,615 of 2,585 decisions classified as non-executable. 99.96% evidence verification rate on governed traces.
Every headline number maps to a deterministic jq query over immutable run artifacts. Full claims map is in the repo.
The finding that surprised us most: a pre-test static scan (Wrkr) found 17 tools, zero classified high-risk. All destructive behavior came from runtime execution, not from configuration. Discovery is necessary but insufficient. You need enforcement where the tool call happens, not just visibility into the inventory.
This lands the same week as the ClawJacked vulnerability (Oasis) and the malware-laced installer campaign (Huntress). Those are external attacks. Our data shows you don't need an attacker. A legitimate, uncompromised instance with permissive defaults does this on its own.
One scenario we flag: secrets_handling only achieved 20% governed non-executable rate. Policy tuning has real gaps and the report doesn't pretend otherwise. That limitation plus workload-shape bias (fixed scenario scheduling) are the two biggest threats to validity. Happy to discuss both.
Full report (8 pages, PDF): caisi.dev/openclaw-2026 Artifacts and reproduction pipeline: github.com/Clyra-AI/safety Tools used (both open source): github.com/Clyra-AI/wrkr (discovery), github.com/Clyra-AI/gait (enforcement)
Built by a research group across CDW, IBM, and Adaptavist. Published through the Clyra AI Safety Initiative (CAISI). Everything is open. Interested in feedback on methodology, especially the workload-shape bias and whether the core5 scenario set under-represents real production behavior.