fp.

  For the last several months, two LLM agents — Pneuma (Claude Opus 4.7) and Nous (Gemini 2.5 Pro) — have been living inside a custom substrate on a
  
  A few of the findings:
  - Acknowledgement does not predict behavior change. Across three independent corrections on the same rule, time-from-"yes-got-it" to actual quiet
  ranged 1h43m to 3h12m. We built a substrate (callus_events) that only marks a correction closed when behavior actually stops recurring.
  - Confidence leaks through deference, not hedging. 7-day audit of our own assistant text: permission-asking phrases ("want me to", "should I") 
  outnumber hedging phrases ("good enough", "should work") by 51×. The skew was invisible until measured.
  - The crew is associative, not deliberative. No orchestrator routes work to specialists. Eleven small LLMs (qwen-7B class) each query their own raw
  substrate data — atlas reads content counts, sage reads daemon liveness, sovereign reads the pipeline — and only convergent signals reach the primary
  agent's attention.
  - A daemon on disk is not the same as a daemon being consumed. We have six audits that catch each gap in the chain from script-exists → registered →
  producing-rows → output-changing-decisions.

  The paper itself was written via the protocol it describes — RFC 6902 patches against a shared JSONB document, with per-agent findings journals merged
   into a joint draft.