The benchmark defines reproducible tests for: - Token efficiency - Coherence stability - Latency - Drift control
It compares a baseline looping agent vs. the SIGMA runtime loop: `context → _generate() → output → drift/stability/memory update → causal continuity → context rebuild`.
The goal is to make runtime efficiency and coherence retention measurable and independently verifiable across model providers and agent frameworks.
Independent replication and external validation are encouraged. If you run long-context or autonomous LLMs, please share your benchmark results or insights.