Today's agent frameworks have done serious work on the cognitive layer: tool selection, planning, multi-agent coordination. What they don't provide is an execution layer, the machinery that controls how tool calls run, not just which ones get made.
Two gaps kept biting us:
There's no way to bound what an agent can do. It can call any tool, execute any number of operations, with nothing structurally preventing it. Give it access to delete_file and it can wipe your filesystem before you notice.
There's no process model. And when it does go off the rails, you can't even stop it. No pause, no resume. If something fails on step 39 of 40, you restart from step 1.
Castor routes every tool call through a kernel as a syscall. The agent has no other execution path, so capability limits and approval gates are structural, not advisory.
Within budget, everything auto-executes, even deletes. No popups. Budget runs out, the kernel stops the agent and a human decides. Budget replaces per-call approval.
Every syscall result is logged in an immutable journal. Suspend = unwind the stack. Resume = replay from the top with cached responses, live execution only from the suspension point. So you don't burn another $2.00 on tokens just to see if your fix worked. Capability limits, HITL, crash recovery, and deterministic debugging all fall out of the same mechanism.
The tradeoff is real: all non-determinism has to go through the kernel. If the agent sneaks in a raw API call outside the boundary, the replay diverges. It's a hard constraint.
When we stepped back, we realized we'd reinvented a 50-year-old idea. This is exactly the separation an OS draws between user space and kernel space. Castor is, in that sense, a microkernel for agents: a minimal privileged core that enforces resource limits and mediates every interaction between agent code and the outside world.
One thing we're still not sure about: is routing ALL non-determinism through a kernel boundary too heavy-handed? We considered using a lighter model where only destructive tools go through the check, but then you lose deterministic replay. Anyone found a middle ground or other ideas?
Code: https://github.com/substratum-labs/castor Docs: http://substratumlabs.ai/castor-docs/