The only thing I would mention is that building a lot of agents and working with a lot of plug-ins and MCPs is everything is super situation- and context-dependent. It's hard to spin up a general agent that's useful in a production workflow because it requires so much configuration from a standard template. And if you're not being very careful in monitoring it, then it won't meet your requirements when it's completed, when it comes to agents, precision and control is key.
We built toran.sh specifically for this: it lets you watch real API requests from your agents as they happen, without adding SDKs or logging code. Replace the base URL, and you see exactly what the agent sent and what came back.
The "precision and control" point is key though - visibility is step one, but you also need guardrails. We're working on that layer too (keypost.ai for policy enforcement on MCP pipelines).
Would love to hear what monitoring approaches you've found work well for production agent workflows.
"Run" X,Y,Z...where, where does it run? "Isolated environment". How isolation was achieved? Is it a VM, if yes then what is the virtualization stack and what it contains? Is it firecracker, just a docker image? What are the defaults and network rules in this isolated environments?
Taking a look at 1400 lines long test file: https://github.com/vm0-ai/vm0/blob/1aaeaf1fed3fd07afaef8668b... and it becomes really clear why we shouldn't yet use LLMs (without detailed reviews) for this.
Obviously, you want your tests to test the implementation, not test that the mocks are working. I didn't read all the code, but lots of it not great. Generally, you want to treat your test code as any other production code, build abstractions and simple design/architecture that lets you heavily reduce test duplication, otherwise you end up with huge balls of spaghetti that are impossible to get a clear overview of, actually reasonably change, and hard to understand what is actually being tested. Like that run.test.ts.
Severian•2w ago
Is this Docker, Kubernetes, KVM/Xen, AWS, Azure, GCP, Fly.io, some other VM tech, or some rando's basement ?
Very little detail and I don't trust this at all.
cannonpalms•2w ago
scalemaxx•2w ago
jstummbillig•2w ago
ushakov•2w ago
e7h4nz•2w ago
I fully agree, without clear architecture docs, I wouldn't trust an infra service either. We're working on technical documentation now.
Here is some quick summaris about our arch, we uses E2B's managed sandbox (Firecracker microVMs), and keep working on our own Firecracker runner implementation (independent of E2B) with experimental network firewall features.
We use E2B because easy to start, no infrastructure needed, but self-hosted give developer full control, custom security policies, run on your own infra.
We're at an early stage and planning to release end of Jan. Detailed architecture docs are coming soon. Feedback welcome!