We kept seeing the same problem with AI agents in production: you can observe them (traces, logs, dashboards), but when something goes wrong, your options are basically restart the process and lose all progress, or wait and hope.
We built Restate as a durable execution engine, and it turns out the primitives it provides, journaling every step, giving each execution a stable ID, map really well onto the control problem for agents.
This post walks through concrete scenarios: cancelling hundreds of agents stuck retrying a dead endpoint, pausing agents during DB maintenance instead of letting them burn through retries, and restarting a failed three-hour workflow from the exact step that failed (without redoing the expensive work before it).
Curious what control problems others have hit with agents in production. Happy to answer questions.
verdverm•1h ago
Any framework worth it's muster is already handling this. I use ADK
BlueHotDog2•1h ago
this is cool! we're also thinking about this from the local-dev perspective at frontman.sh! love the animations on the blog
stsffap•1h ago
We kept seeing the same problem with AI agents in production: you can observe them (traces, logs, dashboards), but when something goes wrong, your options are basically restart the process and lose all progress, or wait and hope.
We built Restate as a durable execution engine, and it turns out the primitives it provides, journaling every step, giving each execution a stable ID, map really well onto the control problem for agents.
This post walks through concrete scenarios: cancelling hundreds of agents stuck retrying a dead endpoint, pausing agents during DB maintenance instead of letting them burn through retries, and restarting a failed three-hour workflow from the exact step that failed (without redoing the expensive work before it).
Curious what control problems others have hit with agents in production. Happy to answer questions.
verdverm•1h ago