In real time the coding agent ran through a suite of system commands, figured out which jobs were causing problems, and then even started to dig into the explicit function calls (python and node processes can both be inspected at the function call level by sideloaded processes) before the entire system finally crashed.
Besides being extremely cool, I realized that with a few tweaks I could make this a legitimately useful tool. The basic idea: any time certain system vitals cross a threshold, spin up a coding agent and have the agent debug what is going on as aggressively as possible, with all logs being streamed to a third party server (in addition to being stored on disk). This basic abstraction would solve two huge problems:
- Most of the time it is very hard to figure out why exactly a machine went down. This tool would effectively act as an airplane blackbox, a sort of last record of what was going on that specifically is focused on debugging the failure as it happened. Massive speed up on figuring out system-breaking issues.
- Most of the time there are available interventions that someone could take that would prevent the system from going down at all, if a human was around when the crash was happening. For example, if I see that I’m about to OOM from vitest, I can just kill a bunch of the processes that are spiking memory and prevent the system from crashing that way.
We now have premortem running on all of our production machines.
Hope this is useful for other folks!
doormatt•41m ago
>Premortem continuously watches system vitals (CPU, memory, disk, processes) and spawns Claude agents to diagnose problems when thresholds are breached.
Surely you see the irony here...
theahura•26m ago
doormatt•22m ago
theahura•19m ago
theahura•15m ago