So I asked a different question: what if the system was designed around the assumption that agents WILL fail, and the job of the infrastructure is to never let that failure become a dead end?
openTiger is a "non-human-first" orchestration system that runs multiple AI agents in parallel — planner, workers, testers, judge — each with a dedicated role. The planner decomposes requirements into tasks, the dispatcher fans them out to worker agents concurrently, and the judge evaluates results and feeds back rework decisions. It's not one agent doing everything; it's a pipeline of specialized agents running simultaneously.
The entire architecture is built on one principle: no state is terminal. Every failure is a blocked state with a reason, and every reason has a recovery path. If the same failure repeats, the system escalates to a different strategy instead of retrying the same thing.
The interesting philosophical bit: optimizing for recovery turns out to be more effective than optimizing for first-attempt success. When you stop fearing failure, you can let agents be more aggressive.
Early stage, lots to improve. Feedback and contributions welcome.