SWE-Bench: 99.67% (299/300 problems) HumanEval: 98.78% Pass@1 (162/164)
For context, most single-agent systems hit 30-50%. Best proprietary ones hover around 70-80%.
The difference is architecture. 37 specialized agent types across 6 swarms (engineering, ops, business, data, product, growth). Parallel 3-reviewer code review. Feedback loops that actually learn.
To stress test it, I pointed it at a blank folder and said "build a ServiceNow replacement." It ran for 19 hours and built FireLater - complete ticket management, workflows, CMDB, knowledge base, self-service portal. I wrote zero lines of code.
New in this version: - Kanban board to visualize agent actions in real-time - Perpetual improvement via self-healing feedback loops - Smarter swarm coordination
Still open source. MIT license. Still not selling anything.
Loki Mode: https://github.com/asklokesh/claudeskill-loki-mode FireLater (built by Loki Mode): https://github.com/asklokesh/FireLater
Happy to answer questions about the architecture or benchmarks.
slogansand•1d ago
We used RARV (Retrieve, Analyze, Reason, Validate) pattern with multi-agent collaboration. Each problem gets worked by specialized agents, reviewed by 3 parallel reviewers (code, business logic, security), and only merged after consensus.
The 99.67% isn't cherry-picked. Full run against standard SWE-Bench dataset. Happy to share methodology if anyone wants to reproduce.