Show HN: Loki Mode hit 99.67% SWE-Bench – MAF built a SaaS overnight

https://github.com/asklokesh/claudeskill-loki-mode

2•slogansand•1mo ago

Last month I shared Loki Mode here. Since then, benchmarks came back.

SWE-Bench: 99.67% (299/300 problems) HumanEval: 98.78% Pass@1 (162/164)

For context, most single-agent systems hit 30-50%. Best proprietary ones hover around 70-80%.

The difference is architecture. 37 specialized agent types across 6 swarms (engineering, ops, business, data, product, growth). Parallel 3-reviewer code review. Feedback loops that actually learn.

To stress test it, I pointed it at a blank folder and said "build a ServiceNow replacement." It ran for 19 hours and built FireLater - complete ticket management, workflows, CMDB, knowledge base, self-service portal. I wrote zero lines of code.

New in this version: - Kanban board to visualize agent actions in real-time - Perpetual improvement via self-healing feedback loops - Smarter swarm coordination

Still open source. MIT license. Still not selling anything.

Loki Mode: https://github.com/asklokesh/claudeskill-loki-mode FireLater (built by Loki Mode): https://github.com/asklokesh/FireLater

Happy to answer questions about the architecture or benchmarks.

Comments

slogansand•1mo ago

Author here. Quick context on the benchmarks:

We used RARV (Retrieve, Analyze, Reason, Validate) pattern with multi-agent collaboration. Each problem gets worked by specialized agents, reviewed by 3 parallel reviewers (code, business logic, security), and only merged after consensus.

The 99.67% isn't cherry-picked. Full run against standard SWE-Bench dataset. Happy to share methodology if anyone wants to reproduce.

slogansand•1mo ago

On the swarm architecture for those curious:

Engineering (8 types): frontend, backend, database, mobile, API, QA, perf, infra Operations (8 types): devops, SRE, security, monitoring, incident, release, cost, compliance Business (8 types): marketing, sales, finance, legal, support, HR, investor, partnerships Data (3 types): ML, data eng, analytics Product (3 types): PM, design, tech writer Growth (4 types): growth hacker, community, success, lifecycle Review (3 types): code, business, security

Agents don't step on each other. Frontend agent never thinks about database schemas. QA agent never writes deployment scripts. Domain isolation is key.

slogansand•1mo ago

For the skeptics (fair): FireLater repo has full git history. You can see the commits. No human intervention in the implementation phase.

I reviewed outputs and approved deployments. But architecture decisions, code, tests, docs - all Loki Mode.

It's not perfect. Some rough edges. But it works and enterprises can self-host it today.

slogansand•1mo ago

vs single-agent coding assistants: They tap out around 50% on SWE-Bench. No specialization. No parallel review. No self-healing.

vs other multi-agent frameworks: Most focus on chat or simple task delegation. Loki Mode runs full SDLC - from PRD to deployed product with monitoring and business ops.

vs hiring a team: Obviously humans are better for ambiguous problems. But for well-defined PRDs, this removes the "I'll get to it this weekend" bottleneck.

slogansand•1mo ago

Last time someone raised concerns about web crawling for competitive research. Valid point.

New version has configurable research modes. You can disable external crawling entirely and run fully offline if needed. Feedback heard.

The chaos in the US is affecting open source software and its developers

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

The chaos in the US is affecting open source software and its developers

Trying to make an Automated Ecologist: A first pass through the Biotime dataset

Watch Ukraine's Minigun-Firing, Drone-Hunting Turboprop in Action

Free Trial: AI Interviewer

FDA Intends to Take Action Against Non-FDA-Approved GLP-1 Drugs

Supernote e-ink devices for writing like paper

We are QA Engineers now

Show HN: Measuring how AI agent teams improve issue resolution on SWE-Verified

Adversarial Reasoning: Multiagent World Models for Closing the Simulation Gap

Show HN: Poddley.com – Follow people, not podcasts

Layoffs Surge 118% in January – The Highest Since 2009

Papyrus 114: Homer's Iliad

DicePit – Real-time multiplayer Knucklebones in the browser

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Show HN: AI Agent Tool That Keeps You in the Loop

Why Every R Package Wrapping External Tools Needs a Sitrep() Function

Achieving Ultra-Fast AI Chat Widgets

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Show HN: Loki Mode hit 99.67% SWE-Bench – MAF built a SaaS overnight

Comments