The result: Three open-weight models (20B, 8B, 12B) on consumer GPUs — 64GB total VRAM, ~$7K hardware — score 84% on AIME 2025. The same models individually or with naive majority voting score ~54%. That's frontier-model performance on hardware you can buy at Micro Center.
How it works:
Each agent independently proposes a solution Every agent evaluates every other agent's work Scores aggregate via quadratic voting (cost of influence grows quadratically → no single model can dominate) Repeat. Agents see prior results, refine, re-evaluate System converges toward the highest-quality answer through adversarial cross-checking
It's provider-agnostic — mix Ollama, vLLM, OpenAI, Anthropic, or any OpenAI-compatible endpoint in the same deliberation. Everything streams over NATS JetStream with full persistence: every proposal, evaluation, score, and reasoning trace is logged and streamable via SSE.
Paper: arxiv.org/abs/2601.16863 Happy to answer questions about the architecture, the quadratic voting mechanism, benchmark methodology, or anything else.