Show HN: Lightless Labs Refinery – multi-model consensus and synthesis

https://github.com/Lightless-Labs/refinery

2•ElFitz•1d ago

Hi!

In the past few weeks I (mostly Claude) cobbled together a Rust library + cli to run the same prompt across multiple models, through multiple rounds of iterative consensus.

Each model is fed the same initial prompt, produces an answer, then every model individually reviews and scores each of the other model's answers independently. The original prompt, previous answer, and the reviews, are then fed back to the models for the next round, until either one model "wins" two rounds in a row or a limit is reached.

It did quite well on the car wash test (https://github.com/Lightless-Labs/refinery?tab=readme-ov-fil...). Most models answer badly initially, but it just takes one for all of them to quickly converge towards better answers. Although, to my initial surprise, adding more models quickly breaks the current voting+threshold selection strategy.

I also recently added a synthesis mode, which does the same thing but with an additional synthesis round at the end where each model produces a synthesis of all the answers that scored above the threshold in the last round, followed by one last review round.

The total number of calls quickly blows up with rounds and model count, but it's been fun!

Currently, I'm racking my brain trying to figure out a way to select for both diversity and quality, for a "brainstorm" process. If you have any ideas either on that or other features, let me know!

Comments

ad-tech•1d ago

The voting thing breaks because youre treating all models equally when they shouldnt be. We ran consensus logic like this on a smaller scale and quickly realized throwing 5 mediocre models at a problem just makes them argue in circle. One good model beats three bad ones always. The synthesis round will get expensive fast too - we started with 2 models doing 3 rounds and it was already costing 40x a single pass. For brainstorm mode maybe weight models by past accuracy instead of pure voting? We do this with our team internally - the person who got it right last time gets listened to more next time, not equal voice to everyone. Could be interesting to test.

ElFitz•1d ago

Why would the synthesis round get expensive than the regular rounds?

> and quickly realized throwing 5 mediocre models at a problem just makes them argue in circle.

What was your selection strategy? My current issue is more that the more models I add, the less likely any specific one is to win two rounds in a row. Which would make perfect sense no matter the model quality, no? Unless there’s a huge gap.

> For brainstorm mode maybe weight models by past accuracy instead of pure voting?

By adding outputs history and a way to track the actual outcomes?

Show HN: AINL – Compile AI agent workflows to deterministic graphs

Good News: Free Speech Wins Big in Court

AI Won't Automatically Accelerate Clinical Trials

Dreaming of a Ten-Year Computer

China Is Not an Expansionist Power

Principles and Gear

Battleship Prompts

KDE Plasma 6.6 Delivers an Impressive Edge over Gnome 50 on Ubuntu 26.04

ClawInstitute

Show HN: Kora – An AI-native OS layer written in 370k lines of Rust

Next.js Across Platforms: Adapters, OpenNext, and Our Commitments

Aerion – An Open Source Lightweight Email Client

Iran war could crimp Gulf allies' US investments

The RISE RISC-V Runners: free, native RISC-V CI on GitHub

Why aren't we fine-tuning more?

AMD Announces the Ryzen 9 9950X3D2

Hello Algo

Show HN: Wit – Stops merge conflicts when multiple AI agents edit the same repo

ZT Manager – A native iOS app to manage ZeroTier networks

Flowers for dry Claude: Memes are better sensors than benchmarks

Sorting Algorithms

Speaking of Voxtral

Federal government employees are not ok

FossGIS Videos (mostly in German language)

Show HN: Search and track flight prices across date and destination combinations

Oldest dog identified at ancient hunter-gatherer site

Show HN: I resurrected my 2013 web usability checklist for the AI age

French e, è, é, ê, ë – what's the difference?

S.F. school board restores 8th-grade algebra after 12-year hiatus

Cactus, a work-stealing parallel recursion runtime for C