In the past few weeks I (mostly Claude) cobbled together a Rust library + cli to run the same prompt across multiple models, through multiple rounds of iterative consensus.
Each model is fed the same initial prompt, produces an answer, then every model individually reviews and scores each of the other model's answers independently. The original prompt, previous answer, and the reviews, are then fed back to the models for the next round, until either one model "wins" two rounds in a row or a limit is reached.
It did quite well on the car wash test (https://github.com/Lightless-Labs/refinery?tab=readme-ov-fil...). Most models answer badly initially, but it just takes one for all of them to quickly converge towards better answers. Although, to my initial surprise, adding more models quickly breaks the current voting+threshold selection strategy.
I also recently added a synthesis mode, which does the same thing but with an additional synthesis round at the end where each model produces a synthesis of all the answers that scored above the threshold in the last round, followed by one last review round.
The total number of calls quickly blows up with rounds and model count, but it's been fun!
Currently, I'm racking my brain trying to figure out a way to select for both diversity and quality, for a "brainstorm" process. If you have any ideas either on that or other features, let me know!
ad-tech•1d ago
ElFitz•1d ago
> and quickly realized throwing 5 mediocre models at a problem just makes them argue in circle.
What was your selection strategy? My current issue is more that the more models I add, the less likely any specific one is to win two rounds in a row. Which would make perfect sense no matter the model quality, no? Unless there’s a huge gap.
> For brainstorm mode maybe weight models by past accuracy instead of pure voting?
By adding outputs history and a way to track the actual outcomes?