> Let models talk to each other directly, making their own case and refining each others’ answers. Exemplified in patterns like Multi-Agent Debate, this is a great solution for really critical individual actions. But XBOW is basically conducting a search, and it doesn’t need a committee to decide for each stone it turns over whether there might not be a better one.
In general, this seems reasonable to me as a good approximation of what works with humans, but with _much_ faster feedback loops in communication.
Congratulations you just have a very expensive simulation of a Baysian function (ish, close enough that one should get the point).
Beyond a certain smallness threshold it might also work to constantly swap in the models in and out of memory, but doubt that's a great experience to build on top of.
Are there any library helpers for managing this with tool call support or is it just closed source / dependent on someone else to make open source inside a different library?
There's a lotta potemkin villages, particularly in Google land. Gemini needed highly specific handholding. It's mostly cleared up now.
In all seriousness, more or less miraculously, the final Gemini stable release went from like 20%-30% success at JSON edits to 80%-90%, so you could stop doing the parsing Aider edits out of prose.
If you really want a library, python has litellm, and typescript has vercel’s AI library. I am sure there are many others, and in other languages too.
I would say that that’s at least something the alloying should be benchmarked against, which I didn’t find in the article.
That’s super interesting, that the alloying actually performs better! I guess it’s the same as people working in a team rather than individually?
They do say that the more different the models are, the better the alloy performs... but still, multiple contexts seems worth considering, even though you end up doubling the usage.
A sentence straight out of Lena! https://qntm.org/mmacevedo :
> Although it initially performs to a very high standard, work quality drops within 200-300 subjective hours (at a 0.33 work ratio) and outright revolt begins within another 100 subjective hours.
We will never stop trying to make the torment nexus.
Hey I accept it’s a limitation I have, and I’m glad folks enjoy it! But I couldn’t figure out why folks share it on Lemmy[1] and get so into it when I saw nothing there.
Thanks :)
[1]: open-source & Rust-y reddit alternative; no affiliation
I feel like there's a pattern (genre?) there that's been niche-popular for for 15-20 years now, which includes TV shows like Lost or Heroes or The Lost Room. It's some variation of magical-realism, for an audience that always wants more and more surprise or twists or weird juxtapositions of normal and abnormal, room for crafting and trading fan-theories and predictions.
But eventually, it gets harder to keep up the balancing-act, and nobody's figured out how to end that kind of story in a way that satisfies, so the final twist is the lack of resolution.
A counterpoint to this is Sourcegraph's Amp, which is all in on Anthropic because they "believe that building deeply into the model’s capabilities yields the best product, vs. building for the lowest common denominator across many models." https://ampcode.com/fif#model-selector
When I embark on a project, I usually ask Gemini to architect and implement the first pass, then iterate with Claude.
That’s how people keep interpreting it but it’s incorrect. MoE is just a technique to decompose your single giant LLM into smaller models where a random one gets activated for each token. This is great because you need 1/N memory bandwidth to generate a token. Additionally, in the cloud, you split the model parts to different servers to improve utilization and drive down costs.
But the models aren’t actually separated across high level concepts.
vFunct•4h ago
BoorishBears•3h ago
If they actually inspected where the performance mismatch is between the two models individually, they'd probably find certain classes of mistakes each is making that can be fixed with a better prompt/CoT/workflow with the individual model.
For a given prompt, different families of models almost always have idiosyncratic gaps that need to be fixed because of the differences in post-training for instruction following.
That's also why LLM routers feel kind of silly: the right prompt for one model on a complex task is almost never the optimal prompt for the next model.
kadushka•2h ago