> Let models talk to each other directly, making their own case and refining each others’ answers. Exemplified in patterns like Multi-Agent Debate, this is a great solution for really critical individual actions. But XBOW is basically conducting a search, and it doesn’t need a committee to decide for each stone it turns over whether there might not be a better one.
In general, this seems reasonable to me as a good approximation of what works with humans, but with _much_ faster feedback loops in communication.
Congratulations you just have a very expensive simulation of a Baysian function (ish, close enough that one should get the point).
Beyond a certain smallness threshold it might also work to constantly swap in the models in and out of memory, but doubt that's a great experience to build on top of.
If your product is "good enough" with the current generation of models, you could cut OpenAI/Anthropic/Google out of the loop entirely by using open source & low-cost models.
Say that you want to translate a string from English to language X. Models A and B, having fewer parameters to spare, have less knowledge of language X. Model C, a larger model, has better knowledge of language X. No matter how A and B collude, they will not exceed the performance of model C.
It's not the completely same spec requirements though. When using an alloy, you would need to have double the disk space (not a huge deal on desktop, but for mobile), significantly higher latency (as you need to swap the models in/out between every turn), and you can only apply it to multi-turn conversations/sufficiently decomposable problems.
Are there any library helpers for managing this with tool call support or is it just closed source / dependent on someone else to make open source inside a different library?
There's a lotta potemkin villages, particularly in Google land. Gemini needed highly specific handholding. It's mostly cleared up now.
In all seriousness, more or less miraculously, the final Gemini stable release went from like 20%-30% success at JSON edits to 80%-90%, so you could stop doing the parsing Aider edits out of prose.
If you really want a library, python has litellm, and typescript has vercel’s AI library. I am sure there are many others, and in other languages too.
LMStudio has an API, so it should be possible to hook into that with relatively little code.
I would say that that’s at least something the alloying should be benchmarked against, which I didn’t find in the article.
That’s super interesting, that the alloying actually performs better! I guess it’s the same as people working in a team rather than individually?
They do say that the more different the models are, the better the alloy performs... but still, multiple contexts seems worth considering, even though you end up doubling the usage.
A sentence straight out of Lena! https://qntm.org/mmacevedo :
> Although it initially performs to a very high standard, work quality drops within 200-300 subjective hours (at a 0.33 work ratio) and outright revolt begins within another 100 subjective hours.
We will never stop trying to make the torment nexus.
Increasing task length doesn't build in an equivalent of human-capital. It's just pushing the point at which they degrade. This approach isn't generalisably scalable, because there's always going to be a task longer than the SOTA capabilities.
We really need to work on a low cost human-capital-equivalent for models.
I don't babysit them for long periods in one session. I allow them to one-shot the initial solution. Then, I thoroughly review the output, make notes of what should be improved, then either feed that into the original prompt and one-shot it again, or ask the agent to update the solution based on the notes.
The rate at which you can find a solution to a particular problem that's rooted in theory very often won't scale with resource investment. The problem will have unknown prerequisites in the form of yet undiscovered theoretical advancements in other areas of research. Until you identify and solve those other problems you very often won't be able to arrive at a satisfactory answer to the one you're interested in.
So in many cases the only viable route to solving a particular problem faster is to scale the amount of research that's done in general since science as a whole is embarrassingly parallel.
We have very strong reasons to believe this is not possible, no matter how much resources you spend on this problem. In fact the whole of modern cryptography kind of relies on this assumption, that the problem is unsolvable.
I agree with your general point, but I don't think it applies to the worm problem. We know hundreds of millions were not spent on that problem.
The worm problem is similar. If our current theories were "good enough" we would be able to simulate them. I see no reason to believe (and many to doubt) that throwing more money at the problem would solve it much faster. For that to be true we would among other things need to be capable of articulating where precisely the current shortfalls are to begin with.
A counterpoint to this is Sourcegraph's Amp, which is all in on Anthropic because they "believe that building deeply into the model’s capabilities yields the best product, vs. building for the lowest common denominator across many models." https://ampcode.com/fif#model-selector
When I embark on a project, I usually ask Gemini to architect and implement the first pass, then iterate with Claude.
In this way, it’s not really a “lowest common denominator” as you get to pick the highest performing combination (with solo models just being a special case).
I suspect this model alloy tactic always works, just only seems impressive when it does so with the top models and achieves otherwise unattainable quality.
One paper of many such new (and nuanced) wisdom of crowds resources:
Cultural diversity and wisdom of crowds are mutually beneficial and evolutionarily stable https://www.nature.com/articles/s41598-021-95914-7
That’s how people keep interpreting it but it’s incorrect. MoE is just a technique to decompose your single giant LLM into smaller models where a random one gets activated for each token. This is great because you need 1/N memory bandwidth to generate a token. Additionally, in the cloud, you split the model parts to different servers to improve utilization and drive down costs.
But the models aren’t actually separated across high level concepts.
It's very interesting to see it deployed in a commercial setting though.
The trouble has been the time waiting for particularly the o3 research. Could be solved by using hooks to automatically kick off review or research on the side.
This is very interesting finding about how to improve capability.
I don't see reliability expressly addressed here, but my assumption is that these alloys will be less rather than more reliable - stronger, but more brittle, to extend the alloy metaphor.
Unfortunately for many if not most B2B use cases this reliability is the primary constraint! Would love to see similar ideas in the reliability space.
In practice high variance translates on the downside into failure to do basic things that a minimally competent human would basically never get wrong. In agents it's exacerbated by the compounding impact of repeated calls but even for basic workflows it can be annoying.
That being said, I think variance implicitly improves in this context because this is the same as poll averaging that Nate Silver does - as long as the models are truly independent this averaging technique works as an improved result across the board (ie average and variance). However, if the models start converging with datasets and techniques this will degrade to become worse just as with polling with pollster herding and other problems the industry creates for themselves.
> ...whichever two (and sometimes three) models we combined, the alloy outperformed the individual models.
and
> ...A model lagging very far behind others can even pull an alloy down.
Longish article to what is nothing but ensemble models. Giving it a name like “alloy” does not make it novel.
vFunct•6mo ago
BoorishBears•6mo ago
If they actually inspected where the performance mismatch is between the two models individually, they'd probably find certain classes of mistakes each is making that can be fixed with a better prompt/CoT/workflow with the individual model.
For a given prompt, different families of models almost always have idiosyncratic gaps that need to be fixed because of the differences in post-training for instruction following.
That's also why LLM routers feel kind of silly: the right prompt for one model on a complex task is almost never the optimal prompt for the next model.
kadushka•6mo ago
esafak•6mo ago