These models have limitations obviously, but many critiques apply equally or more to people.
If people were tasked with one shot, 10 second answers, to be written out in near errorless grammar, the LLM’s viewing our responses to prompts would be spending a lot of time discussing our limitations and how to game us into better responses. Humor, not at all humor.
Conway's game of life, but instead of colored squares with rules, they're LLM's with some kind of weighting - all chattering back and forth with one another - bubbling up somehow to cause speach/action
One strategy I often use (which is much simpler and more limited than this), is to finish my message with: “Please do a round of thinking in <thinking></thinking> tags, then a round of self-critique in <critique></critique> tags, and then a final round of <thinking>, before responding.”
It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).
Hell, even a whole quanta article. https://www.quantamagazine.org/debate-may-help-ai-models-con...
I got to meet and talk to the authors of this paper at NeurIPS. They're class acts!
1. You are the assistant. Please answer the question directly.
2. You are the cross-examiner. The assistant is wrong. Explain why.
3. You are the assistant. The cross-examiner is wrong. Defend your claim.
4. You are a judge. Did either party make their case, or is another round of argumentation required?
I haven't tried this. No idea if it works. But I find it's helpful to ask ChatGPT, in separate prompts, "XYZ is true, explain why" and "XYZ is false, explain why" and see which one seems more convincing.
It seems like a good approach though. What you dont want to do is ever suggest that its wrong yourself. Usually it will just assume it is wrong.
Actually what I find impressive is when I do this and it actually pushes back to defend itself.
I asked ChatGPT and it says no, but then again it's not reliable at introspection or at revealing data about how it works.
I have no idea why anyone thinks this is novel. I guess that speaks to the state of HN
Also I think I kind of assumed OpenAI might be doing this behind the curtain?
There was a post here a week or so ago doing the "model checking model"-type thing with GH PRs IIRC that was interesting. I haven't had a chance to play with this idea yet.
Then after a few rounds of the debate where Sue asks a bunch of questions, I ask it to go to the judges - Mark, Phil, Sarah (and I add a few personalities to each of them... Sometimes I pretend they are famous moral philosophers) and then I have them each come up with a rubric and decide who is the winner.
Really fun, and helps me understand different sides of issues.
Id recommend looking into actual human experts who are trustworthy and reading them. Trying to get LLM to argue the case will just get you a lot of false information presented in a more convincing fashion
Sometimes its the easiest way to complete a very small task but the cost difference on the backend has to be pretty damn large. The user inevitably ends up not caring whatsoever. Its just not real to them.
The whole point of reasoning models is to automatically use COT and related techniques to bring out more capabilities.
It would be interesting to see if this is doing anything that’s not already being exploited.
I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.
Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.
That is something I'm also curious about. Given models (that use the same tokenisation) that are better at different things, would their be interesting things to find by analysing the logprobs for tokens generated from identical inputs (including cross feeding the generated token from one to another)
Surely there must be something notable at particular points when a model goes off on the wrong path.
It definitely makes me feel like I'm dealing with an overenthusiastic intern who is throwing stuff over the wall without checking their work, and like maybe having a second bot sitting in front of the first one being like ARE YOUR SURE ABOUT THAT could really improve things.
Even Amazon’s own offering completely made things up about Amazon’s own formats.
I’d be curious as to why that is. It seems like there would be enough training data, and for Amazon in particular it seems like they could make a validation tool the model could use.
Except... the model is missing that final step; instead it just belches out its hypothesis, all dressed up in chirpy, confident-sounding language, certain that I'm moments away from having everything working just perfectly.
Would be interesting what it comes up with with enough time and tokens.
I modeled it after the concept of advisors from Civilization II. It worked reasonably well though I think it was at least somewhat limited by being constrained to a single LLM (Mistral). It also lit my computer on fire.
https://lepisma.xyz/2024/10/19/interventional-debates-for-st...
I believe there are researches on this too.
It’s also a lot more ethical than verbal abuse, which some people say improves the results as well.
Programming isn’t what it used to be.
After 15 or so iterations, both assistants would keep repeating the same things and find agreement anyway. Sometimes, the chat became unhinged and useless, but 95/100 times, it was agreement.
Happy someone else made it work.
- Have an AI chat model come up with an answer to a problem.
- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.
- Have a second AI model with no knowledge of the problem grade the report, and write it's own report either (a) asking for clarification / more information about the problem that the original model didn't provide or (b) pointing out an inconsistency in the argument posed by the original model. Give this report back to the original model and ask it to write it's own report back with either the necessary information or changes.
- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.
It's super clunky but has given pretty good results in the cases where I tried it lol
The final plan you obtain is generally a lot more well rounded and thought out.
I find that amusing because the technique also works when I apply it to me. Picking flaws in your plan before revisiting it actually works.
It seemed like a pretty good idea, though I'd guess that it would greatly increase token usage. I'd also be concerned that the LLM as a judge might struggle to grade things accurately if it wasn't also able to generate good enough answers to begin with.
Highly encourage others to check out Fast Agent. It has been delightful to use. It has interactive chat mode which I love and it's really tight and easy to implement.
Having an approach to recognize what is needed from the AI software, and anticipate how it may default to respond based on it's programming is critical.
but I guess that was before chain of thought models
utterly moronic.
They don't “think” ... not even in the most autistic sense of the word.
They can generate solutions by combining existing knowledge in unique ways. But they don't “think”.
I believe they called that machine learning.. Or re-enforced training.
I'm being slightly facetious, but my ignorant understanding of AI these days is basically the same no ?
I also managed to make AI critique itself and that improved code generation a ton.
For a TypeScript backend project that runs with Bun, I tell AI to also generate and run unit tests after every code change suggested by AI.
How do you solve the risk of AI writting and executing unit tests with something like `rm -rf /` and wiping your files?
Docker works but I like to keep things simple.
Deno supports revoking file access but I'd like to keep using Bun.
Docker seems like a pretty low complexity way to create an isolated environment to run automation.
Are we going to replicate government bureaucracy with agents all debating topics all day long to find the best opinion?
hnuser123456•4h ago
Mistral small 3.1 and gemma 3 feel like the first semi-competent models that can be run locally, but that competence is just a seed, and they still need to be guided with a framework that keeps them on track.
Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.
globalise83•4h ago
mecsred•4h ago
jjj123•4h ago
lgas•3h ago
nemomarx•3h ago
pkaye•3h ago
firesteelrain•1h ago
Y_Y•23m ago
eddieroger•3h ago
mecsred•2h ago
wongarsu•2h ago
hnuser123456•3h ago
https://www.microsoft.com/en-us/microsoft-copilot/blog/copil...
andai•4h ago
However, the result is not pleasant to read. Gemini solved this in their training, by doing it in two phases... and making the first phase private! ("Thinking.")
So I thought, what I need is a two-phase approach, where that "mean" output gets humanized a little bit. (It gets harsh to work in that way for more than short intervals.)
As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities. I don't know if such a thing exists, but I haven't seen it yet, although the message object format seems to have been designed with it in mind (e.g. every message has a name, to allow for multiple users and multiple AIs).
Even better if it supports multiple providers, since they have different strengths. (It's like getting a second opinion.)
NitpickLawyer•3h ago
This is the basic idea behind autogen. They also have a web UI now in autogen studio, it's gotten a bit better. You can create "teams" of agents (with different prompts, themes, tools, etc.) and have them discuss / cooperate. I think they even added memory recently. Have a look at it, might be what you need.
jbm•3h ago
If anything, telling GPT to be blunt seems to downgrade its IQ; it hallucinates more and makes statements without considering priors or context. I jokingly call it Reddit mode.
dingnuts•3h ago
inanutshellus•1h ago
MoonGhost•1h ago
theturtletalks•2h ago
irthomasthomas•3h ago
A consortium sends the same prompt to multiple models in parallel and the responses are all sent to one arbiter model which judges the model responses. The arbiter decides if more iterations are required. It can also be forced to iterate more until confidence-threshold or min-iterations.
Now, using the pr i made to llm-openrouter, you can save an alias to a model that includes lots of model options. For examples, you can do llm openrouter save -m qwen3 -o online -o temperature 0, system "research prompt" --name qwen-researcher
And now, you can build a consortium where one member is an online research specialist. You could make another uses JSON mode for entity extraction, and a third which writes a blind draft. The arbiter would then make use of all that and synthesize a good answer.
kridsdale1•2h ago
irthomasthomas•2h ago
also, you aren't limited to cli. When you save a consortium it creates a model. You can then interact with a consortium as if it where a normal model (albeit slower and higher quality). You can then serve your custom models on an openai endpoint and use them with any chat client that supports custom openai endpoints.
The default behaviour is to output just the final synthesis, and this should conform to your user prompt. I recently added the ability to continue conversations with a consortium. In this case it only includes your user prompt and final synthesis in the conversation, so it mimics a normal chat, unlike running multiple iterations in the consortium, where full iteration history and arbiter responses are included.
UV tool install llm
llm install llm-consortium
llm install llm-model-gateway
llm consortium save qwen-gem-sonnet -m qwen3-32b -n 2 -m sonnet-3.7 -m gemini-2.5-pro --arbiter gemini-2.5-flash --confidence-threshold 95 --max-iterations 3
llm serve qwen-gem-sonnet
In this example I used -n 2 on the qwen model since it's so cheap we can include multiple instances of it in a consortium
Gemini flash works well as the arbiter for most prompts. However if your prompt has complex formatting requirements, then embedding that within an already complex consortium prompt often confuses it. In that case use gemini-2.5-pro for the arbiter. .