"We are not interested in the fact that the brain has the consistency of cold porridge." — Alan Turing
Simply asking the LLM in two separate contexts the same question but from opposing perspectives, then in a third context asking it to analyze both responses and choose the most neutral and objective take, you wipe out any "(dis)agreeableness" bias and dig closer to a deeper, more nuanced synthesis of a given topic. This paper is just taking this idea to the next level.
This isn't really possible with RLHF alone unless you train the LLM to often give two opposing perspectives, which would get tiring.
One more reason to be wary of pushing for better capabilities.
1: https://arxiviq.substack.com/p/coming-soon
> ArXivIQ exists to turn that fire-hose into jet-streams of insight: every paper is hand-picked by a human editor, then pushed through a purpose-built multi-agent AI pipeline that dissects methods, experiments, and limitations in minutes instead of hours.
Three motivating points:
- GEPA / evolutionary agents are performing a zero-th order (no gradient) optimization in a combinatorial space. Their loss curves are VERY noisy and stochastic. If we run such agents multiple times, the performance variance is extremely high -- and in some cases cancels out the gains from single experiment. However, obtaining the error bounds is hard because the API costs are pretty restrictive.
- The problem we face with test time scaling is not that prompt engineering is ineffective/less effective than fine-tuning. It is that fine-tuning _reliably_ increases performance for a model for any subset of tasks and the scaling curves for performance per additional data token are well understood.
- Test time optimization techniques work well on in-distribution problems (e.g. generate and debug this Python code) but fail pretty badly on even slightly out of distribution problems (e.g. generate and debug this Julia code). Compare this to gradient search -- it wouldve been so fascinating and confusing if SGD failed to optimize a CNN image classifier on COCO but worked very well on ImageNet.
How do people feel about this? Does this line up with your viewpoints?
So, as with any less informed user reviewing LLM output, what you say definitely sounds plausible and correct.
- raw accuracy is now a "vanity" metric. so the benchmarks need to get more sophisticated, and i think they're going to have to be far more task specific than hotpot or hover. they've become like the mnist of multi hop.
- in my use of MIPROv2 and SIMBA, I see a fair amount of improvements for multi hop tasks (published some of these on hn before). I'm going to try GEPA and see how it performs. so I think we're at the start of what I would call "meta learning".. tuning across a huge search surface rather than tweaking one prompt. hyper param search for higher dim spaces.
- tokens burned should be a reported result
ACCount36•1d ago
falcor84•23h ago
ACCount36•22h ago
"Self-distillation" refers to distilling from a model into a copy of itself. Which is of limited use - unless you can steer the teacher, and want the student to internalize that steering.
The reason for doing self-distillation here is that we have both access to a richer representation (logit stream), and want to capture a richer behavior - not the answers themselves, but better reasoning techniques that are downstream from better prompts.
imtringued•1h ago