Well, the folks on this website think installing vLLM (pip install vLLM...) is hard and that ollama - a far slower and shittier inference engine - is better. Enormous damage has been done to the hobbyist LLM ecosystem due to folks not knowing what tools work on what platform.
The one exception is for mac peasants where llama.cpp is still probably the best implementation, but if you have nvidia and you're not using sglang or vLLM, you're doing it wrong.
But this is of ENORMOUS use for folks who want to run tiny models at home. Go to bed wake up with a K=512 solution answer to your problem.
The authors seem to be Chinese, and may not be that confident in their English. I suspect that we'll be seeing a lot more of this kind of stuff, as time goes on.
Also, some of the very worst English I've ever read, has been technical prose, written by born-and-bred native English speakers with very high educational credentials.
Clear communication is important. The best idea on Earth, is worthless, if it can't be articulated well.
No, it was fully or almost fully LLM generated. See: https://arxiviq.substack.com/p/coming-soon
If otherwise, then it looks like The Singularity has arrived.
It’s a perfectly valid article; an AI-generated summary of a lot of work done by humans.
Not a paper that would be presented for peer review, but rather, to be consumed by regular mensch (like me).
That’s actually something that AI is pretty good at. I use it to summarize stuff for me, all the time.
It should probably have a disclaimer, somewhere, saying what it is, maybe with a link to the raw source, but it’s just another way of communicating.
I’ve been reading human-generated marketing drivel for decades. This is actually a lot better than that stuff.
Careful where you place your anger. You should not be angry at the people writing the paper.
No I think the confusing thing is that the LLM-written blog post doesn't adequately explain the screenshot.
> "Specifically, DeepConf-low uses top η= 10% (corresponding to the 90th percentile) and DeepConf-high uses top η = 90% (corresponding to the 10th percentile) uniformly across all settings. This threshold ensures that during online generation, traces are terminated when their confidence falls below the level that retains the top η% highest-confidence traces from the warmup phase."
I'm not sure if I'm parsing it right, but are they using "low" and "high" as descriptors of the number used as the %, meaning that the "low" 10 cuts anything outside the best 10%, while the "high" 90 leaves the best 90% ie high is less selective than low?
I also do manual reviews (https://gonzoml.substack.com/), but there are many more papers for which I don't have time to write a review. So I created a multi-agentic system to help me, and I'm constantly iterating to improve it. And I like the result. It was also validated by the paper authors a couple of times, they agree the reviews are correct. So, if you see something is definitely wrong, please let me know.
Regarding myself, I became at least x10 more productive in reading papers and understanding what's happening. Hope, it will also help some of you.
The previous self-consistency approach and this confidence pruning approach aren't really novel, but it's nice to see the numbers run. Fundamentally these approaches are about handling contradicting results, but not resolving the contradictions or increasing the quality of reasoning. What if the rare idea is the right answer? You can squeeze the training juice harder, but if you still get the wrong answer when it really really mattered, you're just left with a stress toy in your hand.
And we will be supposed to find said code at https://jiaweizzhao.github.io/deepconf at some point?
Reducing costs of reasoning is a huge ongoing challenge in LLMs. We're spending so much energy and compute resources today on reasoning that today's consumption rates were unexpected (to me) a short 1 yr ago. We're literally burning forests, the atmosphere and making electricity expensive for everyone.
DeepThink v3.1 made a significant leap in this direction recently -- significantly shorter thinking tokens at the same quality. GPT5's router was also one (important) attempt to reduce reasoning costs and make o3-quality available in the free tier without breaking the bank. This is also why Claude 4 is winning the coding wars against its reasoning peers -- it provides great quality without all the added reasoning tokens.
Getting inspiration from Alpha-go and MCMC literature -- applying tree weighting, prioritization and pruning feels extremely appropriate. (To improve the quality of Deep Think -- offered by Gemini & GPT5 Pro today)
So, yes, more of this please. Totally the right direction.
yoouareperfect•10h ago
furyofantares•9h ago
It's not remotely practical to select the most probable path but you can do a little bit of search a few tokens at a time.