The core idea: instead of giving every query the same "thinking time," classify queries as HIGH or LOW complexity and allocate thinking tokens accordingly. Complex reasoning gets 70-90% of tokens, simple queries get 20-40%.
I also implemented steering vectors derived from Pivotal Token Search (originally from Microsoft's Phi-4 paper) that guide the model's reasoning patterns during generation. These vectors encourage behaviors like numerical accuracy, self-correction, and thorough exploration.
Results on DeepSeek-R1-Distill-Qwen-1.5B:
- GPQA-Diamond: 31.06% vs 21.72% baseline (+43% relative improvement)
- MMLU-Pro: 26.38% vs 25.58% baseline
- Uses fewer tokens than baseline approaches
Works with any local reasoning model - DeepSeek, Qwen, custom fine-tuned models. No API dependencies.
The technique builds on two things I developed: an adaptive classification framework that can learn new complexity categories without retraining, and an open source implementation of Pivotal Token Search.
Technical paper: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5253327
Code and examples: https://github.com/codelion/optillm/tree/main/optillm/autoth...
PTS implementation: https://github.com/codelion/pts
I'm curious about your thoughts on adaptive resource allocation for AI reasoning. Have you tried similar approaches with your local models?
codelion•1d ago
The breakthrough was combining two techniques I'd been working on separately: adaptive classification (which can learn new categories without retraining) and an open source implementation of Pivotal Token Search from Microsoft's Phi-4 paper. When I put them together with dynamic token budgeting, the performance gains were much better than expected.
What surprised me most was that the technique actually uses fewer tokens on average while improving performance. The adaptive allocation means simple queries finish faster, offsetting the extra computation on complex ones.
A few technical notes:
- The steering vectors are small (typically <1MB per pattern) and add minimal memory overhead
- Classification adds about 10ms latency, which is negligible
- Target layer selection matters - I found middle layers (15-20) work best for most models
I'd love feedback on:
- Have you tried similar adaptive approaches with your models?
- What other reasoning patterns would be useful to steer toward?
- Ideas for automatically detecting the optimal target layer?
Thanks for checking it out! Happy to answer any questions about the implementation or results.
behnamoh•1d ago
Not anymore. Have you seen Gemini 2.5 Pro? Ask it simple questions and it almost doesn't "think". Ask it a coding question and it'll write a long reasoning article. I think the same goes for o3.
codelion•1d ago
sigmoid10•1d ago
shing3232•1d ago
victorbjorklund•1d ago
"how long break distance does a train need if going in 100 km/hour?"
Just need a quick reply and you dont care so much (maybe showerthought)? Or is life and death depending on the answer?
The same question can need different amount of thinking.
normie3000•23h ago
In this situation I suspect you'd still want the answer quickly.
diggan•21h ago
GTP•19h ago
TeMPOraL•12h ago
If you need the answer within a couple hours, you can probably get it for an expert; if you need to get an actionable answer within minutes, based on some back-of-the-envelope calculations, then a SOTA LLM is a much safer bet than flagging whoever seems the smartest in the room and asking them for help.
CjHuber•22h ago
thegeomaster•22h ago
vladf•19h ago
CharlesW•20h ago
Definitely, in my experience. Elsewhere in the thread, OP says that open models/systems don't do this, in which case this seems like important work toward making open alternatives competitive.
olddustytrail•19h ago
You could even put a simpler AI in front to decide if it was effectively the same query.
mclau157•19h ago
Abishek_Muthian•1d ago
So far I’ve taken only lazy approach to optimising local LLMs by sending small queries to my M4 Mac Mini running MLX models and sending larger queries to my Nvidia 4090; it’s remarkable how efficient M4 is compared to Nvidia and I think Apple is in the right direction with MLX.
I would read about AutoThink and try to integrate it with my workflow.
Lerc•19h ago
codelion•19h ago
waffletower•17h ago