Show HN: Entropy-Guided Loop – How to make small models reason

https://github.com/monostate/weave-logprobs-reasoning-loop

33•andrewmonostate•5mo ago

TLDR: A small, vendor-agnostic inference loop that turns token logprobs/perplexity/entropy into an extra pass and reasoning for LLMs.

- Captures logprobs/top-k during generation, computes perplexity and token-level entropy.

- Triggers at most one refine when simple thresholds fire; passes a compact “uncertainty report” (uncertain tokens + top-k alts + local context) back to the model.

- In our tests on technical Q&A / math / code, a small model recovered much of “reasoning” quality at ~⅓ the cost while refining ~⅓ of outputs.

I kept seeing “reasoning” models behave like expensive black boxes. Meanwhile, standard inference already computes useful signals both before softmax normalization and after it(logprobs), which we usually throw away. This loop tries the simplest thing that you could think of: use those signals to decide when (and where) to think again.

GitHub (notebook + minimal code): https://github.com/monostate/weave-logprobs-reasoning-loop

Paper (short & engineer made): https://arxiv.org/abs/2509.00079

Blog (more context): https://monostate.ai/blog/entropy-refinement-blog

Requirements: Python, API that exposes logprobs (tested with OpenAI non reasoning 4.1). OPENAI_API_KEY and WEAVE for observability. Run the notebook; it prints metrics and shows which tokens triggered refinement.

- Python, simple loop (no retraining).

- Uses Responses API logprobs/top-k; metrics: perplexity, max token entropy, low-confidence counts.

- Weave for lightweight logging/observability (optional).

- Passing alternatives (not just “this looks uncertain”) prevents over-correction.

- A simple OR rule (ppl / max-entropy / low-confidence count) catches complementary failure modes.

- Numbers drift across vendors; keeping the method vendor-agnostic is better than chasing fragile pairings.

- Needs APIs that expose logprobs/top-k.

- Results are indicative—not a leaderboard; focus is on within-model gains (single-pass vs +loop).

- Thresholds might need light tuning per domain.

- One pass only; not a chain-of-thought replacement.

- Run it on your models and ideas (e.g., 4o-mini, v3, Llama variants with logprobs) and share logs in a PR for our README in GitHub if you'd like, PRs welcome - I’ll credit and link.

Overall let me know if you find making small models reason like this useful!

Comments

mountainriver•5mo ago

Deep Entropix vibes

andrewmonostate•5mo ago

Thanks for bringing this up! Good catch on the similarities! Yes, both use entropy/uncertainty to allocate compute intelligently.

From what I understand, Entropix is an entropy-aware decoder - it monitors token entropy during generation and dynamically adjusts sampling or spawns parallel CoT branches at high-uncertainty points. It's a decoding-time intervention.

My approach doesn't touch decoding at all. I:

1. Generate normally (standard sampling)

2. Capture logprobs + top-k alternatives

3. Check if perplexity/entropy/confidence triggers exceed thresholds

4. If yes, do ONE refinement pass with an "uncertainty report" showing the model exactly which tokens were uncertain + their alternatives + context

The key difference: Entropix steers the ship while sailing; my loop reviews the voyage log and decides whether to make one correction pass. No branching, no custom samplers, deterministic cost (0 or 1 extra pass).

They're actually complementary - you could use Entropix entropy-aware sampling for initial generation and still apply a refinement loop afterward. Same underlying signal (entropy), different control points! The result of combining both should be outstanding! I will test it soon.

mountainriver•5mo ago

this is very cool!

andrewmonostate•4mo ago

Thanks, please do try when you got some time! https://github.com/monostate/weave-logprobs-reasoning-loop or https://colab.research.google.com/github/monostate/weave-log...

OpenClaw Partners with VirusTotal for Skill Security

Goal: Ship 1M Lines of Code Daily

Show HN: Codex-mem, 90% fewer tokens for Codex

FastLangML: FastLangML:Context‑aware lang detector for short conversational text

LineageOS 23.2

Crypto Deposit Frauds

Substack makes money from hosting Nazi newsletters

Framing an LLM as a safety researcher changes its language, not its judgement

Are there anyone interested about a creator economy startup

Show HN: Skill Lab – CLI tool for testing and quality scoring agent skills

2003: What is Google's Ultimate Goal? [video]

Roger Ebert Reviews "The Shawshank Redemption"

Busy Months in KDE Linux

Zram as Swap

Green’s Dictionary of Slang - Five hundred years of the vulgar tongue

Nvidia CEO Says AI Capital Spending Is Appropriate, Sustainable

Show HN: StyloShare – privacy-first anonymous file sharing with zero sign-up

Part 1 the Persistent Vault Issue: Your Encryption Strategy Has a Shelf Life

Show HN: Teleop_xr – Modular WebXR solution for bimanual robot teleoperation

The Highest Exam: How the Gaokao Shapes China

Open-source framework for tracking prediction accuracy

India's Sarvan AI LLM launches Indic-language focused models

Show HN: CryptoClaw – open-source AI agent with built-in wallet and DeFi skills

ShowHN: Make OpenClaw respond in Scarlett Johansson’s AI Voice from the Film Her

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality