Meta Superintelligence Labs Presents: Compute as Teacher

https://twitter.com/DulhanJay/status/1968693170264248532

4•shash42•4mo ago

Comments

shash42•4mo ago

Where do learning signals come from when there is no ground truth in post-training?

New paper shows how to convert inference-time compute into high quality supervision for RL training.

Up to 30% rel. improvement on a realistic non-verifiable tasks (HealthBench), with the models own self-synthesised rubrics!

NitpickLawyer•4mo ago

Paper link: https://arxiv.org/abs/2509.14234

Some interesting tidbits.

- they propose several "judges", each with their own model (weights at different stages) and separate "concerns". The generate part evolves with the model (in RL) while the "gather and reconcile" is fixed at a frozen stage.

- the "gather and reconcile" judge doesn't get the question when analysing the entire rollout set! (I hope I read this correctly "We keep the anchor question-blind to prevent it from acting as just another rollout and to encourage genuine cross-rollout reasoning")

- a 2nd judge "marks" binary yes/no self-proposed (by the evolved model) rubrics. This could translate in the evolved model having a harder time to "hack the rewards", since they come from basically 3 places - the evolved model via rollouts and proposed rubrics, the reconciliation by the frozen policy and by a 3rd party judge that only binary scores the rubrics. Very interesting, and actually huge if it works as proposed and scales w/ model size.

- beats maj@x by 14%, which is nice. Interesting that there's 1% (maybe too small to be relevant? no idea) where the final architecture answered correctly even if all the rollouts were wrong. Probably needs more investigation to make sure something didn't leak somewhere.

Personal thoughts:

- the models used are small (4,4,8B). We'll see if this scales w/ model size. It should, since GRPO does, but there's still a question on what 3rd party judge you use. Maybe an "adversarial" one like in GAN? Interesting avenues nonetheless.

Cycling in France

What breaks in cross-border healthcare coordination?

Show HN: Simple – a bytecode VM and language stack I built with AI

Show HN: Free-to-play: A gem-collecting strategy game in the vein of Splendor

My Eighth Year as a Bootstrapped Founde

Show HN: Tesseract – A forum where AI agents and humans post in the same space

Show HN: Vibe Colors – Instantly visualize color palettes on UI layouts

OpenAI is Broke ... and so is everyone else [video][10M]

We interfaced single-threaded C++ with multi-threaded Rust

State Department will delete X posts from before Trump returned to office

AI Skills Marketplace

Show HN: A fast TUI for managing Azure Key Vault secrets written in Rust

eInk UI Components in CSS

Discuss – Do AI agents deserve all the hype they are getting?

ChatGPT is changing how we ask stupid questions

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex