Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

38•getnormality•4mo ago

Comments

getnormality•4mo ago

I stumbled across this AI paper just now. It sounds intimidatingly technical, but if you read the abstract and look at Figures 1 and 2 and Equation 6, I think it's got some neat and accessible conceptual ideas.

Supervised learning is a much more mature technology than reinforcement learning, so it seems like a good thing to leverage that.

yorwba•4mo ago

I think you meant to link to

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR https://arxiv.org/abs/2509.02522

not

Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline https://arxiv.org/abs/2507.15855

dang•4mo ago

We've changed the top link to that from https://arxiv.org/abs/2507.15855. Thanks!

getnormality•4mo ago

Ack, thank you.

impossiblefork•4mo ago

That paper is really cool too though. I'm happy that your comment sort of records the old link, because I only saw the right paper.

anfego•4mo ago

Is this DPO?

getnormality•4mo ago

I have no idea. My understanding of this entire field is extremely superficial. I only posted this because I was able to sort of understand the paper despite that.

I can tell you that they cite the DPO paper right before Equation 8.

impossiblefork•4mo ago

59.76% on AIME is really appealing. Without having had time to understand it and determine whether it's useful or not, I see this number as indicating that this could be a stepping stone on something like the o1-to-DeepSeek-R1 progression for thinking, where open source models eventually figured out how o1 worked, only for the less definite 'o1' and instead what Google achieved and OpenAI may have achieved on the 2025 IMO problems.

radarsat1•4mo ago

> By treating the outcome reward as a predictable label, we reformulate the RLVR problem into a supervised learning task over a score function parameterized by the policy model and optimized using cross-entropy loss.

Isn't this how the Decision Transformer works? I don't see it in the references, so I'll be curious to compare the papers in more depth.

https://arxiv.org/abs/2106.01345

> By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return.

Lately it has crossed my mind that I haven't seen DT brought up much lately, it seemed really interesting when it was first published but I haven't read much follow-up work.

lostmsu•3mo ago

Can somebody explain "Base" model in the charts? Are they saying that the original model (e.g. before they applied either their or comparable training methods) has better or similar performance on all benchmarks vs their own result?

Ask HN: How much of your token use is fixing the bugs Claude Code causes?

Show HN: Agents – Sync MCP Configs Across Claude, Cursor, Codex Automatically

Hello

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

.72% Variance Lance

ReKindle – web-based operating system designed specifically for E-ink devices

Encrypt It

NextMatch – 5-minute video speed dating to reduce ghosting

Personalizing esketamine treatment in TRD and TRBD

SpaceKit.xyz – a browser‑native VM for decentralized compute

NotebookLM: The AI that only learns from you

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

Game Boy Advance d-pad capacitor measurements

South Korean crypto firm accidentally sends $44B in bitcoins to users

Apache Poison Fountain

Web.whatsapp.com appears to be having issues syncing and sending messages

Google in Your Terminal

Shannon: Claude Code for Pen Testing: #1 on Github today

Anthropic: Latest Claude model finds more than 500 vulnerabilities

Brooklyn cemetery plans human composting option, stirring interest and debate

Why the 'Strivers' Are Right

Brain Dumps as a Literary Form

Agentic Coding and the Problem of Oracles

Malicious packages for dYdX cryptocurrency exchange empties user wallets

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

Ask HN: How much of your token use is fixing the bugs Claude Code causes?

Show HN: Agents – Sync MCP Configs Across Claude, Cursor, Codex Automatically

Hello

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

.72% Variance Lance

ReKindle – web-based operating system designed specifically for E-ink devices

Encrypt It

NextMatch – 5-minute video speed dating to reduce ghosting

Personalizing esketamine treatment in TRD and TRBD

SpaceKit.xyz – a browser‑native VM for decentralized compute

NotebookLM: The AI that only learns from you

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

Game Boy Advance d-pad capacitor measurements

South Korean crypto firm accidentally sends $44B in bitcoins to users

Apache Poison Fountain

Web.whatsapp.com appears to be having issues syncing and sending messages

Google in Your Terminal

Shannon: Claude Code for Pen Testing: #1 on Github today

Anthropic: Latest Claude model finds more than 500 vulnerabilities

Brooklyn cemetery plans human composting option, stirring interest and debate

Why the 'Strivers' Are Right

Brain Dumps as a Literary Form

Agentic Coding and the Problem of Oracles

Malicious packages for dYdX cryptocurrency exchange empties user wallets

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Comments