Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR https://arxiv.org/abs/2509.02522
not
Winning Gold at IMO 2025 with a Model-Agnostic Verification-and-Refinement Pipeline https://arxiv.org/abs/2507.15855
I can tell you that they cite the DPO paper right before Equation 8.
Isn't this how the Decision Transformer works? I don't see it in the references, so I'll be curious to compare the papers in more depth.
https://arxiv.org/abs/2106.01345
> By conditioning an autoregressive model on the desired return (reward), past states, and actions, our Decision Transformer model can generate future actions that achieve the desired return.
Lately it has crossed my mind that I haven't seen DT brought up much lately, it seemed really interesting when it was first published but I haven't read much follow-up work.
getnormality•4mo ago
Supervised learning is a much more mature technology than reinforcement learning, so it seems like a good thing to leverage that.