The key point about RL is that it is a sequential decision making process. If you don't have something (an agent) making multiple decisions over time while interacting with an environment, then why bother calling it RL?
"Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting."
uh no? SFT is maximizing the RL objective in a dense reward setting. The entire point of RL, specifically actor-critic and Q-Learning, is that the RL method turns the sparse reward into a continuous dense reward against which a model can be trained on with classic gradient descent.
I mean look at the definition of Q-Learning and the bellman equation it uses. It maximizes the current reward by choosing the current action based on whether it maximizes the predicted reward, not the actual reward, which doesn't have to be continuous or produce a gradient. You can build an RL based maze solver where only the goal gives a reward to the model and it would work, albeit it would train extremely slowly.
Meanwhile supervised fine tuning always produces a continuous gradient on every single token.
We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).
We're still running a few experiments and plan to update the post with additional results in a few days.
Looking forward to trying out importance weighting soon!
Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x Lower Cost: https://www.tensorzero.com/blog/curated-behavior-cloning-sma...
Quick question : you mentioned unsloth in the blog post. Which of the fine tuning providers mentioned is using unsloth under the hood?
It's a WIP PR that we plan to merge soon: https://github.com/tensorzero/tensorzero/pull/2273
We worked _really_ hard, burned _tons_ of cash, and we're proud of our D- output. No wonder there are more papers published than actual work being done.
https://artofproblemsolving.com/wiki/index.php/AMC_historica...
Meanwhile the model is trained on these specific types of problems, does not have an apparent time or resource limit, and does not have to take the test in a proctored environment.
It's D- work. Compared to a 12 year old, okay, maybe it's B+. Is this really the point you wanted to make?
Modest results are worth publishing, as are bad results.
Just imagining Monte Carlo sampling it, the middle expectation will have a bunch of zeros due to the indicator function and the right expectation won’t.
I can make the middle expectation be as close to zero as I like by making the success threshold sufficiently high.
mandevil•13h ago
babelfish•13h ago
GabrielBianconi•12h ago
jtspringenberg•12h ago