The key point about RL is that it is a sequential decision making process. If you don't have something (an agent) making multiple decisions over time while interacting with an environment, then why bother calling it RL?
We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).
We're still running a few experiments and plan to update the post with additional results in a few days.
Looking forward to trying out importance weighting soon!
Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x Lower Cost: https://www.tensorzero.com/blog/curated-behavior-cloning-sma...
Quick question : you mentioned unsloth in the blog post. Which of the fine tuning providers mentioned is using unsloth under the hood?
It's a WIP PR that we plan to merge soon: https://github.com/tensorzero/tensorzero/pull/2273
We worked _really_ hard, burned _tons_ of cash, and we're proud of our D- output. No wonder there are more papers published than actual work being done.
https://artofproblemsolving.com/wiki/index.php/AMC_historica...
Meanwhile the model is trained on these specific types of problems, does not have an apparent time or resource limit, and does not have to take the test in a proctored environment.
It's D- work. Compared to a 12 year old, okay, maybe it's B+. Is this really the point you wanted to make?
Modest results are worth publishing, as are bad results.
Just imagining Monte Carlo sampling it, the middle expectation will have a bunch of zeros due to the indicator function and the right expectation won’t.
I can make the middle expectation be as close to zero as I like by making the success threshold sufficiently high.
mandevil•12h ago
babelfish•12h ago
GabrielBianconi•11h ago
jtspringenberg•11h ago