frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Supervised fine tuning on curated data is reinforcement learning

https://arxiv.org/abs/2507.12856
56•GabrielBianconi•14h ago

Comments

mandevil•13h ago
Interesting to see two independent researchers on this. Makes me curious as to what the back-story is? Side project?
babelfish•13h ago
Especially interesting given they both work for Google DeepMind.
GabrielBianconi•12h ago
Yeah, I hadn't noticed!
jtspringenberg•12h ago
Author here, just to clarify: we are both no longer working for DeepMind. This was purely an independent effort for the sake of research and understanding! Happy to answer any questions.
iandanforth•13h ago
How is this kind of analogy helpful? You can frame any optimization problem as RL if you try hard enough. RL is a method of optimization which calls the optimum "reward maximization". You can craft the reward function any which way you want.

The key point about RL is that it is a sequential decision making process. If you don't have something (an agent) making multiple decisions over time while interacting with an environment, then why bother calling it RL?

imtringued•1h ago
I personally am quite disappointed by the abstract:

"Building on existing literature, we clarify that SFT can be understood as maximizing a lower bound on the RL objective in a sparse reward setting."

uh no? SFT is maximizing the RL objective in a dense reward setting. The entire point of RL, specifically actor-critic and Q-Learning, is that the RL method turns the sparse reward into a continuous dense reward against which a model can be trained on with classic gradient descent.

I mean look at the definition of Q-Learning and the bellman equation it uses. It maximizes the current reward by choosing the current action based on whether it maximizes the predicted reward, not the actual reward, which doesn't have to be continuous or produce a gradient. You can build an RL based maze solver where only the goal gives a reward to the model and it would work, albeit it would train extremely slowly.

Meanwhile supervised fine tuning always produces a continuous gradient on every single token.

anndvision•12h ago
We recently ran similar experiments and saw that fine-tuning small models on automatically curated high-quality outputs from a large model can beat large-model performance while reducing inference costs by up to 30x and inference time by up to 4x.

We benchmarked closed-source (OpenAI, Google) and open-source (Qwen) models on multi-turn maze navigation (BabyAI), agentic RAG (Multi-Hop), and agentic tool use (τ-bench).

We're still running a few experiments and plan to update the post with additional results in a few days.

Looking forward to trying out importance weighting soon!

Curated Behavior Cloning: Small LLMs Can Beat Large Ones at 5-30x Lower Cost: https://www.tensorzero.com/blog/curated-behavior-cloning-sma...

chongliqin•12h ago
Cool! If you are interested, we have open sourced our code: https://github.com/emmyqin/iw_sft
anndvision•11h ago
thanks
TheTaytay•7h ago
Thanks for this - I’ve spent the last hour reading your docs and blog. I like the primitives you’ve exposed in your APO, and particularly like the decision to separate out the structured inputs from the prompt when you record an LLM call, so I can finally perform optimizations and evals on past calls.

Quick question : you mentioned unsloth in the blog post. Which of the fine tuning providers mentioned is using unsloth under the hood?

GabrielBianconi•7h ago
[I'm his coworker.] We ran Unsloth ourselves on a GPU-by-the-hour server. We have a notebook in the repository showing how to query historical data and use it with Unsloth.

It's a WIP PR that we plan to merge soon: https://github.com/tensorzero/tensorzero/pull/2273

henriquegodoy•12h ago
It's cool to see the perspective that many problems (somekinda communication problems, look at lawyers, compliance and etc...) can be solved by treating AI less as agents and more as modular components within a larger system. Once we build a working process—monitored through evals—we can then reduce costs by distilling these modules. That means starting with superintelligent models and later distilling them down to just a few billion parameters, instead of needing hundreds of billions.
stolencode•8h ago
> For example achieving 66.7% on the AIME 2024 dataset.

We worked _really_ hard, burned _tons_ of cash, and we're proud of our D- output. No wonder there are more papers published than actual work being done.

supermdguy•8h ago
That corresponds to a 10/15, which is actually really good (median is around 6)

https://artofproblemsolving.com/wiki/index.php/AMC_historica...

stolencode•4h ago
Isn't the test taken only by students under the age of 12?

Meanwhile the model is trained on these specific types of problems, does not have an apparent time or resource limit, and does not have to take the test in a proctored environment.

It's D- work. Compared to a 12 year old, okay, maybe it's B+. Is this really the point you wanted to make?

jpcompartir•1h ago
This is a nonsense critique.

Modest results are worth publishing, as are bad results.

markisus•7h ago
Something seems off with equation (5).

Just imagining Monte Carlo sampling it, the middle expectation will have a bunch of zeros due to the indicator function and the right expectation won’t.

I can make the middle expectation be as close to zero as I like by making the success threshold sufficiently high.

M8.7 earthquake in Western Pacific, tsunami warning issued

https://earthquake.usgs.gov/earthquakes/eventpage/us6000qw60/executive
690•jandrewrogers•9h ago•179 comments

Study mode

https://openai.com/index/chatgpt-study-mode/
940•meetpateltech•17h ago•669 comments

RIP Shunsaku Tamiya, the man who made plastic model kits a global obsession

https://JapaneseNostalgicCar.com/rip-shunsaku-tamiya-plastic-model-kits/
288•fidotron•13h ago•60 comments

Launch HN: Hyprnote (YC S25) – An open-source AI meeting notetaker

207•yujonglee•17h ago•115 comments

URL-Driven State in HTMX

https://www.lorenstew.art/blog/bookmarkable-by-design-url-state-htmx/
202•lorenstewart•12h ago•98 comments

A major AI training data set contains millions of examples of personal data

https://www.technologyreview.com/2025/07/18/1120466/a-major-ai-training-data-set-contains-millions-of-examples-of-personal-data/
10•pera•24m ago•2 comments

iPhone 16 cameras vs. traditional digital cameras

https://candid9.com/phone-camera/
315•sergiotapia•20h ago•330 comments

Sleep all comes down to the mitochondria

https://www.science.org/content/blog-post/it-all-comes-down-mitochondria
27•A_D_E_P_T•1h ago•5 comments

Show HN: I built a free backlink exchange marketplace

https://launchigniter.com/link-exchange
3•maulikdhameliya•1h ago•0 comments

Learning basic electronics by building fireflies

http://a64.in/posts/learning-basic-electronics-by-building-fireflies/
269•signa11•17h ago•69 comments

Two Birds with One Tone: I/Q Signals and Fourier Transform

https://wirelesspi.com/two-birds-with-one-tone-i-q-signals-and-fourier-transform-part-1/
73•teleforce•11h ago•16 comments

ACM Transitions to Full Open Access

https://www.acm.org/publications/openaccess
265•pcvarmint•17h ago•24 comments

Show HN: The Aria Programming Language

https://github.com/egranata/aria
5•egranata_aria•3d ago•4 comments

Analoguediehard

http://www.analoguediehard.com/
24•gregsadetsky•3d ago•4 comments

Show HN: Cant, rust nn lib for learning

https://github.com/TuckerBMorgan/can-t
11•TuckerBMorgan•3d ago•0 comments

USB-C for Lightning iPhones

https://obsoless.com/products/iph0n3-usb-c-protection-case
149•colinprince•3d ago•102 comments

How the brain increases blood flow on demand

https://hms.harvard.edu/news/how-brain-increases-blood-flow-demand
124•gmays•15h ago•57 comments

FoundationDB: From idea to Apple acquisition [video]

https://www.youtube.com/watch?v=C1nZzQqcPZw
181•zdw•4d ago•34 comments

Show HN: Terminal-Bench-RL: Training long-horizon terminal agents with RL

https://github.com/Danau5tin/terminal-bench-rl
115•Danau5tin•23h ago•10 comments

Irrelevant facts about cats added to math problems increase LLM errors by 300%

https://www.science.org/content/article/scienceadviser-cats-confuse-ai
413•sxv•19h ago•200 comments

Show HN: I built an AI that turns any book into a text adventure game

https://www.kathaaverse.com/
253•rcrKnight•18h ago•100 comments

A month using XMPP (using Snikket) for every call and chat (2023)

https://neilzone.co.uk/2023/08/a-month-using-xmpp-using-snikket-for-every-call-and-chat/
118•ColinWright•15h ago•74 comments

My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air)

https://simonwillison.net/2025/Jul/29/space-invaders/
533•simonw•20h ago•356 comments

Elements of System Design

https://github.com/jarulraj/periodic-table
129•qianli_cs•16h ago•34 comments

Structuring large Clojure codebases with Biff

https://biffweb.com/p/structuring-large-codebases/
81•PaulHoule•19h ago•4 comments

Observable Notebooks 2.0 Technology Preview

https://observablehq.com/notebook-kit/
213•mbostock•19h ago•51 comments

Playing with more user-friendly methods for multi-factor authentication

https://tesseral.com/blog/i-designed-some-more-user-friendly-methods-for-multi-factor-authentication
74•noleary•1d ago•52 comments

Microsoft Flight Simulator 2024: WebAssembly SDK

https://docs.flightsimulator.com/msfs2024/html/6_Programming_APIs/WASM/WebAssembly.htm
137•breve•3d ago•84 comments

Supervised fine tuning on curated data is reinforcement learning

https://arxiv.org/abs/2507.12856
56•GabrielBianconi•14h ago•17 comments

CodeCrafters (YC S22) is hiring first Marketing Person

https://www.ycombinator.com/companies/codecrafters/jobs/7ATipKJ-1st-marketing-hire
1•sarupbanskota•12h ago