Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

58•mfiguiere•11h ago

Comments

justanotheratom•10h ago

Is Best-of-N Sampling standard practice these days in Inference? Sounds expensive on the face of it. I am surprised because I thought the trend was towards cheaper inference.

diwank•10h ago

For reasoning models, this would actually improve exploration efficiency and hence possibly allow higher performance for the same compute budget. As in, if you want to sample from multiple rollouts for the same prompt, it's more efficient if the model is able to produce diverse thought directions and consider them to find the best response as opposed to going down similar trajectories and waste compute.

codelion•9h ago

Not standard but one of several techniques, you can see them in our open source inference proxy - https://github.com/codelion/optillm

Cerebras has used optillm for optimising inference with techniques like CePO and LongCePO.

peepeepoopoo114•9h ago

Almost all of the efficiency gains have come from shedding bit precision, but the problem is that AI labs are now running out of bits to shed. The move to reduced precision inference has been masking the insane unsustainability of compute scaling as a model improvement paradigm.

karmasimida•5h ago

Isn't the BoN RL formulation similar to DeepSeek's GRPO algorithm? The latter seems to implicitly already captured this?

Johnyhar•4h ago

Wouldn't RL training, with the goal of aligning the LLM with the reward function R(x, y), result in the outputs of the trained LLM maximizing said reward function? How different are the rewards of the N outputs in BoN sampling, to justify its cost.

padolsey•3h ago

I wish they had some example completions in the paper and not just eval results. It would be really useful to see if there are any emergent linguistic tilts to the newly diverse responses...

vessenes•10m ago

Nice idea. Essentially, adding differentiability to the best of n choice lets them encourage models to add some diversity “naturally”. The Gemma 2b results indicate it’s probably worth trying this on larger models.

That said, I’m unclear how much this helps in practice; we don’t usually parse through say 32 responses from our 2B parameter models. I guess if you instrumented parallel reasoning processes in batch this might be helpful. Perhaps that’s what o1-pro is doing in the background, actually.

Anyway, this one seems to me like it might make its way onto the “good idea” list when rl is available in the training pipeline.

Show HN: I built a hardware processor that runs Python

We Found Insurance Fraud in Our Crash Data

Reports of widespread power cuts in Spain and Portugal

Uncovering the mechanics of The Games: Winter Challenge

Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

Show HN: Autarkie – Instant Grammar Fuzzing Using Rust Macros

Internet in a Box

Show HN: I made a web-based, free alternative to Screen Studio

AI helps unravel a cause of Alzheimer’s and identify a therapeutic candidate

Optery (YC W22) – Engineering Team Lead and Engineers with Node.js (U.S., Latam)

How a single line of code could brick your iPhone

Ask HN: What are you working on? (April 2025)

Reversing the Fossilization of Computer Science Conferences

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Presentation Slides with Markdown

New material gives copper superalloy-like strength

Read the Obits

To 'Reclaim Future-Making', Amazon Workers Published Collection of SciFi Stories

Did 5G kill the IMSI catcher?

Naur's "Programming as Theory Building" and LLMs replacing human programmers

I just want to code (2023)

The suburban office park that launched Silicon Valley

The hospital where staff treat fear of death as well as physical pain

Reverse geocoding is hard

Show HN: I486SX_soft_FPU – Software FPU Emulator for NetBSD 10 on 486SX

The coming knowledge-work supply-chain crisis

Boxie – an always offline audio player for my 3 year old

Virginia passes law to enforce maximum vehicle speeds for repeat speeders

Show HN: Bhvr, a Bun and Hono and Vite and React Starter

How a Pipe Organ Works (2020)

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Comments

Show HN: I built a hardware processor that runs Python

We Found Insurance Fraud in Our Crash Data

Reports of widespread power cuts in Spain and Portugal

Uncovering the mechanics of The Games: Winter Challenge

Tiny-LLM – a course of serving LLM on Apple Silicon for systems engineers

Show HN: Autarkie – Instant Grammar Fuzzing Using Rust Macros

Internet in a Box

Show HN: I made a web-based, free alternative to Screen Studio

AI helps unravel a cause of Alzheimer’s and identify a therapeutic candidate

Optery (YC W22) – Engineering Team Lead and Engineers with Node.js (U.S., Latam)

How a single line of code could brick your iPhone

Ask HN: What are you working on? (April 2025)

Reversing the Fossilization of Computer Science Conferences

Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Presentation Slides with Markdown

New material gives copper superalloy-like strength

Read the Obits

To 'Reclaim Future-Making', Amazon Workers Published Collection of SciFi Stories

Did 5G kill the IMSI catcher?

Naur's "Programming as Theory Building" and LLMs replacing human programmers

I just want to code (2023)

The suburban office park that launched Silicon Valley

The hospital where staff treat fear of death as well as physical pain

Reverse geocoding is hard

Show HN: I486SX_soft_FPU – Software FPU Emulator for NetBSD 10 on 486SX

The coming knowledge-work supply-chain crisis

Boxie – an always offline audio player for my 3 year old

Virginia passes law to enforce maximum vehicle speeds for repeat speeders

Show HN: Bhvr, a Bun and Hono and Vite and React Starter

How a Pipe Organ Works (2020)