Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

69•mfiguiere•9mo ago

Comments

justanotheratom•9mo ago

Is Best-of-N Sampling standard practice these days in Inference? Sounds expensive on the face of it. I am surprised because I thought the trend was towards cheaper inference.

diwank•9mo ago

For reasoning models, this would actually improve exploration efficiency and hence possibly allow higher performance for the same compute budget. As in, if you want to sample from multiple rollouts for the same prompt, it's more efficient if the model is able to produce diverse thought directions and consider them to find the best response as opposed to going down similar trajectories and waste compute.

codelion•9mo ago

Not standard but one of several techniques, you can see them in our open source inference proxy - https://github.com/codelion/optillm

Cerebras has used optillm for optimising inference with techniques like CePO and LongCePO.

peepeepoopoo114•9mo ago

Almost all of the efficiency gains have come from shedding bit precision, but the problem is that AI labs are now running out of bits to shed. The move to reduced precision inference has been masking the insane unsustainability of compute scaling as a model improvement paradigm.

nullc•9mo ago

Is there really a limit on bits to shed? I suspect not.

Take N gates, normalize them, represent them as points on the surface of a hypersphere. Quantize the hypersphere as coarsely as you need to get the precision you want. Want less precision but your quantization is getting too coarse? Increase N.

Fast algebraic codes exist to convert positions on a hyperspheric-ish surfaces to indexes and vice versa.

Perhaps spherical VQ isn't ideal-- though I suspect it is, since groups of weights often act as rotations naturally-- but some other geometry should be good if not.

karmasimida•9mo ago

Isn't the BoN RL formulation similar to DeepSeek's GRPO algorithm? The latter seems to implicitly already captured this?

Johnyhar•9mo ago

Wouldn't RL training, with the goal of aligning the LLM with the reward function R(x, y), result in the outputs of the trained LLM maximizing said reward function? How different are the rewards of the N outputs in BoN sampling, to justify its cost.

padolsey•9mo ago

I wish they had some example completions in the paper and not just eval results. It would be really useful to see if there are any emergent linguistic tilts to the newly diverse responses...

vessenes•9mo ago

Nice idea. Essentially, adding differentiability to the best of n choice lets them encourage models to add some diversity “naturally”. The Gemma 2b results indicate it’s probably worth trying this on larger models.

That said, I’m unclear how much this helps in practice; we don’t usually parse through say 32 responses from our 2B parameter models. I guess if you instrumented parallel reasoning processes in batch this might be helpful. Perhaps that’s what o1-pro is doing in the background, actually.

Anyway, this one seems to me like it might make its way onto the “good idea” list when rl is available in the training pipeline.

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]

Vouch

HRL Labs in Malibu laying off 1/3 of their workforce

Show HN: High-performance bidirectional list for React, React Native, and Vue

Show HN: I built a Mac screen recorder Recap.Studio

Ask HN: Codex 5.3 broke toolcalls? Opus 4.6 ignores instructions?