A recurring issue in on-policy RL for LLMs is GPU under-utilization while actors wait for weight syncs from the learner. PipelineRL uses in-flight weight updates: actors keep sampling while the learner updates weights, which reduces policy lag without stalling the pipeline.
In practice this gives ~2× wall-clock speedups on large models.
A paper on the approach was recently accepted to TMLR and discusses policy-lag bounds in more detail.
muchomuchach0•1h ago
A recurring issue in on-policy RL for LLMs is GPU under-utilization while actors wait for weight syncs from the learner. PipelineRL uses in-flight weight updates: actors keep sampling while the learner updates weights, which reduces policy lag without stalling the pipeline.
In practice this gives ~2× wall-clock speedups on large models.
A paper on the approach was recently accepted to TMLR and discusses policy-lag bounds in more detail.
muchomuchach0•1h ago