Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OPPO: Accelerating PPO-based RLHF via Pipeline Overlap

About

Proximal Policy Optimization (PPO)-based reinforcement learning from human feedback (RLHF) is a widely adopted paradigm for aligning large language models (LLMs) with human preferences. However, its training pipeline suffers from substantial inefficiencies due to sequential multi-model dependencies (e.g., reward model depends on actor outputs) and long-tail response lengths, where a few long responses straggle the stage completion. We present OPPO, a novel, lightweight, and model-agnostic PPO-based RLHF framework that improves training efficiency by overlapping pipeline execution. OPPO introduces two novel techniques: (1) Intra-step overlap, which streams upstream model outputs (e.g., actor model) in right-sized chunks, enabling the downstream model (e.g., reward) to begin prefill while the upstream continues decoding; and (2) Inter-step overlap, which adaptively overcommits a few prompts and defers long generations to future steps, mitigating tail latency without discarding partial work. OPPO integrates easily with existing PPO implementations with a lightweight wrapper. Extensive evaluations show that OPPO accelerates PPO-based RLHF training by $1.8\times$--$2.8\times$ and improves GPU utilization by $1.4\times$--$2.1\times$ without compromising training convergence.

Kaizhuo Yan, Yingjie Yu, Yifan Yu, Haizhong Zheng, Fan Lai• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringARC-Challenge 0-shot (test)
Accuracy55.8
39
Mathematical ReasoningGSM8K 5-shot (test)
Strict Match Accuracy82.79
37
Question AnsweringARC-E 0-shot
Accuracy81.36
33
Commonsense ReasoningHellaSwag 0-shot (test)
Accuracy (0-shot)80.79
4
RLHF TrainingStandard RLHF Workload (train)
Mean Latency (s)99.84
4
Question AnsweringTruthfulQA MC2 0-shot (test)
Accuracy64.03
4
End-to-end step latencyStack-Exchange-Paired
Mean Latency (s)111.1
2
Showing 7 of 7 rows

Other info

Follow for update