Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models
Abstract
Asynchronous generation and learning in Reinforcement Learning from Human Feedback (RLHF) improve training speed and computational efficiency while maintaining performance, particularly with larger policy models.
The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.
Community
Asynchronous RLHF! A faster, more efficient paradigm for language model and RL training.
Standard RLHF is forced to be synchronous: online, on-policy RL. To take advantage of LLM generation libraries and efficiencies (e.g. vllm), we put generation and training on separate GPUs. This makes training off-policy but allows us to achieve big speedups. These speedups increase with scale but performance is matched!
paper: https://arxiv.org/abs/2410.18252
code: https://github.com/mnoukhov/async_rlhf
hf collection: https://huggingface.co/collections/mnoukhov/asynchronous-rlhf-6717bee31de7be3bcb0ce800
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment (2024)
- Generative Reward Models (2024)
- Policy Filtration in RLHF to Fine-Tune LLM for Code Generation (2024)
- MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions (2024)
- On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper