Reinforcing Diffusion Models by Direct Group Preference Optimization
Abstract
DGPO, a new online RL algorithm, enhances diffusion models by learning from group-level preferences, enabling the use of efficient deterministic ODE samplers and achieving faster training and superior performance.
While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.
Community
DGPO is a new online RL algorithm that learns directly from group-level preferences instead of employing the policy-gradient framework. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- $\text{G}^2$RPO: Granular GRPO for Precise Reward in Flow Models (2025)
- Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models (2025)
- DiffusionNFT: Online Diffusion Reinforcement with Forward Process (2025)
- Towards Better Optimization For Listwise Preference in Diffusion Models (2025)
- Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance (2025)
- Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching (2025)
- Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper