Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents
Abstract
Entropy-Modulated Policy Gradients (EMPG) addresses learning dynamics issues in LLMs by recalibrating policy gradients based on uncertainty and task outcomes, leading to improved performance in long-horizon tasks.
In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/
Community
Excited to share our latest paper: "Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents" ๐ค๐ฌ
Credit assignment with sparse rewards is a huge challenge in long-horizon tasks. We identify & solve a fundamental issue in policy gradients: the coupling of update magnitude and policy entropy, which leads to inefficient and unstable learning.
We introduce EMPG, a framework that recalibrates the learning signal using the agent's own uncertainty. Comparing with GRPO and DAPO, it achieves promising gains on agent benchmarks like WebShop, ALFWorld, & Deep Search!
๐ Project Page: https://empgseed-seed.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- HAEPO: History-Aggregated Exploratory Policy Optimization (2025)
- GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy (2025)
- Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR (2025)
- COPO: Consistency-Aware Policy Optimization (2025)
- Agentic Reinforced Policy Optimization (2025)
- Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning (2025)
- Learning to Deliberate: Meta-policy Collaboration for Agentic LLMs with Multi-agent Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thanks, very interesting
EMPG Update! We've validated our method on the Seed-1.6-Thinking model (A very frontier LLM) for Tool Use Agent tasks, outperforming the GRPO baseline.
On a combined benchmark (GAIA, BrowseComp, HLE, Tau-Bench), EMPG boosted accuracy from 39.2% to 41.0%!
Especially:
Tau-Bench-Airline: 57.0% -> 59.0%
Tau-Bench-Retail: 69.6% -> 75.2%
The integration is incredibly simple: one small function addition in verl's core_algo.py.
Try it out to stabilize your RL fine-tuning!
Details & easy code snippet on the Project Page. [Link: https://empgseed-seed.github.io]
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
