Language Models Can Learn from Verbal Feedback Without Scalar Rewards
Abstract
Feedback-conditional policy (FCP) enables LLMs to learn from verbal feedback by treating it as a conditioning signal, improving expressiveness over scalar rewards.
LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at https://github.com/sail-sg/feedback-conditional-policy.
Community
LLMs are often trained with RL from human or AI feedback, yet such methods typically compress nuanced feedback into scalar rewards, discarding much of their richness and inducing scale imbalance. We propose treating verbal feedback as a conditioning signal. Inspired by language priors in text-to-image generation, which enable novel outputs from unseen prompts, we introduce the feedback-conditional policy (FCP). FCP learns directly from response-feedback pairs, approximating the feedback-conditional posterior through maximum likelihood training on offline data. We further develop an online bootstrapping stage where the policy generates under positive conditions and receives fresh feedback to refine itself. This reframes feedback-driven learning as conditional generation rather than reward optimization, offering a more expressive way for LLMs to directly learn from verbal feedback. Our code is available at this https URL.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Variational Reasoning for Language Models (2025)
- Online Process Reward Leanring for Agentic Reinforcement Learning (2025)
- Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models (2025)
- PromptCoT 2.0: Scaling Prompt Synthesis for Large Language Model Reasoning (2025)
- Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision (2025)
- RLBFF: Binary Flexible Feedback to bridge between Human Feedback&Verifiable Rewards (2025)
- Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Explain
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper