TRL documentation
GRPO Trainer
GRPO Trainer
Overview
TRL supports the GRPO Trainer for training language models, as described in the paper DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models by Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, Daya Guo.
The abstract from the paper is the following:
Mathematical reasoning poses a significant challenge for language models due to its complex and structured nature. In this paper, we introduce DeepSeekMath 7B, which continues pre-training DeepSeek-Coder-Base-v1.5 7B with 120B math-related tokens sourced from Common Crawl, together with natural language and code data. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4. Self-consistency over 64 samples from DeepSeekMath 7B achieves 60.9% on MATH. The mathematical reasoning capability of DeepSeekMath is attributed to two key factors: First, we harness the significant potential of publicly available web data through a meticulously engineered data selection pipeline. Second, we introduce Group Relative Policy Optimization (GRPO), a variant of Proximal Policy Optimization (PPO), that enhances mathematical reasoning abilities while concurrently optimizing the memory usage of PPO.
This post-training method was contributed by Quentin Gallouédec.
Quick start
This example demonstrates how to train a model using the GRPO method. We train a Qwen 0.5B Instruct model with the prompts from the UltraFeedback prompts dataset. You can view the data in the dataset here:
Below is the script to train the model.
# train_grpo.py
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
"""Reward function that rewards completions with more unique letters."""
completion_contents = [completion[0]["content"] for completion in completions]
return [float(len(set(content))) for content in completion_contents]
training_args = GRPOConfig(output_dir="Qwen2-0.5B-GRPO")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_num_unique_letters,
args=training_args,
train_dataset=dataset,
)
trainer.train()
Execute the script using the following command:
accelerate launch train_grpo.py
Distributed across 8 GPUs, the training takes approximately 1 day.
Looking deeper into the GRPO method
GRPO is an online learning algorithm, meaning it improves iteratively by using the data generated by the trained model itself during training. The intuition behind GRPO objective is to maximize the advantage of the generated completions, while ensuring that the model remains close to the reference policy. To understand how GRPO works, it can be broken down into four main steps: Generating completions, computing the advantage, estimating the KL divergence, and computing the loss.
Generating completions
At each training step, we sample a batch of prompts and generate a set of completions for each prompt (denoted as ).
Computing the advantage
For each of the sequences, we compute the reward using a reward model or reward function. To align with the comparative nature of reward models—typically trained on datasets of comparisons between outputs for the same question—the advantage is calculated to reflect these relative comparisons. It is normalized as follows:
This approach gives the method its name: Group Relative Policy Optimization (GRPO).
It was shown in the paper Understanding R1-Zero-Like Training: A Critical Perspective that scaling by may cause a question-level difficulty bias. You can disable this scaling by setting
scale_rewards=False
in GRPOConfig.
[!TIP][Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO)](https://huggingface.co/papers/2508.08221) showed that calculating the mean at the local (group) level and the standard deviation at the global (batch) level enables more robust reward shaping. You can use this scaling strategy by setting
scale_rewards="batch"
in GRPOConfig.
Estimating the KL divergence
KL divergence is estimated using the approximator introduced by Schulman et al. (2020). The approximator is defined as follows:
Computing the loss
The objective is to maximize the advantage while ensuring that the model remains close to the reference policy. Consequently, the loss is defined as follows:
where the first term represents the scaled advantage and the second term penalizes deviations from the reference policy through KL divergence.
Note that compared to the original formulation in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, we don’t scale by because it was shown in the paper Understanding R1-Zero-Like Training: A Critical Perspective that this introduces a response-level length bias. More details in loss types.
Note that compared to the original formulation in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, we use by default, meaning that the KL divergence term is not used. This choice is motivated by several recent studies (e.g., Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model) which have shown that the KL divergence term is not essential for training with GRPO. As a result, it has become common practice to exclude it (e.g. Understanding R1-Zero-Like Training: A Critical Perspective, DAPO: An Open-Source LLM Reinforcement Learning System at Scale). If you wish to include the KL divergence term, you can set
beta
in GRPOConfig to a non-zero value.
In the original paper, this formulation is generalized to account for multiple updates after each generation (denoted , can be set with num_iterations
in GRPOConfig) by leveraging the clipped surrogate objective:
where ensures that updates do not deviate excessively from the reference policy by bounding the policy ratio between and . When (default in TRL), the clipped surrogate objective simplifies to the original objective.
Loss Types
Several formulations of the objective have been proposed in the literature. Initially, the objective of GRPO was defined as follows:
where
The DAPO paper highlights the limitations of the GRPO algorithm’s sample-level loss in long-CoT scenarios, where longer responses are under-penalized, leading to poorer quality outputs. The proposed solution is a token-level normalization, which better handles longer sequences by assigning more balanced rewards to individual tokens, regardless of response length:
To use this formulation, set loss_type="dapo"
in GRPOConfig.
Furthermore, it was demonstrated in the paper Understanding R1-Zero-Like Training: A Critical Perspective that the initial GRPO formulation introduces a response length bias. They show that while the DAPO formulation reduces this bias, it does not eliminate it completely. To fully remove this bias, they propose dividing by a constant instead of the sequence length, resulting in the following formulation:
This constant is recommended to be the maximum completion length. To use this formulation, set loss_type="dr_grpo"
in the GRPOConfig.
Logged metrics
While training and evaluating, we record the following reward metrics:
num_tokens
: The total number of tokens processed so far, including both prompts and completions.completions/mean_length
: The average length of generated completions.completions/min_length
: The minimum length of generated completions.completions/max_length
: The maximum length of generated completions.completions/mean_terminated_length
: The average length of generated completions that terminate with EOS.completions/min_terminated_length
: The minimum length of generated completions that terminate with EOS.completions/max_terminated_length
: The maximum length of generated completions that terminate with EOS.completions/clipped_ratio
: The ratio of truncated (clipped) completions.reward/{reward_func_name}/mean
: The average reward from a specific reward function.reward/{reward_func_name}/std
: The standard deviation of the reward from a specific reward function.reward
: The overall average reward after applying reward weights.reward_std
: The standard deviation of rewards after applying reward weights.- If
scale_rewards
is"group"
or"none"
, this is the average of the per-group standard deviations. - If
scale_rewards
is"batch"
, this is the standard deviation computed over all rewards in the batch (ignoring groups).
- If
frac_reward_zero_std
: The fraction of samples in the generation batch with a reward std of zero, implying there is little diversity for that prompt (all answers are correct or incorrect).entropy
: Average entropy of token predictions across generated completions. (Ifmask_truncated_completions=True
, masked sequences tokens are excluded.)kl
: The average KL divergence between the model and the reference model, calculated over generated completions. Logged only ifbeta
is nonzero.clip_ratio/region_mean
: The ratio of token (or sequence, ifimportance_sampling_level="sequence"
) probabilities where the GRPO objective is clipped to stay within the trust region: $$ \text{clip}\left( r{i,t}(\theta), 1 - \epsilon\mathrm{low}, 1 + \epsilon\mathrm{high} \right)\,, \qquad r{i,t}(\theta) = \frac{\pi\theta(o{i,t} \mid q, o{i,< t})}{\pi{\theta{\text{old}}}(o{i,t} \mid q, o_{i,< t})}\,. $$ A higher value means more tokens are clipped, which constrains how much the policy can change.clip_ratio/low_mean
: The average ratio of token (or sequence, ifimportance_sampling_level="sequence"
) probabilities that were clipped on the lower bound of the trust region:clip_ratio/low_min
: The minimum ratio of token (or sequence, ifimportance_sampling_level="sequence"
) probabilities that were clipped on the lower bound of the trust region:clip_ratio/high_mean
: The average ratio of token (or sequence, ifimportance_sampling_level="sequence"
) probabilities that were clipped on the upper bound of the trust region:clip_ratio/high_max
: The maximum ratio of token (or sequence, ifimportance_sampling_level="sequence"
) probabilities that were clipped on the upper bound of the trust region: .
Customization
Speed up training with vLLM-powered generation
Generation is often the main bottleneck when training with online methods. To accelerate generation, you can use vLLM, a high-throughput, low-latency inference engine for LLMs. To enable it, first install the package with
pip install trl[vllm]
We support two ways of using vLLM during training: server mode and colocate mode.
By default, Truncated Importance Sampling is activated for vLLM generation to address the generation-training mismatch that occurs when using different frameworks. This can be turned off by setting
vllm_importance_sampling_correction=False
. For more information, see Truncated Importance Sampling
🔌 Option 1: Server mode
In this mode, vLLM runs in a separate process (and using separate GPUs) and communicates with the trainer via HTTP. This is ideal if you have dedicated GPUs for inference.
Start the vLLM server:
trl vllm-serve --model <model_name>
Enable server mode in your training script:
from trl import GRPOConfig training_args = GRPOConfig( ..., use_vllm=True, vllm_mode="server", # default value, can be omitted )
Make sure that the server is using different GPUs than the trainer, otherwise you may run into NCCL errors. You can specify the GPUs to use with the
CUDA_VISIBLE_DEVICES
environment variable.
🧩 Option 2: Colocate mode
In this mode, vLLM runs inside the trainer process and shares GPU memory with the training model. This avoids launching a separate server and can improve GPU utilization, but may lead to memory contention on the training GPUs.
from trl import GRPOConfig
training_args = GRPOConfig(
...,
use_vllm=True,
vllm_mode="colocate",
)
Depending on the model size and the overall GPU memory requirements for training, you may need to adjust the
vllm_gpu_memory_utilization
parameter in GRPOConfig to avoid underutilization or out-of-memory errors.We provide a HF Space to help estimate the recommended GPU memory utilization based on your model configuration and experiment settings. Simply use it as follows to get
vllm_gpu_memory_utilization
recommendation:If the recommended value does not work in your environment, we suggest adding a small buffer (e.g., +0.05 or +0.1) to the recommended value to ensure stability.
If you still find you are getting out-of-memory errors set
vllm_enable_sleep_mode
to True and the vllm parameters and cache will be offloaded during the optimization step. For more information, see Reducing Memory Usage with vLLM Sleep Mode.
By default, GRPO uses
MASTER_ADDR=localhost
andMASTER_PORT=12345
for vLLM, but you can override these values by setting the environment variables accordingly.
For more information, see Speeding up training with vLLM.
GRPO at scale: train a 70B+ Model on multiple nodes
When training large models like Qwen2.5-72B, you need several key optimizations to make the training efficient and scalable across multiple GPUs and nodes. These include:
- DeepSpeed ZeRO Stage 3: ZeRO leverages data parallelism to distribute model states (weights, gradients, optimizer states) across multiple GPUs and CPUs, reducing memory and compute requirements on each device. Since large models cannot fit on a single GPU, using ZeRO Stage 3 is required for training such models. For more details, see DeepSpeed Integration.
- Accelerate: Accelerate is a library that simplifies distributed training across multiple GPUs and nodes. It provides a simple API to launch distributed training and handles the complexities of distributed training, such as data parallelism, gradient accumulation, and distributed data loading. For more details, see Distributing Training.
- vLLM: See the previous section on how to use vLLM to speed up generation.
Below is an example SLURM script to train a 70B model with GRPO on multiple nodes. This script trains a model on 4 nodes and uses the 5th node for vLLM-powered generation.
#!/bin/bash
#SBATCH --nodes=5
#SBATCH --gres=gpu:8
# Get the list of allocated nodes
NODELIST=($(scontrol show hostnames $SLURM_JOB_NODELIST))
# Assign the first 4 nodes for training and the 5th node for vLLM
TRAIN_NODES="${NODELIST[@]:0:4}" # Nodes 0, 1, 2, 3 for training
VLLM_NODE="${NODELIST[4]}" # Node 4 for vLLM
# Run training on the first 4 nodes (Group 1)
srun --nodes=4 --ntasks=4 --nodelist="${NODELIST[@]:0:4}" accelerate launch \
--config_file examples/accelerate_configs/deepspeed_zero3.yaml \
--num_processes 32 \
--num_machines 4 \
--main_process_ip ${NODELIST[0]} \
--machine_rank $SLURM_PROCID \
--rdzv_backend c10d \
train_grpo.py \
--server_ip $VLLM_NODE &
# Run vLLM server on the 5th node (Group 2)
srun --nodes=1 --ntasks=1 --nodelist="${NODELIST[4]}" trl vllm-serve --model Qwen/Qwen2.5-72B --tensor_parallel_size 8 &
wait
import argparse
from datasets import load_dataset
from trl import GRPOTrainer, GRPOConfig
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--vllm_server_host", type=str, default="", help="The server IP")
args = parser.parse_args()
# Example dataset from TLDR
dataset = load_dataset("trl-lib/tldr", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c)) for c in completions]
training_args = GRPOConfig(
output_dir="Qwen2.5-72B-GRPO",
per_device_train_batch_size=4,
bf16=True,
gradient_checkpointing=True,
use_vllm=True,
vllm_server_host=args.vllm_server_host.replace("ip-", "").replace("-", "."), # from ip-X-X-X-X to X.X.X.X
)
trainer = GRPOTrainer(model="Qwen/Qwen2.5-72B", args=training_args, reward_funcs=reward_num_unique_chars, train_dataset=dataset)
trainer.train()
if __name__=="__main__":
main()
Using a custom reward function
The GRPOTrainer supports using custom reward functions instead of dense reward models. To ensure compatibility, your reward function must satisfy the following requirements:
Input arguments:
The function must accept the following as keyword arguments:
prompts
(contains the prompts),completions
(contains the generated completions),completions_ids
(contains the tokenized completions),trainer_state
(TrainerState): The current state of the trainer. This can be used to implement dynamic reward functions, such as curriculum learning, where the reward is adjusted based on the training progress.- All column names (but
prompt
) that the dataset may have. For example, if the dataset contains a column namedground_truth
, the function will be called withground_truth
as a keyword argument.
The easiest way to comply with this requirement is to use
**kwargs
in the function signature.Depending on the dataset format, the input will vary:
- For standard format,
prompts
andcompletions
will be lists of strings. - For conversational format,
prompts
andcompletions
will be lists of message dictionaries.
- For standard format,
Return value: The function must return a list of floats. Each float represents the reward corresponding to a single completion.
Example 1: Reward longer completions
Below is an example of a reward function for a standard format that rewards longer completions:
def reward_func(completions_ids, **kwargs):
"""Reward function that assigns higher scores to longer completions (in terms of token count)."""
return [float(len(ids)) for ids in completions_ids]
You can test it as follows:
>>> prompts = ["The sky is", "The sun is"] # not used in the reward function, but the trainer will pass it
>>> completions = [" blue.", " in the sky."] # not used in the reward function, but the trainer will pass it
>>> completions_ids = [[6303, 13], [304, 279, 12884, 13]]
>>> reward_func(prompts=prompts, completions=completions, completions_ids=completions_ids)
[2.0, 4.0]
Example 1.1: Reward longer completions (based on the number of characters)
Same as the previous example, but this time the reward function is based on the number of characters instead of tokens.
def reward_func(completions, **kwargs):
"""Reward function that assigns higher scores to longer completions (in terms of character count)."""
return [float(len(completion)) for completion in completions]
You can test it as follows:
>>> prompts = ["The sky is", "The sun is"]
>>> completions = [" blue.", " in the sky."]
>>> completions_ids = [[6303, 13], [304, 279, 12884, 13]] # not used in the reward function, but the trainer will pass it
>>> reward_func(prompts=prompts, completions=completions, completions_ids=completions_ids)
[6.0, 12.0]
Example 2: Reward completions with a specific format
Below is an example of a reward function that checks if the completion has a specific format. This example is inspired by the format reward function used in the paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. It is designed for a conversational format, where prompts and completions consist of structured messages.
import re
def format_reward_func(completions, **kwargs):
"""Reward function that checks if the completion has a specific format."""
pattern = r"^<think>.*?</think><answer>.*?</answer>$"
completion_contents = [completion[0]["content"] for completion in completions]
matches = [re.match(pattern, content) for content in completion_contents]
return [1.0 if match else 0.0 for match in matches]
You can test this function as follows:
>>> prompts = [
... [{"role": "assistant", "content": "What is the result of (1 + 2) * 4?"}],
... [{"role": "assistant", "content": "What is the result of (3 + 1) * 2?"}],
... ]
>>> completions = [
... [{"role": "assistant", "content": "<think>The sum of 1 and 2 is 3, which we multiply by 4 to get 12.</think><answer>(1 + 2) * 4 = 12</answer>"}],
... [{"role": "assistant", "content": "The sum of 3 and 1 is 4, which we multiply by 2 to get 8. So (3 + 1) * 2 = 8."}],
... ]
>>> format_reward_func(prompts=prompts, completions=completions)
[1.0, 0.0]
Example 3: Reward completions based on a reference
Below is an example of a reward function that checks if the completion is correct. This example is inspired by the accuracy reward function used in the paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.
This example is designed for standard format, where the dataset contains a column named ground_truth
.
import re
def reward_func(completions, ground_truth, **kwargs):
# Regular expression to capture content inside \boxed{}
matches = [re.search(r"\\boxed\{(.*?)\}", completion) for completion in completions]
contents = [match.group(1) if match else "" for match in matches]
# Reward 1 if the content is the same as the ground truth, 0 otherwise
return [1.0 if c == gt else 0.0 for c, gt in zip(contents, ground_truth)]
You can test this function as follows:
>>> prompts = ["Problem: Solve the equation $2x + 3 = 7$. Solution:", "Problem: Solve the equation $3x - 5 = 10$."]
>>> completions = [r" The solution is \boxed{2}.", r" The solution is \boxed{6}."]
>>> ground_truth = ["2", "5"]
>>> reward_func(prompts=prompts, completions=completions, ground_truth=ground_truth)
[1.0, 0.0]
Example 4: Multi-task reward functions
Below is an example of using multiple reward functions in the GRPOTrainer. In this example, we define two task-specific reward functions: math_reward_func
and coding_reward_func
. The math_reward_func
rewards math problems based on their correctness, while the coding_reward_func
rewards coding problems based on whether the solution works.
from datasets import Dataset
from trl import GRPOTrainer
# Define a dataset that contains both math and coding problems
dataset = Dataset.from_list(
[
{"prompt": "What is 2+2?", "task": "math"},
{"prompt": "Write a function that returns the sum of two numbers.", "task": "code"},
{"prompt": "What is 3*4?", "task": "math"},
{"prompt": "Write a function that returns the product of two numbers.", "task": "code"},
]
)
# Math-specific reward function
def math_reward_func(prompts, completions, task, **kwargs):
rewards = []
for prompt, completion, t in zip(prompts, completions, task):
if t == "math":
# Calculate math-specific reward
correct = check_math_solution(prompt, completion)
reward = 1.0 if correct else -1.0
rewards.append(reward)
else:
# Return None for non-math tasks
rewards.append(None)
return rewards
# Coding-specific reward function
def coding_reward_func(prompts, completions, task, **kwargs):
rewards = []
for prompt, completion, t in zip(prompts, completions, task):
if t == "coding":
# Calculate coding-specific reward
works = test_code_solution(prompt, completion)
reward = 1.0 if works else -1.0
rewards.append(reward)
else:
# Return None for non-coding tasks
rewards.append(None)
return rewards
# Use both task-specific reward functions
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=[math_reward_func, coding_reward_func],
train_dataset=dataset,
)
trainer.train()
In this example, the math_reward_func
and coding_reward_func
are designed to work with a mixed dataset that contains both math and coding problems. The task
column in the dataset is used to determine which reward function to apply to each problem. If there is no relevant reward function for a sample in the dataset, the reward function will return None
, and the GRPOTrainer will continue with the valid functions and tasks. This allows the GRPOTrainer to handle multiple reward functions with different applicability.
Note that the GRPOTrainer will ignore the None
rewards returned by the reward functions and only consider the rewards returned by the relevant functions. This ensures that the model is trained on the relevant tasks and ignores the tasks for which there is no relevant reward function.
Passing the reward function to the trainer
To use your custom reward function, pass it to the GRPOTrainer as follows:
from trl import GRPOTrainer
trainer = GRPOTrainer(
reward_funcs=reward_func,
...,
)
If you have multiple reward functions, you can pass them as a list:
from trl import GRPOTrainer
trainer = GRPOTrainer(
reward_funcs=[reward_func1, reward_func2],
...,
)
and the reward will be computed as the sum of the rewards from each function, or the weighted sum if reward_weights
is provided in the config.
Note that GRPOTrainer supports multiple reward functions of different types. See the parameters documentation for more details.
Vision-Language Model (VLM) Training
GRPO supports training Vision-Language Models (VLMs) on multimodal datasets containing both text and images.
Supported Models
Tested with:
- Gemma3 — e.g.,
google/gemma-3-4b-it
- LLaVA-NeXT — e.g.,
llava-hf/llava-v1.6-mistral-7b-hf
- Qwen2-VL — e.g.,
Qwen/Qwen2-VL-2B-Instruct
- Qwen2.5-VL — e.g.,
Qwen/Qwen2.5-VL-3B-Instruct
- SmolVLM2 — e.g.,
HuggingFaceTB/SmolVLM2-2.2B-Instruct
Compatibility with all VLMs is not guaranteed. If you believe a model should be supported, feel free to open an issue on GitHub — or better yet, submit a pull request with the required changes.
Quick Start
Use grpo_vlm.py to fine-tune a VLM. Example command for training on lmms-lab/multimodal-open-r1-8k-verified
:
accelerate launch \
--config_file=examples/accelerate_configs/deepspeed_zero3.yaml \
examples/scripts/grpo_vlm.py \
--model_name_or_path Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir grpo-Qwen2.5-VL-3B-Instruct \
--learning_rate 1e-5 \
--gradient_checkpointing \
--dtype bfloat16 \
--max_prompt_length 2048 \
--max_completion_length 1024 \
--use_vllm \
--vllm_mode colocate \
--use_peft \
--lora_target_modules "q_proj", "v_proj" \
--log_completions
Configuration Tips
VLM training may fail if image tokens are truncated. We highly recommend disabling truncation by setting
max_prompt_length
toNone
.
- Use LoRA on vision-language projection layers
- Enable 4-bit quantization to reduce memory usage
- VLMs are memory-intensive — start with smaller batch sizes
- Most models are compatible with vLLM (
server
andcolocate
modes)
Dataset Format
Each training sample should include:
prompt
: Text formatted via the processor’s chat templateimage
/images
: PIL Image or list of PIL Images
The trainer automatically handles image-to-tensor conversion via the model’s image processor.
GRPOTrainer
class trl.GRPOTrainer
< source >( model: typing.Union[str, transformers.modeling_utils.PreTrainedModel] reward_funcs: typing.Union[str, transformers.modeling_utils.PreTrainedModel, typing.Callable[[list, list], list[float]], list[typing.Union[str, transformers.modeling_utils.PreTrainedModel, typing.Callable[[list, list], list[float]]]]] args: typing.Optional[trl.trainer.grpo_config.GRPOConfig] = None train_dataset: typing.Union[datasets.arrow_dataset.Dataset, datasets.iterable_dataset.IterableDataset, NoneType] = None eval_dataset: typing.Union[datasets.arrow_dataset.Dataset, datasets.iterable_dataset.IterableDataset, dict[str, typing.Union[datasets.arrow_dataset.Dataset, datasets.iterable_dataset.IterableDataset]], NoneType] = None processing_class: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, transformers.processing_utils.ProcessorMixin, NoneType] = None reward_processing_classes: typing.Union[transformers.tokenization_utils_base.PreTrainedTokenizerBase, list[transformers.tokenization_utils_base.PreTrainedTokenizerBase], NoneType] = None callbacks: typing.Optional[list[transformers.trainer_callback.TrainerCallback]] = None optimizers: tuple = (None, None) peft_config: typing.Optional[ForwardRef('PeftConfig')] = None )
Parameters
- model (
Union[str, PreTrainedModel]
) — Model to be trained. Can be either:- A string, being the model id of a pretrained model hosted inside a model repo on huggingface.co, or a
path to a directory containing model weights saved using
save_pretrained, e.g.,
'./my_model_directory/'
. The model is loaded using from_pretrained with the keyword arguments inargs.model_init_kwargs
. - A PreTrainedModel object. Only causal language models are supported.
- A string, being the model id of a pretrained model hosted inside a model repo on huggingface.co, or a
path to a directory containing model weights saved using
save_pretrained, e.g.,
- reward_funcs (
Union[RewardFunc, list[RewardFunc]]
) — Reward functions to be used for computing the rewards. To compute the rewards, we call all the reward functions with the prompts and completions and sum the rewards. Can be either:-
A single reward function, such as:
-
A string: The model ID of a pretrained model hosted inside a model repo on huggingface.co, or a path to a directory containing model weights saved using save_pretrained, e.g.,
'./my_model_directory/'
. The model is loaded using from_pretrained withnum_labels=1
and the keyword arguments inargs.model_init_kwargs
. -
A PreTrainedModel object: Only sequence classification models are supported.
-
A custom reward function: The function is provided with the prompts and the generated completions, plus any additional columns in the dataset. It should return a list of rewards. Custom reward functions can also return
None
when the reward is not applicable to those samples. This is useful for multi-task training where different reward functions apply to different types of samples. When a reward function returnsNone
for a sample, that reward function is excluded from the reward calculation for that sample. For more details, see Using a custom reward function.The trainer’s state is also passed to the reward function. The trainer’s state is an instance of TrainerState and can be accessed by accessing the
trainer_state
argument to the reward function’s signature.
-
-
A list of reward functions, where each item can independently be any of the above types. Mixing different types within the list (e.g., a string model ID and a custom reward function) is allowed.
-
- args (GRPOConfig, optional) —
Configuration for this trainer. If
None
, a default configuration is used. - train_dataset (Dataset or IterableDataset) —
Dataset to use for training. It must include a column
"prompt"
. Any additional columns in the dataset is ignored. The format of the samples can be either:- Standard: Each sample contains plain text.
- Conversational: Each sample contains structured messages (e.g., role and content).
- eval_dataset (Dataset, IterableDataset or
dict[str, Union[Dataset, IterableDataset]]
) — Dataset to use for evaluation. It must meet the same requirements astrain_dataset
. - processing_class (PreTrainedTokenizerBase, ProcessorMixin, optional) —
Processing class used to process the data. The padding side must be set to “left”. If
None
, the processing class is loaded from the model’s name with from_pretrained. A padding token,tokenizer.pad_token
, must be set. If the processing class has not set a padding token,tokenizer.eos_token
will be used as the default. - reward_processing_classes (PreTrainedTokenizerBase or
list[PreTrainedTokenizerBase]
, optional) — Processing classes corresponding to the reward functions specified inreward_funcs
. Can be either:- A single processing class: Used when
reward_funcs
contains only one reward function. - A list of processing classes: Must match the order and length of the reward functions in
reward_funcs
. If set toNone
, or if an element of the list corresponding to a PreTrainedModel isNone
, the tokenizer for the model is automatically loaded using from_pretrained. For elements inreward_funcs
that are custom reward functions (not PreTrainedModel), the corresponding entries inreward_processing_classes
are ignored.
- A single processing class: Used when
- callbacks (list of TrainerCallback, optional) —
List of callbacks to customize the training loop. Will add those to the list of default callbacks detailed
in here.
If you want to remove one of the default callbacks used, use the remove_callback method.
- optimizers (
tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR]
, optional, defaults to(None, None)
) — A tuple containing the optimizer and the scheduler to use. Will default to an instance ofAdamW
on your model and a scheduler given byget_linear_schedule_with_warmup
controlled byargs
. - peft_config (
~peft.PeftConfig
, optional) — PEFT configuration used to wrap the model. IfNone
, the model is not wrapped.
Trainer for the Group Relative Policy Optimization (GRPO) method. This algorithm was initially proposed in the paper DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.
Example:
from datasets import load_dataset
from trl import GRPOTrainer
dataset = load_dataset("trl-lib/tldr", split="train")
def reward_func(completions, **kwargs):
# Dummy reward function that rewards completions with more unique letters.
return [float(len(set(completion))) for completion in completions]
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_func,
train_dataset=dataset,
)
trainer.train()
train
< source >( resume_from_checkpoint: typing.Union[str, bool, NoneType] = None trial: typing.Union[ForwardRef('optuna.Trial'), dict[str, typing.Any], NoneType] = None ignore_keys_for_eval: typing.Optional[list[str]] = None **kwargs: typing.Any )
Parameters
- resume_from_checkpoint (
str
orbool
, optional) — If astr
, local path to a saved checkpoint as saved by a previous instance ofTrainer
. If abool
and equalsTrue
, load the last checkpoint in args.output_dir as saved by a previous instance ofTrainer
. If present, training will resume from the model/optimizer/scheduler states loaded here. - trial (
optuna.Trial
ordict[str, Any]
, optional) — The trial run or the hyperparameter dictionary for hyperparameter search. - ignore_keys_for_eval (
list[str]
, optional) — A list of keys in the output of your model (if it is a dictionary) that should be ignored when gathering predictions for evaluation during the training. - kwargs (
dict[str, Any]
, optional) — Additional keyword arguments used to hide deprecated arguments
Main training entry point.
Will save the model, so you can reload it using from_pretrained()
.
Will only save from the main process.
push_to_hub
< source >( commit_message: typing.Optional[str] = 'End of training' blocking: bool = True token: typing.Optional[str] = None revision: typing.Optional[str] = None **kwargs )
Parameters
- commit_message (
str
, optional, defaults to"End of training"
) — Message to commit while pushing. - blocking (
bool
, optional, defaults toTrue
) — Whether the function should return only when thegit push
has finished. - token (
str
, optional, defaults toNone
) — Token with write permission to overwrite Trainer’s original args. - revision (
str
, optional) — The git revision to commit from. Defaults to the head of the “main” branch. - kwargs (
dict[str, Any]
, optional) — Additional keyword arguments passed along to~Trainer.create_model_card
.
Upload self.model
and self.processing_class
to the 🤗 model hub on the repo self.args.hub_model_id
.
GRPOConfig
class trl.GRPOConfig
< source >( output_dir: typing.Optional[str] = None overwrite_output_dir: bool = False do_train: bool = False do_eval: bool = False do_predict: bool = False eval_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'no' prediction_loss_only: bool = False per_device_train_batch_size: int = 8 per_device_eval_batch_size: int = 8 per_gpu_train_batch_size: typing.Optional[int] = None per_gpu_eval_batch_size: typing.Optional[int] = None gradient_accumulation_steps: int = 1 eval_accumulation_steps: typing.Optional[int] = None eval_delay: float = 0 torch_empty_cache_steps: typing.Optional[int] = None learning_rate: float = 1e-06 weight_decay: float = 0.0 adam_beta1: float = 0.9 adam_beta2: float = 0.999 adam_epsilon: float = 1e-08 max_grad_norm: float = 1.0 num_train_epochs: float = 3.0 max_steps: int = -1 lr_scheduler_type: typing.Union[transformers.trainer_utils.SchedulerType, str] = 'linear' lr_scheduler_kwargs: typing.Union[dict[str, typing.Any], str] = <factory> warmup_ratio: float = 0.0 warmup_steps: int = 0 log_level: str = 'passive' log_level_replica: str = 'warning' log_on_each_node: bool = True logging_dir: typing.Optional[str] = None logging_strategy: typing.Union[transformers.trainer_utils.IntervalStrategy, str] = 'steps' logging_first_step: bool = False logging_steps: float = 10 logging_nan_inf_filter: bool = True save_strategy: typing.Union[transformers.trainer_utils.SaveStrategy, str] = 'steps' save_steps: float = 500 save_total_limit: typing.Optional[int] = None save_safetensors: bool = True save_on_each_node: bool = False save_only_model: bool = False restore_callback_states_from_checkpoint: bool = False no_cuda: bool = False use_cpu: bool = False use_mps_device: bool = False seed: int = 42 data_seed: typing.Optional[int] = None jit_mode_eval: bool = False bf16: typing.Optional[bool] = None fp16: bool = False fp16_opt_level: str = 'O1' half_precision_backend: str = 'auto' bf16_full_eval: bool = False fp16_full_eval: bool = False tf32: typing.Optional[bool] = None local_rank: int = -1 ddp_backend: typing.Optional[str] = None tpu_num_cores: typing.Optional[int] = None tpu_metrics_debug: bool = False debug: typing.Union[str, list[transformers.debug_utils.DebugOption]] = '' dataloader_drop_last: bool = False eval_steps: typing.Optional[float] = None dataloader_num_workers: int = 0 dataloader_prefetch_factor: typing.Optional[int] = None past_index: int = -1 run_name: typing.Optional[str] = None disable_tqdm: typing.Optional[bool] = None remove_unused_columns: typing.Optional[bool] = False label_names: typing.Optional[list[str]] = None load_best_model_at_end: bool = False metric_for_best_model: typing.Optional[str] = None greater_is_better: typing.Optional[bool] = None ignore_data_skip: bool = False fsdp: typing.Union[list[transformers.trainer_utils.FSDPOption], str, NoneType] = None fsdp_min_num_params: int = 0 fsdp_config: typing.Union[dict[str, typing.Any], str, NoneType] = None fsdp_transformer_layer_cls_to_wrap: typing.Optional[str] = None accelerator_config: typing.Union[dict, str, NoneType] = None parallelism_config: typing.Optional[accelerate.parallelism_config.ParallelismConfig] = None deepspeed: typing.Union[dict, str, NoneType] = None label_smoothing_factor: float = 0.0 optim: typing.Union[transformers.training_args.OptimizerNames, str] = 'adamw_torch_fused' optim_args: typing.Optional[str] = None adafactor: bool = False group_by_length: bool = False length_column_name: str = 'length' report_to: typing.Union[NoneType, str, list[str]] = None project: str = 'huggingface' trackio_space_id: typing.Optional[str] = 'trackio' ddp_find_unused_parameters: typing.Optional[bool] = None ddp_bucket_cap_mb: typing.Optional[int] = None ddp_broadcast_buffers: typing.Optional[bool] = None dataloader_pin_memory: bool = True dataloader_persistent_workers: bool = False skip_memory_metrics: bool = True use_legacy_prediction_loop: bool = False push_to_hub: bool = False resume_from_checkpoint: typing.Optional[str] = None hub_model_id: typing.Optional[str] = None hub_strategy: typing.Union[transformers.trainer_utils.HubStrategy, str] = 'every_save' hub_token: typing.Optional[str] = None hub_private_repo: typing.Optional[bool] = None hub_always_push: bool = False hub_revision: typing.Optional[str] = None gradient_checkpointing: bool = True gradient_checkpointing_kwargs: typing.Union[dict[str, typing.Any], str, NoneType] = None include_inputs_for_metrics: bool = False include_for_metrics: list = <factory> eval_do_concat_batches: bool = True fp16_backend: str = 'auto' push_to_hub_model_id: typing.Optional[str] = None push_to_hub_organization: typing.Optional[str] = None push_to_hub_token: typing.Optional[str] = None mp_parameters: str = '' auto_find_batch_size: bool = False full_determinism: bool = False torchdynamo: typing.Optional[str] = None ray_scope: typing.Optional[str] = 'last' ddp_timeout: int = 1800 torch_compile: bool = False torch_compile_backend: typing.Optional[str] = None torch_compile_mode: typing.Optional[str] = None include_tokens_per_second: bool = False include_num_input_tokens_seen: typing.Union[str, bool] = False neftune_noise_alpha: typing.Optional[float] = None optim_target_modules: typing.Union[NoneType, str, list[str]] = None batch_eval_metrics: bool = False eval_on_start: bool = False use_liger_kernel: bool = False liger_kernel_config: typing.Optional[dict[str, bool]] = None eval_use_gather_object: bool = False average_tokens_across_devices: bool = True model_init_kwargs: typing.Union[dict, str, NoneType] = None disable_dropout: bool = False max_prompt_length: typing.Optional[int] = 512 num_generations: typing.Optional[int] = 8 max_completion_length: typing.Optional[int] = 256 ds3_gather_for_generation: bool = True shuffle_dataset: typing.Optional[bool] = True generation_batch_size: typing.Optional[int] = None steps_per_generation: typing.Optional[int] = None temperature: float = 1.0 top_p: float = 1.0 top_k: typing.Optional[int] = None min_p: typing.Optional[float] = None generation_kwargs: typing.Optional[dict] = None repetition_penalty: float = 1.0 use_transformers_paged: bool = False cache_implementation: typing.Optional[str] = None use_vllm: bool = False vllm_mode: str = 'server' vllm_model_impl: str = 'vllm' vllm_enable_sleep_mode: bool = False vllm_guided_decoding_regex: typing.Optional[str] = None vllm_server_base_url: typing.Optional[str] = None vllm_server_host: str = '0.0.0.0' vllm_server_port: int = 8000 vllm_server_timeout: float = 240.0 vllm_gpu_memory_utilization: float = 0.3 vllm_tensor_parallel_size: int = 1 beta: float = 0.0 num_iterations: int = 1 epsilon: float = 0.2 delta: typing.Optional[float] = None epsilon_high: typing.Optional[float] = None importance_sampling_level: str = 'token' reward_weights: typing.Optional[list[float]] = None scale_rewards: str = 'group' loss_type: str = 'dapo' mask_truncated_completions: bool = False sync_ref_model: bool = False ref_model_mixup_alpha: float = 0.6 ref_model_sync_steps: int = 512 top_entropy_quantile: float = 1.0 use_liger_loss: bool = False vllm_importance_sampling_correction: bool = True vllm_importance_sampling_cap: float = 2.0 log_completions: bool = False num_completions_to_print: typing.Optional[int] = None wandb_log_unique_prompts: typing.Optional[bool] = False )
Parameters that control the model and reference model
- model_init_kwargs (
str
,dict[str, Any]
, optional) — Keyword arguments for from_pretrained, used when themodel
argument of the GRPOTrainer is provided as a string. - disable_dropout (
bool
, optional, defaults toFalse
) — Whether to disable dropout in the model. This is useful for training with a reference model, as it prevents the model from generating different logprobs for the same input.
Parameters that control the data preprocessing
- remove_unused_columns (
bool
, optional, defaults toFalse
) — Whether to only keep the column"prompt"
in the dataset. If you use a custom reward function that requires any column other than"prompts"
and"completions"
, you should keep this toFalse
. - max_prompt_length (
int
orNone
, optional, defaults to512
) — Maximum length of the prompt. If the prompt is longer than this value, it will be truncated left. - num_generations (
int
orNone
, optional, defaults to8
) — Number of generations per prompt to sample. The effective batch size (num_processes * per_device_batch_size- gradient_accumulation_steps) must be evenly divisible by this value.
- max_completion_length (
int
orNone
, optional, defaults to256
) — Maximum length of the generated completion. - ds3_gather_for_generation (
bool
, optional, defaults toTrue
) — This setting applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation, improving generation speed. However, disabling this option allows training models that exceed the VRAM capacity of a single GPU, albeit at the cost of slower generation. Disabling this option is not compatible with vLLM generation. - shuffle_dataset (
bool
, optional, defaults toTrue
) — Whether to shuffle the training dataset.
Parameters that control generation
- generation_batch_size — (
int
, optional): Batch size to use for generation. IfNone
, it defaults to the effective training batch size:per_device_train_batch_size * num_processes * steps_per_generation
. In other words, there is one generation batch processed per optimization step. Mutually exclusive withsteps_per_generation
. - steps_per_generation — (
int
, optional): Number of steps per generation. IfNone
, it defaults togradient_accumulation_steps
. Mutually exclusive withgeneration_batch_size
. - temperature (
float
, defaults to1.0
) — Temperature for sampling. The higher the temperature, the more random the completions. - top_p (
float
, optional, defaults to1.0
) — Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to1.0
to consider all tokens. - top_k (
int
, optional) — Number of highest probability vocabulary tokens to keep for top-k-filtering. IfNone
, top-k-filtering is disabled and all tokens are considered. - min_p (
float
, optional) — Minimum token probability, which will be scaled by the probability of the most likely token. It must be a value between0.0
and1.0
. Typical values are in the0.01-0.2
range. - repetition_penalty (
float
, optional, defaults to1.0
) — Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values >1.0
encourage the model to use new tokens, while values <1.0
encourage the model to repeat tokens. - use_transformers_paged (
bool
, optional, defaults toFalse
) — Whether to use thetransformers
paged implementation for generation. If set toTrue
, thetransformers
paged implementation will be used for generation instead of the default padded implementation. This parameter is only effective whenuse_vllm
is set toFalse
. - cache_implementation (
str
, optional) — Implementation of the cache method for faster generation whenuse_vllm
is set toFalse
. - generation_kwargs (
dict[str, Any]
, optional) — Additional keyword arguments to pass to GenerationConfig (if using transformers) orSamplingParams
(if using vLLM) when sampling completions. This can be used to further customize the generation behavior, such as settingsuppress_tokens
,num_beams
, etc. If it contains keys that conflict with the other generation parameters (likemin_p
,top_p
, etc.), they will override them.
Parameters that control generation acceleration powered by vLLM
- use_vllm (
bool
, optional, defaults toFalse
) — Whether to use vLLM for generating completions. If set toTrue
, the trainer will use vLLM for generation instead of the default model.generate(). Requiresvllm
to be installed. - vllm_mode (
str
, optional, defaults to"server"
) — Mode to use for vLLM integration whenuse_vllm
is set toTrue
. Must be one of"server"
or"colocate"
."server"
: The trainer will send generation requests to a separate vLLM server. Make sure a TRL vLLM server is running (start withtrl vllm-serve
)."colocate"
: vLLM will run in the same process and share the training GPUs. This avoids the need for a separate server but may cause resource contention with training.
- vllm_model_impl (
str
, optional, defaults to"vllm"
) — Model implementation to use for vLLM. Must be one of"transformers"
or"vllm"
."transformers"
: Use thetransformers
backend for model implementation."vllm"
: Use thevllm
library for model implementation. - vllm_guided_decoding_regex (
str
, optional) — Regex for vLLM guided decoding. IfNone
(default), guided decoding is disabled.
Parameters that control the vLLM server (only used when `vllm_mode` is `"server"`)
- vllm_server_base_url (
str
, optional) — Base URL for the vLLM server (e.g.,"http://localhost:8000"
). If provided,vllm_server_host
andvllm_server_port
are ignored. - vllm_server_host (
str
, optional, defaults to"0.0.0.0"
) — Host of the vLLM server to connect to. Ignored ifvllm_server_base_url
is provided. - vllm_server_port (
int
, optional, defaults to8000
) — Port of the vLLM server to connect to. Ignored ifvllm_server_base_url
is provided. - vllm_server_timeout (
float
, optional, defaults to240.0
) — Total timeout duration in seconds to wait for the vLLM server to be up. If the server is not up after the timeout, aConnectionError
is raised.
Parameters that control colocated vLLM execution (only used when `vllm_mode` is `"colocate"`)
- vllm_gpu_memory_utilization (
float
, optional, defaults to0.3
) — Control the GPU memory utilization for vLLM. This setting only applies whenvllm_mode
is set to"colocate"
. If you are usingvllm_mode="server"
, this parameter must be passed separately when launching the vLLM server via the--vllm_gpu_memory_utilization
flag. - vllm_tensor_parallel_size (
int
, optional, defaults to1
) — Control the tensor parallel size for vLLM. This setting only applies whenvllm_mode
is set to"colocate"
. If you are usingvllm_mode="server"
, this parameter must be passed separately when launching the vLLM server via the--vllm_tensor_parallel_size
flag. - vllm_enable_sleep_mode (
bool
, optional, defaults toFalse
) — Whether to enable sleep mode for vLLM. IfTrue
, vLLM will sleep during the optimization step and woken for weight sync and generation.
Parameters that control the training
- beta (
float
, optional, defaults to0.0
) — KL coefficient. If0.0
(default), the reference model is not loaded, reducing memory usage and improving training speed. - num_iterations (
int
, optional, defaults to1
) — Number of iterations per batch (denoted as μ in the algorithm). - epsilon (
float
, optional, defaults to0.2
) — Epsilon value for clipping. - delta (
float
, optional) — Enables the upper clipping bound in two-sided GRPO loss when set to a float. IfNone
(default), standard GRPO clipping is used. Recommended to be greater than1 + ε
when enabled. This method is introduced in the INTELLECT-2 tech report. - epsilon_high (
float
, optional) — Upper-bound epsilon value for clipping. If not specified, it defaults to the same value as the lower-bound specified in argumentepsilon
. Paper DAPO recommends0.28
. - importance_sampling_level (
str
, optional, defaults to"token"
) — Controls whether importance sampling ratios are computed at the"token"
or"sequence"
level."token"
keeps the raw per-token log-probability ratios (one weight per token)."sequence"
averages the log-probability ratios across valid tokens to produce a single ratio per sequence. The GSPO paper shows that sequence-level sampling often yields more stable training and better alignment with sequence-level rewards. - reward_weights (
list[float]
, optional) — Weights for each reward function. Must match the number of reward functions. IfNone
, all rewards are weighted equally with weight1.0
. - scale_rewards (
str
orbool
, optional, defaults to"group"
) — Specifies the scaling strategy for rewards. Supported values are:True
or"group"
(default): rewards are scaled by the standard deviation within each group, ensuring unit variance within a group."batch"
: rewards are scaled by the standard deviation across the entire batch, as recommended in the PPO Lite paper.False
or"none"
: no scaling is applied. The Dr. GRPO paper recommends not scaling rewards, as scaling by the standard deviation introduces a question-level difficulty bias.
- loss_type (
str
, optional, defaults to"dapo"
) — Specifies the loss formulation to use. Supported values are:"grpo"
: Aggregates token-level losses by normalizing over sequence length. Not recommended due to length bias—this approach tends to prefer shorter completions with positive advantages and longer ones with negative advantages."dr_grpo"
: Aggregates token-level losses by normalizing with a global constant. This method was introduced in the Dr. GRPO paper to eliminate length bias. The value of the constant corresponds tomax_completion_length
."dapo"
(default): Aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced in the DAPO paper to eliminate length bias."bnpo"
: Aggregates token-level losses by normalizing with the number of active token in the local batch. Note that normalization is performed over the local batch only, so results may slightly vary depending on the local batch size, despite a constant effective batch size. When usingper_device_train_batch_size==1
, the loss is equivalent to the GRPO loss.
- mask_truncated_completions (
bool
, optional, defaults toFalse
) — When enabled, truncated completions are excluded from the loss calculation, preventing them from being incorrectly penalized and introducing noise during training. According to the DAPO paper, this is a good practice for training stability. - sync_ref_model (
bool
, optional, defaults toFalse
) — Whether to synchronize the reference model with the active model everyref_model_sync_steps
steps, using theref_model_mixup_alpha
parameter. This synchronization originates from the TR-DPO paper. - ref_model_mixup_alpha (
float
, optional, defaults to0.6
) — α parameter from the TR-DPO paper, which controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation:π_ref = α * π_θ + (1 - α) * π_ref_prev
. To use this parameter, you must setsync_ref_model=True
. - ref_model_sync_steps (
int
, optional, defaults to512
) — τ parameter from the TR-DPO paper, which determines how frequently the current policy is synchronized with the reference policy. To use this parameter, you must setsync_ref_model=True
. - top_entropy_quantile (
float
, optional, defaults to1.0
) — ρ parameter from Beyond the 80/20 Rule. Keeps in the policy loss term only the top-ρ quantile of tokens by entropy of the probability distribution at each sequence position, improving results. Range:[0.0-1.0]
. A value of0.0
masks all but the highest entropy token;1.0
keeps all tokens. The paper recommends a value of0.2
. If used withmask_truncated_completions=True
, only tokens from non-truncated completions are considered. - use_liger_loss (
bool
, optional, defaults toFalse
) — Whether to use the Liger GRPO loss. - vllm_importance_sampling_correction (
bool
, optional, defaults toTrue
) — Whether to apply Truncated Importance Sampling (TIS) between vLLM completion logprobs and recomputed logprobs. Your Efficient RL Framework Secretly Brings You Off-Policy RL Training highlights that using a separate generation framework (such as vLLM) can introduce off-policy effects due to subtle implementation differences between generation and training backends. TIS is proposed as a remedy for this issue. - vllm_importance_sampling_cap (
float
, optional, defaults to2.0
) — Truncation parameter C for Truncated Importance Sampling (TIS). This sets an upper bound on the importance sampling ratio, improving training stability.
Parameters that control the logging
- log_completions (
bool
, optional, defaults toFalse
) — Whether to log a sample of (prompt, completion) pairs everylogging_steps
steps. Ifrich
is installed, it prints the sample. Ifwandb
logging is enabled, it logs it towandb
. - num_completions_to_print (
int
, optional) — Number of completions to print withrich
. IfNone
, all completions are logged. - wandb_log_unique_prompts (
bool
, optional, defaults toFalse
) — Whether to log unique prompts in wandb. IfTrue
, only unique prompts are logged. IfFalse
, all prompts are logged.
Configuration class for the GRPOTrainer.
This class includes only the parameters that are specific to GRPO training. For a full list of training arguments, please refer to the TrainingArguments documentation. Note that default values in this class may differ from those in TrainingArguments.
Using HfArgumentParser we can turn this class into argparse arguments that can be specified on the command line.