Open-Source AI Cookbook documentation

Efficient Online Training with GRPO and vLLM in TRL

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Open In Colab

Efficient Online Training with GRPO and vLLM in TRL

Authored by: Sergio Paniego

Online training methods, such as Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO), require the model to generate outputs in real time during training. This “online” aspect often becomes a critical bottleneck, as generating completions is both compute- and memory-intensive, especially for large language models (LLMs).

Without optimization, running inference during training can be slow and memory-heavy, limiting both efficiency and scalability. This is particularly noticeable when hardware resources are constrained, such as in Colab with a single GPU.

This notebook demonstrates how to overcome the online generation bottleneck by combining vLLM, a high-throughput, low-latency inference engine built on PagedAttention, with TRL. On a single GPU, TRL and vLLM can share resources efficiently, enabling faster training even with limited hardware. On larger setups, such as multi-GPU or multi-node environments, vLLM can run as a separate process on dedicated GPUs while TRL handles training on others, allowing seamless scaling without impacting generation speed.

Although we focus on GRPO here, this setup is compatible with any online training method in TRL with vLLM support that requires generating completions during training, such as DPO. With minimal adjustments, the workflow can be adapted to different online optimization algorithms and hardware configurations while taking full advantage of efficient inference.

By using vLLM alongside TRL, we can directly observe measurable gains in training efficiency, with faster generation, reduced memory usage, and the ability to scale across multiple GPUs or nodes when needed.

The diagram below illustrates the overall training workflow and highlights where vLLM (blue box) and TRL (pink box) fit into the process:

grpo_vllm_online_training (1).png

1. Install Dependencies

First, let’s install the essential libraries required for fine-tuning. The important highlight here is TRL with vLLM support, which enables high-throughput, low-latency generation during online training, removing the common bottleneck in completion generation.

!pip install -U -q trl[vllm] peft math_verify trackio transformers

# Tested with trl==0.23.1, peft==0.17.1, math_verify==0.8.0, vllm==0.11.0, trackio==0.5.2 transformers==4.57.0

Authenticate with your Hugging Face account to save and share your model directly from this notebook 🗝️.

from huggingface_hub import notebook_login

notebook_login()

2. Load Dataset 📁

These models excel at tasks that require complex, multi-step reasoning. A prime example is mathematical problem-solving, where step-by-step thinking is essential to arrive at the correct answer.

For this project, we’ll use the AI-MO/NuminaMath-TIR dataset.
This reasoning-focused dataset contains mathematical problems, their final solutions, and, most importantly, detailed reasoning steps that explain how to move from the problem statement to the solution.

from datasets import load_dataset

dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset, test_dataset = load_dataset(dataset_id, split=['train[:10%]', 'test[:10%]'])

Let’s check the structure of the dataset

>>> print(train_dataset)
Dataset({
    features: ['problem', 'solution', 'messages'],
    num_rows: 7244
})

Let’s check one sample:

>>> print(train_dataset[0])
{'problem': 'What is the coefficient of x2y6x^2y^6 in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.', 'solution': "To determine the coefficient ofx2y6x^2y^6 in the expansion ofleft(frac35xfracy2right)8\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8, we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case,a=frac35xa = \\frac{3}{5}x,b=fracy2b = -\\frac{y}{2}, andn=8n = 8.\n\nWe are interested in the term that containsx2y6x^2y^6. In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo getx2x^2, we need8k=28 - k = 2, thusk=6k = 6.\n\nSubstitutingk=6k = 6 into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficientbinom86\\binom{8}{6}.\n2. Computeleft(frac35right)2\\left(\\frac{3}{5}\\right)^2.\n3. Computeleft(fracy2right)6\\left(-\\frac{y}{2}\\right)^6.\n4. Combine everything together to get the coefficient ofx2y6x^2y^6.\n\nLet's compute these in Python.\n```python\nfrom math import comb\n\n# Given values\nn = 8\nk = 6\n\n# Calculate the binomial coefficient\nbinom_coeff = comb(n, k)\n\n# Compute (3/5)^2\na_term = (3/5)**2\n\n# Compute (-1/2)^6\nb_term = (-1/2)**6\n\n# Combine terms to get the coefficient of x^2y^6\ncoefficient = binom_coeff * a_term * b_term\nprint(coefficient)\n```\n```output\n0.1575\n```\nThe coefficient ofx2y6x^2y^6 in the expansion ofleft(frac35xfracy2right)8\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8 is0.15750.1575. To express this as a common fraction, we recognize that:\n\n\\[ 0.1575 = \\frac{1575}{10000} = \\frac{63}{400} \\]\n\nThus, the coefficient can be expressed as:\n\n\\[\n\\boxed{\\frac{63}{400}}\n\\]", 'messages': [{'content': 'What is the coefficient of x2y6x^2y^6 in the expansion of $\\left(\\frac{3}{5}x-\\frac{y}{2}\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}, {'content': "To determine the coefficient ofx2y6x^2y^6 in the expansion ofleft(frac35xfracy2right)8\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8, we can use the binomial theorem.\n\nThe binomial theorem states:\n\\[\n(a + b)^n = \\sum_{k=0}^{n} \\binom{n}{k} a^{n-k} b^k\n\\]\n\nIn this case,a=frac35xa = \\frac{3}{5}x,b=fracy2b = -\\frac{y}{2}, andn=8n = 8.\n\nWe are interested in the term that containsx2y6x^2y^6. In the general term of the binomial expansion:\n\\[\n\\binom{8}{k} \\left(\\frac{3}{5}x\\right)^{8-k} \\left(-\\frac{y}{2}\\right)^k\n\\]\n\nTo getx2x^2, we need8k=28 - k = 2, thusk=6k = 6.\n\nSubstitutingk=6k = 6 into the expression:\n\\[\n\\binom{8}{6} \\left(\\frac{3}{5}x\\right)^{8-6} \\left(-\\frac{y}{2}\\right)^6 = \\binom{8}{6} \\left(\\frac{3}{5}x\\right)^2 \\left(-\\frac{y}{2}\\right)^6\n\\]\n\nNow, we will compute each part of this expression.\n\n1. Calculate the binomial coefficientbinom86\\binom{8}{6}.\n2. Computeleft(frac35right)2\\left(\\frac{3}{5}\\right)^2.\n3. Computeleft(fracy2right)6\\left(-\\frac{y}{2}\\right)^6.\n4. Combine everything together to get the coefficient ofx2y6x^2y^6.\n\nLet's compute these in Python.\n```python\nfrom math import comb\n\n# Given values\nn = 8\nk = 6\n\n# Calculate the binomial coefficient\nbinom_coeff = comb(n, k)\n\n# Compute (3/5)^2\na_term = (3/5)**2\n\n# Compute (-1/2)^6\nb_term = (-1/2)**6\n\n# Combine terms to get the coefficient of x^2y^6\ncoefficient = binom_coeff * a_term * b_term\nprint(coefficient)\n```\n```output\n0.1575\n```\nThe coefficient ofx2y6x^2y^6 in the expansion ofleft(frac35xfracy2right)8\\left(\\frac{3}{5}x - \\frac{y}{2}\\right)^8 is0.15750.1575. To express this as a common fraction, we recognize that:\n\n\\[ 0.1575 = \\frac{1575}{10000} = \\frac{63}{400} \\]\n\nThus, the coefficient can be expressed as:\n\n\\[\n\\boxed{\\frac{63}{400}}\n\\]", 'role': 'assistant'}]}

In the DeepSeek-R1 training procedure (where GRPO was first introduced, as described in the previous notebook), a specific system prompt was used to guide the model in generating both reasoning steps and the final answer in a structured format.

We’ll adopt the same approach here, formatting our dataset so that each example represents a conversation between a User and an Assistant. The Assistant is prompted to first think through the problem before providing the final solution.

The system prompt used is:

A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user
with the answer. The reasoning process and answer are enclosed within <think> </think> and
<answer> </answer> tags, respectively, i.e., <think> reasoning process here </think>
<answer> answer here </answer>. User: prompt. Assistant:

This conversational structure ensures that the model explicitly demonstrates its reasoning before giving the answer, which is crucial for enhancing multi-step reasoning skills in mathematical problem-solving tasks.

SYSTEM_PROMPT = (
    "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "
    "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "
    "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "
    "<think> reasoning process here </think><answer> answer here </answer>"
)

def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

train_dataset = train_dataset.map(make_conversation)
test_dataset = test_dataset.map(make_conversation)

Let’s take a look at an example:

>>> print(train_dataset[0]['prompt'])
[&#123;'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here ', 'role': 'system'}, &#123;'content': 'What is the coefficient of x2y6x^2y^6 in the expansion of $\\left(\\frac&#123;3}&#123;5}x-\\frac&#123;y}&#123;2}\\right)^8$?  Express your answer as a common fraction.', 'role': 'user'}]

We’ll remove the messages and problem columns, as we only need the custom prompt column and solution to verify the generated answer.

>>> train_dataset = train_dataset.remove_columns(['messages', 'problem'])
>>> print(train_dataset)
Dataset(&#123;
    features: ['solution', 'prompt'],
    num_rows: 7244
})

3. Post-Training the Base Model Using GRPO + vLLM ⚡

A key challenge in online methods like GRPO is that the model must generate completions during training, which can quickly become a bottleneck. By integrating vLLM, we enable high-throughput, low-latency generation via its PagedAttention mechanism. This not only speeds up the post-training loop but also improves memory efficiency, making large-scale reasoning tasks more practical.

TRL supports online training with vLLM in two different modes:

  • colocate: The trainer process and the vLLM process share the same GPU resources. This is the setup used in this notebook, since Colab provides only a single GPU.
  • server: The trainer and vLLM run on separate GPUs. This mode is ideal for multi-GPU setups, where TRL can use some GPUs for training while vLLM uses others, communicating via HTTP.

These modes provide flexibility to efficiently leverage available hardware while benefiting from vLLM’s fast generation.

3.1 Loading the Baseline Model

We’ll start by loading Qwen/Qwen2-0.5B as our baseline (the Policy Model in the diagram above).
With just 0.5B parameters, this model is lightweight and fits comfortably within typical GPU memory.

Later in the workflow, vLLM will reuse this same model for generation. Importantly, we don’t need to initialize vLLM here—TRL will handle initialization automatically once the training loop begins, thanks to colocate mode (explained earlier).

We’ll see how this comes into play in the next steps.

import torch
from transformers import AutoModelForCausalLM

model_id = "Qwen/Qwen2-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

3.2 Configuring LoRA ⚙️

Next, we’ll configure LoRA (Low-Rank Adaptation) for model training.
LoRA allows us to fine-tune the model efficiently by updating a small set of parameters instead of the full model, resulting in faster training and lower GPU memory usage.

>>> from peft import LoraConfig, get_peft_model

>>> lora_config = LoraConfig(
...     task_type="CAUSAL_LM",
...     r=8,
...     lora_alpha=32,
...     lora_dropout=0.1,
...     target_modules=["q_proj", "v_proj"],
... )

>>> model = get_peft_model(model, lora_config)

>>> model.print_trainable_parameters()
trainable params: 540,672 || all params: 494,573,440 || trainable%: 0.1093

3.3 Loading Reward Functions

For the reward component of the system, we can use either pretrained reward models or reward functions defined directly in code. For training, the DeepSeek-R1 authors used an accuracy-based reward model evaluates whether the response is correct, alongside a format-based reward that ensures the model places its reasoning process between <think> </think> tags. You can find more details here. We can simply define and implement these reward functions as generic Python functions.

In this case, we will utilize these reward functions:

  1. Format Enforcement: Ensures that the generation follows a specific format using <think> </think> <answer> </answer> tags for reasoning.
import re

def format_reward(completions, **kwargs):
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<think>.*?</think>\s*<answer>.*?</answer>$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content) for content in completion_contents]
    rewards_list = [1.0 if match else 0.0 for match in matches]
    return [1.0 if match else 0.0 for match in matches]
  1. Solution Accuracy: Verifies whether the solution to the problem is correct.
from math_verify import LatexExtractionConfig, parse, verify

def accuracy_reward(completions, **kwargs):
    """Reward function that checks if the completion is the same as the ground truth."""
    solutions = kwargs['solution']
    completion_contents = [completion[0]["content"] for completion in completions]
    rewards = []
    for content, solution in zip(completion_contents, solutions):
        gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        answer_parsed = parse(content, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])
        if len(gold_parsed) != 0:
            try:
                rewards.append(float(verify(answer_parsed, gold_parsed)))
            except Exception:
                rewards.append(0.0)
        else:
            rewards.append(1.0)
    return rewards

3.4 Configuring GRPO Training Parameters

Next, we’ll configure the training parameters for GRPO. Key parameters to experiment with are max_completion_length, num_generations, and max_prompt_length (see the diagram at the beginning for details on each).

To keep things simple, we’ll start with just one training epoch. We’ve doubled the max_completion_length so the model can generate slightly longer answers than the default in the GRPOConfig of 256 tokens. In practice, we recommend setting num_generations to 8 or more, as this has virtually no impact on GPU memory. The same principle applies to other parameters—careful experimentation and fine-tuning are key to identifying the most effective configuration for your task. In the next section, we provide a table showing training speeds for different parameter settings.

We’ll also enable vLLM for generation during training. This is done by setting use_vllm=True, which instructs TRL to automatically launch and manage vLLM once the training loop begins.

Since this notebook runs on a single GPU, we configure colocate mode (via the vllm_mode parameter), so both the trainer and vLLM share the same GPU resources. In multi-GPU setups, you can instead run vLLM in a separate process, dedicating specific GPUs to each and letting them communicate via HTTP—unlocking even greater efficiency.

For more advanced configurations, check out the official vLLM integration guide. In multi-GPU environments, you can also launch vLLM with the trl vllm-serve tool to further maximize throughput and performance.

from trl import GRPOConfig

output_dir = "Qwen2-0-5B-GRPO-vllm-trl"

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    output_dir=output_dir,
    learning_rate=1e-5,
    gradient_accumulation_steps=16,
    num_train_epochs=1,

    # Parameters that control de data preprocessing
    max_completion_length=512,  # default: 256
    num_generations=8,  # default: 8
    max_prompt_length=512,  # default: 512

    # Parameters related to reporting and saving
    report_to=["trackio"],
    project=output_dir, # For trackio
    trackio_space_id=f"sergiopaniego/{output_dir}", # For trackio
    push_to_hub=True,
    save_strategy="steps",
    save_steps=10,

    # Configure vLLM
    use_vllm=True,
    vllm_mode="colocate",
    # Some more params you can configure for vLLM with their defaults
    # vllm_model_impl='vllm',
    # vllm_enable_sleep_mode=False,
    # vllm_guided_decoding_regex=None,
    # vllm_server_base_url=None,
    # vllm_server_host='0.0.0.0',
    # vllm_server_port=8000,
    # vllm_server_timeout=240.0,
    # vllm_gpu_memory_utilization=0.3,
    # vllm_tensor_parallel_size=1
    # vllm_importance_sampling_correction=True,
    # vllm_importance_sampling_cap=2.0
)

We’re reporting the training results to trackio. To keep track of metrics and monitor them live during training, we can set up a Hugging Face Space, where the tracking will be continuously updated. We added project and trackio_space_id in GRPOConfig to configure it.

3.5 Training the Model 🏃

Next, we’ll configure the trainer and begin training the model.

For this setup, we pass the two reward functions we defined earlier to the trainer to guide the learning process.

Below is a diagram illustrating the training procedure we’ll be reproducing, adapted from the Open-R1 project.

image.png

Finally, let’s configure the GRPOTrainer.

If you look closely at the output, you’ll see details about the launch of vLLM. Thanks to TRL, integrating vLLM is straightforward, with minimal friction—allowing you to easily take advantage of high-throughput generation during online training.

For a deeper understanding of the benefits, we recommend comparing this notebook with the previous GRPO recipe without vLLM.

>>> from trl import GRPOTrainer

>>> trainer = GRPOTrainer(
...     model=model,
...     reward_funcs=[format_reward, accuracy_reward],
...     args=training_args,
...     train_dataset=train_dataset
... )
INFO 10-08 08:38:44 [__init__.py:216] Automatically detected platform cuda.

We’ll suppress certain warnings and logs to keep the output clean during training. Since training involves loops, some logs can appear repeatedly and may not be helpful for our example. In a real setting, be careful when suppressing logs, as important information could be hidden. If you want to see the full trace, you can ignore this cell.

import logging
import warnings
from transformers import logging as transformers_logging

logging.basicConfig(level=logging.WARNING) # Set global logging level to WARNING
logging.getLogger("vllm").setLevel(logging.WARNING)  # Silence INFO logs from vLLM
transformers_logging.set_verbosity_warning() # Set Transformers logging to WARNING

# Ignore specific Python warnings
warnings.filterwarnings("ignore", category=UserWarning, module="torch.utils.checkpoint")
warnings.filterwarnings("ignore", category=DeprecationWarning, module="jupyter_client.session")

Time to train the model! 🎉

>>> trainer.train()
* Trackio project initialized: Qwen2-0-5B-GRPO-vllm-trl
* Trackio metrics will be synced to Hugging Face Dataset: sergiopaniego/Qwen2-0-5B-GRPO-vllm-trl-dataset
* Creating new space: https://huggingface.co/spaces/sergiopaniego/Qwen2-0-5B-GRPO-vllm-trl
* View dashboard by going to: https://sergiopaniego-Qwen2-0-5B-GRPO-vllm-trl.hf.space/

Let’s save the results 💾

trainer.save_model(training_args.output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

In the HF Space, you can review the training results tracked by trackio. The metrics look very promising!

The setup shown here runs on a single GPU, yet we can already see how vLLM boosts training efficiency. With vLLM enabled, training reaches 0.07 it/s, whereas disabling it (use_vllm=False) drops performance to 0.04 it/s—an immediate ~75% speedup even in this basic configuration.

And this is just the beginning: we haven’t yet explored more optimal setups. For further efficiency gains, you can experiment with training parameters like max_completion_length, num_generations, or max_prompt_length, and scale across multiple GPUs to fully leverage vLLM’s high-throughput generation.

4. Evaluating Different Training Configurations

After training a model efficiently with a single configuration, it’s insightful to explore other possible configurations to understand how training performance changes when using vLLM versus not using it. The table below shows various configurations along with their corresponding it/s (iterations per second), highlighting the performance impact of vLLM.

These results were obtained using a Colab setup, so you can expect significantly higher gains when scaling to more advanced environments with multiple GPUs or distributed nodes.

max_completion_length num_generations max_prompt_length vLLM it/s
64 4 128 0.14
64 4 128 0.12
64 8 128 0.14
64 8 128 0.12
128 8 128 0.13
128 8 128 0.09
128 16 128 0.13
128 16 128 0.09
256 8 128 0.10
256 8 128 0.06
256 16 128 0.10
256 16 128 0.06
512 8 128 0.07
512 8 128 0.04
512 16 128 0.07
512 16 128 0.04
1024 16 128 0.04
1024 16 128 0.02
1024 32 128 0.04
1024 32 128 0.02

From the table above, several observations can be made:

  • As max_completion_length increases, the it/s naturally decreases, which is expected due to the larger computation per iteration.
  • vLLM consistently provides faster training, and the performance gain becomes more significant as we scale to larger max_completion_length values.
  • The num_generations parameter has minimal impact on it/s, showing that parallel generation does not significantly affect throughput in this setup.
  • Although max_prompt_length was kept constant in these experiments, similar trends would apply if it were increased: higher values would reduce it/s depending on the dataset characteristics, just like max_completion_length.

5. Check the Model Performance

We’ve kept things simple so far, but now let’s check if the model has already learned to reason. We’ll load the saved model and run an evaluation on a test sample.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "sergiopaniego/Qwen2-0-5B-GRPO-vllm-trl"
trained_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
)
trained_tokenizer = AutoTokenizer.from_pretrained(model_id)

Let’s check one sample from the test set!

>>> print(test_dataset['prompt'][0])
[&#123;'content': 'A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning process and answer are enclosed within   and   tags, respectively, i.e.,  reasoning process here  answer here ', 'role': 'system'}, &#123;'content': "In 1988, a person's age was equal to the sum of the digits of their birth year. How old was this person?", 'role': 'user'}]

We’ll create a function to interact with the model. In addition to generating the answer, we’ll measure the inference duration and count the number of generated tokens. This will give us insights into how much the model has reasoned during generation.

import time
import torch

def generate_with_reasoning(model, tokenizer, prompt):
  # Build the prompt from the dataset
  text = tokenizer.apply_chat_template(
      prompt, tokenize=False, add_generation_prompt=True
  )
  # Tokenize and move to the same device as the model
  model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

  # conduct text completion
  start_time = time.time()
  generated_ids = model.generate(
      **model_inputs,
      max_new_tokens=500
  )
  end_time = time.time()
  output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

  # Decode and extract model response
  generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)

  # Get inference time
  inference_duration = end_time - start_time

  # Get number of generated tokens
  num_input_tokens = model_inputs['input_ids'].shape[1]
  num_generated_tokens = len(output_ids)

  return generated_text, inference_duration, num_generated_tokens

Let’s generate the answer for that test sample!

>>> prompt = test_dataset['prompt'][0]

>>> generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(trained_model, trained_tokenizer, prompt)
>>> print('-- Trained model --')
>>> print(generated_text)

>>> generated_text, inference_duration, num_generated_tokens = generate_with_reasoning(model, tokenizer, prompt)
>>> print('-- Base model --')
>>> print(generated_text)
-- Trained model --
 reasoning process here  46 
-- Base model --

The person's age was equal to the sum of the digits of their birth year.

The trained model already demonstrates the ability to correctly generate the <think> and <answer> tags, even though the reasoning or final solution may still be incorrect. We can also observe clear differences between the trained and baseline model responses — the base model fails to produce the tags properly.

Considering both the inference time and the number of generated tokens, this approach shows promising potential benefits:

>>> print(f"Inference time: {inference_duration:.2f} seconds")
>>> print(f"Generated tokens: {num_generated_tokens}")
Inference time: 0.70 seconds
Generated tokens: 18

We observe that the model shows some reasoning capabilities, although they are quite limited. This is likely due to using a small model and a very basic training setup, designed more for educational purposes than for maximizing performance.

For better results, using a larger model, training on the full dataset, and adjusting the configuration to generate more and longer completions would significantly improve the model’s final performance.

5. Continuing Your Learning Journey 🧑‍🎓

This notebook is just the beginning of exploring online training methods with TRL, including GRPO and other online trainers, now enhanced with vLLM for faster, more efficient generation.

If you’re eager to dive deeper, check out the resources linked throughout this notebook, as well as the following materials:

Keep exploring, experimenting, and learning!

Update on GitHub