KAIROS-MM-Qwen2.5-VL-7B-RL

KAIROS-MM-Qwen2.5-VL-7B-RL is a multimodal reasoning model based on Qwen2.5-VL, designed to enable robots and vision AI agents to reason about the real world using physics understanding, prior knowledge, and common sense. The model understands space, time, and fundamental physical principles, and is well suited for planning, decision making, and long horizon video reasoning in embodied and agentic systems. It serves as a planning and reasoning backbone for embodied agents, allowing them to infer what actions to take next based on visual observations, temporal context, and physical constraints.

Key Capabilities

Multimodal Reasoning Across Space and Time Understands spatial relationships, temporal dynamics, and causal interactions from images and long horizon videos.
Physics and Common Sense Understanding Reasons about motion, forces, object permanence, collisions, and everyday physical interactions in real world environments.
Long Horizon Video Reasoning Maintains contextual understanding across extended video sequences for activity understanding, forecasting, and planning.
Embodied Agent Planning Acts as a high level planning model to reason about the next steps an embodied agent or robot should take in a given environment.
Vision Language Grounding Aligns visual perception with language instructions and goals for reliable reasoning and action guidance.
Task Level Decision Making Supports complex multi step reasoning for navigation, manipulation, and interaction tasks in dynamic environments.
Cross Domain Generalization Adapts to robotics, physical AI, simulation environments, and real world visual reasoning scenarios.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/KAIROS-MM-Qwen2.5-VL-7B-RL",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "prithivMLmods/KAIROS-MM-Qwen2.5-VL-7B-RL"
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "<LONG_HORIZON_VIDEO>"},
            {"type": "text", "text": "What should the robot do next to safely pick up the object?"},
        ],
    }
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

output_text = processor.batch_decode(
    generated_ids_trimmed,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False,
)
print(output_text)

Training Datasets

The model is trained using reinforcement learning focused multimodal reasoning and planning datasets that emphasize physical understanding, temporal reasoning, and decision making:

SAGE-MM-RL-7k A reinforcement learning dataset focused on grounded multimodal reasoning and action level decision making. https://huggingface.co/datasets/allenai/SAGE-MM-RL-7k
Cosmos-Reason1-RL-Dataset A reinforcement learning dataset designed for physical AI and robotics reasoning with an emphasis on physics grounded and common sense driven learning. https://huggingface.co/datasets/nvidia/Cosmos-Reason1-RL-Dataset

Intended Use

Embodied AI and robotics reasoning
Long horizon video understanding and planning
Physical AI and simulation environments
Vision based decision making and forecasting
Robot task planning and action sequencing
Multimodal reinforcement learning research and development

Limitations

Performance may degrade on extremely long or noisy video inputs
Highly ambiguous physical scenarios may require external planners or simulators
Domain specific robotics tasks may benefit from additional task specific training
Fine grained low level control is outside the scope of this model

References

Qwen2.5 VL Technical Report https://huggingface.co/papers/2502.13923
YaRN: Efficient Context Window Extension https://arxiv.org/pdf/2309.00071
Qwen2 VL: High Resolution Perception https://arxiv.org/pdf/2409.12191
Model Soups: Averaging Weights of Multiple Fine Tuned Models Improves Accuracy Without Increasing Inference Time https://arxiv.org/abs/2203.05482
Model Stock: All We Need Is Just a Few Fine Tuned Models https://arxiv.org/abs/2403.19522
SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning https://github.com/allenai/SAGE