Cosmos Predict 2.5 & Transfer 2.5: Evolving the World Foundation Models for Physical AI
NVIDIA Cosmos family of open world models is redefining how we model and simulate the real world. Built for robotics, autonomous systems, and simulation-driven AI, Cosmos world foundation models (WFMs) enable machines to see, imagine, and reason about physical reality.
With the launch of Cosmos Predict 2.5 and Cosmos Transfer 2.5, world generation has taken another major leap forward. These models extend the Cosmos family into longer horizons, richer viewpoints, and more adaptive domain transformations โ laying the groundwork for scalable physical AI.
๐ฎ Cosmos Predict 2.5: World Generation
Cosmos Predict 2.5 merges what were once three separate models โ Text2World, Image2World, and Video2World โ into a single, unified architecture capable of generating consistent, controllable video worlds from several input modalities. Trained on 200 million high-quality clips and enhanced into a unified model with a new reinforcement learning (RL) algorithm, it outperforms Cosmos Predict 1 in quality and prompt alignment for generating high-quality synthetic video data from single frames.
โจ Key Highlights
- One Powerful Model Cosmos Predict 2.5 has the capabilities of Text2World, Image2World, and Video2World into a single model and utilizes Cosmos Reason 1, a Physical AI reasoning vision language model (VLM), as the text encoder. Saving cost on compute and time needed to post-train and build your Physical AI workflows .
- Extended Video Horizons Produces sequences up to 30 seconds, maintaining spatial-temporal coherence โ important for simulation, long-horizon prediction, and robotic planning.
- Multi-View Generation Create synchronized camera views for realistic multi-camera setups in autonomous vehicle (AV) training or robot vision with camera control.
- Grounded Prompt Alignment Integrates Cosmos Reason as a text-scene encoder, tightening semantic grounding and reducing hallucinations.
- Efficiency-Driven Design Despite its scale and capability, Predict 2.5 improves upon overall quality, inference speed, and resource efficiency through architectural refinements.
๐ Cosmos Transfer 2.5: Spatially Controlled World Transformation
While Predict 2.5 creates worlds, Transfer 2.5 transforms them โ enabling high-fidelity, spatially conditioned world-to-world translation. Compared to Cosmos Transfer1-7B, CosmosTransfer2.5-2B is much smaller, with better prompt and physics alignment, and results in less hallucination and error accumulation for long video generations.
โจ Key Highlights
Smaller, faster, and enhanced quality 3.5 ร smaller than its predecessor yet faster and better quality โ optimized for deployment in both research and production pipelines.
Policy training for robots Robot policy models trained with Cosmos Transfer 2.5-2B augmentation significantly outperform others in generalizing to novel environments.
Better adherence to control signals for autonomous vehicles The evaluation of 3D lane and cuboid detection on generated multi-view videosโusing real-world scenarios as the control inputโshows up to a 60% improvement over the previous model (Transfer1-7B-Sample-AV), using LATR for lane detection and BEVFormer for cuboid detection.
Multi-camera consistency for autonomous vehicles Cosmos Transfer 2.5 improves on Cosmos Transfer 1 by distributing control blocks more evenly throughout the network for smoother integration of conditioning information.
Less error accumulation Transfer 2.5 shows less error accumulation for all four control modalities (edge/blur/depth/segmentation) compared to Cosmos-Transer1-7B.
Other Updates from Cosmos Platform
๐ง Cosmos Reason 1: Reasoning Vision Language Model
Part of the Cosmos WFMs, NVIDIA Cosmos Reason is an open, customizable, 7-billion-parameter reasoning Vision Language Model (VLM) for physical AI and robotics. The model enables robots and vision AI agents to reason like humans, using prior knowledge, physics understanding and common sense to understand and act in the real world. Cosmos Reason has topped the Physical Reasoning leaderboard. The model is also available as an NVIDIA NIM, which offers secure, easy-to-use microservices for deploying high-performance generative AI across any environment.
๐ Cosmos Dataset Search: Large-scale Data Search and Retrieval
To accelerate model post-training, NVIDIA Cosmos Dataset Search is a vector-based workflow that enables physical AI developers to instantly search and retrieve targeted scenarios from massive training datasets. It uses the Cosmos-Embed NIM to enable highly accurate semantic search and connects to NVIDIA Cosmos Curator to refine datasets and retrieve queried data with incredible efficiency and accuracy. With the ability to search billions of clips in seconds, Cosmos Dataset Search dramatically shortens post-trainingโcutting development cycles from years to days.
๐งฉ Use cases and Workflows
๐ Cosmos Cookbook The Cosmos Cookbook offers developers step-by-step recipes and post-training scripts to quickly build, customize, and deploy NVIDIAโs Cosmos world foundation models for robotics and autonomous systems.
Read the ๐ lastest white paper from the NVIDIA Research team for more details on the capabilities and benchmarks of Cosmos Predict 2.5 and Cosmos Transfer 2.5.
๐ง Resources
- ๐ World Simulation With Video Foundation Models for Physical AI Whitepaper
- ๐ Cosmos Predict 2.5 Research Page
- ๐ Cosmos Transfer 2.5 Research Page
- ๐ Cosmos Reason 1
- ๐ Cosmos Dataset Search
๐ช Get Started today
All Cosmos WFMs are available on Hugging Face - model checkpoints here.
- Cosmos Predict 2.5 - Multimodal world foundation model for generating next frames based on input prompts. Explore GitHub repository for inference and post-training scripts.
- Cosmos Transfer 2.5 - Multicontrol world foundation model for data augmentation from structured video inputs. Explore GitHub for inference and post-training scripts.
Join our community for regular updates, Q&A, livestreams and hands-on tutorials!