|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- video-generation |
|
|
- robotics |
|
|
- embodied-ai |
|
|
- physical-reasoning |
|
|
- causal-reasoning |
|
|
- inverse-dynamics |
|
|
- wow |
|
|
- arxiv:2509.22642 |
|
|
datasets: |
|
|
- WoW-world-model/WoW-1-Benchmark-Samples |
|
|
pipeline_tag: video-generation |
|
|
base_model: wan |
|
|
--- |
|
|
|
|
|
# π€ WoW-1-Wan-14B-2M |
|
|
|
|
|
**WoW-1-Wan-14B** is a 14-billion-parameter generative world model trained on **2 million real-world robot interaction trajectories**. It is designed to imagine, reason, and act in physically consistent environments, powered by SOPHIA-guided refinement and a co-trained **Inverse Dynamics Model**. |
|
|
|
|
|
This model is part of the [WoW (World-Omniscient World Model)](https://github.com/wow-world-model/wow-world-model) project, introduced in the paper: |
|
|
|
|
|
> **[WoW: Towards a World omniscient World model Through Embodied Interaction](https://arxiv.org/abs/2509.22642)** |
|
|
> *Chi et al., 2025 β arXiv:2509.22642* |
|
|
|
|
|
## π§ Key Features |
|
|
|
|
|
- **14B parameters** trained on **2M robot interaction samples** |
|
|
- Learns **causal physical reasoning** from embodied action |
|
|
- Generates physically consistent video and robotic action plans |
|
|
- Uses **SOPHIA**, a vision-language critic, to refine outputs |
|
|
- Paired with an **Inverse Dynamics Model** to complete imagination-to-action loop |
|
|
|
|
|
## π§ͺ Training Data |
|
|
|
|
|
<!-- - Dataset: [WoW-1-Benchmark-Samples](https://huggingface.co/datasets/WoW-world-model/WoW-1-Benchmark-Samples) --> |
|
|
- **2M** Real-world robot interaction trajectories |
|
|
- Multimodal scenes including vision, action, and language |
|
|
- Diverse **mixture captions** for better generalization |
|
|
### π§ Mixture Caption Strategy |
|
|
|
|
|
- **Prompt Lengths**: |
|
|
- Short: *"The Franka robot, grasp the red bottle on the table"* |
|
|
- Long: *"The scene... open the drawer, take the screwdriver, place it on the table..."* |
|
|
|
|
|
- **Robot Model Mixing**: |
|
|
- Captions reference various robot types |
|
|
- Example: *"grasp with the Franka Panda arm"*, *"use end-effector to align"* |
|
|
|
|
|
- **Action Granularity**: |
|
|
- Coarse: *"move to object"* |
|
|
- Fine: *"rotate wrist 30Β° before grasping"* |
|
|
|
|
|
|
|
|
## π Continuous Updates |
|
|
|
|
|
This dataset will be **continuously updated** with: |
|
|
- More trajectories |
|
|
- Richer language |
|
|
- Finer multimodal annotations |
|
|
|
|
|
## π§© Applications |
|
|
|
|
|
- Zero-shot video generation in robotics |
|
|
- Causal reasoning and physics simulation |
|
|
- Long-horizon manipulation planning |
|
|
- Forward and inverse control prediction |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@article{chi2025wow, |
|
|
title={WoW: Towards a World omniscient World model Through Embodied Interaction}, |
|
|
author={Chi, Xiaowei and Jia, Peidong and Fan, Chun-Kai and Ju, Xiaozhu and Mi, Weishi and Qin, Zhiyuan and Zhang, Kevin and Tian, Wanxin and Ge, Kuangzhi and Li, Hao and others}, |
|
|
journal={arXiv preprint arXiv:2509.22642}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π Resources |
|
|
|
|
|
- π§ Project page: [wow-world-model.github.io](https://wow-world-model.github.io/) |
|
|
- π» GitHub repo: [wow-world-model/wow-world-model](https://github.com/wow-world-model/wow-world-model) |
|
|
- π Dataset: [WoW-1 Benchmark Samples](https://huggingface.co/datasets/WoW-world-model/WoW-1-Benchmark-Samples) |
|
|
|
|
|
--- |