Mantis

This is the official checkpoint of Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight

πŸ”₯ Highlights

  • Disentangled Visual Foresight augments action learning without overburdening the backbone.
  • Progressive Training preserves the understanding capabilities of the backbone.
  • Adaptive Temporal Ensemble reduces inference cost while maintaining stable control.

How to use

This is the base Mantis model. For detailed usage please refer to our repository.

πŸ“ Citation

If you find our code or models useful in your work, please cite our paper:

@article{yang2025mantis,
  title={Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight},
  author={Yang, Yi and Li, Xueqi and Chen, Yiyang and Song, Jin and Wang, Yihan and Xiao, Zipeng and Su, Jiadi and Qiaoben, You and Liu, Pengfei and Deng, Zhijie},
  journal={arXiv preprint arXiv:2511.16175},
  year={2025}
}
Downloads last month
28
Safetensors
Model size
6B params
Tensor type
F32
Β·
BF16
Β·
Video Preview
loading

Collection including Yysrc/Mantis-Base