VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM

If you like our project, please give us a star ⭐ on Github for the latest update.

πŸ“° News

🌏 Model Zoo

πŸ“‘ Citation

If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:

@InProceedings{Yuan_2025_CVPR,
    author    = {Yuan, Yuqian and Zhang, Hang and Li, Wentong and Cheng, Zesen and Zhang, Boqiang and Li, Long and Li, Xin and Zhao, Deli and Zhang, Wenqiao and Zhuang, Yueting and Zhu, Jianke and Bing, Lidong},
    title     = {VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
    booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {18970-18980}
}

@article{damonlpsg2025videollama3,
  title={VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding},
  author={Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao},
  journal={arXiv preprint arXiv:2501.13106},
  year={2025},
  url = {https://arxiv.org/abs/2501.13106}
}
Downloads last month
9
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for DAMO-NLP-SG/VideoRefer-VideoLLaMA3-2B

Base model

Qwen/Qwen2.5-1.5B
Finetuned
(2)
this model

Collection including DAMO-NLP-SG/VideoRefer-VideoLLaMA3-2B