PixelRefer-Lite-2B / README.md
CircleRadon's picture
Create README.md
7e81eed verified
metadata
license: apache-2.0
language:
  - en
metrics:
  - accuracy
library_name: transformers
pipeline_tag: video-text-to-text
tags:
  - multimodal large language model
  - large video-language model
base_model:
  - DAMO-NLP-SG/VideoLLaMA3-2B-Image

PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity

Static Badge arXiv preprint Dataset Model Benchmark

Homepage Huggingface

📰 News

🌏 Model Zoo

📑 Citation

If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:

@article{yuan2025pixelrefer,
  title     = {PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity},
  author    = {Yuqian Yuan and Wenqiao Zhang and Xin Li and Shihao Wang and Kehan Li and Wentong Li and Jun Xiao and Lei Zhang and Beng Chin Ooi},
  year      = {2025},
  journal   = {arXiv},
}

@inproceedings{yuan2025videorefer,
  title     = {Videorefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
  author    = {Yuqian Yuan and Hang Zhang and Wentong Li and Zesen Cheng and Boqiang Zhang and Long Li and Xin Li and Deli Zhao and Wenqiao Zhang and Yueting Zhuang and others},
  booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages     = {18970--18980},
  year      = {2025},
}