NEXUS-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision
Che Liu
,
Yingji Zhang
,
Dong Zhang
,
Weijie Zhang
,
Chenggong Gong
,
Yu Lu
,
Shilin Zhou
,
Ziliang Gan
,
Ziao Wang,
Haipang Wu,
Ji Liu,
Andre Freitas,
Qifan Wang,
Zenglin Xu,
Rongjunchen Zhang♠,
Yong Dai♠
📖Paper |🤗Model | 🤗Training Data (Coming Soon)
NEXUS-O is an industry-scale omni-modal large language model (LLM) that unifies audio, vision, and language understanding into a single modular framework. Human perception integrates sight, sound, and language — NEXUS-O aims to replicate this ability for intelligent agents across real-world scenarios such as ASR, Speech-to-Speech Chat, and Multimodal Reasoning.
Architecture of NEXUS-O
Training Stages
📢 News
- 🚀 [08/01/2025] Our paper has been accepted for ACM MM 2025.
💡 Highlights
- 🧩 Modular End-to-End Framework. A highly configurable encoder–LLM–decoder architecture supporting flexible modality combinations and rapid iteration for industry applications.
- 💡 Lightweight Alignment Strategy. Efficient audio–language pre-training built upon the state-of-the-art Qwen2.5-VL model — eliminating the need for costly vision pre-training while retaining strong tri-modal performance.
- 🎧 Synthetic Audio Data Pipeline. A scalable audio synthesis system that generates diverse, high-fidelity audio-text pairs from real-world scenes, enabling robust downstream ASR and S2S tasks.
TODO
- Rlease NEXUS-O full model weight on HuggingFace
- Rlease Audio Encoder Training Data
- Rlease Audio Decoder Training Data
✒️Citation
@article{liu2025nexus,
title={Nexus: An Omni-Perceptive And-Interactive Model for Language, Audio, And Vision},
author={Liu, Che and Zhang, Yingji and Zhang, Dong and Zhang, Weijie and Gong, Chenggong and Li, Haohan and Lu, Yu and Zhou, Shilin and Lu, Yue and Gan, Ziliang and others},
journal={arXiv preprint arXiv:2503.01879},
year={2025}
}
📄 License
Usage and License Notices: The data and code are intended and licensed for research use only.
License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use
💖 Acknowledgement
- Downloads last month
- 65
Model tree for HiThink-Research/NEXUS-O
Base model
Qwen/Qwen2-Audio-7B-Instruct