NEXUS-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Che Liu , Yingji Zhang , Dong Zhang , Weijie Zhang , Chenggong Gong , Yu Lu , Shilin Zhou , Ziliang Gan ,
Ziao Wang, Haipang Wu, Ji Liu, Andre Freitas, Qifan Wang, Zenglin Xu,
Rongjunchen Zhang^♠, Yong Dai^♠

^♠Corresponding author, [email protected], [email protected]

📖Paper |🤗Model | 🤗Training Data (Coming Soon)

NEXUS-O is an industry-scale omni-modal large language model (LLM) that unifies audio, vision, and language understanding into a single modular framework. Human perception integrates sight, sound, and language — NEXUS-O aims to replicate this ability for intelligent agents across real-world scenarios such as ASR, Speech-to-Speech Chat, and Multimodal Reasoning.

Architecture of NEXUS-O

Training Stages

📢 News

🚀 [08/01/2025] Our paper has been accepted for ACM MM 2025.

💡 Highlights

🧩 Modular End-to-End Framework. A highly configurable encoder–LLM–decoder architecture supporting flexible modality combinations and rapid iteration for industry applications.
💡 Lightweight Alignment Strategy. Efficient audio–language pre-training built upon the state-of-the-art Qwen2.5-VL model — eliminating the need for costly vision pre-training while retaining strong tri-modal performance.
🎧 Synthetic Audio Data Pipeline. A scalable audio synthesis system that generates diverse, high-fidelity audio-text pairs from real-world scenes, enabling robust downstream ASR and S2S tasks.

TODO

Rlease NEXUS-O full model weight on HuggingFace
Rlease Audio Encoder Training Data
Rlease Audio Decoder Training Data

✒️Citation

@article{liu2025nexus,
  title={Nexus: An Omni-Perceptive And-Interactive Model for Language, Audio, And Vision},
  author={Liu, Che and Zhang, Yingji and Zhang, Dong and Zhang, Weijie and Gong, Chenggong and Li, Haohan and Lu, Yu and Zhou, Shilin and Lu, Yue and Gan, Ziliang and others},
  journal={arXiv preprint arXiv:2503.01879},
  year={2025}
}

📄 License

Usage and License Notices: The data and code are intended and licensed for research use only. License: Attribution-NonCommercial 4.0 International It should abide by the policy of OpenAI: https://openai.com/policies/terms-of-use

💖 Acknowledgement

Downloads last month: 65

Safetensors

Model size

10B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HiThink-Research/NEXUS-O

Base model

Qwen/Qwen2-Audio-7B-Instruct

Finetuned

(13)

this model