---
license: other
library_name: transformers
---
# **OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM**
[](arxiv.org/abs/2510.15870 )
[](https://github.com/NVlabs/OmniVinci)
[](https://huggingface.co/nvidia/omnivinci)
[](https://nvlabs.github.io/OmniVinci)
## Introduction
OmniVinci is an NVIDIA research project focused on exploring omni-modal LLMs that can not only see and read but also listen, speak, and reason.
We are among the best omni-modality understanding models. Check out our performance on some of the most popular omni-modality, audio, and vision benchmarks:
## Quickstart Below, we provide simple examples to show how to use our model with Transformers. ### Environment Setup 1. Download and navigate to the HuggingFace repository: ``` huggingface-cli download nvidia/omnivinci --local-dir ./omnivinci --local-dir-use-symlinks False cd ./omnivinci ``` 2. Install Python environment (based on NVILA codebase): ``` bash ./environment_setup.sh omnivinci ``` ### 🤗 Transformers Usage #### Video (with Audio) Inference Example ```python from transformers import AutoProcessor, AutoModel, AutoConfig,AutoModelForCausalLM import torch import os # default: Load the model on the available device(s) model_path = "./" video_path = "xxx.mp4" generation_kwargs = {"max_new_tokens": 1024, "max_length": 99999999} load_audio_in_video = True num_video_frames = 128 audio_length = "max_3600" config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype="torch.float16", device_map="auto") processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) generation_config = model.default_generation_config generation_config.update(**generation_kwargs) model.config.load_audio_in_video = load_audio_in_video processor.config.load_audio_in_video = load_audio_in_video if num_video_frames > 0: model.config.num_video_frames = num_video_frames processor.config.num_video_frames = num_video_frames if audio_length != -1: model.config.audio_chunk_length = audio_length processor.config.audio_chunk_length = audio_length conversation = [{ "role": "user", "content": [ {"type": "video", "video":video_path}, {"type": "text", "text": "Assess the video, followed by a detailed description of its video and audio contents."} ] }] text = processor.apply_chat_template(conversation, tokenize=False, add_generation_prompt=True) inputs = processor([text]) output_ids = model.generate( input_ids=inputs.input_ids, media=getattr(inputs, 'media', None), media_config=getattr(inputs, 'media_config', None), generation_config=generation_config, ) print(processor.tokenizer.batch_decode(output_ids, skip_special_tokens=True)) ``` - **For audio and image inference examples, please refer to `example_mini_audio.py` and `example_mini_image.py`.** ## License / Terms of Use The model is released under the [NVIDIA OneWay Noncommercial License](asset/NVIDIA_OneWay_Noncommercial_License.docx). ## Citation Please consider to cite our paper and this framework, if they are helpful in your research. ```bibtex @article{omnivinci2025, title={OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM}, author={Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Jason Lu, Oluwatobi Olabiyi, Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov}, journal={arXiv}, year={2025}, } ```