VideoNSA: Native Sparse Attention for Video Understanding
Model Description
VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to 128K vision-text tokens using only 3.6% of the full attention budget while maintaining competitive performance.
Key Features
- π― Learned Sparsity: Intelligently learns sparsity patterns over video tokens
- π Efficient Scaling: Handles massive video contexts with minimal computational overhead
- π¬ Hybrid Attention: Combines compression, selection, and sliding window mechanisms
- π§ Hardware-Aware: Optimized for efficient inference on modern GPUs
- π Strong Performance: Achieves leading results on video understanding benchmarks
Model Architecture
VideoNSA employs a hybrid attention strategy with three complementary branches:
- Compression Branch: Averages frame KV blocks to maintain salient visual cues
- Selection Branch: Ranks and retains the most informative video segments
- Sliding Window Branch: Ensures local temporal coverage for fine-grained details
Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.
Training Details
- Base Model: Qwen2.5-VL-7B-Instruct
- Dataset: Filtered LLaVA-Video-178K
- Sampling Rate: 4 fps
- Context Limit: 36K tokens during training
- Compute: ~4,600 H100 GPU hours
Usage
For installation, training, and evaluation instructions, please refer to:
- π» GitHub Repository
- π Project Page
Limitations
- Optimized for video understanding tasks; may not be optimal for pure image tasks
- Requires sufficient GPU memory for long video processing
- Performance may vary with different video resolutions and frame rates
Citation
@misc{song2025videonsanativesparseattention,
title={VideoNSA: Native Sparse Attention Scales Video Understanding},
author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
year={2025},
eprint={2510.02295},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.02295},
}
Resources
- π Paper
- π Project Page
- π» GitHub Repository
License
This model is released under the Apache 2.0 License.
- Downloads last month
- 119
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support