VideoNSA: Native Sparse Attention for Video Understanding

Model Description

VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to 128K vision-text tokens using only 3.6% of the full attention budget while maintaining competitive performance.

Key Features

🎯 Learned Sparsity: Intelligently learns sparsity patterns over video tokens
🚀 Efficient Scaling: Handles massive video contexts with minimal computational overhead
🎬 Hybrid Attention: Combines compression, selection, and sliding window mechanisms
🔧 Hardware-Aware: Optimized for efficient inference on modern GPUs
📊 Strong Performance: Achieves leading results on video understanding benchmarks

Model Architecture

VideoNSA employs a hybrid attention strategy with three complementary branches:

Compression Branch: Averages frame KV blocks to maintain salient visual cues
Selection Branch: Ranks and retains the most informative video segments
Sliding Window Branch: Ensures local temporal coverage for fine-grained details

Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.

Training Details

Base Model: Qwen2.5-VL-7B-Instruct
Dataset: Filtered LLaVA-Video-178K
Sampling Rate: 4 fps
Context Limit: 36K tokens during training
Compute: ~4,600 H100 GPU hours

Usage

For installation, training, and evaluation instructions, please refer to:

💻 GitHub Repository
🌐 Project Page

Limitations

Optimized for video understanding tasks; may not be optimal for pure image tasks
Requires sufficient GPU memory for long video processing
Performance may vary with different video resolutions and frame rates

Citation

@misc{song2025videonsanativesparseattention,
      title={VideoNSA: Native Sparse Attention Scales Video Understanding}, 
      author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
      year={2025},
      eprint={2510.02295},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02295}, 
}

Resources

License

This model is released under the Apache 2.0 License.

Downloads last month: 119

Safetensors

Model size

9B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support