VideoNSA: Native Sparse Attention for Video Understanding

VideoNSA Overview

Model Description

VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to 128K vision-text tokens using only 3.6% of the full attention budget while maintaining competitive performance.

Key Features

  • 🎯 Learned Sparsity: Intelligently learns sparsity patterns over video tokens
  • πŸš€ Efficient Scaling: Handles massive video contexts with minimal computational overhead
  • 🎬 Hybrid Attention: Combines compression, selection, and sliding window mechanisms
  • πŸ”§ Hardware-Aware: Optimized for efficient inference on modern GPUs
  • πŸ“Š Strong Performance: Achieves leading results on video understanding benchmarks

Model Architecture

VideoNSA employs a hybrid attention strategy with three complementary branches:

  1. Compression Branch: Averages frame KV blocks to maintain salient visual cues
  2. Selection Branch: Ranks and retains the most informative video segments
  3. Sliding Window Branch: Ensures local temporal coverage for fine-grained details

Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.

Training Details

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Dataset: Filtered LLaVA-Video-178K
  • Sampling Rate: 4 fps
  • Context Limit: 36K tokens during training
  • Compute: ~4,600 H100 GPU hours

Usage

For installation, training, and evaluation instructions, please refer to:

Limitations

  • Optimized for video understanding tasks; may not be optimal for pure image tasks
  • Requires sufficient GPU memory for long video processing
  • Performance may vary with different video resolutions and frame rates

Citation

@misc{song2025videonsanativesparseattention,
      title={VideoNSA: Native Sparse Attention Scales Video Understanding}, 
      author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
      year={2025},
      eprint={2510.02295},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02295}, 
}

Resources

License

This model is released under the Apache 2.0 License.

Downloads last month
119
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support