VideoNSA / README.md
Enxin's picture
Update README.md
d18c083 verified
metadata
language:
  - en
license: apache-2.0
tags:
  - video-understanding
  - sparse-attention
  - vision-language
  - qwen2.5-vl
  - multimodal
pipeline_tag: video-text-to-text

VideoNSA: Native Sparse Attention for Video Understanding

VideoNSA Overview

Model Description

VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to 128K vision-text tokens using only 3.6% of the full attention budget while maintaining competitive performance.

Key Features

  • 🎯 Learned Sparsity: Intelligently learns sparsity patterns over video tokens
  • πŸš€ Efficient Scaling: Handles massive video contexts with minimal computational overhead
  • 🎬 Hybrid Attention: Combines compression, selection, and sliding window mechanisms
  • πŸ”§ Hardware-Aware: Optimized for efficient inference on modern GPUs
  • πŸ“Š Strong Performance: Achieves leading results on video understanding benchmarks

Model Architecture

VideoNSA employs a hybrid attention strategy with three complementary branches:

  1. Compression Branch: Averages frame KV blocks to maintain salient visual cues
  2. Selection Branch: Ranks and retains the most informative video segments
  3. Sliding Window Branch: Ensures local temporal coverage for fine-grained details

Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.

Training Details

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Dataset: Filtered LLaVA-Video-178K
  • Sampling Rate: 4 fps
  • Context Limit: 36K tokens during training
  • Compute: ~4,600 H100 GPU hours

Usage

For installation, training, and evaluation instructions, please refer to:

Limitations

  • Optimized for video understanding tasks; may not be optimal for pure image tasks
  • Requires sufficient GPU memory for long video processing
  • Performance may vary with different video resolutions and frame rates

Citation

@misc{song2025videonsanativesparseattention,
      title={VideoNSA: Native Sparse Attention Scales Video Understanding}, 
      author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
      year={2025},
      eprint={2510.02295},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02295}, 
}

Resources

License

This model is released under the Apache 2.0 License.