VideoNSA / README.md

Enxin

Update README.md

d18c083 verified about 1 month ago

preview code

raw

history blame contribute delete

2.9 kB

metadata

language:
  - en
license: apache-2.0
tags:
  - video-understanding
  - sparse-attention
  - vision-language
  - qwen2.5-vl
  - multimodal
pipeline_tag: video-text-to-text

VideoNSA: Native Sparse Attention for Video Understanding

Model Description

VideoNSA is a learnable, hardware-aware sparse-attention framework built on Qwen2.5-VL-7B for efficient long video understanding. It processes up to 128K vision-text tokens using only 3.6% of the full attention budget while maintaining competitive performance.

Key Features

🎯 Learned Sparsity: Intelligently learns sparsity patterns over video tokens
🚀 Efficient Scaling: Handles massive video contexts with minimal computational overhead
🎬 Hybrid Attention: Combines compression, selection, and sliding window mechanisms
🔧 Hardware-Aware: Optimized for efficient inference on modern GPUs
📊 Strong Performance: Achieves leading results on video understanding benchmarks

Model Architecture

VideoNSA employs a hybrid attention strategy with three complementary branches:

Compression Branch: Averages frame KV blocks to maintain salient visual cues
Selection Branch: Ranks and retains the most informative video segments
Sliding Window Branch: Ensures local temporal coverage for fine-grained details

Each branch is weighted by learnable per-head gates for adaptive token allocation across different tasks.

Training Details

Base Model: Qwen2.5-VL-7B-Instruct
Dataset: Filtered LLaVA-Video-178K
Sampling Rate: 4 fps
Context Limit: 36K tokens during training
Compute: ~4,600 H100 GPU hours

Usage

For installation, training, and evaluation instructions, please refer to:

💻 GitHub Repository
🌐 Project Page

Limitations

Optimized for video understanding tasks; may not be optimal for pure image tasks
Requires sufficient GPU memory for long video processing
Performance may vary with different video resolutions and frame rates

Citation

@misc{song2025videonsanativesparseattention,
      title={VideoNSA: Native Sparse Attention Scales Video Understanding}, 
      author={Enxin Song and Wenhao Chai and Shusheng Yang and Ethan Armand and Xiaojun Shan and Haiyang Xu and Jianwen Xie and Zhuowen Tu},
      year={2025},
      eprint={2510.02295},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.02295}, 
}

Resources

License

This model is released under the Apache 2.0 License.