UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers
Abstract
UltraViCo addresses video length extrapolation by suppressing attention dispersion, improving quality and reducing repetition beyond training length.
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
Community
Achieving 3x to 4x video DiT length extrapolation in a plug-and-play way!
Paper: https://arxiv.org/abs/2511.20123
Project Page: https://thu-ml.github.io/UltraViCo.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InfVSR: Breaking Length Limits of Generic Video Super-Resolution (2025)
- DyPE: Dynamic Position Extrapolation for Ultra High Resolution Diffusion (2025)
- FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution (2025)
- BachVid: Training-Free Video Generation with Consistent Background and Character (2025)
- Pack and Force Your Memory: Long-form and Consistent Video Generation (2025)
- Rolling Forcing: Autoregressive Long Video Diffusion in Real Time (2025)
- VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper