UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Abstract
UniMMVSR is a unified generative video super-resolution framework that incorporates hybrid-modal conditions, including text, images, and videos, within a latent video diffusion model, achieving superior detail and conformity to multi-modal conditions.
Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.
Community
TL;DR: We propose UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos, which supports 4K controllable video generation for the first time.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InfVSR: Breaking Length Limits of Generic Video Super-Resolution (2025)
- VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning (2025)
- UniVideo: Unified Understanding, Generation, and Editing for Videos (2025)
- BindWeave: Subject-Consistent Video Generation via Cross-Modal Integration (2025)
- Lynx: Towards High-Fidelity Personalized Video Generation (2025)
- SSG-Dit: A Spatial Signal Guided Framework for Controllable Video Generation (2025)
- Towards Redundancy Reduction in Diffusion Models for Efficient Video Super-Resolution (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
 
	 
					 
					
 
						