Composing Concepts from Images and Videos via Concept-prompt Binding
Abstract
Bind & Compose uses Diffusion Transformers with hierarchical binders and temporal strategies to accurately compose complex visual concepts from images and videos.
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when training with diversified prompts. To enhance the compatibility between image and video concepts, we present a Temporal Disentanglement Strategy that decouples the training process of video concepts into two stages with a dual-branch binder structure for temporal modeling. Evaluations demonstrate that our method achieves superior concept consistency, prompt fidelity, and motion quality over existing approaches, opening up new possibilities for visual creativity.
Community
We introduce Bind & Compose (BiCo), a one-shot method that enables flexible visual concept composition by binding visual concepts with the corresponding prompt tokens and composing the target prompt with bound tokens from various sources.
- ๐ Project page: https://refkxh.github.io/BiCo_Webpage/
- ๐ป GitHub: https://github.com/refkxh/bico
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Plan-X: Instruct Video Generation via Semantic Planning (2025)
- ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation (2025)
- Unified Video Editing with Temporal Reasoner (2025)
- InstanceV: Instance-Level Video Generation (2025)
- HiCoGen: Hierarchical Compositional Text-to-Image Generation in Diffusion Models via Reinforcement Learning (2025)
- ConsistCompose: Unified Multimodal Layout Control for Image Composition (2025)
- Canvas-to-Image: Compositional Image Generation with Multimodal Controls (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper