MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation
Abstract
MultiBanana is a benchmark dataset for evaluating multi-reference text-to-image generation models across various challenging conditions, providing insights into model strengths and weaknesses.
Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce MultiBanana, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment (2025)
- UniGenBench++: A Unified Semantic Evaluation Benchmark for Text-to-Image Generation (2025)
- Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing (2025)
- PSR: Scaling Multi-Subject Personalized Image Generation with Pairwise Subject-Consistency Rewards (2025)
- Canvas-to-Image: Compositional Image Generation with Multimodal Controls (2025)
- GIR-Bench: Versatile Benchmark for Generating Images with Reasoning (2025)
- Beyond the Pixels: VLM-based Evaluation of Identity Preservation in Reference-Guided Synthesis (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
