Image Segmentation Using Text and Image Prompts
Abstract
A model using CLIP as a backbone and a transformer-based decoder generates image segmentations from arbitrary text or image prompts, supporting various binary segmentation tasks.
Image segmentation is usually addressed by training a model for a fixed set of object classes. Incorporating additional classes or more complex queries later is expensive as it requires re-training the model on a dataset that encompasses these expressions. Here we propose a system that can generate image segmentations based on arbitrary prompts at test time. A prompt can be either a text or an image. This approach enables us to create a unified model (trained once) for three common segmentation tasks, which come with distinct challenges: referring expression segmentation, zero-shot segmentation and one-shot segmentation. We build upon the CLIP model as a backbone which we extend with a transformer-based decoder that enables dense prediction. After training on an extended version of the PhraseCut dataset, our system generates a binary segmentation map for an image based on a free-text prompt or on an additional image expressing the query. We analyze different variants of the latter image-based prompts in detail. This novel hybrid input allows for dynamic adaptation not only to the three segmentation tasks mentioned above, but to any binary segmentation task where a text or image query can be formulated. Finally, we find our system to adapt well to generalized queries involving affordances or properties. Code is available at https://eckerlab.org/code/clipseg.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation (2024)
- Enhancing Vision-Language Model with Unmasked Token Alignment (2024)
- Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation (2024)
- Cascade-CLIP: Cascaded Vision-Language Embeddings Alignment for Zero-Shot Semantic Segmentation (2024)
- Zero Shot Context-Based Object Segmentation using SLIP (SAM+CLIP) (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 7
Browse 7 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 120
Collections including this paper 0
No Collection including this paper
