Abstract
MedSAM-3, a text-promptable medical segmentation model fine-tuned on SAM 3 architecture, achieves superior performance across various medical imaging modalities using semantic conceptual labels and multimodal large language models.
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models. We will release our code and model at https://github.com/Joey-S-Liu/MedSAM3.
Community
Medical image segmentation is fundamental for biomedical discovery. Existing methods lack generalizability and demand extensive, time-consuming manual annotation for new clinical application. Here, we propose MedSAM-3, a text promptable medical segmentation model for medical image and video segmentation. By fine-tuning the Segment Anything Model (SAM) 3 architecture on medical images paired with semantic conceptual labels, our MedSAM-3 enables medical Promptable Concept Segmentation (PCS), allowing precise targeting of anatomical structures via open-vocabulary text descriptions rather than solely geometric prompts. We further introduce the MedSAM-3 Agent, a framework that integrates Multimodal Large Language Models (MLLMs) to perform complex reasoning and iterative refinement in an agent-in-the-loop workflow. Comprehensive experiments across diverse medical imaging modalities, including X-ray, MRI, Ultrasound, CT, and video, demonstrate that our approach significantly outperforms existing specialist and foundation models.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Medal S: Spatio-Textual Prompt Model for Medical Segmentation (2025)
- Anatomy-VLM: A Fine-grained Vision-Language Model for Medical Interpretation (2025)
- SpinalSAM-R1: A Vision-Language Multimodal Interactive System for Spine CT Segmentation (2025)
- VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel (2025)
- Fleming-VL: Towards Universal Medical Visual Reasoning with Multimodal LLMs (2025)
- DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis (2025)
- MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
arXiv explained breakdown of this paper ๐ https://arxivexplained.com/papers/medsam3-delving-into-segment-anything-with-medical-concepts
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper