MedMO-8B-Next: Grounding and Understanding Multimodal Large Language Model for Medical Images

Paper Model Model Model License

MedMO Logo

MedMO-8B-Next is the latest and most powerful iteration of the MedMO family β€” an open-source multimodal foundation model purpose-built for comprehensive medical image understanding and grounding. Trained on 26M+ diverse medical samples across 45 datasets, MedMO-8B-Next achieves state-of-the-art performance across all major medical imaging benchmarks, outperforming both open-source and closed-source competitors on VQA, Text QA, grounding, and report generation tasks.


πŸ† Benchmark Performance

VQA & Text QA Results

MedMO-8B-Next sets a new state-of-the-art across the board, achieving the highest average scores on both medical VQA and Text QA benchmarks β€” surpassing strong baselines including Lingshu-7B and Fleming-VL-8B.

OMIVQA = OmniMedVQA Β· MedXQA = MedXpertQA Β· Medbullets reported as op4/op5

Medical VQA Benchmarks

Model MMMU-Med VQA-RAD (closed/all) SLAKE (closed/all) PathVQA PMC-VQA OmniMedVQA MedXpertQA Avg.
Lingshu-7B 54.0 77.2 / 43.0 82.4 / 33.2 61.9 54.2 82.9 26.9 57.3
Fleming-VL-8B 63.3 78.4 / 56.0 86.9 / 80.0 62.9 64.3 88.2 21.6 66.8
MedMO-8B-Next 65.3 80.4 / 65.0 75.5 / 74.7 57.3 70.3 88.8 48.9 69.6

Medical Text QA Benchmarks

Model MMLU-Med PubMedQA MedMCQA MedQA Medbullets (op4/op5) MedXpertQA SGPQA Avg.
Lingshu-7B 69.6 75.8 56.3 63.5 62.0 / 53.8 16.4 27.5 51.1
Fleming-VL-8B 71.8 74.0 51.8 53.7 40.5 12.1 24.9 46.9
MedMO-8B-Next 80.2 75.6 62.0 83.8 65.2 / 57.8 20.9 35.5 60.1

Bold = best result. MedMO-8B-Next achieves the highest average on both VQA (69.6) and Text QA (60.1) benchmarks.

  • Benchmarked on AMD MI210 GPU.

Supported Imaging Modalities

Domain Modalities
Radiology X-ray, CT, MRI, Ultrasound
Pathology Whole-slide imaging, Microscopy
Ophthalmology Fundus photography, OCT
Dermatology Clinical skin images
Nuclear Medicine PET, SPECT

πŸš€ Quick Start

Installation

pip install transformers torch qwen-vl-utils

Basic Usage

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "MBZUAI/MedMO-8B-Next",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained("MBZUAI/MedMO-8B-Next")

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "path/to/medical/image.png",
            },
            {"type": "text", "text": "What abnormalities are present in this chest X-ray?"},
        ],
    }
]

# Process and generate
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])

Example: Disease Localization with Bounding Boxes

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "chest_xray.png"},
            {"type": "text", "text": "Detect and localize all abnormalities in this image."},
        ],
    }
]
# Example output:
# "Fractures <box>[[156, 516, 231, 607], [240, 529, 296, 581]]</box>"

Example: Radiology Report Generation

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "ct_scan.png"},
            {"type": "text", "text": "Generate a detailed radiology report for this CT scan."},
        ],
    }
]
# MedMO-8B-Next generates comprehensive clinical reports with findings and impressions

πŸ“¦ Model Family

Model Parameters Best For
MedMO-8B-Next 8B Highest accuracy, all tasks β€” recommended
MedMO-8B 8B Previous generation
MedMO-4B 4B Resource-constrained environments

πŸ“„ Citation

If you use MedMO in your research, please cite our paper:

@article{deria2026medmo,
  title={MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images},
  author={Deria, Ankan and Kumar, Komal and Dukre, Adinath Madhavrao and Segal, Eran and Khan, Salman and Razzak, Imran},
  journal={arXiv preprint arXiv:2602.06965},
  year={2026}
}

πŸ“œ License

This project is licensed under the Apache License 2.0 β€” see the LICENSE file for details.

Downloads last month
797
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including MBZUAI/MedMO-8B-Next

Paper for MBZUAI/MedMO-8B-Next