SrikanthChellappa's picture
Update README.md
b4d29cd verified
metadata
language:
  - en
license: cc-by-nc-nd-4.0
tags:
  - vision-language
  - medical
  - biomedical
  - radiology
  - pathology
  - multi-image
  - medical-imaging
  - vlm
  - qwen3
  - siglip
base_model:
  - Qwen/Qwen3-14B
  - google/medsiglip-448
pipeline_tag: image-text-to-text
library_name: transformers

Bio-Medical-ContactDoctorVLLM-14B-V1-102025 πŸ₯πŸ”¬

Model Type Domain Multi-Image License

🎯 Model Overview

Bio-Medical-ContactDoctorVLLM-14B-V1-102025 is a specialized vision-language model designed for comprehensive biomedical image analysis. Built on a novel architecture combining Qwen3-14B language model with Google's MedSigLIP-448 vision encoder, this model excels at analyzing diverse medical imaging modalities including X-rays, CT scans, MRI, ultrasound, histopathology, and clinical photography.

Key Highlights

  • πŸ—οΈ Custom Biomedical Architecture: Features a novel Perceiver Resampler bridge connecting vision and language modalities, optimized for medical imaging tasks
  • πŸ”¬ Domain-Specialized Training: Pre-trained and fine-tuned on extensive biomedical datasets with both single and multi-image scenarios
  • πŸ—£οΈ Medical Fluency: Speaks medical terminology naturally with deep understanding of clinical contexts
  • πŸ–ΌοΈ Multi-Image Analysis: Native support for analyzing multiple medical images simultaneously with comparative reasoning
  • ⚑ Efficient Inference: Optimized for real-world deployment with 32 visual tokens per image

πŸ›οΈ Architecture

architecture

architecture

Technical Specifications

Component Details
Base LLM Qwen3-14B (40 layers, 5120 hidden size)
Vision Encoder MedSigLIP-448 (1152D)
Resampler Perceiver architecture with 8 heads, 2 layers
Visual Tokens 32 per image (compressed from 1024)
Total Parameters ~14.xB
Training Precision BFloat16
Context Length 40,960 tokens

πŸ“Š Training Data

The model was trained on a diverse collection of biomedical imaging datasets:

  • Radiology: X-ray, CT, MRI across multiple anatomical regions
  • Pathology: Histopathology slides, cytology
  • Clinical Imaging: Dermatology, ophthalmology, endoscopy
  • Multi-Modal Cases: Datasets with multiple imaging studies per patient
  • Report Pairs: Image-report pairs for grounded medical reasoning

Training incorporated both:

  • Single-image scenarios: Detailed analysis of individual medical images
  • Multi-image scenarios: Comparative analysis, temporal progression, multi-modal integration

Note: This model is not trained on generic images and its responses to such inputs may be inaccurate or unreliable.

πŸš€ Quick Start

Installation

pip install torch transformers pillow

Download Model Files

Ensure you have the following custom files from this repository:

  • modeling_contactdoctor_vlm.py
  • configuration_contactdoctor_vlm.py
  • resampler.py

Single Image Analysis

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoImageProcessor, TextStreamer
from modeling_contactdoctor_vlm import ContactDoctorVLLM

MODEL_PATH = "ContactDoctor/Bio-Medical-ContactDoctorVLLM-14B-V1-102025"
IMAGE_TOKEN = "<image>"

# Load model
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = ContactDoctorVLLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Set image token
image_token_id = tokenizer.convert_tokens_to_ids(IMAGE_TOKEN)
model.set_image_token_id(image_token_id)

# Prepare image
image = Image.open("chest_xray.jpg").convert("RGB")
pixel_values = image_processor(images=image, return_tensors="pt")["pixel_values"]

# Create prompt
system_prompt = """You are an expert medical image analyst.
Analyze the provided medical image and provide a structured report:
1. **Modality & View**: Identify the imaging type and anatomical view
2. **Key Observations**: List significant visual findings
3. **Clinical Interpretation**: Brief assessment based on visible findings
"""

user_prompt = "<image>\nProvide a detailed analysis of this medical image."
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

# Tokenize
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)

# Generate
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"].to(model.device),
        attention_mask=inputs["attention_mask"].to(model.device),
        pixel_values=pixel_values.to(model.device),
        max_new_tokens=512,
        temperature=0.2,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multi-Image Comparative Analysis

# Load multiple images
image_paths = ["ct_scan_1.jpg", "ct_scan_2.jpg", "mri_scan.jpg"]
images = [Image.open(path).convert("RGB") for path in image_paths]
pixel_values = image_processor(images=images, return_tensors="pt")["pixel_values"]

# Multi-image prompt (one <image> token per image)
num_images = len(images)
image_tokens = IMAGE_TOKEN * num_images
user_prompt = f"{image_tokens}\nCompare these {num_images} medical images and provide an integrated analysis."

messages = [
    {"role": "system", "content": "You are an expert medical image analyst. Analyze multiple images and provide comparative assessment."},
    {"role": "user", "content": user_prompt}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)

# Generate with multi-image input
outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    pixel_values=pixel_values.to(model.device),
    max_new_tokens=1024,
    temperature=0.2,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

πŸ’‘ Use Cases

Clinical Applications

  1. Radiology Report Generation

    • Chest X-ray interpretation
    • CT/MRI findings summarization
    • Comparative analysis of serial imaging
  2. Pathology Analysis

    • Histopathology slide description
    • Cell morphology assessment
    • Tissue abnormality detection
  3. Multi-Modal Integration

    • Cross-modality comparison (e.g., X-ray + CT)
    • Temporal disease progression tracking
    • Pre/post-treatment assessment
  4. Medical Education

    • Training case analysis
    • Image-based clinical reasoning
    • Differential diagnosis support

Research Applications

  • Large-scale medical image annotation
  • Dataset quality assessment
  • Multi-image relationship extraction
  • Clinical study automation

βš™οΈ Generation Configuration

Recommended Settings

For Detailed Analysis (Single Image):

generation_config = {
    "max_new_tokens": 512,
    "min_new_tokens": 20,
    "temperature": 0.2,      # Lower for factual accuracy
    "top_p": 0.9,
    "top_k": 20,
    "do_sample": True,
    "repetition_penalty": 1.15,
    "length_penalty": 1.0,
}

For Comparative Analysis (Multi-Image):

generation_config = {
    "max_new_tokens": 1024,   # Longer for comprehensive comparison
    "min_new_tokens": 50,
    "temperature": 0.3,
    "top_p": 0.9,
    "do_sample": True,
    "repetition_penalty": 1.15,
}

For Quick Screening:

generation_config = {
    "max_new_tokens": 256,
    "temperature": 0.1,       # More deterministic
    "do_sample": False,       # Greedy decoding
}

πŸ“ˆ Performance Characteristics

Strengths

  • βœ… Accurate modality and view identification
  • βœ… Detailed anatomical localization
  • βœ… Clinical terminology fluency
  • βœ… Multi-image reasoning and comparison
  • βœ… Temporal progression analysis
  • βœ… Structured report generation

Limitations

  • ⚠️ Not a replacement for professional medical diagnosis
  • ⚠️ May hallucinate fine details in low-quality images
  • ⚠️ Limited to visual information; no access to clinical history
  • ⚠️ Performance varies by imaging modality and quality

πŸ”’ Ethical Considerations & Disclaimer

⚠️ IMPORTANT MEDICAL DISCLAIMER ⚠️

This model is intended for research and educational purposes only. It is NOT approved for clinical use and should NOT be used as a substitute for professional medical advice, diagnosis, or treatment.

Guidelines for Responsible Use

  1. Human Oversight Required: All outputs must be reviewed by qualified healthcare professionals
  2. Not for Diagnosis: Do not use for clinical decision-making without expert validation
  3. Bias Awareness: Model may reflect biases in training data
  4. Privacy: Ensure patient data is de-identified and handled per regulations (HIPAA, GDPR)
  5. Liability: Users assume full responsibility for any application of this model

Known Limitations

  • Model outputs are probabilistic and may contain errors
  • Cannot replace clinical judgment and expertise
  • May not generalize to rare conditions or novel imaging techniques
  • No access to patient history, lab results, or other clinical context

πŸ“œ Citation

If you use this model in your research, please cite:

@misc{biomedical-contactdoctor-vlm-2025,
  title={Bio-Medical-ContactDoctorVLLM: A Specialized Vision-Language Model for Biomedical Image Analysis},
  author={ContactDoctor Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/ContactDoctor/Bio-Medical-ContactDoctorVLLM-14B-V1-102025}}
}

🀝 Contributing & Support

  • Issues: Report bugs or request features via GitHub Issues
  • Discussions: Join model discussions on HuggingFace
  • Contact: [[email protected]]

πŸ“„ License

This model is released under the cc-by-nc-nd-4.0 License. See LICENSE file for details.

Component Licenses:

  • Base LLM (Qwen3-14B): [Qwen License]
  • Vision Encoder (MedSigLIP): Apache 2.0
  • Custom Architecture: cc-by-nc-nd-4.0

πŸ™ Acknowledgments

  • Qwen Team for the powerful Qwen3-14B language model
  • Google Research for MedSigLIP vision encoder
  • Medical Imaging Community for datasets and domain expertise
  • CloudEXE for GPU Sponsorship
  • HuggingFace for model hosting and infrastructure

Built with ❀️ for advancing medical AI research
For research and educational use only β€’ Not for clinical diagnosis