Update README.md

b4d29cd verified about 1 month ago

11.5 kB

metadata

language:
  - en
license: cc-by-nc-nd-4.0
tags:
  - vision-language
  - medical
  - biomedical
  - radiology
  - pathology
  - multi-image
  - medical-imaging
  - vlm
  - qwen3
  - siglip
base_model:
  - Qwen/Qwen3-14B
  - google/medsiglip-448
pipeline_tag: image-text-to-text
library_name: transformers

Bio-Medical-ContactDoctorVLLM-14B-V1-102025 🏥🔬

🎯 Model Overview

Bio-Medical-ContactDoctorVLLM-14B-V1-102025 is a specialized vision-language model designed for comprehensive biomedical image analysis. Built on a novel architecture combining Qwen3-14B language model with Google's MedSigLIP-448 vision encoder, this model excels at analyzing diverse medical imaging modalities including X-rays, CT scans, MRI, ultrasound, histopathology, and clinical photography.

Key Highlights

🏗️ Custom Biomedical Architecture: Features a novel Perceiver Resampler bridge connecting vision and language modalities, optimized for medical imaging tasks
🔬 Domain-Specialized Training: Pre-trained and fine-tuned on extensive biomedical datasets with both single and multi-image scenarios
🗣️ Medical Fluency: Speaks medical terminology naturally with deep understanding of clinical contexts
🖼️ Multi-Image Analysis: Native support for analyzing multiple medical images simultaneously with comparative reasoning
⚡ Efficient Inference: Optimized for real-world deployment with 32 visual tokens per image

🏛️ Architecture

Technical Specifications

Component	Details
Base LLM	Qwen3-14B (40 layers, 5120 hidden size)
Vision Encoder	MedSigLIP-448 (1152D)
Resampler	Perceiver architecture with 8 heads, 2 layers
Visual Tokens	32 per image (compressed from 1024)
Total Parameters	~14.xB
Training Precision	BFloat16
Context Length	40,960 tokens

📊 Training Data

The model was trained on a diverse collection of biomedical imaging datasets:

Radiology: X-ray, CT, MRI across multiple anatomical regions
Pathology: Histopathology slides, cytology
Clinical Imaging: Dermatology, ophthalmology, endoscopy
Multi-Modal Cases: Datasets with multiple imaging studies per patient
Report Pairs: Image-report pairs for grounded medical reasoning

Training incorporated both:

Single-image scenarios: Detailed analysis of individual medical images
Multi-image scenarios: Comparative analysis, temporal progression, multi-modal integration

Note: This model is not trained on generic images and its responses to such inputs may be inaccurate or unreliable.

🚀 Quick Start

Installation

pip install torch transformers pillow

Download Model Files

Ensure you have the following custom files from this repository:

modeling_contactdoctor_vlm.py
configuration_contactdoctor_vlm.py
resampler.py

Single Image Analysis

import torch
from PIL import Image
from transformers import AutoTokenizer, AutoImageProcessor, TextStreamer
from modeling_contactdoctor_vlm import ContactDoctorVLLM

MODEL_PATH = "ContactDoctor/Bio-Medical-ContactDoctorVLLM-14B-V1-102025"
IMAGE_TOKEN = "<image>"

# Load model
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
image_processor = AutoImageProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
model = ContactDoctorVLLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Set image token
image_token_id = tokenizer.convert_tokens_to_ids(IMAGE_TOKEN)
model.set_image_token_id(image_token_id)

# Prepare image
image = Image.open("chest_xray.jpg").convert("RGB")
pixel_values = image_processor(images=image, return_tensors="pt")["pixel_values"]

# Create prompt
system_prompt = """You are an expert medical image analyst.
Analyze the provided medical image and provide a structured report:
1. **Modality & View**: Identify the imaging type and anatomical view
2. **Key Observations**: List significant visual findings
3. **Clinical Interpretation**: Brief assessment based on visible findings
"""

user_prompt = "<image>\nProvide a detailed analysis of this medical image."
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

# Tokenize
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)

# Generate
with torch.no_grad():
    outputs = model.generate(
        input_ids=inputs["input_ids"].to(model.device),
        attention_mask=inputs["attention_mask"].to(model.device),
        pixel_values=pixel_values.to(model.device),
        max_new_tokens=512,
        temperature=0.2,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Multi-Image Comparative Analysis

# Load multiple images
image_paths = ["ct_scan_1.jpg", "ct_scan_2.jpg", "mri_scan.jpg"]
images = [Image.open(path).convert("RGB") for path in image_paths]
pixel_values = image_processor(images=images, return_tensors="pt")["pixel_values"]

# Multi-image prompt (one <image> token per image)
num_images = len(images)
image_tokens = IMAGE_TOKEN * num_images
user_prompt = f"{image_tokens}\nCompare these {num_images} medical images and provide an integrated analysis."

messages = [
    {"role": "system", "content": "You are an expert medical image analyst. Analyze multiple images and provide comparative assessment."},
    {"role": "user", "content": user_prompt}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt", padding=True, truncation=True)

# Generate with multi-image input
outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    pixel_values=pixel_values.to(model.device),
    max_new_tokens=1024,
    temperature=0.2,
    top_p=0.9,
    do_sample=True,
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

💡 Use Cases

Clinical Applications

Radiology Report Generation
- Chest X-ray interpretation
- CT/MRI findings summarization
- Comparative analysis of serial imaging
Pathology Analysis
- Histopathology slide description
- Cell morphology assessment
- Tissue abnormality detection
Multi-Modal Integration
- Cross-modality comparison (e.g., X-ray + CT)
- Temporal disease progression tracking
- Pre/post-treatment assessment
Medical Education
- Training case analysis
- Image-based clinical reasoning
- Differential diagnosis support

Research Applications

Large-scale medical image annotation
Dataset quality assessment
Multi-image relationship extraction
Clinical study automation

⚙️ Generation Configuration

Recommended Settings

For Detailed Analysis (Single Image):

generation_config = {
    "max_new_tokens": 512,
    "min_new_tokens": 20,
    "temperature": 0.2,      # Lower for factual accuracy
    "top_p": 0.9,
    "top_k": 20,
    "do_sample": True,
    "repetition_penalty": 1.15,
    "length_penalty": 1.0,
}

For Comparative Analysis (Multi-Image):

generation_config = {
    "max_new_tokens": 1024,   # Longer for comprehensive comparison
    "min_new_tokens": 50,
    "temperature": 0.3,
    "top_p": 0.9,
    "do_sample": True,
    "repetition_penalty": 1.15,
}

For Quick Screening:

generation_config = {
    "max_new_tokens": 256,
    "temperature": 0.1,       # More deterministic
    "do_sample": False,       # Greedy decoding
}

📈 Performance Characteristics

Strengths

✅ Accurate modality and view identification
✅ Detailed anatomical localization
✅ Clinical terminology fluency
✅ Multi-image reasoning and comparison
✅ Temporal progression analysis
✅ Structured report generation

Limitations

⚠️ Not a replacement for professional medical diagnosis
⚠️ May hallucinate fine details in low-quality images
⚠️ Limited to visual information; no access to clinical history
⚠️ Performance varies by imaging modality and quality

🔒 Ethical Considerations & Disclaimer

⚠️ IMPORTANT MEDICAL DISCLAIMER ⚠️

This model is intended for research and educational purposes only. It is NOT approved for clinical use and should NOT be used as a substitute for professional medical advice, diagnosis, or treatment.

Guidelines for Responsible Use

Human Oversight Required: All outputs must be reviewed by qualified healthcare professionals
Not for Diagnosis: Do not use for clinical decision-making without expert validation
Bias Awareness: Model may reflect biases in training data
Privacy: Ensure patient data is de-identified and handled per regulations (HIPAA, GDPR)
Liability: Users assume full responsibility for any application of this model

Known Limitations

Model outputs are probabilistic and may contain errors
Cannot replace clinical judgment and expertise
May not generalize to rare conditions or novel imaging techniques
No access to patient history, lab results, or other clinical context

📜 Citation

If you use this model in your research, please cite:

@misc{biomedical-contactdoctor-vlm-2025,
  title={Bio-Medical-ContactDoctorVLLM: A Specialized Vision-Language Model for Biomedical Image Analysis},
  author={ContactDoctor Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/ContactDoctor/Bio-Medical-ContactDoctorVLLM-14B-V1-102025}}
}

🤝 Contributing & Support

Issues: Report bugs or request features via GitHub Issues
Discussions: Join model discussions on HuggingFace
Contact: [[email protected]]

📄 License

This model is released under the cc-by-nc-nd-4.0 License. See LICENSE file for details.

Component Licenses:

Base LLM (Qwen3-14B): [Qwen License]
Vision Encoder (MedSigLIP): Apache 2.0
Custom Architecture: cc-by-nc-nd-4.0

🙏 Acknowledgments

Qwen Team for the powerful Qwen3-14B language model
Google Research for MedSigLIP vision encoder
Medical Imaging Community for datasets and domain expertise
CloudEXE for GPU Sponsorship
HuggingFace for model hosting and infrastructure

Built with ❤️ for advancing medical AI research
_{For research and educational use only • Not for clinical diagnosis}