SauerkrautLM-Vision-Document-Retrieval
Collection
7 items
β’
Updated
β’
8
π¬ Experimental Architecture | Mistral-Based Visual Retrieval
SauerkrautLM-ColMinistral3-3b-v0.1 is an experimental model based on mistralai/Ministral-3-3B-Reasoning-2512 with the Pixtral vision encoder, exploring the Mistral architecture for document retrieval.
β οΈ Note: This is an experimental release. For production use, we recommend ColQwen3 or ColLFM2 models.
Traditional OCR-based retrieval loses layout, tables, and visual context. Our visual approach:
| Benchmark | Score | Rank (128-dim) |
|---|---|---|
| ViDoRe v1 | 81.98 | - |
| MTEB v1+v2 | 71.93 | - |
| ViDoRe v3 | 40.50 | #11 |
| Model | Params | Dim | ViDoRe v1 | MTEB v1+v2 | ViDoRe v3 |
|---|---|---|---|---|---|
| SauerkrautLM-ColQwen3-4b-v0.1 β | 4.0B | 128 | 90.80 | 81.97 | 56.03 |
| EvoQwen2.5-VL-Retriever-3B-v1 | 3.0B | 128 | 90.67 | 82.76 | - |
| colnomic-embed-multimodal-3b | 3.0B | 128 | 89.86 | 80.09 | 56.40 |
| SauerkrautLM-ColMinistral3-3b-v0.1 | 3.0B | 128 | 81.98 | 71.93 | 40.50 |
| Model | Params | ViDoRe v1 |
|---|---|---|
| ColMinistral3-3b | 3.0B | 81.98 |
| colpali-v1.1 | 2.9B | 81.61 |
Slightly better than ColPali-v1.1 baseline.
| Property | Value |
|---|---|
| Base Model | mistralai/Ministral-3B-Instruct |
| Vision Encoder | Pixtral |
| Parameters | 3.0B |
| Embedding Dimension | 128 |
| VRAM (bfloat16) | ~6 GB |
| Max Context Length | 262,144 tokens |
| License | Apache 2.0 |
| Setting | Value |
|---|---|
| GPUs | 4x NVIDIA RTX 6000 Ada (48GB) |
| Effective Batch Size | 256 |
| Precision | bfloat16 |
| Dataset | Type | Description |
|---|---|---|
| vidore/colpali_train_set | Public | ColPali training data |
| openbmb/VisRAG-Ret-Train-In-domain-data | Public | Visual RAG training data |
| llamaindex/vdr-multilingual-train | Public | Multilingual retrieval |
| VAGO Multilingual Dataset 1 | In-house | Proprietary multilingual document-query pairs |
| VAGO Multilingual Dataset 2 | In-house | Proprietary multilingual document-query pairs |
β οΈ Important: Install our package first (requires transformers 5.0.0+):
pip install "sauerkrautlm-colpali[ministral]"
# Or: pip install git+https://github.com/VAGOsolutions/sauerkrautlm-colpali && pip install transformers>=5.0.0rc0
import torch
from PIL import Image
from sauerkrautlm_colpali.models import ColMinistral3, ColMinistral3Processor
model_name = "VAGOsolutions/SauerkrautLM-ColMinistral3-3b-v0.1"
model = ColMinistral3.from_pretrained(model_name)
model = model.to(dtype=torch.bfloat16, device="cuda:0").eval()
processor = ColMinistral3Processor.from_pretrained(model_name)
images = [Image.open("document.png")]
queries = ["What is the main topic?"]
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)
with torch.no_grad():
image_embeddings = model(**batch_images)
query_embeddings = model(**batch_queries)
scores = processor.score(query_embeddings, image_embeddings)
β Consider when:
β Use ColQwen3 instead when:
This model represents architecture exploration. Key findings:
@misc{sauerkrautlm-colpali-2025,
title={SauerkrautLM-ColPali: Multi-Vector Vision Retrieval Models},
author={David Golchinfar},
organization={VAGO Solutions},
year={2025},
url={https://github.com/VAGOsolutions/sauerkrautlm-colpali}
}