--- license: apache-2.0 language: - en tags: - ColBERT - passage-retrieval - knowledge-distillation pretty_name: Independent Implementation of ColBERTv2.0+ Models - modern_colbert_base_en_v1. new_version: prithivida/modern_colbert_base_en_v1 ---
DonkeyStereotype

Trained by Donkey Stereotype



# Independent Implementation of ColBERTv2.0+ Models >
> Background: > As part of this project, we will be releasing a set of models across weight classes: 1.) Models that worked well, 2.) Experimental models, including failed attempts. This work stands on the shoulders of all previous robust research on ColBERT and variants. >
> >
> What this independent implementation entail? > >
As of this writing (2nd July 2025) 1.
LightOn AI's ColBERT is the best in the world and can be considered SOTA.
2. **Today we are humbled and thrilled to announce prithivida/modern_colbert_base_en_v1 is the 2nd best ColBERT in the world.**. Borrowing Antoine Chaffin's words -
> This is the 2nd model to outperform ColBERT-small on BEIR While it is also bigger, it is still a very lightweight model and benefits from the efficiency of ModernBERT!"
# Comparison with Top ColBERTv2.0+ Models | Dataset / Model | GTE-ModernColBERT
(Lighton AI) | modern_colbert_base_en_v1
(Ours) | ColBERT-small
(Answer AI, reproduced by Lighton) | ColBERT-small
(Answer AI, reported) | |:-----------------|:-----------------:|:-----------------:|:------------------------:|:------------------------:| | **Outfit type** | AI Lab with PhDs
| Indie Researcher,
No PhD, No GPU budgets :-) | AI Lab with PhDs | AI Lab with PhDs | | **BEIR Average** | **54.89** (🥇) | **54.51 (🥈)** | 53.35 | 53.79 | | **FiQA2018** | **48.51** | 43.96 | 41.01 | 41.15 | | **NFCorpus** | **37.93** | 37.23 | 36.86 | 37.3 | | **TREC-COVID** | 83.59 | 83.4 | 83.14 | **84.59** | | **Touche2020** | **31.23** | 29.32 | 24.95 | 25.69 | | **ArguAna** | 48.51 | **52.05** | 46.76 | 50.09 | | **QuoraRetrieval** | 86.61 | 87.54 | **87.89** | 87.72 | | **SCIDOCS** | 19.06 | **19.42** | 18.72 | 18.42 | | **SciFact** | 76.34 | **76.44** | 74.02 | 74.77 | | **NQ** | **61.8** | 61.68 | 59.42 | 59.1 | | **ClimateFEVER** | 30.62 | 28.29 | 32.83 | **33.07** | | **HotpotQA** | **77.32** | 76.667 | 76.88 | 76.11 | | **DBPedia** | **48.03** | 46.31 | 46.36 | 45.58 | | **CQADupstack** | 41 | **42.2** | 39.36 | 38.75 | | **FEVER** | 87.44 | 88.106 | 88.66 | **90.96** | | **MSMARCO** | **45.32** | 44.993 | 43.44 | 43.5 |
# Comparison of with legacy ColBERT models Both GTE-ModernColBERT and ColBERT-small model cards have this comparison against older Colbert models. please refer to them. ----- # How to use / Running inference: - Short term: We are releasing a lib called `[lateness]`(https://github.com/PrithivirajDamodaran/lateness) - Medium to Long terms: There are really strong storage and retrieval abstractions: VectorDBs like Qdrant, Weaviate or Vespa that support multi-vectors and strong Colbert training libraries like PyLate, So we feel it is best to work the authors and integrate. For now we offer only code to load the model, run inference and do some light weight in-memory ranking (no heavy lifting like storing and retrieving using FAISS indexes). ## Using modern_colbert to index and query with Vectordb's like Qdrant. > [!TIP] > ```python > pip install lateness # light CPU retrievals > or > pip install lateness[index] # GPU accelerated indexing into vdbs > ``` ______ > [!NOTE] > [Want to locally run qdrant or use in production cluster ? try out an end to end example here](https://github.com/PrithivirajDamodaran/lateness/tree/main/examples/qdrant) ```python from lateness import ModernColBERT colbert = ModernColBERT("prithivida/modern_colbert_base_en_v1", max_query_len = 32, max_doc_len = 300) documents = [ "PyTorch is an open-source machine learning framework that provides tensor computations with GPU acceleration and deep neural networks built on tape-based autograd system.", "Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications across clusters of machines.", "REST APIs follow representational state transfer architectural style using HTTP methods like GET, POST, PUT, DELETE for stateless client-server communication.", ] queries = [ "How to build real-time data pipelines?", "What are the benefits of microservices?", "How to implement efficient web APIs?" ] query_embeddings = colbert.encode_queries(queries) doc_embeddings = colbert.encode_documents(documents) scores = ModernColBERT.compute_similarity(query_embeddings, doc_embeddings) print(scores) ```
Click here for inference code using Transformers > [!TIP] > Copy paste the next snippet before running the below snippet. ```python model_path = "prithivida/modern_colbert_base_en_v1" try: colbert = ColBERT.load_for_inference(model_path, max_query_len=32, max_doc_len=300) # Test data queries = [ "How does deep learning work?", "What is machine learning?", "What are neural networks?" ] documents = [ "Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.", "Deep learning uses neural networks with multiple layers to process data.", "Neural networks are computing systems inspired by biological neural networks.", "Artificial intelligence encompasses machine learning and deep learning.", ] # Encode and find similarity print("\n=== Encode and Calculate similarity ===") q_reps = colbert.encode_queries(queries, batch_size=4, to_cpu=True) p_reps = colbert.encode_documents(documents, batch_size=4, to_cpu=True) scores = colbert.compute_similarity(q_reps, p_reps) print(scores) # or Test single query ranking print("\n=== Single Query Ranking ===") query = "How does deep learning work?" results = colbert.rank_documents(query, documents, top_k=3) print(f"Query: {query}") for i, (doc_idx, score, doc_text) in enumerate(results): print(f" {i+1}. Score: {score:.4f} | Doc: {doc_text}") except Exception as e: print(f"Error during testing: {e}") ``` ```python import torch from torch import nn from transformers import PreTrainedModel, AutoConfig, AutoModel, AutoTokenizer from transformers.modeling_outputs import BaseModelOutput from tqdm import tqdm from typing import List, Tuple, Union, Optional import string import os class TaggingHead(nn.Module): def __init__(self, input_size, num_labels): super().__init__() self.classifier = nn.Linear(input_size, num_labels, bias=False) nn.init.xavier_uniform_(self.classifier.weight) def forward(self, x): return self.classifier(x) class ColBERT(PreTrainedModel): config_class = AutoConfig base_model_prefix = "backbone" def __init__(self, config): super().__init__(config) self.backbone = AutoModel.from_config(config) hidden_dim = config.hidden_size self.heads = nn.ModuleDict({ "col_pooling": TaggingHead(hidden_dim, num_labels=128) }) # Inference settings (will be set when loading for inference) self.tokenizer = None self.max_query_len = 256 self.max_doc_len = 300 self.Q_PID = None self.D_PID = None def _init_weights(self, module): if isinstance(module, (nn.Linear, nn.Embedding)): module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) if isinstance(module, nn.Linear) and module.bias is not None: module.bias.data.zero_() def forward(self, input_ids, attention_mask=None, position_ids=None, return_dict=False, **kwargs): kwargs.pop("token_type_ids", None) outputs = self.backbone( input_ids=input_ids, attention_mask=attention_mask, position_ids=position_ids, return_dict=True, **kwargs ) reps = outputs.last_hidden_state reps = torch.nn.functional.normalize(reps, p=2, dim=2) reps *= attention_mask[:, :, None].float() logits = self.heads["col_pooling"](reps) if return_dict: return BaseModelOutput(last_hidden_state=logits) return logits @classmethod def load_for_inference(cls, model_name_or_path: str, max_query_len: int = 256, max_doc_len: int = 300, device: str = None): """ Load ColBERT model with tokenizer for inference Args: model_name_or_path: HuggingFace model path or local directory max_query_len: Maximum query length max_doc_len: Maximum document length device: Device to run inference on (auto-detect if None) """ device = device or ("cuda" if torch.cuda.is_available() else "cpu") try: # Load model and tokenizer if os.path.exists(model_name_or_path): print(f"Loading model from local directory: {model_name_or_path}") config = AutoConfig.from_pretrained(model_name_or_path) model = cls.from_pretrained(model_name_or_path, config=config) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) else: print(f"Downloading model from HuggingFace Hub: {model_name_or_path}") config = AutoConfig.from_pretrained(model_name_or_path) model = cls.from_pretrained(model_name_or_path, config=config) tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) # Setup inference configuration model.tokenizer = tokenizer model.max_query_len = max_query_len model.max_doc_len = max_doc_len model.Q_PID = tokenizer.convert_tokens_to_ids("[unused0]") model.D_PID = tokenizer.convert_tokens_to_ids("[unused1]") # Setup post-tokenization punctuation masking model.skip_ids = {tokenizer.encode(c, add_special_tokens=False)[0] for c in string.punctuation} model.to(device) model.eval() print(f"ColBERT model loaded on {device}") print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}") return model except Exception as e: print(f"Error loading model: {e}") raise def _encode_batch(self, ids: torch.Tensor, mask: torch.Tensor, to_cpu: bool = False): """Internal encoding function""" if self.tokenizer is None: raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()") ids, mask = ids.to(self.device), mask.to(self.device) pos = torch.arange(ids.size(1), device=self.device).unsqueeze(0).expand_as(ids) with torch.no_grad(): rep = self(input_ids=ids, attention_mask=mask, position_ids=pos) return rep.cpu() if to_cpu else rep def encode_queries(self, queries: List[str], batch_size: Optional[int] = None, to_cpu: bool = False): """ Encode queries for ColBERT retrieval Args: queries: List of query strings batch_size: Batch size for processing (None for single batch) to_cpu: Whether to move results to CPU Returns: Query representations tensor """ if self.tokenizer is None: raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()") print(f"Encoding {len(queries)} queries...") # Tokenize with query prefix enc = self.tokenizer(queries, add_special_tokens=True, truncation=False) id_lists = [[self.Q_PID] + ids for ids in enc["input_ids"]] # Apply dynamic augmentation with length cap cap = self.max_query_len or (self.tokenizer.model_max_length - 1) id_lists = [_dynamic_augment(ids, self.tokenizer.mask_token_id, cap) for ids in id_lists] # Pad sequences padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt") ids, mask = padded["input_ids"], padded["attention_mask"] # Process in batches if specified if batch_size: reps = [] for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"): reps.append(self._encode_batch(i, a, to_cpu)) return torch.cat(reps) return self._encode_batch(ids, mask, to_cpu) def encode_documents(self, documents: List[str], batch_size: Optional[int] = None, keep_dims: bool = True, to_cpu: bool = False): """ Encode documents for ColBERT retrieval with post-tokenization punctuation masking Args: documents: List of document strings batch_size: Batch size for processing (None for single batch) keep_dims: Whether to keep tensor dimensions (True) or return list of variable-length tensors to_cpu: Whether to move results to CPU Returns: Document representations tensor or list """ if self.tokenizer is None: raise RuntimeError("Model not loaded for inference. Use ColBERT.load_for_inference()") print(f"Encoding {len(documents)} documents...") # Tokenize documents WITHOUT removing punctuation (post-tokenization masking) enc = self.tokenizer(documents, add_special_tokens=True, truncation=True, max_length=self.max_doc_len - 1) id_lists = [[self.D_PID] + ids for ids in enc["input_ids"]] # Pad sequences padded = self.tokenizer.pad({"input_ids": id_lists}, padding=True, return_tensors="pt") ids, mask = padded["input_ids"], padded["attention_mask"] # Apply post-tokenization punctuation masking mask[torch.isin(ids, torch.tensor(list(self.skip_ids), device=ids.device))] = 0 # Process in batches if specified if batch_size: ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size) reps = [] for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"): rep = self._encode_batch(i, a, to_cpu) if not keep_dims: # Convert to list of variable-length tensors m = a.cpu().bool() if to_cpu else a.bool() rep = [r[m[idx]] for idx, r in enumerate(rep)] reps.append(rep) if keep_dims: return _stack_3D_tensors(reps)[rev] else: # Flatten and reorder flat = [d for g in reps for d in g] return [flat[i] for i in rev.tolist()] # Single batch processing rep = self._encode_batch(ids, mask, to_cpu) if not keep_dims: m = mask.cpu().bool() if to_cpu else mask.bool() rep = [r[m[idx]] for idx, r in enumerate(rep)] return rep def compute_similarity(q_reps: torch.Tensor, p_reps: torch.Tensor): """ Compute ColBERT-style max similarity between queries and passages Args: q_reps: Query representations [num_queries, max_q_len, dim] p_reps: Passage representations [num_passages, max_p_len, dim] Returns: Similarity scores [num_queries, num_passages] """ token_scores = torch.einsum("qin,pjn->qipj", q_reps, p_reps) scores, _ = token_scores.max(-1) scores = scores.sum(1) return scores def search(self, queries: List[str], documents: List[str], batch_size: Optional[int] = None, return_scores: bool = True): """ End-to-end search: encode queries and documents, compute similarities Args: queries: List of query strings documents: List of document strings batch_size: Batch size for encoding return_scores: Whether to return similarity scores Returns: If return_scores=True: (scores, query_reps, doc_reps) If return_scores=False: (query_reps, doc_reps) """ # Encode queries and documents q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True) p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True) if return_scores: # Compute similarities print("Computing similarities...") scores = self.compute_similarity(q_reps, p_reps) return scores, q_reps, p_reps return q_reps, p_reps def rank_documents(self, query: str, documents: List[str], top_k: int = 10): """ Rank documents for a single query Args: query: Query string documents: List of document strings top_k: Number of top results to return Returns: List of (document_index, score, document_text) tuples """ scores, _, _ = self.search([query], documents, return_scores=True) scores = scores.squeeze(0) # Remove query dimension # Get top-k results top_indices = torch.topk(scores, min(top_k, len(documents))).indices results = [] for idx in top_indices: results.append((idx.item(), scores[idx].item(), documents[idx.item()])) return results # --------------------------------------------------------------------------- # Helper Functions # --------------------------------------------------------------------------- def _split_into_batches(ids: torch.Tensor, mask: torch.Tensor, bsize: int): return [(ids[i:i + bsize], mask[i:i + bsize]) for i in range(0, ids.size(0), bsize)] def _sort_by_length(ids: torch.Tensor, mask: torch.Tensor, bsize: int): if ids.size(0) <= bsize: return ids, mask, torch.arange(ids.size(0)) lengths = mask.sum(-1) order = lengths.sort().indices reverse = order.sort().indices return ids[order], mask[order], reverse def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]: if max_cap is not None and len(ids) > max_cap: return ids[:max_cap] q_len = len(ids) target = max(32, ((q_len + 31) // 32) * 32) if target - q_len < 8: target = q_len + 8 if max_cap is not None: target = min(target, max_cap) return ids + [mask_id] * (target - q_len) def _stack_3D_tensors(groups): bsize = sum(x.size(0) for x in groups) maxlen = max(x.size(1) for x in groups) hdim = groups[0].size(2) out = torch.zeros(bsize, maxlen, hdim, device=groups[0].device, dtype=groups[0].dtype) ptr = 0 for g in groups: out[ptr:ptr + g.size(0), :g.size(1)] = g ptr += g.size(0) return out ```
Click here for inference code using ONNX > [!TIP] > Copy paste the next snippet before running the below snippet. ```python model_path = "prithivida/modern_colbert_base_en_v1" onnx_model_path = "prithivida/modern_colbert_base_en_v1/onnx/model.onnx" # Load ONNX model for inference using the standalone tokenizer path onnx_colbert = ONNXColBERT(onnx_model_path, model_path , max_query_len=32, max_doc_len=300) # Pass model_path as tokenizer_path # Test inference queries = [ "How does deep learning work?", "What is machine learning?", "What are neural networks?" ] documents = [ "Machine learning is the idea of approximating a real world phenomenon using data, the approximation can be mathmetical or otherwise.", "Deep learning uses neural networks with multiple layers to process data.", "Neural networks are computing systems inspired by biological neural networks.", "Artificial intelligence encompasses machine learning and deep learning.", ] # Encode and find similarity print("\n=== ONNX Encode and Compute similarity ===") q_reps = onnx_colbert.encode_queries(queries, batch_size=4, to_cpu=True) p_reps = onnx_colbert.encode_documents(documents, batch_size=4, to_cpu=True) scores = onnx_colbert.compute_similarity(q_reps, p_reps) # or Test single query ranking print("\n=== ONNX Standalone Single Query Ranking ===") query = "How does deep learning work?" results = onnx_colbert.rank_documents(query, documents, top_k=3) print(f"Query: {query}") for i, (doc_idx, score, doc_text) in enumerate(results): print(f" {i+1}. Score: {score:.4f} | Doc: {doc_text}") ``` ```python import numpy as np import onnxruntime as ort from tokenizers import AddedToken, Tokenizer import json import string from pathlib import Path from typing import List, Optional, Tuple, Union from tqdm import tqdm # --------------------------------------------------------------------------- # ONNX ColBERT Class # --------------------------------------------------------------------------- class ONNXColBERT: def __init__(self, onnx_model_path: str, tokenizer_path: str, max_query_len: int = 256, max_doc_len: int = 300, providers: Optional[List[str]] = None): """ ONNX ColBERT - identical to PyTorch ColBERT.load_for_inference() Args: onnx_model_path: Path to the ONNX model file tokenizer_path: Path to the tokenizer directory max_query_len: Maximum query length max_doc_len: Maximum document length providers: ONNX Runtime providers """ # Load standalone tokenizer self.model_dir = Path(tokenizer_path) self.tokenizer = self._get_tokenizer(max_length=512) self.max_query_len = max_query_len self.max_doc_len = max_doc_len # Setup inference configuration self.Q_PID = self.tokenizer.token_to_id("[unused0]") self.D_PID = self.tokenizer.token_to_id("[unused1]") self.mask_token_id = self.tokenizer.token_to_id("[MASK]") if None in [self.Q_PID, self.D_PID, self.mask_token_id]: raise ValueError("Could not find required special tokens in tokenizer") # Setup post-tokenization punctuation masking self.skip_ids = set() for c in string.punctuation: encoded = self.tokenizer.encode(c, add_special_tokens=False) if len(encoded.ids) > 0: self.skip_ids.add(encoded.ids[0]) print(f"Identified {len(self.skip_ids)} punctuation token IDs to skip") # Initialize ONNX Runtime session if providers is None: providers = ['CUDAExecutionProvider', 'CPUExecutionProvider'] self.session = ort.InferenceSession(onnx_model_path, providers=providers) print(f"✅ ONNX ColBERT loaded with providers: {self.session.get_providers()}") print(f"Query max length: {max_query_len}, Document max length: {max_doc_len}") def _get_tokenizer(self, max_length: int = 512) -> Tokenizer: """Initialize tokenizer""" with open(str(self.model_dir / "config.json")) as config_file: config = json.load(config_file) with open(str(self.model_dir / "tokenizer_config.json")) as tokenizer_config_file: tokenizer_config = json.load(tokenizer_config_file) with open(str(self.model_dir / "special_tokens_map.json")) as tokens_map_file: tokens_map = json.load(tokens_map_file) tokenizer = Tokenizer.from_file(str(self.model_dir / "tokenizer.json")) tokenizer.enable_truncation(max_length=min(tokenizer_config["model_max_length"], max_length)) tokenizer.enable_padding(pad_id=config["pad_token_id"], pad_token=tokenizer_config["pad_token"]) for token in tokens_map.values(): if isinstance(token, str): tokenizer.add_special_tokens([token]) elif isinstance(token, dict): tokenizer.add_special_tokens([AddedToken(**token)]) return tokenizer def _encode_batch(self, ids: np.ndarray, mask: np.ndarray, to_cpu: bool = False) -> np.ndarray: """Internal encoding function""" # Create position IDs pos = np.arange(ids.shape[1])[None, :].repeat(ids.shape[0], axis=0) # ONNX inference inputs = { "input_ids": ids.astype(np.int64), "attention_mask": mask.astype(np.int64), "position_ids": pos.astype(np.int64) } outputs = self.session.run(["last_hidden_state"], inputs) return outputs[0] def encode_queries(self, queries: List[str], batch_size: Optional[int] = None, to_cpu: bool = False) -> np.ndarray: """Encode queries - IDENTICAL to PyTorch ColBERT.encode_queries()""" print(f"Encoding {len(queries)} queries...") # Tokenize with query prefix encoded_queries = self.tokenizer.encode_batch(queries, add_special_tokens=True) id_lists = [[self.Q_PID] + encoded.ids for encoded in encoded_queries] # Apply dynamic augmentation with length cap cap = self.max_query_len or 511 id_lists = [_dynamic_augment(ids, self.mask_token_id, cap) for ids in id_lists] # Manual padding max_len = max(len(ids) for ids in id_lists) batch_size_actual = len(id_lists) ids = np.zeros((batch_size_actual, max_len), dtype=np.int64) mask = np.zeros((batch_size_actual, max_len), dtype=np.int64) for i, id_list in enumerate(id_lists): ids[i, :len(id_list)] = id_list mask[i, :len(id_list)] = 1 # Process in batches if specified if batch_size: reps = [] for i, a in tqdm(_split_into_batches(ids, mask, batch_size), desc="Encoding query batches"): reps.append(self._encode_batch(i, a, to_cpu)) return np.concatenate(reps, axis=0) return self._encode_batch(ids, mask, to_cpu) def encode_documents(self, documents: List[str], batch_size: Optional[int] = None, keep_dims: bool = True, to_cpu: bool = False) -> Union[np.ndarray, List[np.ndarray]]: """Encode documents - IDENTICAL to PyTorch ColBERT.encode_documents()""" print(f"Encoding {len(documents)} documents...") # Encode documents individually to preserve natural lengths encoded_docs = [] for doc in documents: encoded = self.tokenizer.encode(doc, add_special_tokens=True) encoded_docs.append(encoded) id_lists = [] for encoded in encoded_docs: ids = encoded.ids # Truncate to max_doc_len - 1 if len(ids) > self.max_doc_len - 1: ids = ids[:self.max_doc_len - 1] # Add D_PID prefix ids = [self.D_PID] + ids id_lists.append(ids) # Manual padding max_len = max(len(ids) for ids in id_lists) batch_size_actual = len(id_lists) ids = np.zeros((batch_size_actual, max_len), dtype=np.int64) mask = np.zeros((batch_size_actual, max_len), dtype=np.int64) for i, id_list in enumerate(id_lists): ids[i, :len(id_list)] = id_list mask[i, :len(id_list)] = 1 # Apply post-tokenization punctuation masking for skip_id in self.skip_ids: mask[ids == skip_id] = 0 # Process in batches if specified if batch_size: ids_s, mask_s, rev = _sort_by_length(ids, mask, batch_size) reps = [] for i, a in tqdm(_split_into_batches(ids_s, mask_s, batch_size), desc="Encoding document batches"): rep = self._encode_batch(i, a, to_cpu) if not keep_dims: m = a.astype(bool) rep = [r[m[idx]] for idx, r in enumerate(rep)] reps.append(rep) if keep_dims: return _stack_3D_arrays(reps)[rev] else: flat = [d for g in reps for d in g] return [flat[i] for i in rev.tolist()] # Single batch processing rep = self._encode_batch(ids, mask, to_cpu) if not keep_dims: m = mask.astype(bool) rep = [r[m[idx]] for idx, r in enumerate(rep)] return rep def compute_similarity(q_reps: np.ndarray, p_reps: np.ndarray) -> np.ndarray: """Compute ColBERT similarity - IDENTICAL to PyTorch version""" # Identical to PyTorch: torch.einsum("qin,pjn->qipj", q_reps, p_reps) token_scores = np.einsum("qin,pjn->qipj", q_reps, p_reps) # Identical to PyTorch: scores, _ = token_scores.max(-1) scores = np.max(token_scores, axis=-1) # Identical to PyTorch: scores = scores.sum(1) scores = np.sum(scores, axis=1) return scores def search(self, queries: List[str], documents: List[str], batch_size: Optional[int] = None, return_scores: bool = True): """End-to-end search - IDENTICAL to PyTorch ColBERT.search()""" # Encode queries and documents q_reps = self.encode_queries(queries, batch_size=batch_size, to_cpu=True) p_reps = self.encode_documents(documents, batch_size=batch_size, to_cpu=True) if return_scores: # Compute similarities print("Computing similarities...") scores = self.compute_similarity(q_reps, p_reps) return scores, q_reps, p_reps return q_reps, p_reps def rank_documents(self, query: str, documents: List[str], top_k: int = 10) -> List[Tuple]: """Rank documents - IDENTICAL to PyTorch ColBERT.rank_documents()""" scores, _, _ = self.search([query], documents, return_scores=True) scores = scores.squeeze(0) # Get top-k results top_indices = np.argsort(scores)[::-1][:min(top_k, len(documents))] results = [] for idx in top_indices: results.append((int(idx), float(scores[idx]), documents[idx])) return results # --------------------------------------------------------------------------- # Helper Functions (NumPy versions) # --------------------------------------------------------------------------- def _split_into_batches(ids: np.ndarray, mask: np.ndarray, bsize: int): return [(ids[i:i + bsize], mask[i:i + bsize]) for i in range(0, ids.shape[0], bsize)] def _sort_by_length(ids: np.ndarray, mask: np.ndarray, bsize: int): if ids.shape[0] <= bsize: return ids, mask, np.arange(ids.shape[0]) lengths = mask.sum(-1) order = np.argsort(lengths) reverse = np.argsort(order) return ids[order], mask[order], reverse def _dynamic_augment(ids: List[int], mask_id: int, max_cap: int = None) -> List[int]: if max_cap is not None and len(ids) > max_cap: return ids[:max_cap] q_len = len(ids) target = max(32, ((q_len + 31) // 32) * 32) if target - q_len < 8: target = q_len + 8 if max_cap is not None: target = min(target, max_cap) return ids + [mask_id] * (target - q_len) def _stack_3D_arrays(groups): bsize = sum(x.shape[0] for x in groups) maxlen = max(x.shape[1] for x in groups) hdim = groups[0].shape[2] out = np.zeros((bsize, maxlen, hdim), dtype=groups[0].dtype) ptr = 0 for g in groups: out[ptr:ptr + g.shape[0], :g.shape[1]] = g ptr += g.shape[0] return out ```

_____ # Notes on reproducing We welcome anyone to reproduce our results. Here are some tips and observations: - Please pay attention to the query length. We tried our best to look at what the original ColBERTv2.0 used, what LightOn AI used and also spoke to Nils Reimers on taking liberty in the choice of query lengths. - Note on query length from ColBERTv2.0 paper: > Unless otherwise stated, we keep the default query maximum sequence length for ColBERTv2 and RocketQAv2, which is 32 tokens. For the ArguAna test in BEIR, as the queries are themselves long documents, we set the maximum query length used by ColBERTv2 and RocketQAv2 to 300. For Climate-FEVER, as the queries are relatively long sentence claims, we set the maximum query length used by ColBERTv2 to 64. - Query lengths used by LightOn AI PyLate: (Assuming the OSS code is what they used) ```python query_len = { "quora": 32, "climate-fever": 64, "nq": 32, "msmarco": 32, "hotpotqa": 32, "nfcorpus": 32, "scifact": 48, "trec-covid": 48, "fiqa": 32, "arguana": 64, "scidocs": 48, "dbpedia-entity": 32, "webis-touche2020": 32, "fever": 32, "cqadupstack/android": 32, "cqadupstack/english": 32, "cqadupstack/gaming": 32, "cqadupstack/gis": 32, "cqadupstack/mathematica": 32, "cqadupstack/physics": 32, "cqadupstack/programmers": 32, "cqadupstack/stats": 32, "cqadupstack/tex": 32, "cqadupstack/unix": 32, "cqadupstack/webmasters": 32, "cqadupstack/wordpress": 32, } ``` - This is what OG Nils had to say when I asked about why query has so much liberty: > Comparison is always hard...I think query length doesn't skew too much. Retrieval compute scales linear with the number of query tokens. So if people are comfortable to compare models with largely different parameters, comparing different query token lengths would be fine as well - We took a balanced view of both choices and borrowed the query length defaults used by LightOn with only exception of arguana. Instead of original's Colbert's 300 or LightOn's 64 we used 256. - Nota bene: There *may be* minor differences in the numbers when reproducing, for instance BGE-M3 reports a nDCG@10 of 59.3 for MIRACL hindi and we Observed only 58.9. But not massive differences like in the case of reported and reproduced Colbert-small in some datasets. Here are our numbers for the full hindi run on BGE-M3 ```python {'NDCG@1': 0.49714, 'NDCG@3': 0.5115, 'NDCG@5': 0.53908, 'NDCG@10': 0.58936, 'NDCG@100': 0.6457, 'NDCG@1000': 0.65336} {'MAP@1': 0.28845, 'MAP@3': 0.42424, 'MAP@5': 0.46455, 'MAP@10': 0.49955, 'MAP@100': 0.51886, 'MAP@1000': 0.51933} {'Recall@10': 0.73032, 'Recall@50': 0.8987, 'Recall@100': 0.93974, 'Recall@200': 0.95763, 'Recall@500': 0.97813, 'Recall@1000': 0.9902} {'P@1': 0.49714, 'P@3': 0.33048, 'P@5': 0.24629, 'P@10': 0.15543, 'P@100': 0.0202, 'P@1000': 0.00212} {'MRR@10': 0.60893, 'MRR@100': 0.615, 'MRR@1000': 0.6151} ``` - We made sure all quirks and known BEIR ColBERT issues are taken care off: - [Arguana and Quora (?) self match issues](https://github.com/beir-cellar/beir/issues/67) - Will add more - TBA # Acknowledgements - Thanks to Alibaba-NLP for Alibaba-NLP/gte-modernbert-base, which is our base model (as used by LightOn AI) - Thanks to Nils Reimers for the tips and inputs. - Thanks to Nandan Thakur for answering questions. - Thanks to Antoine Chaffin and the entire LightOn team for PyLate. - Thanks to NanoBEIR authors, its a blessing. - Thanks to Prithivi Da for his generous funding for this work :-) # Open Questions (still have on ColBERT) / thoughts: - People worked on ColBERT would agree marginmse loss sucks and KLDiv works great for ColBERT in practice, is there a formal / mathematical study on why marginmse sucks so bad ? (JaColBERT has done some ablations but would love to read why) - What BERT as an encoder architecture brings to be the best choice for ColBERT compared to other encoder architectures ? - What were the temperature choices for ColBERT for query, doc scores ? - Alibaba-NLP/gte-modernbert-base's BEIR avg is 55.33 and beats best ColBERTs in the world (as of 2nd July 2025), so calling single-vec models naive is naive.. # Wishlist - When I can expend more GPU - would love to try and reproduce Ligton AI's GTE-ModernColBERT BEIR eval numbers. - would run eval for prithivida/modern_colbert_base_en_v1 on long docs benchmark.