--- license: apache-2.0 base_model: answerdotai/ModernBERT-base tags: - sentence-transformers - feature-extraction - sentence-similarity - biomedical - embeddings - life-sciences - scientific-text - SODA-VEC - EMBO datasets: - EMBO/soda-vec-data-full_pmc_title_abstract_paired metrics: - cosine-similarity --- # VICReg Our Contrast Model ## Model Description SODA-VEC embedding model trained with VICReg Our Contrast loss function. This model uses normalized embeddings with covariance, feature, and dot product losses (including off-diagonal terms) to learn rich biomedical text representations. This model is part of the **SODA-VEC** (Scientific Open Domain Adaptation for Vector Embeddings) project, which focuses on creating high-quality embedding models for biomedical and life sciences text. **Key Features:** - Trained on **26.5M biomedical title-abstract pairs** from PubMed Central - Based on **ModernBERT-base** architecture - Optimized for **biomedical text similarity** and **semantic search** - Produces **768-dimensional embeddings** with mean pooling ## Training Details ### Training Data - **Dataset**: [`EMBO/soda-vec-data-full_pmc_title_abstract_paired`](https://huggingface.co/datasets/EMBO/soda-vec-data-full_pmc_title_abstract_paired) - **Size**: 26,473,900 training pairs - **Source**: Complete PubMed Central baseline (July 2024) - **Format**: Paired title-abstract examples optimized for contrastive learning ### Training Procedure **Loss Function**: VICReg Our Contrast: normalized embeddings with covariance loss, feature loss, and dot product loss (diagonal + off-diagonal) We have implemented a series of changes from the original [VICREG in the paper from Meta](https://arxiv.org/pdf/2105.04906). Here we show the main differences: | Feature | Original VICReg | VICReg Our | VICReg Our Contrast | |---------|----------------|------------|---------------------| | Normalization | No | Yes (L2-normalized) | Yes (L2-normalized) | | Invariance (MSE) | Yes | No | No | | Variance (hinge) | Yes | No | No | | Covariance | Yes (unnormalized) | Yes (normalized) | Yes (normalized) | | Feature correlation | No | Yes (cross-view) | Yes (cross-view) | | Sample similarity | No | Yes (diagonal only) | Yes (diagonal + off-diagonal) | **Coefficients**: cov=1.0, feature=1.0, dot=1.0 **Base Model**: `answerdotai/ModernBERT-base` **Training Configuration:** - **GPUs**: 4 - **Batch Size per GPU**: 16 - **Gradient Accumulation**: 4 - **Effective Batch Size**: 256 - **Learning Rate**: 2e-05 - **Warmup Steps**: 100 - **Pooling Strategy**: mean - **Epochs**: 1 (full dataset pass) **Training Command:** ```bash python scripts/soda-vec-train.py --config vicreg_our_contrast --coeff_cov 1 --coeff_feature 1 --coeff_dot 1 --push_to_hub --hub_org EMBO --save_limit 5 ``` ### Model Architecture - **Base Architecture**: ModernBERT-base (12 layers, 768 hidden size) - **Pooling**: Mean pooling over token embeddings - **Output Dimension**: 768 - **Normalization**: L2-normalized embeddings (for VICReg-based models) ## Usage ### Using Sentence-Transformers ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer("EMBO/vicreg_our_contrast") # Encode sentences sentences = [ "CRISPR-Cas9 gene editing in human cells", "Genome editing using CRISPR technology" ] embeddings = model.encode(sentences) print(f"Embedding shape: {embeddings.shape}") # Compute similarity from sentence_transformers.util import cos_sim similarity = cos_sim(embeddings[0], embeddings[1]) print(f"Similarity: {similarity.item():.4f}") ``` ### Using Hugging Face Transformers ```python from transformers import AutoTokenizer, AutoModel import torch import torch.nn.functional as F # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("EMBO/vicreg_our_contrast") model = AutoModel.from_pretrained("EMBO/vicreg_our_contrast") # Encode sentences sentences = [ "CRISPR-Cas9 gene editing in human cells", "Genome editing using CRISPR technology" ] inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) # Mean pooling embeddings = outputs.last_hidden_state.mean(dim=1) # Normalize (for VICReg models) embeddings = F.normalize(embeddings, p=2, dim=1) # Compute similarity similarity = F.cosine_similarity(embeddings[0:1], embeddings[1:2]) print(f"Similarity: {similarity.item():.4f}") ``` ## Evaluation The model has been evaluated on comprehensive biomedical benchmarks including: - **Journal-Category Classification**: Matching journals to BioRxiv subject categories - **Title-Abstract Similarity**: Discriminating between related and unrelated paper pairs - **Field-Specific Separability**: Distinguishing between different biological fields - **Semantic Search**: Retrieval quality on biomedical text corpora For detailed evaluation results, see the [SODA-VEC benchmark notebooks](https://github.com/source-data/soda-vec). ## Intended Use This model is designed for: - **Biomedical Semantic Search**: Finding relevant papers, abstracts, or text passages - **Scientific Text Similarity**: Computing similarity between biomedical texts - **Information Retrieval**: Building search systems for scientific literature - **Downstream Tasks**: As a base for fine-tuning on specific biomedical tasks - **Research Applications**: Academic and research use in life sciences ## Limitations - **Domain Specificity**: Optimized for biomedical and life sciences text; may not perform as well on general domain text - **Language**: English only - **Text Length**: Optimized for titles and abstracts; longer documents may require chunking - **Bias**: Inherits biases from the training data (PubMed Central corpus) ## Citation If you use this model, please cite: ```bibtex @software{soda_vec, title = {SODA-VEC: Scientific Open Domain Adaptation for Vector Embeddings}, author = {EMBO}, year = {2024}, url = {https://github.com/EMBO/soda-vec} } ``` ## Model Card Contact For questions or issues, please open an issue on the [SODA-VEC GitHub repository](https://github.com/source-data/soda-vec). --- **Model Card Generated**: 2025-11-10