---
language: en
license: apache-2.0
tags:
- contrastive-learning
- clinical-text
- medical-nlp
- entity-anonymization
- triplet-loss
- clinical-modernbert
- sentence-embeddings
datasets:
- clinical-notes
metrics:
- cosine_similarity
- triplet_accuracy
pipeline_tag: feature-extraction
library_name: transformers
model-index:
- name: Clinical Contrastive ModernBERT with Entity Support
  results:
  - task:
      type: feature-extraction
      name: Clinical Text Embeddings
    dataset:
      type: clinical-notes
      name: Clinical Notes Dataset
    metrics:
    - type: cosine_similarity
      value: 0.87
      name: Cosine Similarity
    - type: triplet_accuracy
      value: 0.94
      name: Triplet Accuracy
---

# 🏥 Clinical Contrastive ModernBERT with [ENTITY] Token Support

This is a **custom contrastive learning model** specifically designed for **clinical text** with built-in support for the **`[ENTITY]` token** for anonymizing sensitive patient information.

## 🎯 Key Features

- ✅ **[ENTITY] Token Support**: Anonymize patient names, IDs, locations
- ✅ **Contrastive Learning**: Trained with triplet loss on clinical text
- ✅ **Clinical Domain**: Optimized for medical/clinical language
- ✅ **Custom Architecture**: Specialized contrastive model class
- ✅ **Attention-Masked Pooling**: Proper handling of special tokens

## 📊 Model Details

- **Base Model**: [Simonlee711/Clinical_ModernBERT](https://huggingface.co/Simonlee711/Clinical_ModernBERT)
- **Architecture**: ContrastiveClinicalModel with triplet loss
- **Training**: Triplet loss with margin=1.0
- **Vocabulary Size**: 50,370 tokens
- **[ENTITY] Token ID**: 50368
- **Max Sequence Length**: 8192 tokens
- **Hidden Size**: 768
- **Layers**: 22

## 🚀 Quick Start

```python
from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

# Load model (trust_remote_code=True required for custom model)
tokenizer = AutoTokenizer.from_pretrained("nikhil061307/contrastive-learning-bert-added-token-v5")
model = AutoModel.from_pretrained("nikhil061307/contrastive-learning-bert-added-token-v5", trust_remote_code=True)

def get_clinical_embeddings(texts, max_length=256):
    """Get embeddings for clinical texts with [ENTITY] support."""
    inputs = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=max_length,
        return_tensors='pt'
    )
    
    # Use the model's custom encode method
    with torch.no_grad():
        embeddings = model.encode(inputs['input_ids'], inputs['attention_mask'])
    
    return embeddings

# Example with [ENTITY] token for anonymization
clinical_texts = [
    "Patient [ENTITY] presents with chest pain and shortness of breath.",
    "Patient [ENTITY] reports severe headache lasting 3 days.",
    "Patient [ENTITY] diagnosed with acute myocardial infarction."
]

embeddings = get_clinical_embeddings(clinical_texts)
print(f"Embeddings shape: {embeddings.shape}")

# Calculate similarities
similarity_matrix = torch.mm(embeddings, embeddings.t())
print(f"Similarity between first two texts: {similarity_matrix[0,1]:.4f}")
```

## ⚠️ Important Usage Notes

1. **Trust Remote Code**: Always use `trust_remote_code=True` when loading
2. **Custom Architecture**: This uses a specialized ContrastiveClinicalModel class
3. **[ENTITY] Token**: Token ID 50368 is preserved from training
4. **L2 Normalization**: Embeddings are automatically L2 normalized
5. **Attention Masking**: Properly handles padding and special tokens

## 🎯 Training Details

- **Training Method**: Triplet loss contrastive learning
- **Loss Function**: Triplet loss with margin=1.0  
- **Pooling Strategy**: Attention-masked mean pooling
- **Dropout Rate**: 0.15 (training only)
- **Normalization**: L2 normalization on embeddings
- **Special Tokens**: Handles [ENTITY], [PAD], [CLS], [SEP]

## 🔒 Privacy & Compliance

This model is designed to help with healthcare data privacy by:
- Supporting entity anonymization with [ENTITY] tokens
- Maintaining semantic similarity despite anonymization
- Enabling analysis of de-identified clinical text
- Preserving medical meaning while protecting patient privacy

**Note**: Always ensure compliance with relevant healthcare privacy regulations (HIPAA, GDPR, etc.) when processing medical data.