|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
library_name: transformers |
|
|
tags: |
|
|
- modernbert |
|
|
- masked-language-model |
|
|
- dictionary |
|
|
- encyclopedia |
|
|
- glossary |
|
|
- embeddings |
|
|
- fill-mask |
|
|
datasets: |
|
|
- mjbommar/ogbert-v1-mlm |
|
|
- mjbommar/opengloss-v1.1-dictionary |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
pipeline_tag: fill-mask |
|
|
model-index: |
|
|
- name: ogbert-v1-mlm |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Clustering |
|
|
dataset: |
|
|
type: mjbommar/ogbert-v1-mlm |
|
|
name: OGBert v1 MLM Eval |
|
|
metrics: |
|
|
- type: adjusted_rand_index |
|
|
value: 0.7302 |
|
|
name: ARI |
|
|
- task: |
|
|
type: retrieval |
|
|
name: Definition Retrieval |
|
|
dataset: |
|
|
type: mjbommar/ogbert-v1-mlm |
|
|
name: OGBert v1 MLM Eval |
|
|
metrics: |
|
|
- type: mrr |
|
|
value: 0.9596 |
|
|
name: MRR |
|
|
--- |
|
|
|
|
|
# OGBert v1 MLM |
|
|
|
|
|
**OGBert** (OpenGloss BERT) is a ModernBERT-based masked language model pretrained on the OpenGloss synthetic encyclopedic dictionary. Despite being trained on a relatively small corpus of **~160M words** (435K dictionary entries), the model achieves strong performance on definition understanding and domain-specific terminology. |
|
|
|
|
|
The training corpus contains definitions across 16 domains (geography, mathematics, science, law, technology, philosophy, etc.) and 11 reading levels (kindergarten through PhD). |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model type:** ModernBERT for Masked Language Modeling |
|
|
- **Language:** English |
|
|
- **License:** Apache 2.0 |
|
|
- **Parameters:** ~38M |
|
|
- **Context length:** 1024 tokens |
|
|
- **Training data:** [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) |
|
|
- **Built with:** Transformers v5.0 |
|
|
|
|
|
## Architecture |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Hidden size | 384 | |
|
|
| Intermediate size | 1536 | |
|
|
| Number of layers | 10 | |
|
|
| Attention heads | 6 | |
|
|
| Max position embeddings | 1024 | |
|
|
| Vocabulary size | 32,769 | |
|
|
| Attention pattern | Full + Sliding (128 local window) | |
|
|
|
|
|
The model uses ModernBERT's hybrid attention pattern with full attention every 3 layers and sliding window attention in between, enabling efficient processing of long sequences. |
|
|
|
|
|
## Intended Uses |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
- **Fill-mask tasks:** Predicting masked tokens in dictionary/definition text |
|
|
- **Feature extraction:** Generating embeddings for downstream tasks |
|
|
- **Fine-tuning base:** Starting point for domain-specific models |
|
|
|
|
|
### Domain Strengths |
|
|
|
|
|
The model shows strong performance on: |
|
|
- **Geography** (0.44 loss) - Place names and geographic terminology |
|
|
- **Mathematics** (0.56 loss) - Mathematical and symbolic language |
|
|
- **Society** (0.60 loss) - Social science terminology |
|
|
- **Science** (0.63 loss) - Natural science terminology |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Fill-Mask Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
fill_mask = pipeline("fill-mask", model="mjbommar/ogbert-v1-mlm") |
|
|
result = fill_mask("A molecule is the smallest <|mask|> of a chemical compound.") |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
**Example outputs:** |
|
|
|
|
|
| Input | Top Predictions | |
|
|
|-------|-----------------| |
|
|
| "A triangle is a <\|mask\|> with three sides." | triangle (0.74), polygon (0.11), plane (0.04) | |
|
|
| "A molecule is the smallest <\|mask\|> of a chemical compound." | unit (0.65), part (0.11), component (0.05) | |
|
|
| "Democracy is a system of <\|mask\|> in which citizens exercise power." | government (0.39), governance (0.14), democracy (0.07) | |
|
|
| "Photosynthesis is the process by which plants convert <\|mask\|> into energy." | energy (0.30), nutrients (0.19), light (0.10) | |
|
|
|
|
|
### Feature Extraction |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-v1-mlm") |
|
|
model = AutoModel.from_pretrained("mjbommar/ogbert-v1-mlm") |
|
|
|
|
|
text = "Photosynthesis is the process by which plants convert light into energy." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling |
|
|
|
|
|
print(embeddings.shape) # torch.Size([1, 384]) |
|
|
``` |
|
|
|
|
|
### Masked Language Modeling |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-v1-mlm") |
|
|
model = AutoModelForMaskedLM.from_pretrained("mjbommar/ogbert-v1-mlm") |
|
|
|
|
|
text = "A molecule is the smallest <|mask|> of a chemical compound." |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] |
|
|
predictions = outputs.logits[0, mask_idx].softmax(dim=-1) |
|
|
top_tokens = predictions.topk(5) |
|
|
|
|
|
for score, idx in zip(top_tokens.values[0], top_tokens.indices[0]): |
|
|
print(f"{tokenizer.decode(idx)}: {score:.4f}") |
|
|
# Output: |
|
|
# unit: 0.6509 |
|
|
# part: 0.1069 |
|
|
# component: 0.0541 |
|
|
# form: 0.0294 |
|
|
# portion: 0.0243 |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset:** [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm) |
|
|
- **Source:** OpenGloss v1.1 Dictionary |
|
|
- **Domains:** 16 (Geography, Mathematics, Science, Law, Technology, Philosophy, etc.) |
|
|
- **Reading levels:** 11 (Kindergarten through PhD) |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Training steps | 5,000 | |
|
|
| Total tokens | 8,402,890,141 (8.40B) | |
|
|
| Epochs | 34.99 | |
|
|
| MLM probability | 25% | |
|
|
| Per-device batch size | 84 | |
|
|
| Gradient accumulation | 32 | |
|
|
| Global batch size | 2,688 | |
|
|
|
|
|
### Hyperparameters |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| Peak learning rate | 5e-4 | |
|
|
| Final learning rate | 0.0 (linear decay) | |
|
|
| Weight decay | 0.01 | |
|
|
| Warmup steps | 500 | |
|
|
| LR schedule | Linear warmup + linear decay | |
|
|
| Optimizer | AdamW | |
|
|
| Precision | bf16 | |
|
|
|
|
|
### Training Infrastructure |
|
|
|
|
|
- **Framework:** Transformers + Accelerate |
|
|
- **Hardware:** Single GPU |
|
|
|
|
|
## Final Training Metrics |
|
|
|
|
|
*From step 5000 (final checkpoint):* |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Train loss | 0.6334 | |
|
|
| Eval loss | 0.6685 | |
|
|
| Eval perplexity | 1.951 | |
|
|
| Gradient norm | 0.300 | |
|
|
| Loss (100-step avg) | 0.655 | |
|
|
| Loss (1000-step avg) | 0.667 | |
|
|
|
|
|
### Loss Stability (Final 1000 Steps) |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Mean | 0.6673 | |
|
|
| Std | 0.0257 | |
|
|
| 5th percentile | 0.6266 | |
|
|
| 95th percentile | 0.7100 | |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
### Clustering Performance |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Adjusted Rand Index (ARI) | 0.7302 | |
|
|
| Cluster Accuracy | 0.8000 | |
|
|
| Silhouette Score | 0.2547 | |
|
|
|
|
|
### Retrieval Performance |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Mean Reciprocal Rank (MRR) | 0.9596 | |
|
|
| Mean Average Precision (MAP) | 0.8183 | |
|
|
| Precision@1 | 0.9375 | |
|
|
| Precision@3 | 0.9083 | |
|
|
|
|
|
### Word Similarity (SimLex-999) |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Pearson correlation | 0.2911 | |
|
|
| Spearman correlation | 0.2829 | |
|
|
|
|
|
### Training Convergence |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Initial loss | 10.46 | |
|
|
| Final train loss | 0.6334 | |
|
|
| Final eval loss | 0.6685 | |
|
|
| Loss reduction | 94% | |
|
|
| Final perplexity | 1.95 | |
|
|
|
|
|
## Limitations |
|
|
|
|
|
1. **Word similarity:** The model achieves relatively low word similarity scores (SimLex 0.29). MLM pretraining optimizes for categorical boundaries rather than pairwise similarity. For tasks requiring fine-grained similarity, consider contrastive fine-tuning. |
|
|
|
|
|
2. **Domain coverage:** Performance varies by domain. Arts and history show higher loss (0.77-0.84) compared to geography and mathematics (0.44-0.56). |
|
|
|
|
|
3. **English only:** The model is trained exclusively on English text. |
|
|
|
|
|
## Related Models |
|
|
|
|
|
- **Base architecture:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) |
|
|
- **Training data:** [mjbommar/opengloss-v1.1-dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-dictionary) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite the OpenGloss paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{bommarito2025opengloss, |
|
|
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph}, |
|
|
author={Bommarito, Michael J., II}, |
|
|
year={2025}, |
|
|
eprint={2511.18622}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2511.18622} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 license. |
|
|
|