ogbert-v1-mlm / README.md

Fix model-index metadata (add dataset fields)

05fe338 verified about 1 month ago

8.18 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- masked-language-model
	- dictionary
	- encyclopedia
	- glossary
	- embeddings
	- fill-mask
	datasets:
	- mjbommar/ogbert-v1-mlm
	- mjbommar/opengloss-v1.1-dictionary
	base_model: answerdotai/ModernBERT-base
	pipeline_tag: fill-mask
	model-index:
	- name: ogbert-v1-mlm
	results:
	- task:
	type: text-classification
	name: Clustering
	dataset:
	type: mjbommar/ogbert-v1-mlm
	name: OGBert v1 MLM Eval
	metrics:
	- type: adjusted_rand_index
	value: 0.7302
	name: ARI
	- task:
	type: retrieval
	name: Definition Retrieval
	dataset:
	type: mjbommar/ogbert-v1-mlm
	name: OGBert v1 MLM Eval
	metrics:
	- type: mrr
	value: 0.9596
	name: MRR
	---

	# OGBert v1 MLM

	OGBert (OpenGloss BERT) is a ModernBERT-based masked language model pretrained on the OpenGloss synthetic encyclopedic dictionary. Despite being trained on a relatively small corpus of ~160M words (435K dictionary entries), the model achieves strong performance on definition understanding and domain-specific terminology.

	The training corpus contains definitions across 16 domains (geography, mathematics, science, law, technology, philosophy, etc.) and 11 reading levels (kindergarten through PhD).

	## Model Description

	- Model type: ModernBERT for Masked Language Modeling
	- Language: English
	- License: Apache 2.0
	- Parameters: ~38M
	- Context length: 1024 tokens
	- Training data: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm)
	- Built with: Transformers v5.0

	## Architecture

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Hidden size \| 384 \|
	\| Intermediate size \| 1536 \|
	\| Number of layers \| 10 \|
	\| Attention heads \| 6 \|
	\| Max position embeddings \| 1024 \|
	\| Vocabulary size \| 32,769 \|
	\| Attention pattern \| Full + Sliding (128 local window) \|

	The model uses ModernBERT's hybrid attention pattern with full attention every 3 layers and sliding window attention in between, enabling efficient processing of long sequences.

	## Intended Uses

	### Primary Use Cases

	- Fill-mask tasks: Predicting masked tokens in dictionary/definition text
	- Feature extraction: Generating embeddings for downstream tasks
	- Fine-tuning base: Starting point for domain-specific models

	### Domain Strengths

	The model shows strong performance on:
	- Geography (0.44 loss) - Place names and geographic terminology
	- Mathematics (0.56 loss) - Mathematical and symbolic language
	- Society (0.60 loss) - Social science terminology
	- Science (0.63 loss) - Natural science terminology

	## How to Use

	### Fill-Mask Pipeline

	```python
	from transformers import pipeline

	fill_mask = pipeline("fill-mask", model="mjbommar/ogbert-v1-mlm")
	result = fill_mask("A molecule is the smallest <\|mask\|> of a chemical compound.")
	print(result)
	```

	Example outputs:

	\| Input \| Top Predictions \|
	\|-------\|-----------------\|
	\| "A triangle is a <\\|mask\\|> with three sides." \| triangle (0.74), polygon (0.11), plane (0.04) \|
	\| "A molecule is the smallest <\\|mask\\|> of a chemical compound." \| unit (0.65), part (0.11), component (0.05) \|
	\| "Democracy is a system of <\\|mask\\|> in which citizens exercise power." \| government (0.39), governance (0.14), democracy (0.07) \|
	\| "Photosynthesis is the process by which plants convert <\\|mask\\|> into energy." \| energy (0.30), nutrients (0.19), light (0.10) \|

	### Feature Extraction

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-v1-mlm")
	model = AutoModel.from_pretrained("mjbommar/ogbert-v1-mlm")

	text = "Photosynthesis is the process by which plants convert light into energy."
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs)
	embeddings = outputs.last_hidden_state.mean(dim=1) # Mean pooling

	print(embeddings.shape) # torch.Size([1, 384])
	```

	### Masked Language Modeling

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("mjbommar/ogbert-v1-mlm")
	model = AutoModelForMaskedLM.from_pretrained("mjbommar/ogbert-v1-mlm")

	text = "A molecule is the smallest <\|mask\|> of a chemical compound."
	inputs = tokenizer(text, return_tensors="pt")

	with torch.no_grad():
	outputs = model(**inputs)
	mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
	predictions = outputs.logits[0, mask_idx].softmax(dim=-1)
	top_tokens = predictions.topk(5)

	for score, idx in zip(top_tokens.values[0], top_tokens.indices[0]):
	print(f"{tokenizer.decode(idx)}: {score:.4f}")
	# Output:
	# unit: 0.6509
	# part: 0.1069
	# component: 0.0541
	# form: 0.0294
	# portion: 0.0243
	```

	## Training Details

	### Training Data

	- Dataset: [mjbommar/ogbert-v1-mlm](https://huggingface.co/datasets/mjbommar/ogbert-v1-mlm)
	- Source: OpenGloss v1.1 Dictionary
	- Domains: 16 (Geography, Mathematics, Science, Law, Technology, Philosophy, etc.)
	- Reading levels: 11 (Kindergarten through PhD)

	### Training Procedure

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Training steps \| 5,000 \|
	\| Total tokens \| 8,402,890,141 (8.40B) \|
	\| Epochs \| 34.99 \|
	\| MLM probability \| 25% \|
	\| Per-device batch size \| 84 \|
	\| Gradient accumulation \| 32 \|
	\| Global batch size \| 2,688 \|

	### Hyperparameters

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Peak learning rate \| 5e-4 \|
	\| Final learning rate \| 0.0 (linear decay) \|
	\| Weight decay \| 0.01 \|
	\| Warmup steps \| 500 \|
	\| LR schedule \| Linear warmup + linear decay \|
	\| Optimizer \| AdamW \|
	\| Precision \| bf16 \|

	### Training Infrastructure

	- Framework: Transformers + Accelerate
	- Hardware: Single GPU

	## Final Training Metrics

	From step 5000 (final checkpoint):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Train loss \| 0.6334 \|
	\| Eval loss \| 0.6685 \|
	\| Eval perplexity \| 1.951 \|
	\| Gradient norm \| 0.300 \|
	\| Loss (100-step avg) \| 0.655 \|
	\| Loss (1000-step avg) \| 0.667 \|

	### Loss Stability (Final 1000 Steps)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Mean \| 0.6673 \|
	\| Std \| 0.0257 \|
	\| 5th percentile \| 0.6266 \|
	\| 95th percentile \| 0.7100 \|

	## Evaluation Results

	### Clustering Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Adjusted Rand Index (ARI) \| 0.7302 \|
	\| Cluster Accuracy \| 0.8000 \|
	\| Silhouette Score \| 0.2547 \|

	### Retrieval Performance

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Mean Reciprocal Rank (MRR) \| 0.9596 \|
	\| Mean Average Precision (MAP) \| 0.8183 \|
	\| Precision@1 \| 0.9375 \|
	\| Precision@3 \| 0.9083 \|

	### Word Similarity (SimLex-999)

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Pearson correlation \| 0.2911 \|
	\| Spearman correlation \| 0.2829 \|

	### Training Convergence

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Initial loss \| 10.46 \|
	\| Final train loss \| 0.6334 \|
	\| Final eval loss \| 0.6685 \|
	\| Loss reduction \| 94% \|
	\| Final perplexity \| 1.95 \|

	## Limitations

	1. Word similarity: The model achieves relatively low word similarity scores (SimLex 0.29). MLM pretraining optimizes for categorical boundaries rather than pairwise similarity. For tasks requiring fine-grained similarity, consider contrastive fine-tuning.

	2. Domain coverage: Performance varies by domain. Arts and history show higher loss (0.77-0.84) compared to geography and mathematics (0.44-0.56).

	3. English only: The model is trained exclusively on English text.

	## Related Models

	- Base architecture: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
	- Training data: [mjbommar/opengloss-v1.1-dictionary](https://huggingface.co/datasets/mjbommar/opengloss-v1.1-dictionary)

	## Citation

	If you use this model, please cite the OpenGloss paper:

	```bibtex
	@misc{bommarito2025opengloss,
	title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
	author={Bommarito, Michael J., II},
	year={2025},
	eprint={2511.18622},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2511.18622}
	}
	```

	## License

	This model is released under the Apache 2.0 license.