scandmmBERT: A ModernBERT Specialized for Scandinavian Languages
Model Description
scandmmBERT is a masked language model based on jhu-clsp/mmBERT-base that has undergone continued pre-training on a large corpus of Scandinavian languages (Swedish, Danish, Norwegian, and Icelandic).
The original mmBERT is a powerful multilingual model trained on over 1,800 languages. This version specializes that broad knowledge by exposing it to a massive amount of high-quality, in-domain text, making it a powerful expert model for any Scandinavian NLU task.
This project was developed as a hands-on exploration of large-scale model training on high-performance computing resources. The full development and troubleshooting process is detailed in the corresponding GitHub repository: https://github.com/joenaess/scandmmBERT.
Intended Uses & Limitations
This model is intended to be used as a base for fine-tuning on specific downstream tasks. It is particularly well-suited for:
- Text Classification (e.g., sentiment analysis, topic classification)
- Named Entity Recognition (NER)
- Question Answering
Limitations
- This is a masked language model and is not suitable for text generation.
- The model has not been fine-tuned for any specific task and should be adapted to your use case.
- The model inherits potential biases and stereotypes present in the web-crawled training data (
HPLT 2.0).
How to Use
You can use this model directly with the fill-mask pipeline for masked word prediction.
from transformers import pipeline
# Replace YOUR_USERNAME with your actual Hugging Face username
model_id = "YOUR_USERNAME/scandmmBERT-base-scandinavian"
unmasker = pipeline('fill-mask', model=model_id)
# Swedish
result_sv = unmasker("Sveriges huvudstad heter <mask>.")
print([r['token_str'] for r in result_sv])
# Danish
result_da = unmasker("Dronningen af Danmark hedder <mask>.")
print([r['token_str'] for r in result_da])
Training Procedure
Pre-training Data
The model was trained on a combined and interleaved stream of the following language subsets from the HPLT/HPLT2.0_cleaned dataset:
- Icelandic (
isl_Latn) - Norwegian Nynorsk (
nno_Latn) - Swedish (
swe_Latn) - Danish (
dan_Latn) - Norwegian Bokmål (
nob_Latn)
Due to storage constraints, the smaller Icelandic and Nynorsk datasets were used in their entirety, while the larger Swedish, Danish, and Bokmål datasets were sampled.
Hyperparameters
The continued pre-training was performed for 50,000 steps using the following configuration:
| Hyperparameter | Value |
|---|---|
learning_rate |
2e-5 |
per_device_train_batch_size |
2 |
gradient_accumulation_steps |
16 |
| Effective Batch Size | 64 |
max_steps |
50,000 |
optimizer |
AdamW |
precision |
bf16 |
max_seq_length |
512 |
The training was performed on a server with 2x NVIDIA L4 GPUs (24GB VRAM each) using PyTorch, Hugging Face transformers, and accelerate. The environment was managed with pixi.
Evaluation
A simple qualitative evaluation using the fill-mask pipeline shows strong performance in predicting contextually relevant words in Scandinavian languages.
Swedish 🇸🇪
- Input:
Sveriges huvudstad heter <mask>. - Top Prediction:
Stockholm
Danish 🇩🇰
- Input:
Dronningen af Danmark hedder <mask>. - Top Prediction:
Margrethe
Norwegian 🇳🇴
- Input:
Norges mest berømte maler er Edvard <mask>. - Top Prediction:
Munch
Citation
If you use this model in your work, please consider citing the original mmBERT and HPLT sources, and you can cite this model as:
@misc{scandmmbert2025,
author = {Jonas Lind},
title = {scandmmBERT: A ModernBERT Specialized for Scandinavian Languages},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{[https://huggingface.co/YOUR_USERNAME/scandmmBERT-base-scandinavian](https://huggingface.co/YOUR_USERNAME/scandmmBERT-base-scandinavian)}}
}
- Downloads last month
- 1