scandmmBERT: A ModernBERT Specialized for Scandinavian Languages

Model Description

scandmmBERT is a masked language model based on jhu-clsp/mmBERT-base that has undergone continued pre-training on a large corpus of Scandinavian languages (Swedish, Danish, Norwegian, and Icelandic).

The original mmBERT is a powerful multilingual model trained on over 1,800 languages. This version specializes that broad knowledge by exposing it to a massive amount of high-quality, in-domain text, making it a powerful expert model for any Scandinavian NLU task.

This project was developed as a hands-on exploration of large-scale model training on high-performance computing resources. The full development and troubleshooting process is detailed in the corresponding GitHub repository: https://github.com/joenaess/scandmmBERT.

Intended Uses & Limitations

This model is intended to be used as a base for fine-tuning on specific downstream tasks. It is particularly well-suited for:

  • Text Classification (e.g., sentiment analysis, topic classification)
  • Named Entity Recognition (NER)
  • Question Answering

Limitations

  • This is a masked language model and is not suitable for text generation.
  • The model has not been fine-tuned for any specific task and should be adapted to your use case.
  • The model inherits potential biases and stereotypes present in the web-crawled training data (HPLT 2.0).

How to Use

You can use this model directly with the fill-mask pipeline for masked word prediction.

from transformers import pipeline

# Replace YOUR_USERNAME with your actual Hugging Face username
model_id = "YOUR_USERNAME/scandmmBERT-base-scandinavian"
unmasker = pipeline('fill-mask', model=model_id)

# Swedish
result_sv = unmasker("Sveriges huvudstad heter <mask>.")
print([r['token_str'] for r in result_sv])

# Danish
result_da = unmasker("Dronningen af Danmark hedder <mask>.")
print([r['token_str'] for r in result_da])

Training Procedure

Pre-training Data

The model was trained on a combined and interleaved stream of the following language subsets from the HPLT/HPLT2.0_cleaned dataset:

  • Icelandic (isl_Latn)
  • Norwegian Nynorsk (nno_Latn)
  • Swedish (swe_Latn)
  • Danish (dan_Latn)
  • Norwegian Bokmål (nob_Latn)

Due to storage constraints, the smaller Icelandic and Nynorsk datasets were used in their entirety, while the larger Swedish, Danish, and Bokmål datasets were sampled.

Hyperparameters

The continued pre-training was performed for 50,000 steps using the following configuration:

Hyperparameter Value
learning_rate 2e-5
per_device_train_batch_size 2
gradient_accumulation_steps 16
Effective Batch Size 64
max_steps 50,000
optimizer AdamW
precision bf16
max_seq_length 512

The training was performed on a server with 2x NVIDIA L4 GPUs (24GB VRAM each) using PyTorch, Hugging Face transformers, and accelerate. The environment was managed with pixi.

Evaluation

A simple qualitative evaluation using the fill-mask pipeline shows strong performance in predicting contextually relevant words in Scandinavian languages.

Swedish 🇸🇪

  • Input: Sveriges huvudstad heter <mask>.
  • Top Prediction: Stockholm

Danish 🇩🇰

  • Input: Dronningen af Danmark hedder <mask>.
  • Top Prediction: Margrethe

Norwegian 🇳🇴

  • Input: Norges mest berømte maler er Edvard <mask>.
  • Top Prediction: Munch

Citation

If you use this model in your work, please consider citing the original mmBERT and HPLT sources, and you can cite this model as:

@misc{scandmmbert2025,
  author    = {Jonas Lind},
  title     = {scandmmBERT: A ModernBERT Specialized for Scandinavian Languages},
  year      = {2025},
  publisher = {Hugging Face},
  journal   = {Hugging Face repository},
  howpublished = {\url{[https://huggingface.co/YOUR_USERNAME/scandmmBERT-base-scandinavian](https://huggingface.co/YOUR_USERNAME/scandmmBERT-base-scandinavian)}}
}
Downloads last month
1
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jonasaise/scandmmBERT-base-scandinavian

Finetuned
(27)
this model
Finetunes
2 models

Dataset used to train jonasaise/scandmmBERT-base-scandinavian