--- license: apache-2.0 language: - bar datasets: - bavarian-nlp/barwiki-20250901 - bavarian-nlp/gemini-bavarian-bible - bavarian-nlp/gemini-bavarian-tagesschau-v0.1 - bavarian-nlp/bavarian-books-ocred-v0.1 - HuggingFaceFW/finepdfs base_model: - gerturax/gerturax-3 --- # Baivaria ![alt text](logo.png) *Baivaria* is an encoder-only language model for Bavarian achieving new SOTA results on Named Entity Recognition (NER) and Part-of-Speech Tagging (PoS). More detailed information about the model can be found in [this GitHub](https://github.com/stefan-it/baivaria) repo. # 📋 Changelog * 18.09.2025: Initial version of this model. # Data Selection We use the following Bavarian corpora for the pretraining of Baivaria: * [Bavarian Wikipedia](https://huggingface.co/datasets/bavarian-nlp/barwiki-20250901) * [Bavarian Bible](https://huggingface.co/datasets/bavarian-nlp/gemini-bavarian-bible) * [Bavarian Awesome Tagesschau](https://huggingface.co/datasets/bavarian-nlp/gemini-bavarian-tagesschau-v0.1) * Bavarian Occiglot * [Bavarian Books](https://huggingface.co/datasets/bavarian-nlp/bavarian-books-ocred-v0.1) * [Bavarian Finepdfs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) The following table shows some stats for all corpora - after filtering: | Corpus Name | Quality Measures |Documents | Sentences | Tokens | Plaintext Size | |:--------------------------- |:---------------------- |---------:| ---------:| -----------:| --------------:| | Bavarian Wikipedia | High-quality Wikipedia | 43,627 | 242,245 | 7,001,569 | 21M | | Bavarian Bible | Gemini-translated | 1,189 | 35,156 | 1,346,116 | 3.8M | | Bavarian Awesome Tagesschau | Gemini-translated | 10,036 | 335,989 | 10,528,908 | 35M | | Bavarian Occiglot | Gemini-translated | 149,774 | 6,842,935 | 214,697,892 | 834M | | Bavarian Books | OCR'ed Books | 4,361 | 53,656 | 1,147,435 | 3.2M | | Bavarian Finepdfs | OCR'ed PDFs | 1,989 | 73,970 | 2,381,873 | 6.7M | Overall, the pretraining corpus has 210,976 documents, 7,583,951 sentences, 237,103,793 tokens with a total plaintext size of 903M. # Pretraining Pretraining a Bavarian model from scratch would be very inefficient, as there's not enough pretraining dataset. For Baivaria we follow the main idea in the "[Don't Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/abs/2004.10964)" paper from Gururangan et al. and perform a domain-adaptive pretraining and continue pretraining from a strong German encoder-only model. We use the recently released [GERTuraX-3](https://huggingface.co/gerturax/gerturax-3) as backbone and continue pretraining with our Bavarian corpus. Additionally, we perform a small hyper-parameter search and report micro F1-score on the [BarNER](https://arxiv.org/abs/2403.12749) NER dataset. The best performing ablation model is used as final model and is released as *Baivaria* in version 1. Thanks to the [TRC program](https://sites.research.google/trc/about/), the following ablation models could be pretrained on a v4-32 TPU Pod: | Hyper-Parameter | Ablation 1 | Ablation 2 | Ablation 3 | Ablation 4 | | ------------------- | ----------:| ----------:| ----------:| ----------:| | `decay_steps` | 26.638 | 26.638 | 26.638 | 26.638 | | `end_lr` | 0.0 | 0.0 | 0.0 | 0.0 | | `init_lr` | 0.0003 | 0.0003 | 0.0005 | 0.0003 | | `train_steps` | 26.638 | 26.638 | 26.638 | 26.638 | | `global_batch_size` | 1024 | 1024 | 1024 | 1024 | | `warmup_steps` | 266 | 0 | 1598 | 2663 | Ablation 3 is using the proposed hyper-parameters from the "[Don't Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/abs/2004.10964)" paper. Now we report the MLM accuracy and loss for the training, including the downstream task performance on the [BarNER](https://arxiv.org/abs/2403.12749) dataset. For fine-tuning, we use the last checkpoint and the hyper-parameters as specified in the [GERTuraX Fine-Tuner](https://github.com/stefan-it/gerturax-fine-tuner) repo: | Metric | Ablation 1 | Ablation 2 | Ablation 3 | Ablation 4 | |:--------------- | ------------:| ----------------:| ------------:| ------------:| | MLM Accuracy | 72.24 | 72.17 | 72.99 | 71.61 | | Train Loss | 2.9175 | 2.9248 | 2.8785 | 2.9689 | | BarNER F1-Score | 80.21 ± 0.31 | **80.83** ± 0.28 | 80.59 ± 0.35 | 80.06 ± 0.41 | # Results Not many datasets for Bavarian exists for an evaluation on downstream tasks. We are using the following ones: * [BarNER](https://arxiv.org/abs/2403.12749) NER dataset * [MaiBaam](https://arxiv.org/abs/2403.10293) Part-of-Speech Tagging dataset We use the [GERTuraX Fine-Tuner](https://github.com/stefan-it/gerturax-fine-tuner) repo and its hyper-parameter to fine-tune Baivaria for Bavarian NER and PoS Tagging. ## Overall In the overall section we compare results of Baivaria to current state-of-the-art results in the corresponding papers. For NER: | Model | F1-Score (Final test dataset) | | ----------------------------------------------------------- | --------------------------- | | GBERT Large from [BarNER](https://arxiv.org/abs/2403.12749) | 72.17 ± 1.75 | | Baivaria v1 | **75.70** ± 0.97 | For PoS Tagging: | Model | Accuracy (Final test dataset) | F1-Score (Final test dataset) | | ------------------------------------------------------------ | ----------------------------- | ----------------------------- | | GBERT Large from [MaiBaam](https://arxiv.org/abs/2403.10293) | 80.29 | 62.45 | | Baivaria v1 | **90.28** ± 0.16 | **73.65** ± 0.91 | # ❤️ Acknowledgements Baivaria is the outcome of working with TPUs from the awesome [TRC program](https://sites.research.google/trc/about/) and the [TensorFlow Model Garden](https://github.com/tensorflow/models) library. Many thanks for providing TPUs! Made from Bavarian Oberland with ❤️ and 🥨.