---
license: apache-2.0
language:
- bar
datasets:
- bavarian-nlp/barwiki-20250901
- bavarian-nlp/gemini-bavarian-bible
- bavarian-nlp/gemini-bavarian-tagesschau-v0.1
- bavarian-nlp/bavarian-books-ocred-v0.1
- HuggingFaceFW/finepdfs
base_model:
- gerturax/gerturax-3
---

# Baivaria

![alt text](logo.png)

*Baivaria* is an encoder-only language model for Bavarian achieving new SOTA results on Named Entity Recognition (NER) and Part-of-Speech Tagging (PoS).

More detailed information about the model can be found in [this GitHub](https://github.com/stefan-it/baivaria) repo.

# 📋 Changelog

* 18.09.2025: Initial version of this model.

# Data Selection

We use the following Bavarian corpora for the pretraining of Baivaria:

* [Bavarian Wikipedia](https://huggingface.co/datasets/bavarian-nlp/barwiki-20250901)
* [Bavarian Bible](https://huggingface.co/datasets/bavarian-nlp/gemini-bavarian-bible)
* [Bavarian Awesome Tagesschau](https://huggingface.co/datasets/bavarian-nlp/gemini-bavarian-tagesschau-v0.1)
* Bavarian Occiglot
* [Bavarian Books](https://huggingface.co/datasets/bavarian-nlp/bavarian-books-ocred-v0.1)
* [Bavarian Finepdfs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs)

The following table shows some stats for all corpora - after filtering:

| Corpus Name                 | Quality Measures       |Documents | Sentences | Tokens      | Plaintext Size |
|:--------------------------- |:---------------------- |---------:| ---------:| -----------:| --------------:|
| Bavarian Wikipedia          | High-quality Wikipedia |   43,627 |   242,245 |   7,001,569 |            21M |
| Bavarian Bible              | Gemini-translated      |    1,189 |    35,156 |   1,346,116 |           3.8M |
| Bavarian Awesome Tagesschau | Gemini-translated      |   10,036 |   335,989 |  10,528,908 |            35M |
| Bavarian Occiglot           | Gemini-translated      |  149,774 | 6,842,935 | 214,697,892 |           834M |
| Bavarian Books              | OCR'ed Books           |    4,361 |    53,656 |   1,147,435 |           3.2M |
| Bavarian Finepdfs           | OCR'ed PDFs            |    1,989 |    73,970 |   2,381,873 |           6.7M |

Overall, the pretraining corpus has 210,976 documents, 7,583,951 sentences, 237,103,793 tokens with a total plaintext size of 903M.

# Pretraining

Pretraining a Bavarian model from scratch would be very inefficient, as there's not enough pretraining dataset.

For Baivaria we follow the main idea in the "[Don't Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/abs/2004.10964)" paper from Gururangan et al. and perform a domain-adaptive pretraining and continue pretraining from a strong German encoder-only model.

We use the recently released [GERTuraX-3](https://huggingface.co/gerturax/gerturax-3) as backbone and continue pretraining with our Bavarian corpus. Additionally, we perform a small hyper-parameter search and report micro F1-score on the [BarNER](https://arxiv.org/abs/2403.12749) NER dataset. The best
performing ablation model is used as final model and is released as *Baivaria* in version 1.

Thanks to the [TRC program](https://sites.research.google/trc/about/), the following ablation models could be pretrained on a v4-32 TPU Pod:

| Hyper-Parameter     | Ablation 1 | Ablation 2 | Ablation 3 | Ablation 4 |
| ------------------- | ----------:| ----------:| ----------:| ----------:|
| `decay_steps`       |     26.638 |     26.638 |     26.638 |     26.638 |
| `end_lr`            |        0.0 |        0.0 |        0.0 |        0.0 |
| `init_lr`           |     0.0003 |     0.0003 |     0.0005 |     0.0003 |
| `train_steps`       |     26.638 |     26.638 |     26.638 |     26.638 |
| `global_batch_size` |       1024 |       1024 |       1024 |       1024 |
| `warmup_steps`      |        266 |          0 |       1598 |       2663 |

Ablation 3 is using the proposed hyper-parameters from the "[Don't Stop Pretraining: Adapt Language Models to Domains and Tasks](https://arxiv.org/abs/2004.10964)" paper.

Now we report the MLM accuracy and loss for the training, including the downstream task performance on the [BarNER](https://arxiv.org/abs/2403.12749) dataset. For fine-tuning, we use the last checkpoint and the hyper-parameters as specified in the [GERTuraX Fine-Tuner](https://github.com/stefan-it/gerturax-fine-tuner) repo:

| Metric          | Ablation 1   | Ablation 2       | Ablation 3   | Ablation 4   |
|:--------------- | ------------:| ----------------:| ------------:| ------------:|
| MLM Accuracy    |        72.24 |            72.17 |        72.99 |        71.61 |
| Train Loss      |       2.9175 |           2.9248 |       2.8785 |       2.9689 |
| BarNER F1-Score | 80.21 ± 0.31 | **80.83** ± 0.28 | 80.59 ± 0.35 | 80.06 ± 0.41 |

# Results

Not many datasets for Bavarian exists for an evaluation on downstream tasks. We are using the following ones:

* [BarNER](https://arxiv.org/abs/2403.12749) NER dataset
* [MaiBaam](https://arxiv.org/abs/2403.10293) Part-of-Speech Tagging dataset

We use the [GERTuraX Fine-Tuner](https://github.com/stefan-it/gerturax-fine-tuner) repo and its hyper-parameter to fine-tune Baivaria for Bavarian NER and PoS Tagging.

## Overall

In the overall section we compare results of Baivaria to current state-of-the-art results in the corresponding papers.

For NER:

| Model                                                       | F1-Score (Final test dataset) |
| ----------------------------------------------------------- | --------------------------- |
| GBERT Large from [BarNER](https://arxiv.org/abs/2403.12749) | 72.17 ± 1.75                |
| Baivaria v1                                                 | **75.70** ± 0.97            |

For PoS Tagging:

| Model                                                        | Accuracy (Final test dataset) | F1-Score (Final test dataset) |
| ------------------------------------------------------------ | ----------------------------- | ----------------------------- |
| GBERT Large from [MaiBaam](https://arxiv.org/abs/2403.10293) | 80.29                         | 62.45                         |
| Baivaria v1                                                  | **90.28** ± 0.16              | **73.65** ± 0.91              |

# ❤️ Acknowledgements

Baivaria is the outcome of working with TPUs from the awesome [TRC program](https://sites.research.google/trc/about/)
and the [TensorFlow Model Garden](https://github.com/tensorflow/models) library.

Many thanks for providing TPUs!

Made from Bavarian Oberland with ❤️ and 🥨.