ModernAraBERT / README.md
gizadatateam's picture
Update README.md
8f1434e verified
---
license: mit
datasets:
- yiyic/oscar_arb_Arab_train
- yiyic/oscar_arb_Arab_test
- SaiedAlshahrani/Arabic_Wikipedia_20230101_bots
- ClusterlabAi/101_billion_arabic_words_dataset
language:
- ar
metrics:
- f1
- exact_match
base_model:
- answerdotai/ModernBERT-base
tags:
- Embedding
- Arabic
- Sentiment_Analysis
- QA
- NER
---
# Model Card: ModernAraBERT
## Summary
- Arabic encoder adapted from `answerdotai/ModernBERT-base` via continued pretraining on Arabic corpora (~9.8GB).
- Strong results across SA, NER (Macro-F1), and QA EM vs. AraBERT/mBERT/MARBERT.
- License: MIT · Paper: LREC 2026 · Hub: gizadatateam/ModernAraBERT
## Intended Uses
- Masked LM, feature extraction, and transfer learning for Arabic tasks.
- Downstream: sentiment analysis, NER, extractive QA, general classification/labeling.
## How to use
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
name = "gizadatateam/ModernAraBERT"
model = AutoModelForMaskedLM.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
```
## Training data and recipe (brief)
- Corpora: OSIAN, Arabic Billion Words, Arabic Wikipedia, OSCAR Arabic
- Tokenizer: ModernBERT vocab + 80K Arabic tokens
- Objective: MLM (3 epochs; 128→512 seq len)
- Hardware: A100 40GB; framework: PyTorch + Transformers + Accelerate
## Evaluation (from paper)
### Sentiment Analysis — Macro-F1 (%)
| Model | LABR | HARD | AJGT |
| ----------------- | --------- | --------- | --------- |
| AraBERTv1 | 45.35 | 72.65 | 58.01 |
| AraBERTv2 | 45.79 | 67.10 | 53.59 |
| mBERT | 44.18 | 71.70 | 61.55 |
| MARBERT | 45.54 | 67.39 | 60.63 |
| **ModernAraBERT** | **56.45** | **89.37** | **70.54** |
### NER — Macro-F1 (%)
| Model | Macro-F1 |
| ----------------- | --------- |
| AraBERTv1 | 13.46 |
| AraBERTv2 | 16.77 |
| mBERT | 12.15 |
| MARBERT | 7.42 |
| **ModernAraBERT** | **28.23** |
### QA (ARCD test) — EM (%)
| Model | EM |
| ----------------- | --------- |
| AraBERT | 25.36 |
| AraBERTv2 | 26.08 |
| mBERT | 25.12 |
| MARBERT | 23.58 |
| **ModernAraBERT** | **27.10** |
## Citation
```bibtex
@inproceedings{<paper_id>,
title={Efficient Adaptation of English Language Models for Low-Resource and Morphologically Rich Languages: The Case of Arabic},
author={Maher, Eldamaty, Ashraf, ElShawi, Mostafa},
booktitle={Proceedings of <conference_name>},
year={2025},
organization={<conference_name>}
}
```