Marian MT fine-tuned on the Multilingual Corpus of World’s Constitutions (MCWC)
This model is a fine-tuned version of Helsinki-NLP/opus-mt-ar-en, adapted using high-quality sentence-aligned constitutional text from the Multilingual Corpus of World’s Constitutions (MCWC)
📄 MCWC paper (OSACT 2024): https://aclanthology.org/2024.osact-1.7/
This variant handles: Arabic → English translation.
Overview
The MCWC provides a curated multilingual collection of constitutional texts from countries across the world. The corpus emphasises data cleanliness, high-quality sentence alignment, and detailed metadata (including country and continent mappings). It supports research in:
- legal and constitutional NLP
- comparative constitutional studies
- multilingual machine translation
- cross-lingual semantic analysis
This model was fine-tuned on the Arabic-English segment of the MCWC, enabling translation that is more attuned to legal and constitutional language than general-purpose MT systems.
Intended use
This model is suitable for tasks such as:
- translating constitutional or legal documents
- cross-lingual legal text comparison
- multilingual information extraction
- downstream legal NLP tasks requiring domain-specific MT
It is not intended for casual or conversational translation, as it is optimised for formal and legal text.
Training data
The model was trained on the MCWC’s Arabic-English aligned sentence pairs.
The MCWC dataset includes:
- cleaned constitutional text
- high-quality sentence segmentation
- pairwise alignments
- country and regional metadata
More details may be found in the accompanying paper:
El-Haj, M. & Ezzini, S. (2024). “The Multilingual Corpus of World’s Constitutions (MCWC).”
OSACT @ LREC-COLING 2024.
https://aclanthology.org/2024.osact-1.7/
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'module': 'keras.optimizers.schedules', 'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 5e-05, 'decay_steps': 384, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'registered_name': None}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False, 'weight_decay_rate': 0.01}
- training_precision: mixed_float16
Training results
| Train Loss | Validation Loss | Epoch |
|---|---|---|
| 1.3918 | 1.1473 | 0 |
| 1.0745 | 1.1021 | 1 |
| 0.9486 | 1.0908 | 2 |
Framework versions
- Transformers 4.33.3
- TensorFlow 2.13.0
- Datasets 2.14.5
- Tokenizers 0.13.3
Citation
If you use this model, please cite the MCWC paper:
El-Haj, M. & Ezzini, S. (2024).
The Multilingual Corpus of World’s Constitutions (MCWC).
Proceedings of OSACT @ LREC-COLING 2024.
https://aclanthology.org/2024.osact-1.7/
- Downloads last month
- 9
Model tree for drelhaj/marian-finetuned-mcwc-ar-to-en
Base model
Helsinki-NLP/opus-mt-ar-en