Marian MT fine-tuned on the Multilingual Corpus of World’s Constitutions (MCWC)

This model is a fine-tuned version of Helsinki-NLP/opus-mt-ar-en, adapted using high-quality sentence-aligned constitutional text from the Multilingual Corpus of World’s Constitutions (MCWC)
📄 MCWC paper (OSACT 2024): https://aclanthology.org/2024.osact-1.7/

This variant handles: Arabic → English translation.


Overview

The MCWC provides a curated multilingual collection of constitutional texts from countries across the world. The corpus emphasises data cleanliness, high-quality sentence alignment, and detailed metadata (including country and continent mappings). It supports research in:

  • legal and constitutional NLP
  • comparative constitutional studies
  • multilingual machine translation
  • cross-lingual semantic analysis

This model was fine-tuned on the Arabic-English segment of the MCWC, enabling translation that is more attuned to legal and constitutional language than general-purpose MT systems.


Intended use

This model is suitable for tasks such as:

  • translating constitutional or legal documents
  • cross-lingual legal text comparison
  • multilingual information extraction
  • downstream legal NLP tasks requiring domain-specific MT

It is not intended for casual or conversational translation, as it is optimised for formal and legal text.


Training data

The model was trained on the MCWC’s Arabic-English aligned sentence pairs.
The MCWC dataset includes:

  • cleaned constitutional text
  • high-quality sentence segmentation
  • pairwise alignments
  • country and regional metadata

More details may be found in the accompanying paper:

El-Haj, M. & Ezzini, S. (2024). “The Multilingual Corpus of World’s Constitutions (MCWC).”
OSACT @ LREC-COLING 2024.
https://aclanthology.org/2024.osact-1.7/


Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • optimizer: {'name': 'AdamWeightDecay', 'learning_rate': {'module': 'keras.optimizers.schedules', 'class_name': 'PolynomialDecay', 'config': {'initial_learning_rate': 5e-05, 'decay_steps': 384, 'end_learning_rate': 0.0, 'power': 1.0, 'cycle': False, 'name': None}, 'registered_name': None}, 'decay': 0.0, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False, 'weight_decay_rate': 0.01}
  • training_precision: mixed_float16

Training results

Train Loss Validation Loss Epoch
1.3918 1.1473 0
1.0745 1.1021 1
0.9486 1.0908 2

Framework versions

  • Transformers 4.33.3
  • TensorFlow 2.13.0
  • Datasets 2.14.5
  • Tokenizers 0.13.3

Citation

If you use this model, please cite the MCWC paper:

El-Haj, M. & Ezzini, S. (2024).
The Multilingual Corpus of World’s Constitutions (MCWC).
Proceedings of OSACT @ LREC-COLING 2024.
https://aclanthology.org/2024.osact-1.7/

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for drelhaj/marian-finetuned-mcwc-ar-to-en

Finetuned
(40)
this model