--- language: - en - uk metrics: - chrf - comet tags: - synthetic - machine-translation - low-resource - data-augmentation - nmt - multilingual - dataset - transformer-base datasets: - openlanguagedata/flores_plus license: cc-by-4.0 --- # Model Name: `Helsinki-NLP/opus-mt-synthetic-en-uk` ## Model Overview This model is the synthetic baseline (transformer-base) for the English-Ukrainian language pair of our paper ["Scaling Low-Resource MT via Synthetic Data Generation with LLMs"](https://arxiv.org/abs/2505.14423). The training data is generated by forward translating English Europarl with GPT-4o and is specifically aimed at improving MT performance for underrepresented languages by supplementing traditional datasets with high-quality, LLM-generated translations. The goal of this model is to provide a baseline for MT tasks, demonstrating the potential of synthetic data to enhance translation capabilities for languages with limited existing resources. For more detailed methodology, see the full paper [here](https://arxiv.org/abs/2505.14423). ## Supported Language Pair: * **English ↔ Ukrainian** ## Evaluation The quality of the generated synthetic data was evaluated using both automatic metrics (such as COMET and ChrF) and human evaluations. The evaluation shows that the synthetic data generally performs well for low-resource languages, with significant gains observed when using the data in downstream MT training. Below are the evaluation results on FLORES+: | Language Pair | ChrF Score | COMET Score | | ------------------------- | ---------- | ----------- | | English ↔ Basque | 53.00 | 81.51 | | English ↔ Scottish Gaelic | 51.10 | 78.04 | | English ↔ Icelandic | 49.91 | 80.16 | | English ↔ Georgian | 49.49 | 80.72 | | English ↔ Macedonian | 57.72 | 82.24 | | English ↔ Somali | 45.10 | 78.15 | | English ↔ Ukrainian | 51.71 | 78.89 | The results demonstrate that synthetic data provides strong baseline performance across all language pairs, with the best performance for Macedonian and Ukrainian, which are relatively less low-resource compared to others. ## Usage You can use this model to generate translations with the following code: ```python from transformers import MarianMTModel, MarianTokenizer # Load the pre-trained model and tokenizer model_name = "Helsinki-NLP/opus-mt-synthetic-en-uk" model = MarianMTModel.from_pretrained(model_name) tokenizer = MarianTokenizer.from_pretrained(model_name) # Example source text (English) source_texts = ["Hello, how are you?", "Good morning!", "What is your name?"] # Tokenize the input texts inputs = tokenizer(source_texts, return_tensors="pt", padding=True, truncation=True) # Generate translations translated_ids = model.generate(inputs["input_ids"]) # Decode the generated tokens to get the translated text translated_texts = tokenizer.batch_decode(translated_ids, skip_special_tokens=True) # Print the translations for src, tgt in zip(source_texts, translated_texts): print(f"Source: {src} => Translated: {tgt}") ``` For the given English sentences, the output might look something like this: ``` Source: How are you? => Translated: Як ви? Source: Good morning! => Translated: Доброго ранку Source: What is your name? => Translated: Яке ваше ім'я? ``` ## Citation Information ```bibtex @article{degibert2025scaling, title={Scaling Low-Resource MT via Synthetic Data Generation with LLMs}, author={de Gibert, Ona and Attieh, Joseph and Vahtola, Teemu and Aulamo, Mikko and Li, Zihao and V{\'a}zquez, Ra{\'u}l and Hu, Tiancheng and Tiedemann, J{\"o}rg}, journal={arXiv preprint arXiv:2505.14423}, year={2025} } ```