|
|
--- |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
We released the suite of models we trained as part of our work on scaling laws of decoder-only machine translation systems. This work has been published in WMT24 and is available [here](https://aclanthology.org/2024.wmt-1.124/). |
|
|
|
|
|
These models have been trained on a mixture of general and financial sentences on 11 language directions. They support 8 languages (English, French, German, Italian, Spanish, Dutch, Swedish and Portuguese) as well as 9 domains (general + 8 financial subdomains). They are not tailored for document-level translation. |
|
|
|
|
|
A running demo of these models is available on [our dedicated space](https://huggingface.co/spaces/DragonLLM/FinTranslate-Demo). |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
The below table details the performance of our models on general domain translation. |
|
|
|
|
|
| Model | BLEU | COMET | COMET-Kiwi | |
|
|
| ----------- | --------- | --------- | ---------- | |
|
|
| FinTranslate-70M | 29.62 | 81.31 | 80.72 | |
|
|
| FinTranslate-160M | 32.43 | 84.00 | 83.45 | |
|
|
| FinTranslate-410M | 33.60 | 84.81 | 84.14 | |
|
|
| FinTranslate-Bronze | 34.08 | 85.10 | 84.35 | |
|
|
| FinTranslate-Silver | 34.42 | 85.10 | 84.33 | |
|
|
| FinTranslate-Gold | **36.07** | 85.88 | 84.82 | |
|
|
| | | | | |
|
|
| Llama3.1 8B | 30.43 | 84.82 | 84.47 | |
|
|
| Mistral 7B | 23.26 | 80.08 | 82.29 | |
|
|
| Tower 7B | 33.50 | **85.91** | **85.02** | |
|
|
|
|
|
|
|
|
The below table details the performance of our models on financial translation. |
|
|
|
|
|
| Model | BLEU | COMET | COMET-Kiwi | |
|
|
| ------------ | --------- | --------- | ---------- | |
|
|
| FinTranslate-70M | 44.63 | 86.95 | 80.88 | |
|
|
| FinTranslate-160M | 49.02 | 88.27 | 81.80 | |
|
|
| FinTranslate-410M | 50.85 | 88.64 | 81.73 | |
|
|
| FinTranslate-Bronze | 52.00 | 88.85 | 81.71 | |
|
|
| FinTranslate-Silver | 53.28 | **89.98** | 81.61 | |
|
|
| FinTranslate-Gold | **58.34** | 89.62 | 81.35 | |
|
|
| | | | | |
|
|
| Llama 3.1 8B | 34.99 | 84.42 | 81.75 | |
|
|
| Mistral 7B | 38.93 | 76.52 | 76.17 | |
|
|
| Tower 7B | 38.93 | 86.49 | **82.66** | |
|
|
|
|
|
|
|
|
## How to use it |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
LANGUAGES = ["en", "de", "es", "fr", "it", "nl", "sv", "pt"] |
|
|
DOMAINS = { |
|
|
"Asset manangement": "am", |
|
|
"Annual report": "ar", |
|
|
"Corporate action": "corporateAction", |
|
|
"Equity research": "equi", |
|
|
"Fund fact sheet": "ffs", |
|
|
"Kiid": "kiid", |
|
|
"Life insurance": "lifeInsurance", |
|
|
"Regulatory": "regulatory", |
|
|
"General": "general", |
|
|
} |
|
|
|
|
|
|
|
|
def language_token(lang): |
|
|
return f"<lang_{lang}>" |
|
|
|
|
|
|
|
|
def domain_token(dom): |
|
|
return f"<dom_{dom}>" |
|
|
|
|
|
|
|
|
def format_input(src, tgt_lang, src_lang, domain): |
|
|
assert tgt_lang in LANGUAGES |
|
|
tgt_lang_token = language_token(tgt_lang) |
|
|
# Please read our paper to understand why we need to prefix the input with <eos> |
|
|
base_input = f"<eos>{src}</src>{tgt_lang_token}" |
|
|
if src_lang is None: |
|
|
return base_input |
|
|
else: |
|
|
assert src_lang in LANGUAGES |
|
|
src_lang_token = language_token(src_lang) |
|
|
base_input = f"{base_input}{src_lang_token}" |
|
|
if domain is None: |
|
|
return base_input |
|
|
else: |
|
|
domain = DOMAINS.get(domain, "general") |
|
|
dom_token = domain_token(domain) |
|
|
base_input = f"{base_input}{dom_token}" |
|
|
return base_input |
|
|
|
|
|
|
|
|
model_id = "DragonLLM/FinTranslate-160M" |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
|
|
source_sentence = "Dragon LLM est une entreprise française spécialisé dans le domaine de l'IA générative." |
|
|
formatted_sentence = format_input(source_sentence, "en", "fr", "General") |
|
|
inputs = tokenizer(formatted_sentence, return_tensors="pt", return_token_type_ids=False) |
|
|
outputs = model.generate(**inputs, max_new_tokens=64) |
|
|
|
|
|
input_size = inputs["input_ids"].size(1) |
|
|
translated_sentence = tokenizer.decode( |
|
|
outputs[0, input_size:], skip_special_tokens=True |
|
|
) |
|
|
print(translated_sentence) |
|
|
# Dragon LLM is a French company specialized in the field of generative AI. |
|
|
|
|
|
``` |
|
|
|
|
|
## Citing this work |
|
|
|
|
|
If you use this model in your work, please cite it as: |
|
|
|
|
|
``` |
|
|
@inproceedings{caillaut-etal-2024-scaling, |
|
|
title = "Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task", |
|
|
author = {Caillaut, Ga{\"e}tan and |
|
|
Nakhl{\'e}, Mariam and |
|
|
Qader, Raheel and |
|
|
Liu, Jingshu and |
|
|
Barth{\'e}lemy, Jean-Gabriel}, |
|
|
editor = "Haddow, Barry and |
|
|
Kocmi, Tom and |
|
|
Koehn, Philipp and |
|
|
Monz, Christof", |
|
|
booktitle = "Proceedings of the Ninth Conference on Machine Translation", |
|
|
month = nov, |
|
|
year = "2024", |
|
|
address = "Miami, Florida, USA", |
|
|
publisher = "Association for Computational Linguistics", |
|
|
url = "https://aclanthology.org/2024.wmt-1.124/", |
|
|
doi = "10.18653/v1/2024.wmt-1.124", |
|
|
pages = "1318--1331" |
|
|
} |
|
|
``` |