FinTranslate-160M / README.md

Update README.md

5b4c54c verified about 1 month ago

5.09 kB

	---
	license: apache-2.0
	---

	We released the suite of models we trained as part of our work on scaling laws of decoder-only machine translation systems. This work has been published in WMT24 and is available [here](https://aclanthology.org/2024.wmt-1.124/).

	These models have been trained on a mixture of general and financial sentences on 11 language directions. They support 8 languages (English, French, German, Italian, Spanish, Dutch, Swedish and Portuguese) as well as 9 domains (general + 8 financial subdomains). They are not tailored for document-level translation.

	A running demo of these models is available on [our dedicated space](https://huggingface.co/spaces/DragonLLM/FinTranslate-Demo).

	## Evaluation

	The below table details the performance of our models on general domain translation.

	\| Model \| BLEU \| COMET \| COMET-Kiwi \|
	\| ----------- \| --------- \| --------- \| ---------- \|
	\| FinTranslate-70M \| 29.62 \| 81.31 \| 80.72 \|
	\| FinTranslate-160M \| 32.43 \| 84.00 \| 83.45 \|
	\| FinTranslate-410M \| 33.60 \| 84.81 \| 84.14 \|
	\| FinTranslate-Bronze \| 34.08 \| 85.10 \| 84.35 \|
	\| FinTranslate-Silver \| 34.42 \| 85.10 \| 84.33 \|
	\| FinTranslate-Gold \| 36.07 \| 85.88 \| 84.82 \|
	\| \| \| \| \|
	\| Llama3.1 8B \| 30.43 \| 84.82 \| 84.47 \|
	\| Mistral 7B \| 23.26 \| 80.08 \| 82.29 \|
	\| Tower 7B \| 33.50 \| 85.91 \| 85.02 \|


	The below table details the performance of our models on financial translation.

	\| Model \| BLEU \| COMET \| COMET-Kiwi \|
	\| ------------ \| --------- \| --------- \| ---------- \|
	\| FinTranslate-70M \| 44.63 \| 86.95 \| 80.88 \|
	\| FinTranslate-160M \| 49.02 \| 88.27 \| 81.80 \|
	\| FinTranslate-410M \| 50.85 \| 88.64 \| 81.73 \|
	\| FinTranslate-Bronze \| 52.00 \| 88.85 \| 81.71 \|
	\| FinTranslate-Silver \| 53.28 \| 89.98 \| 81.61 \|
	\| FinTranslate-Gold \| 58.34 \| 89.62 \| 81.35 \|
	\| \| \| \| \|
	\| Llama 3.1 8B \| 34.99 \| 84.42 \| 81.75 \|
	\| Mistral 7B \| 38.93 \| 76.52 \| 76.17 \|
	\| Tower 7B \| 38.93 \| 86.49 \| 82.66 \|


	## How to use it

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	LANGUAGES = ["en", "de", "es", "fr", "it", "nl", "sv", "pt"]
	DOMAINS = {
	"Asset manangement": "am",
	"Annual report": "ar",
	"Corporate action": "corporateAction",
	"Equity research": "equi",
	"Fund fact sheet": "ffs",
	"Kiid": "kiid",
	"Life insurance": "lifeInsurance",
	"Regulatory": "regulatory",
	"General": "general",
	}


	def language_token(lang):
	return f"<lang_{lang}>"


	def domain_token(dom):
	return f"<dom_{dom}>"


	def format_input(src, tgt_lang, src_lang, domain):
	assert tgt_lang in LANGUAGES
	tgt_lang_token = language_token(tgt_lang)
	# Please read our paper to understand why we need to prefix the input with <eos>
	base_input = f"<eos>{src}</src>{tgt_lang_token}"
	if src_lang is None:
	return base_input
	else:
	assert src_lang in LANGUAGES
	src_lang_token = language_token(src_lang)
	base_input = f"{base_input}{src_lang_token}"
	if domain is None:
	return base_input
	else:
	domain = DOMAINS.get(domain, "general")
	dom_token = domain_token(domain)
	base_input = f"{base_input}{dom_token}"
	return base_input


	model_id = "DragonLLM/FinTranslate-160M"
	model = AutoModelForCausalLM.from_pretrained(model_id)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	source_sentence = "Dragon LLM est une entreprise française spécialisé dans le domaine de l'IA générative."
	formatted_sentence = format_input(source_sentence, "en", "fr", "General")
	inputs = tokenizer(formatted_sentence, return_tensors="pt", return_token_type_ids=False)
	outputs = model.generate(**inputs, max_new_tokens=64)

	input_size = inputs["input_ids"].size(1)
	translated_sentence = tokenizer.decode(
	outputs[0, input_size:], skip_special_tokens=True
	)
	print(translated_sentence)
	# Dragon LLM is a French company specialized in the field of generative AI.

	```

	## Citing this work

	If you use this model in your work, please cite it as:

	```
	@inproceedings{caillaut-etal-2024-scaling,
	title = "Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task",
	author = {Caillaut, Ga{\"e}tan and
	Nakhl{\'e}, Mariam and
	Qader, Raheel and
	Liu, Jingshu and
	Barth{\'e}lemy, Jean-Gabriel},
	editor = "Haddow, Barry and
	Kocmi, Tom and
	Koehn, Philipp and
	Monz, Christof",
	booktitle = "Proceedings of the Ninth Conference on Machine Translation",
	month = nov,
	year = "2024",
	address = "Miami, Florida, USA",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2024.wmt-1.124/",
	doi = "10.18653/v1/2024.wmt-1.124",
	pages = "1318--1331"
	}
	```