Model Card
Original weights of SONAR converted for hugging face model.
This is a part of Open SONAR project, an open training pipeline for SONAR. This pipeline could be use to finetune or train from scratch a sonar model.
Model Sources
- Repository: https://github.com/facebookresearch/SONAR
- Paper: https://arxiv.org/abs/2308.11466
Uses
Examples are avalaible here.
Load the model and tokenizer
SONARForSpeech2Text.from_pretrained("tutur90/SONAR-Text-to-Text")
NllbTokenizer.from_pretrained("tutur90/SONAR-Text-to-Text")
The code of SONARForSpeech2Text avalaible in Open SONAR - Model and NllbTokenizer Open SONAR - Tokenizer
Translation
inputs = tokenizer(sentence, langs=src_lang, return_tensors="pt")
generated = sonar.generate(
**inputs,
target_lang_ids=[tokenizer.convert_tokens_to_ids(tgt_lang)],
max_length=128,
num_beams=1,
do_sample=False,
)
decoded = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
print(f"SONAR {src_lang} -> {tgt_lang}: {decoded}")
Encode
encoder = SONARForText2Text.from_pretrained("tutur90/SONAR-Text-to-Text")
encoder.set_encoder_only() # Delete decoder to save memory, this options is not needed
inputs = tokenizer(sentence, langs=src_lang, return_tensors="pt")
embeddings = encoder.encode(**inputs)
Decode
decoder = SONARForText2Text.from_pretrained("tutur90/SONAR-Text-to-Text")
decoder.set_decoder_only() # Same
decoded = decoder.decode(
encoder_outputs,
target_lang_ids=[tokenizer.convert_tokens_to_ids(tgt_lang)],
max_length=128,
num_beams=1,
do_sample=False,
)
decoded = tokenizer.batch_decode(decoded, skip_special_tokens=True)[0]
- Downloads last month
- 7,156
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support