Model Card

Original weights of SONAR converted for hugging face model.

This is a part of Open SONAR project, an open training pipeline for SONAR. This pipeline could be use to finetune or train from scratch a sonar model.

Model Sources

Uses

Examples are avalaible here.

Load the model and tokenizer


SONARForSpeech2Text.from_pretrained("tutur90/SONAR-Text-to-Text")

NllbTokenizer.from_pretrained("tutur90/SONAR-Text-to-Text")

The code of SONARForSpeech2Text avalaible in Open SONAR - Model and NllbTokenizer Open SONAR - Tokenizer

Translation


inputs = tokenizer(sentence, langs=src_lang, return_tensors="pt")

generated = sonar.generate(
        **inputs,
        target_lang_ids=[tokenizer.convert_tokens_to_ids(tgt_lang)],
        max_length=128,
        num_beams=1,
        do_sample=False,
    )
decoded = tokenizer.batch_decode(generated, skip_special_tokens=True)[0]

print(f"SONAR {src_lang} -> {tgt_lang}: {decoded}")

Encode


encoder = SONARForText2Text.from_pretrained("tutur90/SONAR-Text-to-Text")

encoder.set_encoder_only() # Delete decoder to save memory, this options is not needed

inputs = tokenizer(sentence, langs=src_lang, return_tensors="pt")

embeddings = encoder.encode(**inputs)

Decode


decoder = SONARForText2Text.from_pretrained("tutur90/SONAR-Text-to-Text")

decoder.set_decoder_only() # Same

decoded = decoder.decode(
        encoder_outputs,
        target_lang_ids=[tokenizer.convert_tokens_to_ids(tgt_lang)],
        max_length=128,
        num_beams=1,
        do_sample=False,
    )
decoded = tokenizer.batch_decode(decoded, skip_special_tokens=True)[0]

Downloads last month
7,156
Safetensors
Model size
2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support