MarianMT-en-to-ff (English to Fula)

📝 Overview

MarianMT-en-to-ff is a fine-tuned machine translation model specializing in translating text from English to Fula (also known as Fulfulde or Pulaar). This model is based on the powerful MarianMT framework by Helsinki-NLP and was trained on a meticulously curated, but small, parallel corpus, aiming to serve the low-resource language community.

The model provides a baseline for effective machine translation in a language pair where high-quality resources are scarce.

🧠 Model Architecture

Base Model: Initialized from a related language pair (e.g., opus-mt-en-fr) and fine-tuned.
Architecture: Sequence-to-Sequence Transformer (Encoder-Decoder) model.
Tokenizer: A custom SentencePiece tokenizer trained on the combined English and Fula corpus.
Parameters: Standard MarianMT configuration with 6 encoder and 6 decoder layers.
Translation Direction: English $\rightarrow$ Fula (en $\rightarrow$ ff).

🚀 Intended Use

Digital Inclusion: Facilitating access to English-language content for Fula speakers.
Academic Research: A foundational model for further research in low-resource NMT.
Basic Communication: Providing draft translations for non-critical text.

⚠️ Limitations

Low-Resource Quality: Due to the limited size of the parallel corpus, the translation quality may be inconsistent, especially for domain-specific, complex, or highly idiomatic English phrases.
Dialect Variation: Fula has several regional dialects. The training data primarily reflects a West African dialect, and translation quality may degrade for texts in other dialects.
Domain Specificity: The model is trained on general and news domain text. Technical or highly specific vocabulary may not be handled correctly.

💻 Example Code

from transformers import MarianMTModel, MarianTokenizer

# Load model and tokenizer
model_name = "Your-HF-Username/MarianMT-en-to-ff"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

# Sample English text
english_text = ["The community needs clean water for health and agriculture.", 
                "We are going to visit the capital city next week."]

# Tokenize and generate translation
encoded_input = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
translated_tokens = model.generate(**encoded_input)

# Decode and print
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)

print("--- English to Fula Translation ---")
for en, ff in zip(english_text, translated_text):
    print(f"EN: {en}")
    print(f"FF: {ff}\n")
# Note: Fula translations will vary based on training data.
# Expected FF example: "Yimɓe ɓee ɗaɓɓi ndiyam laaɓɗam ngam cellal e ndema."

Downloads last month: 25