MarianMT-en-to-ff (English to Fula)
π Overview
MarianMT-en-to-ff is a fine-tuned machine translation model specializing in translating text from English to Fula (also known as Fulfulde or Pulaar). This model is based on the powerful MarianMT framework by Helsinki-NLP and was trained on a meticulously curated, but small, parallel corpus, aiming to serve the low-resource language community.
The model provides a baseline for effective machine translation in a language pair where high-quality resources are scarce.
π§ Model Architecture
- Base Model: Initialized from a related language pair (e.g.,
opus-mt-en-fr) and fine-tuned. - Architecture: Sequence-to-Sequence Transformer (Encoder-Decoder) model.
- Tokenizer: A custom SentencePiece tokenizer trained on the combined English and Fula corpus.
- Parameters: Standard MarianMT configuration with 6 encoder and 6 decoder layers.
- Translation Direction: English $\rightarrow$ Fula (en $\rightarrow$ ff).
π Intended Use
- Digital Inclusion: Facilitating access to English-language content for Fula speakers.
- Academic Research: A foundational model for further research in low-resource NMT.
- Basic Communication: Providing draft translations for non-critical text.
β οΈ Limitations
- Low-Resource Quality: Due to the limited size of the parallel corpus, the translation quality may be inconsistent, especially for domain-specific, complex, or highly idiomatic English phrases.
- Dialect Variation: Fula has several regional dialects. The training data primarily reflects a West African dialect, and translation quality may degrade for texts in other dialects.
- Domain Specificity: The model is trained on general and news domain text. Technical or highly specific vocabulary may not be handled correctly.
π» Example Code
from transformers import MarianMTModel, MarianTokenizer
# Load model and tokenizer
model_name = "Your-HF-Username/MarianMT-en-to-ff"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
# Sample English text
english_text = ["The community needs clean water for health and agriculture.",
"We are going to visit the capital city next week."]
# Tokenize and generate translation
encoded_input = tokenizer(english_text, return_tensors="pt", padding=True, truncation=True)
translated_tokens = model.generate(**encoded_input)
# Decode and print
translated_text = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
print("--- English to Fula Translation ---")
for en, ff in zip(english_text, translated_text):
print(f"EN: {en}")
print(f"FF: {ff}\n")
# Note: Fula translations will vary based on training data.
# Expected FF example: "YimΙe Ιee ΙaΙΙi ndiyam laaΙΙam ngam cellal e ndema."
- Downloads last month
- 25