Bilingual Language Model (Bangla-English)
Model Description
This is a bilingual causal language model trained on Bangla (Bengali) and English text. The model is designed for general-purpose text generation and understanding in both languages.
Model Type: Causal Language Model (GPT-style)
Languages: Bangla (bn), English (en)
Training Data: Wikipedia articles, educational content, literary texts
License: Apache 2.0
Model Size: 124M parameters
Context Length: 2048 tokens
Intended Uses
Primary Use Cases
- Text Generation: Generate coherent text in Bangla or English
- Text Completion: Complete partial sentences or paragraphs
- Language Understanding: Extract features for downstream tasks
- Fine-tuning: Base model for task-specific applications
Example Applications
- Content generation for educational materials
- Writing assistance tools
- Chatbots and conversational AI
- Text summarization (after fine-tuning)
- Question answering (after fine-tuning)
How to Use
Installation
pip install transformers torch
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "KothaGPT/bilingual-lm"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text in Bangla
prompt = "বাংলাদেশের রাজধানী"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Generate text in English
prompt = "The capital of Bangladesh is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Advanced Usage with Pipeline
from transformers import pipeline
# Create text generation pipeline
generator = pipeline("text-generation", model=model_name)
# Generate with parameters
result = generator(
"বাংলা ভাষা",
max_length=100,
num_return_sequences=3,
temperature=0.8,
top_p=0.9
)
for seq in result:
print(seq['generated_text'])
Training Details
Training Data
- Wikipedia: Bangla and English Wikipedia articles (aligned parallel corpus)
- Literary Corpus: Bengali literature and poetry
- Educational Content: Textbooks and learning materials
- Web Crawl: High-quality web content in both languages
- Total Tokens: ~1.2B tokens (600M per language)
Training Procedure
- Architecture: GPT-Neo architecture with rotary position embeddings
- Tokenizer: Custom bilingual Byte-level BPE tokenizer
- Vocabulary Size: 65,536 tokens (32,768 per language)
- Training Steps: 150,000 steps with gradient accumulation
- Batch Size: 1M tokens per batch (distributed across GPUs)
- Learning Rate: 6e-5 with cosine decay and warmup
- Hardware: Trained on 8x A100 GPUs (80GB) with DeepSpeed ZeRO-3
- Mixed Precision: bfloat16 with gradient checkpointing
- Sequence Length: 2048 tokens
Hyperparameters
{
"model_type": "gpt2",
"vocab_size": 50000,
"n_positions": 1024,
"n_embd": 768,
"n_layer": 12,
"n_head": 12,
"learning_rate": 5e-5,
"warmup_steps": 10000,
"max_steps": 100000
}
Evaluation
Perplexity (Lower is Better)
| Dataset | Perplexity |
|---|---|
| Bangla Test Set | 12.4 |
| English Test Set | 15.8 |
| Mixed Test Set | 14.1 |
| Code-Switched Test Set | 17.3 |
Zero-shot Performance
| Task | Bangla | English |
|---|---|---|
| Text Classification | 78.2% | 82.5% |
| Named Entity Recognition | 75.6% F1 | 79.3% F1 |
| Question Answering | 68.4% F1 | 72.1% F1 |
Downstream Tasks (after fine-tuning)
- Text Classification: 85% accuracy
- Named Entity Recognition: 82% F1
- Question Answering: 78% F1
Limitations
Known Limitations
- Domain Bias: Primarily trained on Wikipedia and educational content
- Formal Language: Better performance on formal text than colloquial speech
- Code-Switching: Handles basic code-switching but may produce inconsistent outputs
- Context Length: Maximum 2048 tokens
- Generation Quality: May produce repetitive or incoherent text for very long sequences
- Toxic Content: May generate harmful or biased content without proper filtering
Language-Specific Issues
- Bangla: May struggle with complex literary forms and regional dialects
- English: Optimized for general English, may not capture specialized domains
- Romanized Bangla: Not trained on Romanized Bengali text
Ethical Considerations
Bias and Fairness
- The model may reflect biases present in Wikipedia and training data
- Geographic bias towards Bangladesh and India
- Potential gender and cultural biases in generated text
Recommended Practices
- Review generated content for appropriateness
- Do not use for generating harmful or misleading content
- Consider fine-tuning on domain-specific data for production use
- Implement content filtering for user-facing applications
Privacy
- Model does not store training data
- No personal information should be present in outputs
- Use caution when processing sensitive information
Citation
If you use this model in your research, please cite:
@misc{kothagpt-bilingual-lm,
title={KothaGPT Bilingual LM: A Large Language Model for Bangla and English},
author={KothaGPT Team},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/KothaGPT/bilingual-lm}},
note={Model card and documentation}
}
Model Card Authors
KothaGPT Team
Model Card Contact
For questions or issues, please open an issue on the GitHub repository.
Additional Resources
- GitHub Repository: https://github.com/KothaGPT/bilingual
- Documentation: https://github.com/KothaGPT/bilingual/tree/main/docs
- Dataset: https://huggingface.co/datasets/KothaGPT/bilingual-corpus
- Demo: https://huggingface.co/spaces/KothaGPT/bilingual-lm-demo
- Downloads last month
- 16
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support