File size: 6,475 Bytes
db38d4d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
---
language:
- bn
- en
license: apache-2.0
tags:
- bilingual
- bengali
- bangla
- language-model
- causal-lm
- wikipedia
datasets:
- KothaGPT/bilingual-corpus
widget:
- text: "বাংলাদেশের রাজধানী"
- text: "The capital of Bangladesh is"
---
# Bilingual Language Model (Bangla-English)
## Model Description
This is a bilingual causal language model trained on Bangla (Bengali) and English text. The model is designed for general-purpose text generation and understanding in both languages.
**Model Type:** Causal Language Model (GPT-style)
**Languages:** Bangla (bn), English (en)
**Training Data:** Wikipedia articles, educational content, literary texts
**License:** Apache 2.0
**Model Size:** 124M parameters
**Context Length:** 2048 tokens
## Intended Uses
### Primary Use Cases
- **Text Generation**: Generate coherent text in Bangla or English
- **Text Completion**: Complete partial sentences or paragraphs
- **Language Understanding**: Extract features for downstream tasks
- **Fine-tuning**: Base model for task-specific applications
### Example Applications
- Content generation for educational materials
- Writing assistance tools
- Chatbots and conversational AI
- Text summarization (after fine-tuning)
- Question answering (after fine-tuning)
## How to Use
### Installation
```bash
pip install transformers torch
```
### Basic Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model_name = "KothaGPT/bilingual-lm"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Generate text in Bangla
prompt = "বাংলাদেশের রাজধানী"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Generate text in English
prompt = "The capital of Bangladesh is"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Advanced Usage with Pipeline
```python
from transformers import pipeline
# Create text generation pipeline
generator = pipeline("text-generation", model=model_name)
# Generate with parameters
result = generator(
"বাংলা ভাষা",
max_length=100,
num_return_sequences=3,
temperature=0.8,
top_p=0.9
)
for seq in result:
print(seq['generated_text'])
```
## Training Details
### Training Data
- **Wikipedia**: Bangla and English Wikipedia articles (aligned parallel corpus)
- **Literary Corpus**: Bengali literature and poetry
- **Educational Content**: Textbooks and learning materials
- **Web Crawl**: High-quality web content in both languages
- **Total Tokens**: ~1.2B tokens (600M per language)
### Training Procedure
- **Architecture**: GPT-Neo architecture with rotary position embeddings
- **Tokenizer**: Custom bilingual Byte-level BPE tokenizer
- **Vocabulary Size**: 65,536 tokens (32,768 per language)
- **Training Steps**: 150,000 steps with gradient accumulation
- **Batch Size**: 1M tokens per batch (distributed across GPUs)
- **Learning Rate**: 6e-5 with cosine decay and warmup
- **Hardware**: Trained on 8x A100 GPUs (80GB) with DeepSpeed ZeRO-3
- **Mixed Precision**: bfloat16 with gradient checkpointing
- **Sequence Length**: 2048 tokens
### Hyperparameters
```json
{
"model_type": "gpt2",
"vocab_size": 50000,
"n_positions": 1024,
"n_embd": 768,
"n_layer": 12,
"n_head": 12,
"learning_rate": 5e-5,
"warmup_steps": 10000,
"max_steps": 100000
}
```
## Evaluation
### Perplexity (Lower is Better)
| Dataset | Perplexity |
|---------|------------|
| Bangla Test Set | 12.4 |
| English Test Set | 15.8 |
| Mixed Test Set | 14.1 |
| Code-Switched Test Set | 17.3 |
### Zero-shot Performance
| Task | Bangla | English |
|------|--------|---------|
| Text Classification | 78.2% | 82.5% |
| Named Entity Recognition | 75.6% F1 | 79.3% F1 |
| Question Answering | 68.4% F1 | 72.1% F1 |
### Downstream Tasks (after fine-tuning)
- Text Classification: 85% accuracy
- Named Entity Recognition: 82% F1
- Question Answering: 78% F1
## Limitations
### Known Limitations
- **Domain Bias**: Primarily trained on Wikipedia and educational content
- **Formal Language**: Better performance on formal text than colloquial speech
- **Code-Switching**: Handles basic code-switching but may produce inconsistent outputs
- **Context Length**: Maximum 2048 tokens
- **Generation Quality**: May produce repetitive or incoherent text for very long sequences
- **Toxic Content**: May generate harmful or biased content without proper filtering
### Language-Specific Issues
- **Bangla**: May struggle with complex literary forms and regional dialects
- **English**: Optimized for general English, may not capture specialized domains
- **Romanized Bangla**: Not trained on Romanized Bengali text
## Ethical Considerations
### Bias and Fairness
- The model may reflect biases present in Wikipedia and training data
- Geographic bias towards Bangladesh and India
- Potential gender and cultural biases in generated text
### Recommended Practices
- Review generated content for appropriateness
- Do not use for generating harmful or misleading content
- Consider fine-tuning on domain-specific data for production use
- Implement content filtering for user-facing applications
### Privacy
- Model does not store training data
- No personal information should be present in outputs
- Use caution when processing sensitive information
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{kothagpt-bilingual-lm,
title={KothaGPT Bilingual LM: A Large Language Model for Bangla and English},
author={KothaGPT Team},
year={2024},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/KothaGPT/bilingual-lm}},
note={Model card and documentation}
}
```
## Model Card Authors
KothaGPT Team
## Model Card Contact
For questions or issues, please open an issue on the [GitHub repository](https://github.com/KothaGPT/bilingual).
## Additional Resources
- **GitHub Repository**: https://github.com/KothaGPT/bilingual
- **Documentation**: https://github.com/KothaGPT/bilingual/tree/main/docs
- **Dataset**: https://huggingface.co/datasets/KothaGPT/bilingual-corpus
- **Demo**: https://huggingface.co/spaces/KothaGPT/bilingual-lm-demo
|