Hindi BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer optimized for Hindi text using Devanagari script.

🎯 Model Description

Vocabulary Size: 5,500 tokens
Compression Ratio: 6.52X average (up to 10.44X on technical text)
Training Corpus: 575K characters (1.5MB) of diverse Hindi text
Decoding Accuracy: 100%
Training Time: ~30 seconds

✨ Features

Hindi-Optimized: Devanagari Unicode prioritization (\u0900-\u097F)
High Compression: 6.52X average, up to 10.44X on technical text
Perfect Decoding: 100% accuracy in text reconstruction
Simple API: Easy encode/decode with compression stats
Fast Training: Train from scratch in ~30 seconds

📦 Installation

# Clone the repository
git clone https://huggingface.co/ansul90/hindi-bpe-tokenizer
cd hindi-bpe-tokenizer

# Install dependencies
pip install regex numpy
# Or with uv:
uv add regex numpy

🚀 Quick Start

Step 1: Train the Tokenizer

⚠️ Note: This repository does not include the pre-trained model file (543MB). You need to train it once locally, which takes only ~30 seconds.

python train_bpe_simple.py

This will:

Load the Hindi corpus (included)
Train the BPE tokenizer
Generate hindi_bpe_tokenizer.json (~543MB)
Test on 8 Hindi samples
Display performance metrics

Step 2: Use the Tokenizer

from hindi_bpe_tokenizer import HindiBPETokenizer

# Load trained tokenizer
tokenizer = HindiBPETokenizer()
tokenizer.load('hindi_bpe_tokenizer.json')

# Encode Hindi text
text = "भारत एक महान देश है।"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# Get compression statistics
stats = tokenizer.get_compression_stats(text)
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")

📊 Performance Metrics

Metric	Value
Vocabulary Size	5,500 tokens
Compression Ratio	6.52X (avg), 10.44X (best)
Decoding Accuracy	100%
Training Corpus	575K chars, 1.5MB
Training Time	~30 seconds

Test Results on Different Text Types

Category	Original Bytes	Compressed Tokens	Compression Ratio
Space Mission	204	31	6.58X
Cricket News	146	27	5.41X
Science & Tech	188	18	10.44X
Language	123	18	6.83X
Education	140	17	8.24X
Environment	132	21	6.29X
Mixed Content	125	34	3.68X
Long Sentence	240	33	7.27X

🔧 Advanced Usage

Custom Training

from hindi_bpe_tokenizer import HindiBPETokenizer

# Create tokenizer with custom vocabulary size
tokenizer = HindiBPETokenizer(vocab_size=8000)

# Load your custom Hindi corpus
with open('my_corpus.txt', 'r', encoding='utf-8') as f:
    corpus = f.read()

# Train
tokenizer.train(corpus, verbose=True)

# Save
tokenizer.save('my_custom_tokenizer.json')

Get Detailed Statistics

stats = tokenizer.get_compression_stats("हिंदी टेक्स्ट")
print(f"Original characters: {stats['original_chars']}")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Vocabulary size: {stats['vocab_size']:,}")

📁 Repository Structure

hindi-bpe-tokenizer/
├── hindi_bpe_tokenizer.py          # Core implementation (8KB)
├── train_bpe_simple.py             # Training script (5KB)
├── create_diverse_hindi_corpus.py  # Corpus generator (17KB)
├── hindi_corpus.txt                # Training data (1.5MB)
├── training_results.json           # Performance metrics (2KB)
├── pyproject.toml                  # Dependencies
└── README.md                       # This file

Note: hindi_bpe_tokenizer.json (543MB) is generated when you run train_bpe_simple.py

🎓 Training Data

The tokenizer was trained on diverse Hindi content including:

News: Cricket, space missions, current events
Science & Technology: विज्ञान, प्रौद्योगिकी vocabulary
Education & Environment: शिक्षा, पर्यावरण topics
Politics & Governance: राजनीति, संविधान terms
Daily Life: Common phrases, daily vocabulary
Complete Alphabet: All Devanagari letters, vowels, consonants
Numbers: Both Arabic (0-9) and Devanagari (०-९) numerals

🔬 Technical Details

BPE Algorithm

Start with 256 byte vocabulary
Find most frequent byte pair in corpus
Merge pair into new token
Repeat until target vocabulary size reached

Hindi-Specific Optimizations

Devanagari Unicode blocks: \u0900-\u097F, \u0980-\u09FF
Optimized regex pattern for Hindi word boundaries
JSON-based serialization for easy sharing

Regex Pattern

r""" ?[\u0900-\u097F]+| ?[\u0980-\u09FF]+| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

🤝 Use Cases

Text compression for Hindi documents
Tokenization for Hindi NLP models
Language model preprocessing
Text analysis and statistics
Educational purposes (understanding BPE)

📝 Requirements

Python 3.13+
regex library (for Unicode support)
numpy (optional, for numerical operations)

⚠️ Limitations

Trained on specific corpus (may need retraining for domain-specific text)
Best for Devanagari script (Hindi)
Compression ratio varies by text type
Not optimized for mixed Hindi-English text (compression drops to ~3.68X)

🛠️ Troubleshooting

Issue: ModuleNotFoundError: No module named 'regex'

pip install regex

Issue: hindi_bpe_tokenizer.json not found

# Train the tokenizer first
python train_bpe_simple.py

Issue: UnicodeDecodeError

# Ensure files are read with UTF-8 encoding
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

📚 Citation

If you use this tokenizer in your research or project, please cite:

@misc{hindi_bpe_tokenizer_2025,
  title={Hindi BPE Tokenizer: Byte Pair Encoding for Devanagari Script},
  author={Your Name},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ansul90/hindi-bpe-tokenizer}
}

📄 License

MIT License - See LICENSE file for details

🙏 Acknowledgments

Inspired by OpenAI's GPT-2 BPE implementation
Built for the Hindi NLP community
Based on the paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016)

📧 Contact & Links

Hugging Face: ansul90/hindi-bpe-tokenizer
GitHub: Your GitHub link
Email: Your email

धन्यवाद (Thank you) for using Hindi BPE Tokenizer! 🙏

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support