Hindi BPE Tokenizer
A Byte Pair Encoding (BPE) tokenizer optimized for Hindi text using Devanagari script.
๐ฏ Model Description
- Vocabulary Size: 5,500 tokens
- Compression Ratio: 6.52X average (up to 10.44X on technical text)
- Training Corpus: 575K characters (1.5MB) of diverse Hindi text
- Decoding Accuracy: 100%
- Training Time: ~30 seconds
โจ Features
- Hindi-Optimized: Devanagari Unicode prioritization (
\u0900-\u097F) - High Compression: 6.52X average, up to 10.44X on technical text
- Perfect Decoding: 100% accuracy in text reconstruction
- Simple API: Easy encode/decode with compression stats
- Fast Training: Train from scratch in ~30 seconds
๐ฆ Installation
# Clone the repository
git clone https://huggingface.co/ansul90/hindi-bpe-tokenizer
cd hindi-bpe-tokenizer
# Install dependencies
pip install regex numpy
# Or with uv:
uv add regex numpy
๐ Quick Start
Step 1: Train the Tokenizer
โ ๏ธ Note: This repository does not include the pre-trained model file (543MB). You need to train it once locally, which takes only ~30 seconds.
python train_bpe_simple.py
This will:
- Load the Hindi corpus (included)
- Train the BPE tokenizer
- Generate
hindi_bpe_tokenizer.json(~543MB) - Test on 8 Hindi samples
- Display performance metrics
Step 2: Use the Tokenizer
from hindi_bpe_tokenizer import HindiBPETokenizer
# Load trained tokenizer
tokenizer = HindiBPETokenizer()
tokenizer.load('hindi_bpe_tokenizer.json')
# Encode Hindi text
text = "เคญเคพเคฐเคค เคเค เคฎเคนเคพเคจ เคฆเฅเคถ เคนเฅเฅค"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")
# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
# Get compression statistics
stats = tokenizer.get_compression_stats(text)
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")
๐ Performance Metrics
| Metric | Value |
|---|---|
| Vocabulary Size | 5,500 tokens |
| Compression Ratio | 6.52X (avg), 10.44X (best) |
| Decoding Accuracy | 100% |
| Training Corpus | 575K chars, 1.5MB |
| Training Time | ~30 seconds |
Test Results on Different Text Types
| Category | Original Bytes | Compressed Tokens | Compression Ratio |
|---|---|---|---|
| Space Mission | 204 | 31 | 6.58X |
| Cricket News | 146 | 27 | 5.41X |
| Science & Tech | 188 | 18 | 10.44X |
| Language | 123 | 18 | 6.83X |
| Education | 140 | 17 | 8.24X |
| Environment | 132 | 21 | 6.29X |
| Mixed Content | 125 | 34 | 3.68X |
| Long Sentence | 240 | 33 | 7.27X |
๐ง Advanced Usage
Custom Training
from hindi_bpe_tokenizer import HindiBPETokenizer
# Create tokenizer with custom vocabulary size
tokenizer = HindiBPETokenizer(vocab_size=8000)
# Load your custom Hindi corpus
with open('my_corpus.txt', 'r', encoding='utf-8') as f:
corpus = f.read()
# Train
tokenizer.train(corpus, verbose=True)
# Save
tokenizer.save('my_custom_tokenizer.json')
Get Detailed Statistics
stats = tokenizer.get_compression_stats("เคนเคฟเคเคฆเฅ เคเฅเคเฅเคธเฅเค")
print(f"Original characters: {stats['original_chars']}")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Vocabulary size: {stats['vocab_size']:,}")
๐ Repository Structure
hindi-bpe-tokenizer/
โโโ hindi_bpe_tokenizer.py # Core implementation (8KB)
โโโ train_bpe_simple.py # Training script (5KB)
โโโ create_diverse_hindi_corpus.py # Corpus generator (17KB)
โโโ hindi_corpus.txt # Training data (1.5MB)
โโโ training_results.json # Performance metrics (2KB)
โโโ pyproject.toml # Dependencies
โโโ README.md # This file
Note: hindi_bpe_tokenizer.json (543MB) is generated when you run train_bpe_simple.py
๐ Training Data
The tokenizer was trained on diverse Hindi content including:
- News: Cricket, space missions, current events
- Science & Technology: เคตเคฟเคเฅเคเคพเคจ, เคชเฅเคฐเฅเคฆเฅเคฏเฅเคเคฟเคเฅ vocabulary
- Education & Environment: เคถเคฟเคเฅเคทเคพ, เคชเคฐเฅเคฏเคพเคตเคฐเคฃ topics
- Politics & Governance: เคฐเคพเคเคจเฅเคคเคฟ, เคธเคเคตเคฟเคงเคพเคจ terms
- Daily Life: Common phrases, daily vocabulary
- Complete Alphabet: All Devanagari letters, vowels, consonants
- Numbers: Both Arabic (0-9) and Devanagari (เฅฆ-เฅฏ) numerals
๐ฌ Technical Details
BPE Algorithm
- Start with 256 byte vocabulary
- Find most frequent byte pair in corpus
- Merge pair into new token
- Repeat until target vocabulary size reached
Hindi-Specific Optimizations
- Devanagari Unicode blocks:
\u0900-\u097F,\u0980-\u09FF - Optimized regex pattern for Hindi word boundaries
- JSON-based serialization for easy sharing
Regex Pattern
r""" ?[\u0900-\u097F]+| ?[\u0980-\u09FF]+| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
๐ค Use Cases
- Text compression for Hindi documents
- Tokenization for Hindi NLP models
- Language model preprocessing
- Text analysis and statistics
- Educational purposes (understanding BPE)
๐ Requirements
- Python 3.13+
regexlibrary (for Unicode support)numpy(optional, for numerical operations)
โ ๏ธ Limitations
- Trained on specific corpus (may need retraining for domain-specific text)
- Best for Devanagari script (Hindi)
- Compression ratio varies by text type
- Not optimized for mixed Hindi-English text (compression drops to ~3.68X)
๐ ๏ธ Troubleshooting
Issue: ModuleNotFoundError: No module named 'regex'
pip install regex
Issue: hindi_bpe_tokenizer.json not found
# Train the tokenizer first
python train_bpe_simple.py
Issue: UnicodeDecodeError
# Ensure files are read with UTF-8 encoding
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()
๐ Citation
If you use this tokenizer in your research or project, please cite:
@misc{hindi_bpe_tokenizer_2025,
title={Hindi BPE Tokenizer: Byte Pair Encoding for Devanagari Script},
author={Your Name},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ansul90/hindi-bpe-tokenizer}
}
๐ License
MIT License - See LICENSE file for details
๐ Acknowledgments
- Inspired by OpenAI's GPT-2 BPE implementation
- Built for the Hindi NLP community
- Based on the paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016)
๐ง Contact & Links
- Hugging Face: ansul90/hindi-bpe-tokenizer
- GitHub: Your GitHub link
- Email: Your email
เคงเคจเฅเคฏเคตเคพเคฆ (Thank you) for using Hindi BPE Tokenizer! ๐
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support