Hindi BPE Tokenizer

A Byte Pair Encoding (BPE) tokenizer optimized for Hindi text using Devanagari script.

๐ŸŽฏ Model Description

  • Vocabulary Size: 5,500 tokens
  • Compression Ratio: 6.52X average (up to 10.44X on technical text)
  • Training Corpus: 575K characters (1.5MB) of diverse Hindi text
  • Decoding Accuracy: 100%
  • Training Time: ~30 seconds

โœจ Features

  • Hindi-Optimized: Devanagari Unicode prioritization (\u0900-\u097F)
  • High Compression: 6.52X average, up to 10.44X on technical text
  • Perfect Decoding: 100% accuracy in text reconstruction
  • Simple API: Easy encode/decode with compression stats
  • Fast Training: Train from scratch in ~30 seconds

๐Ÿ“ฆ Installation

# Clone the repository
git clone https://huggingface.co/ansul90/hindi-bpe-tokenizer
cd hindi-bpe-tokenizer

# Install dependencies
pip install regex numpy
# Or with uv:
uv add regex numpy

๐Ÿš€ Quick Start

Step 1: Train the Tokenizer

โš ๏ธ Note: This repository does not include the pre-trained model file (543MB). You need to train it once locally, which takes only ~30 seconds.

python train_bpe_simple.py

This will:

  • Load the Hindi corpus (included)
  • Train the BPE tokenizer
  • Generate hindi_bpe_tokenizer.json (~543MB)
  • Test on 8 Hindi samples
  • Display performance metrics

Step 2: Use the Tokenizer

from hindi_bpe_tokenizer import HindiBPETokenizer

# Load trained tokenizer
tokenizer = HindiBPETokenizer()
tokenizer.load('hindi_bpe_tokenizer.json')

# Encode Hindi text
text = "เคญเคพเคฐเคค เคเค• เคฎเคนเคพเคจ เคฆเฅ‡เคถ เคนเฅˆเฅค"
tokens = tokenizer.encode(text)
print(f"Tokens: {tokens}")

# Decode back to text
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

# Get compression statistics
stats = tokenizer.get_compression_stats(text)
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")

๐Ÿ“Š Performance Metrics

Metric Value
Vocabulary Size 5,500 tokens
Compression Ratio 6.52X (avg), 10.44X (best)
Decoding Accuracy 100%
Training Corpus 575K chars, 1.5MB
Training Time ~30 seconds

Test Results on Different Text Types

Category Original Bytes Compressed Tokens Compression Ratio
Space Mission 204 31 6.58X
Cricket News 146 27 5.41X
Science & Tech 188 18 10.44X
Language 123 18 6.83X
Education 140 17 8.24X
Environment 132 21 6.29X
Mixed Content 125 34 3.68X
Long Sentence 240 33 7.27X

๐Ÿ”ง Advanced Usage

Custom Training

from hindi_bpe_tokenizer import HindiBPETokenizer

# Create tokenizer with custom vocabulary size
tokenizer = HindiBPETokenizer(vocab_size=8000)

# Load your custom Hindi corpus
with open('my_corpus.txt', 'r', encoding='utf-8') as f:
    corpus = f.read()

# Train
tokenizer.train(corpus, verbose=True)

# Save
tokenizer.save('my_custom_tokenizer.json')

Get Detailed Statistics

stats = tokenizer.get_compression_stats("เคนเคฟเค‚เคฆเฅ€ เคŸเฅ‡เค•เฅเคธเฅเคŸ")
print(f"Original characters: {stats['original_chars']}")
print(f"Original bytes: {stats['original_bytes']}")
print(f"Compressed tokens: {stats['compressed_tokens']}")
print(f"Compression ratio: {stats['compression_ratio']:.2f}X")
print(f"Vocabulary size: {stats['vocab_size']:,}")

๐Ÿ“ Repository Structure

hindi-bpe-tokenizer/
โ”œโ”€โ”€ hindi_bpe_tokenizer.py          # Core implementation (8KB)
โ”œโ”€โ”€ train_bpe_simple.py             # Training script (5KB)
โ”œโ”€โ”€ create_diverse_hindi_corpus.py  # Corpus generator (17KB)
โ”œโ”€โ”€ hindi_corpus.txt                # Training data (1.5MB)
โ”œโ”€โ”€ training_results.json           # Performance metrics (2KB)
โ”œโ”€โ”€ pyproject.toml                  # Dependencies
โ””โ”€โ”€ README.md                       # This file

Note: hindi_bpe_tokenizer.json (543MB) is generated when you run train_bpe_simple.py

๐ŸŽ“ Training Data

The tokenizer was trained on diverse Hindi content including:

  • News: Cricket, space missions, current events
  • Science & Technology: เคตเคฟเคœเฅเคžเคพเคจ, เคชเฅเคฐเฅŒเคฆเฅเคฏเฅ‹เค—เคฟเค•เฅ€ vocabulary
  • Education & Environment: เคถเคฟเค•เฅเคทเคพ, เคชเคฐเฅเคฏเคพเคตเคฐเคฃ topics
  • Politics & Governance: เคฐเคพเคœเคจเฅ€เคคเคฟ, เคธเค‚เคตเคฟเคงเคพเคจ terms
  • Daily Life: Common phrases, daily vocabulary
  • Complete Alphabet: All Devanagari letters, vowels, consonants
  • Numbers: Both Arabic (0-9) and Devanagari (เฅฆ-เฅฏ) numerals

๐Ÿ”ฌ Technical Details

BPE Algorithm

  1. Start with 256 byte vocabulary
  2. Find most frequent byte pair in corpus
  3. Merge pair into new token
  4. Repeat until target vocabulary size reached

Hindi-Specific Optimizations

  • Devanagari Unicode blocks: \u0900-\u097F, \u0980-\u09FF
  • Optimized regex pattern for Hindi word boundaries
  • JSON-based serialization for easy sharing

Regex Pattern

r""" ?[\u0900-\u097F]+| ?[\u0980-\u09FF]+| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

๐Ÿค Use Cases

  • Text compression for Hindi documents
  • Tokenization for Hindi NLP models
  • Language model preprocessing
  • Text analysis and statistics
  • Educational purposes (understanding BPE)

๐Ÿ“ Requirements

  • Python 3.13+
  • regex library (for Unicode support)
  • numpy (optional, for numerical operations)

โš ๏ธ Limitations

  • Trained on specific corpus (may need retraining for domain-specific text)
  • Best for Devanagari script (Hindi)
  • Compression ratio varies by text type
  • Not optimized for mixed Hindi-English text (compression drops to ~3.68X)

๐Ÿ› ๏ธ Troubleshooting

Issue: ModuleNotFoundError: No module named 'regex'

pip install regex

Issue: hindi_bpe_tokenizer.json not found

# Train the tokenizer first
python train_bpe_simple.py

Issue: UnicodeDecodeError

# Ensure files are read with UTF-8 encoding
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

๐Ÿ“š Citation

If you use this tokenizer in your research or project, please cite:

@misc{hindi_bpe_tokenizer_2025,
  title={Hindi BPE Tokenizer: Byte Pair Encoding for Devanagari Script},
  author={Your Name},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/ansul90/hindi-bpe-tokenizer}
}

๐Ÿ“„ License

MIT License - See LICENSE file for details

๐Ÿ™ Acknowledgments

  • Inspired by OpenAI's GPT-2 BPE implementation
  • Built for the Hindi NLP community
  • Based on the paper: "Neural Machine Translation of Rare Words with Subword Units" (Sennrich et al., 2016)

๐Ÿ“ง Contact & Links


เคงเคจเฅเคฏเคตเคพเคฆ (Thank you) for using Hindi BPE Tokenizer! ๐Ÿ™

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support