B2NL v6.1.2 - Byte-to-Natural Language Tokenizer
Model Description
B2NL (Byte-to-Natural Language) v6.1.2 is a revolutionary byte-level tokenizer that achieves exceptional compression ratios through pure learning from bytes, without any predefined vocabulary.
Key Features
- 18.6:1 average compression across 6 languages
- 100% reconstruction accuracy achieved
- 6 core languages: Korean, English, Chinese, Japanese, Spanish, Arabic
- 64-byte chunks for optimal processing
- Boundary learning system for intelligent grouping
- No vocabulary needed - pure byte-level processing
Performance Metrics
| Language Type | Languages | Compression Ratio | Reconstruction |
|---|---|---|---|
| Isolating | Chinese | 39.0:1 | 100% |
| Agglutinative | Korean, Japanese | 26.5:1 | 100% |
| Fusional | English, Spanish | 5.4:1 | 100% |
| Semitic | Arabic | 12.3:1 | 100% |
| Average | 6 languages | 18.6:1 | 100% |
Model Architecture
- Model Size: 301.7M (lightweight!)
- Encoder: 5-layer transformer with progressive dimensions [768, 896, 1024, 1152, 1280]
- Decoder: 8-layer transformer with 1280d hidden size
- Cross-Attention: 20 heads for relational learning
- Training: 233 epochs on 6 languages
Dataset
- Flores-200: Multilingual machine translation benchmark
- 6 languages in current release (Korean, English, Chinese, Japanese, Spanish, Arabic)
- 204 languages support coming soon
Usage
from huggingface_hub import hf_hub_download
import torch
# Download model
model_path = hf_hub_download(repo_id="ggunio/B2NL-v6.1.2", filename="pytorch_model.bin")
# Load model (requires custom model class)
# See the demo space for full implementation
Demo
Try the live demo: B2NL v6.1.2 Demo
Training Data
The model was trained on text data from 6 languages:
- Korean (한êµì–´)
- English
- Chinese (䏿–‡)
- Japanese (日本語)
- Spanish (Español)
- Arabic (العربية)
Limitations
- Current version trained on 6 languages only
- Compression rates may vary with additional languages
- Requires custom implementation for inference
Citation
@software{b2nl2025,
title = {B2NL: Byte-to-Natural-Language Tokenizer v6.1.2},
author = {Jinhyun Woo},
year = {2025},
version = {6.1.2},
note = {18.6:1 compression, 100% reconstruction for 6 languages}
}
📬 Links
- GitHub: Repository
- Demo: Try it live
- Paper: Read on Zenodo | PDF
License
Apache 2.0
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support