B2NL v6.1.2 - Byte-to-Natural Language Tokenizer

Model Description

B2NL (Byte-to-Natural Language) v6.1.2 is a revolutionary byte-level tokenizer that achieves exceptional compression ratios through pure learning from bytes, without any predefined vocabulary.

Key Features

18.6:1 average compression across 6 languages
100% reconstruction accuracy achieved
6 core languages: Korean, English, Chinese, Japanese, Spanish, Arabic
64-byte chunks for optimal processing
Boundary learning system for intelligent grouping
No vocabulary needed - pure byte-level processing

Performance Metrics

Language Type	Languages	Compression Ratio	Reconstruction
Isolating	Chinese	39.0:1	100%
Agglutinative	Korean, Japanese	26.5:1	100%
Fusional	English, Spanish	5.4:1	100%
Semitic	Arabic	12.3:1	100%
Average	6 languages	18.6:1	100%

Model Architecture

Model Size: 301.7M (lightweight!)
Encoder: 5-layer transformer with progressive dimensions [768, 896, 1024, 1152, 1280]
Decoder: 8-layer transformer with 1280d hidden size
Cross-Attention: 20 heads for relational learning
Training: 233 epochs on 6 languages

Dataset

Flores-200: Multilingual machine translation benchmark
6 languages in current release (Korean, English, Chinese, Japanese, Spanish, Arabic)
204 languages support coming soon

Usage

from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(repo_id="ggunio/B2NL-v6.1.2", filename="pytorch_model.bin")

# Load model (requires custom model class)
# See the demo space for full implementation

Demo

Try the live demo: B2NL v6.1.2 Demo

Training Data

The model was trained on text data from 6 languages:

Korean (한국어)
English
Chinese (中文)
Japanese (日本語)
Spanish (Español)
Arabic (العربية)

Limitations

Current version trained on 6 languages only
Compression rates may vary with additional languages
Requires custom implementation for inference

Citation

@software{b2nl2025,
  title = {B2NL: Byte-to-Natural-Language Tokenizer v6.1.2},
  author = {Jinhyun Woo},
  year = {2025},
  version = {6.1.2},
  note = {18.6:1 compression, 100% reconstruction for 6 languages}
}

📬 Links

GitHub: Repository
Demo: Try it live
Paper: Read on Zenodo | PDF

License

Apache 2.0

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

ggunio
/

B2NL-v6.1.2