B2NL v6.1.2 - Byte-to-Natural Language Tokenizer

Model Description

B2NL (Byte-to-Natural Language) v6.1.2 is a revolutionary byte-level tokenizer that achieves exceptional compression ratios through pure learning from bytes, without any predefined vocabulary.

Key Features

  • 18.6:1 average compression across 6 languages
  • 100% reconstruction accuracy achieved
  • 6 core languages: Korean, English, Chinese, Japanese, Spanish, Arabic
  • 64-byte chunks for optimal processing
  • Boundary learning system for intelligent grouping
  • No vocabulary needed - pure byte-level processing

Performance Metrics

Language Type Languages Compression Ratio Reconstruction
Isolating Chinese 39.0:1 100%
Agglutinative Korean, Japanese 26.5:1 100%
Fusional English, Spanish 5.4:1 100%
Semitic Arabic 12.3:1 100%
Average 6 languages 18.6:1 100%

Model Architecture

  • Model Size: 301.7M (lightweight!)
  • Encoder: 5-layer transformer with progressive dimensions [768, 896, 1024, 1152, 1280]
  • Decoder: 8-layer transformer with 1280d hidden size
  • Cross-Attention: 20 heads for relational learning
  • Training: 233 epochs on 6 languages

Dataset

  • Flores-200: Multilingual machine translation benchmark
  • 6 languages in current release (Korean, English, Chinese, Japanese, Spanish, Arabic)
  • 204 languages support coming soon

Usage

from huggingface_hub import hf_hub_download
import torch

# Download model
model_path = hf_hub_download(repo_id="ggunio/B2NL-v6.1.2", filename="pytorch_model.bin")

# Load model (requires custom model class)
# See the demo space for full implementation

Demo

Try the live demo: B2NL v6.1.2 Demo

Training Data

The model was trained on text data from 6 languages:

  • Korean (한국어)
  • English
  • Chinese (中文)
  • Japanese (日本語)
  • Spanish (Español)
  • Arabic (العربية)

Limitations

  • Current version trained on 6 languages only
  • Compression rates may vary with additional languages
  • Requires custom implementation for inference

Citation

@software{b2nl2025,
  title = {B2NL: Byte-to-Natural-Language Tokenizer v6.1.2},
  author = {Jinhyun Woo},
  year = {2025},
  version = {6.1.2},
  note = {18.6:1 compression, 100% reconstruction for 6 languages}
}

📬 Links


License

Apache 2.0

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using ggunio/B2NL-v6.1.2 1