Basic Tokenizers Collection
This repository contains three different tokenizers trained and wrapped for experimentation and educational purposes:
π¦ Contents
WordPiece Tokenizer
Path:ByteMeHarder-404/tokenizers/wordpiece
Classic subword tokenizer (used in BERT). Splits words into subword units based on frequency, ensuring full coverage with a compact vocab.Byte-Pair Encoding (BPE) Tokenizer
Path:ByteMeHarder-404/tokenizers/bpe
Uses byte-level BPE, similar to GPT-2 and RoBERTa. Handles any UTF-8 character without unknown tokens by working directly on bytes.XLNet-Style Tokenizer
Path:ByteMeHarder-404/tokenizers/xlnet
Follows the XLNet tokenization approach, leveraging sentencepiece-like segmentation.
π Usage
You can load each tokenizer with transformers:
from transformers import PreTrainedTokenizerFast
# WordPiece
tok_wordpiece = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/wordpiece")
# BPE
tok_bpe = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/bpe")
# XLNet-style
tok_xlnet = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/xlnet")
π Notes
- These tokenizers are minimal examples and not pretrained with embeddings or models.
- Intended for experimentation, educational purposes, and as a foundation for building custom models.
- You can extend them by training a new vocabulary on your dataset.
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support