Basic Tokenizers Collection

This repository contains three different tokenizers trained and wrapped for experimentation and educational purposes:

πŸ“¦ Contents

  • WordPiece Tokenizer
    Path: ByteMeHarder-404/tokenizers/wordpiece
    Classic subword tokenizer (used in BERT). Splits words into subword units based on frequency, ensuring full coverage with a compact vocab.

  • Byte-Pair Encoding (BPE) Tokenizer
    Path: ByteMeHarder-404/tokenizers/bpe
    Uses byte-level BPE, similar to GPT-2 and RoBERTa. Handles any UTF-8 character without unknown tokens by working directly on bytes.

  • XLNet-Style Tokenizer
    Path: ByteMeHarder-404/tokenizers/xlnet
    Follows the XLNet tokenization approach, leveraging sentencepiece-like segmentation.

πŸš€ Usage

You can load each tokenizer with transformers:

from transformers import PreTrainedTokenizerFast

# WordPiece
tok_wordpiece = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/wordpiece")

# BPE
tok_bpe = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/bpe")

# XLNet-style
tok_xlnet = PreTrainedTokenizerFast.from_pretrained("ByteMeHarder-404/tokenizers/xlnet")

πŸ“š Notes

  • These tokenizers are minimal examples and not pretrained with embeddings or models.
  • Intended for experimentation, educational purposes, and as a foundation for building custom models.
  • You can extend them by training a new vocabulary on your dataset.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support