Custom WordPiece Tokenizer (Trained on WikiText-103 Raw v1)

Model Overview

This repository contains a custom WordPiece-based tokenizer trained from scratch on the WikiText-103 Raw v1 dataset.
The tokenizer is designed for use in natural language processing tasks such as language modeling, text classification, and information retrieval.

Key Features:

  • Custom [CLS] and [SEP] special tokens.
  • WordPiece subword segmentation with ## prefix for subwords.
  • Template-based post-processing for both single and paired sequences.
  • Configured decoding using the WordPiece decoder for seamless reconstruction of original text.

Training Details

Dataset

  • Name: WikiText-103 Raw v1
  • Source: High-quality, long-form Wikipedia articles.
  • Split Used: train
  • Size: ~103 million tokens
  • Loading Method: Streaming mode for efficient large-scale training without local storage bottlenecks.

Tokenizer Configuration

  • Model Type: WordPiece
  • Vocabulary Size: 60,000 (medium-scale for general-purpose LLMs)
  • Lowercasing: Enabled
  • Special Tokens:
    • [CLS] β€” Classification token
    • [SEP] β€” Separator token
    • [UNK] β€” Unknown token
    • [PAD] β€” Padding token
    • [MASK] β€” Masking token (MLM tasks)
  • Post-Processing Template:
    • Single Sequence: [CLS] $A [SEP]
    • Paired Sequences: [CLS] $A [SEP] $B [SEP]
  • Decoder: WordPiece decoder with ## prefix handling.

Training Method

  • Corpus Source: Streaming iterator from WikiText-103 Raw v1 (train split)
  • Batch Size: 1000 lines per batch
  • Trainer: WordPieceTrainer from Hugging Face tokenizers library
  • Special Tokens Added: [CLS], [SEP], [UNK], [PAD], [MASK]

Intended Uses & Limitations

Intended Uses

  • Pre-tokenization for training Transformer-based LLMs.
  • Downstream NLP tasks:
    • Language modeling
    • Text classification
    • Question answering
    • Summarization

Limitations

  • Trained exclusively on English Wikipedia text β€” performance may degrade in informal, domain-specific, or multilingual contexts.
  • May inherit biases present in Wikipedia data.

License

This tokenizer is released under the MIT License.


Citation

If you use this tokenizer, please cite:

title = Custom WordPiece Tokenizer Trained on WikiText-103 Raw v1
author = yakul259
year = 2025
publisher = Hugging Face

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for yakul259/english-wordpiece-tokenizer-60k

Finetunes
1 model

Dataset used to train yakul259/english-wordpiece-tokenizer-60k