Custom WordPiece Tokenizer (Trained on WikiText-103 Raw v1)

Model Overview

This repository contains a custom WordPiece-based tokenizer trained from scratch on the WikiText-103 Raw v1 dataset.
The tokenizer is designed for use in natural language processing tasks such as language modeling, text classification, and information retrieval.

Key Features:

Custom [CLS] and [SEP] special tokens.
WordPiece subword segmentation with ## prefix for subwords.
Template-based post-processing for both single and paired sequences.
Configured decoding using the WordPiece decoder for seamless reconstruction of original text.

Training Details

Dataset

Name: WikiText-103 Raw v1
Source: High-quality, long-form Wikipedia articles.
Split Used: train
Size: ~103 million tokens
Loading Method: Streaming mode for efficient large-scale training without local storage bottlenecks.

Tokenizer Configuration

Model Type: WordPiece
Vocabulary Size: 60,000 (medium-scale for general-purpose LLMs)
Lowercasing: Enabled
Special Tokens:
- [CLS] — Classification token
- [SEP] — Separator token
- [UNK] — Unknown token
- [PAD] — Padding token
- [MASK] — Masking token (MLM tasks)
Post-Processing Template:
- Single Sequence: [CLS] $A [SEP]
- Paired Sequences: [CLS] $A [SEP] $B [SEP]
Decoder: WordPiece decoder with ## prefix handling.

Training Method

Corpus Source: Streaming iterator from WikiText-103 Raw v1 (train split)
Batch Size: 1000 lines per batch
Trainer: WordPieceTrainer from Hugging Face tokenizers library
Special Tokens Added: [CLS], [SEP], [UNK], [PAD], [MASK]

Intended Uses & Limitations

Intended Uses

Pre-tokenization for training Transformer-based LLMs.
Downstream NLP tasks:
- Language modeling
- Text classification
- Question answering
- Summarization

Limitations

Trained exclusively on English Wikipedia text — performance may degrade in informal, domain-specific, or multilingual contexts.
May inherit biases present in Wikipedia data.

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer, please cite:

title = Custom WordPiece Tokenizer Trained on WikiText-103 Raw v1
author = yakul259
year = 2025
publisher = Hugging Face

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yakul259/english-wordpiece-tokenizer-60k

Finetunes

1 model

yakul259
/

english-wordpiece-tokenizer-60k