Custom WordPiece Tokenizer (Trained on WikiText-103 Raw v1)
Model Overview
This repository contains a custom WordPiece-based tokenizer trained from scratch on the WikiText-103 Raw v1 dataset.
The tokenizer is designed for use in natural language processing tasks such as language modeling, text classification, and information retrieval.
Key Features:
- Custom
[CLS]and[SEP]special tokens. - WordPiece subword segmentation with
##prefix for subwords. - Template-based post-processing for both single and paired sequences.
- Configured decoding using the WordPiece decoder for seamless reconstruction of original text.
Training Details
Dataset
- Name: WikiText-103 Raw v1
- Source: High-quality, long-form Wikipedia articles.
- Split Used:
train - Size: ~103 million tokens
- Loading Method: Streaming mode for efficient large-scale training without local storage bottlenecks.
Tokenizer Configuration
- Model Type: WordPiece
- Vocabulary Size: 60,000 (medium-scale for general-purpose LLMs)
- Lowercasing: Enabled
- Special Tokens:
[CLS]β Classification token[SEP]β Separator token[UNK]β Unknown token[PAD]β Padding token[MASK]β Masking token (MLM tasks)
- Post-Processing Template:
- Single Sequence:
[CLS]$A[SEP] - Paired Sequences:
[CLS]$A[SEP]$B[SEP]
- Single Sequence:
- Decoder: WordPiece decoder with
##prefix handling.
Training Method
- Corpus Source: Streaming iterator from WikiText-103 Raw v1 (train split)
- Batch Size: 1000 lines per batch
- Trainer:
WordPieceTrainerfrom Hugging Facetokenizerslibrary - Special Tokens Added:
[CLS],[SEP],[UNK],[PAD],[MASK]
Intended Uses & Limitations
Intended Uses
- Pre-tokenization for training Transformer-based LLMs.
- Downstream NLP tasks:
- Language modeling
- Text classification
- Question answering
- Summarization
Limitations
- Trained exclusively on English Wikipedia text β performance may degrade in informal, domain-specific, or multilingual contexts.
- May inherit biases present in Wikipedia data.
License
This tokenizer is released under the MIT License.
Citation
If you use this tokenizer, please cite:
title = Custom WordPiece Tokenizer Trained on WikiText-103 Raw v1
author = yakul259
year = 2025
publisher = Hugging Face
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support