Geez Tokenizer (Hailay/geez-tokenizer)
	
A BPE tokenizer specifically trained for Geez-script languages, including Tigrinya and Amharic. The tokenizer is trained on monolingual corpora and supports morphologically rich low-resource languages.
๐ง Motivation
Byte-Pair Encoding (BPE) tokenizers trained on English or Latin-script languages often fail to tokenize Geez-script languages efficiently. This tokenizer aims to:
- Reduce over-segmentation errors
- Respect morpheme boundaries
- Improve language understanding for downstream tasks like Machine Translation and QA
๐ Training Details
- Tokenizer Type: BPE
- Vocabulary Size: 32,000
- Pre-tokenizer: Whitespace
- Normalizer: NFD โ Lowercase โ StripAccents
- Special Tokens: [PAD],[UNK],[CLS],[SEP],[MASK]
- Post-processing: Template for [CLS] $A [SEP]and[CLS] $A [SEP] $B [SEP]
๐ Files
- vocab.json: Vocabulary file
- merges.txt: Merge rules for BPE
- tokenizer.json: Full tokenizer config
- tokenizer_config.json: Hugging Face-compatible configuration
- special_tokens_map.json: Maps for special tokens
๐ Usage
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("Hailay/geez-tokenizer")
text = "แจแแฅแ
 แ แญแชแฆแแแตแถแฝ แ แณแฉแซ แแญแฎแแแต แแตแฅ แจแฐแแแแ แตแแแ แแแฅแญ แ แแแฐแแแข"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print("Tokens:", tokens)
print("Token IDs:", ids)
## ๐ Intended Use
This tokenizer is best suited for:
Low-resource NLP pipelines
Machine Translation
Question Answering
Named Entity Recognition
Morphological analysis
โ #**Limitations**
It is optimized for Geez-script languages and might not generalize to others.
Some compound verbs and morphologically fused words may still require linguistic preprocessing.
Currently monolingual for Amharic and Tigrinya; does not support multilingual code-switching.
โ
 #**Evaluation**
The tokenizer was evaluated manually on:
Token coverage of Tigrinya/Amharic corpora
Morphological preservation
Reduction of BPE segmentation errors
Quantitative metrics to be published in an accompanying paper.
๐ #**License**
This tokenizer is licensed under the MIT License.
๐ #**Citation**
@misc{hailay2025geez,
  title={Geสฝez Script_Tokenizer: A Morpheme-Aware BPE Tokenizer for Geez Script Languages},
  author={Teklehaymanot, Hailay},
  year={2025},
  howpublished={\url{https://huggingface.co/Hailay/geez-tokenizer}},
}