π°π KM Improved 22K Tokenizer
A general-purpose Khmer tokenizer optimized for both accuracy and speed.
It provides a stable backbone for Khmer NLP applications such as classification,
question answering, translation, and summarization.
π§ Model Details
Model Description
- Developer: Sok Meas (@Msok99)
- Model type: SentencePiece Unigram Tokenizer
- Language: Khmer (khm)
- License: MIT
- Finetuned from: None (trained from scratch)
Model Sources
- Repository: https://huggingface.co/Msok99/km-improved-22k
βοΈ Uses
Direct Use
- Tokenizing Khmer text for downstream NLP models
- Preparing training data for transformer-based fine-tuning
- Segmenting sentences for analysis or embedding generation
Downstream Use
- Integration into Khmer LLMs or chatbots
- Pre- and post-processing for summarization or translation systems
Out-of-Scope Use
- Not designed for English or heavily mixed KhmerβEnglish content
- Not an inference or generation model itself
βοΈ Bias, Risks & Limitations
- Very long or compound words may still split into several sub-tokens
- Limited exposure to informal slang or non-standard Khmer orthography
Recommendations
For code-switched text (Khmer + English), use the merged modelMsok99/lfm2-khmer-merged-18k.
π How to Get Started
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-22k")
text = "αααα»αααααΆαα’α α’α₯ ααααα»ααΆααΉαα’αα·αααααααα
αα
αααα·ααααΆααααΈα"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support