πŸ‡°πŸ‡­ KM Improved 22K Tokenizer

A general-purpose Khmer tokenizer optimized for both accuracy and speed.
It provides a stable backbone for Khmer NLP applications such as classification,
question answering, translation, and summarization.


🧠 Model Details

Model Description

  • Developer: Sok Meas (@Msok99)
  • Model type: SentencePiece Unigram Tokenizer
  • Language: Khmer (khm)
  • License: MIT
  • Finetuned from: None (trained from scratch)

Model Sources


βš™οΈ Uses

Direct Use

  • Tokenizing Khmer text for downstream NLP models
  • Preparing training data for transformer-based fine-tuning
  • Segmenting sentences for analysis or embedding generation

Downstream Use

  • Integration into Khmer LLMs or chatbots
  • Pre- and post-processing for summarization or translation systems

Out-of-Scope Use

  • Not designed for English or heavily mixed Khmer–English content
  • Not an inference or generation model itself

βš–οΈ Bias, Risks & Limitations

  • Very long or compound words may still split into several sub-tokens
  • Limited exposure to informal slang or non-standard Khmer orthography

Recommendations

For code-switched text (Khmer + English), use the merged model
Msok99/lfm2-khmer-merged-18k.


πŸš€ How to Get Started

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Msok99/km-improved-22k")

text = "αž€αŸ’αž“αž»αž„αž†αŸ’αž“αžΆαŸ†αŸ’αŸ αŸ’αŸ₯ αž€αž˜αŸ’αž–αž»αž‡αžΆαž“αžΉαž„αž’αž—αž·αžœαžŒαŸ’αžαž“αŸαž”αž…αŸ’αž…αŸαž€αžœαž·αž‘αŸ’αž™αžΆαžαŸ’αž˜αžΈαŸ”"
tokens = tokenizer.tokenize(text)
print(tokens)
print(tokenizer.decode(tokenizer.encode(text)))
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support