Model Details
- Developed by: Ahmad Amirivojdan
- Language(s) (NLP): Persian
- Repository: Shekar - Open Source Persian NLP Toolkit
- Paper: Shekar: A Python Toolkit for Persian Natural Language Processing
- License: MIT
Model Description
This model is intended to provide a compact, fine-tuned representation model for Persian text by leveraging the ALBERT base architecture and fine-tuning it on the high-quality Persian corpus Naab corpus (including ZWNJ-aware tokenisation). The model can be used for tasks such as:
- Masked token prediction (fill-mask) in Persian text
- Feature extraction / embedding generation for downstream tasks (e.g., classification, ranking, clustering)
- Pre-training backbone for further fine-tuning on task-specific Persian NLP tasks
How to Use
from transformers import AlbertTokenizer, AutoModelForMaskedLM
tokenizer = AlbertTokenizer.from_pretrained("shekar-ai/albert-base-v2-persian-zwnj-naab-mlm")
model = AutoModelForMaskedLM.from_pretrained("shekar-ai/albert-base-v2-persian-zwnj-naab-mlm")
# Example usage: fill-mask
input_text = "من به مدرسه [MASK] میروم."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# convert logits to predictions etc.
Citation
BibTeX:
@article{Amirivojdan2025Shekar,
author = {Amirivojdan, Ahmad},
doi = {10.21105/joss.09128},
journal = {Journal of Open Source Software},
month = oct,
number = {114},
pages = {9128},
title = {{Shekar: A Python Toolkit for Persian Natural Language Processing}},
url = {https://joss.theoj.org/papers/10.21105/joss.09128},
volume = {10},
year = {2025}
}
- Downloads last month
- 6