afro-xlmr-large-76L_script
AfroXLMR-large was created by first augmenting the XLM-R-large model with missing scripts (N'Ko and Tifinagh), followed by an MLM adaptation of the expanded XLM-R-large model on 76 languages widely spoken in Africa including 4 high-resource languages.
Pre-training corpus
A mix of mC4, Wikipedia and OPUS data
Languages
There are 75 languages available :
- English (eng)
 - Amharic (amh)
 - Arabic (ara)
 - Somali (som)
 - Kiswahili (swa)
 - Portuguese (por)
 - Afrikaans (afr)
 - French (fra)
 - isiZulu (zul)
 - Malagasy (mlg)
 - Hausa (hau)
 - chiShona (sna)
 - Egyptian Arabic (arz)
 - Chichewa (nya)
 - Igbo (ibo)
 - isiXhosa (xho)
 - Yorùbá (yor)
 - Sesotho (sot)
 - Kinyarwanda (kin)
 - Tigrinya (tir)
 - Tsonga (tso)
 - Oromo (orm)
 - Rundi (run)
 - Northern Sotho (nso)
 - Ewe (ewe)
 - Lingala (lin)
 - Twi (twi)
 - Nigerian Pidgin (pcm)
 - Ga (gaa)
 - Lozi (loz)
 - Luganda (lug)
 - Gun (guw)
 - Bemba (bem)
 - Efik (efi)
 - Luvale (lue)
 - Luba-Lulua (lua)
 - Tonga (toi)
 - Tshivenḓa (ven)
 - Tumbuka (tum)
 - Tetela (tll)
 - Isoko (iso)
 - Kaonde (kqn)
 - Zande (zne)
 - Umbundu (umb)
 - Mossi (mos)
 - Tiv (tiv)
 - Luba-Katanga (lub)
 - Fula (fuv)
 - San Salvador Kongo (kwy)
 - Baoulé (bci)
 - Ruund (rnd)
 - Luo (luo)
 - Wolaitta (wal)
 - Swazi (ssw)
 - Lunda (lun)
 - Wolof (wol)
 - Nyaneka (nyk)
 - Kwanyama (kua)
 - Kikuyu (kik)
 - Fon (fon)
 - Bambara (bam)
 - Chokwe (cjk)
 - Dinka (dik)
 - Dyula (dyu)
 - Kabyle (kab)
 - Kamba (kam)
 - Kabiyè (kbp)
 - Kanuri (knc)
 - Kimbundu (kmb)
 - Kikongo (kon)
 - Nuer (nus)
 - Sango (sag)
 - Tamasheq (taq)
 - Tamazight (tzm)
 - N'ko (nqo)
 
Acknowledgment
BibTeX entry and citation info.
@misc{adelani2023sib200,
      title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, 
      author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
      year={2023},
      eprint={2309.07445},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
- Downloads last month
 - 3