AfroLID

AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. AfroLID is described in this paper: [**AfroLID: A Neural Language Identification Tool for African Languages**](https://arxiv.org/abs/2210.11744). ## What's New in AfroLID v1.5? - **Fine-tuned on [SERENGETI](https://huggingface.co/UBC-NLP/serengeti)**, a massively multilingual language model covering 517 African languages and language varieties. - **Enhanced model performance**, improving macro-F1 from 95.95 to 97.41. - **Built on Hugging Face Transformers** for seamless integration. - **Optimized for easy use** with the Hugging Face pipeline. - **Better efficiency and accuracy**, making it more robust for African langauges identification. ## How to use AfroLID v1.5? ``` python from transformers import pipeline afrolid = pipeline("text-classification", model='UBC-NLP/afrolid_1.5') input_text="6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt tɔ̈u tëmec piny de Manatha ku Eparaim ku Thimion , ku ɣään mec tɔ̈u të lɔ rut cï Naptali" result = afrolid(input_text) # Extract the label and score from the first result language = result[0]['label'] score = result[0]['score'] print(f"detected langauge: {language}\tscore: {round(score*100, 2)}") ``` **Output**: ``` detected langauge: dip score: 99.99 ``` ## Supported languages Please refer to [**suported-languages**](https://github.com/UBC-NLP/afrolid/blob/main/supported-languages) ## Citation If you use the AfroLID v1.5 model for your scientific publication, or if you find the resources in this repository useful, please cite our papers as follows: **AfroLID's paper*** ``` @article{adebara2022afrolid, title={AfroLID: A Neural Language Identification Tool for African Languages}, author={Adebara, Ife and Elmadany, AbdelRahim and Abdul-Mageed, Muhammad and Inciarte, Alcides Alcoba}, booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)", month = December, year = "2022", } ``` **Serengeti's Paper** ``` @inproceedings{adebara-etal-2023-serengeti, title = "{SERENGETI}: Massively Multilingual Language Models for {A}frica", author = "Adebara, Ife and Elmadany, AbdelRahim and Abdul-Mageed, Muhammad and Alcoba Inciarte, Alcides", booktitle = "Findings of the Association for Computational Linguistics: ACL 2023", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2023.findings-acl.97", doi = "10.18653/v1/2023.findings-acl.97", pages = "1498--1537", } ```