<p align="center">
<img src="https://raw.githubusercontent.com/UBC-NLP/afrolid/refs/heads/main/images/afrolid_logo.jpg" alt="AfroLID" width="70%" />
</p>

AfroLID, a neural LID toolkit for 517 African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. AfroLID is described in this paper: 
[**AfroLID: A Neural Language Identification Tool for African Languages**](https://arxiv.org/abs/2210.11744).

## What's New in AfroLID v1.5?
- **Fine-tuned on [SERENGETI](https://huggingface.co/UBC-NLP/serengeti)**, a massively multilingual language model covering 517 African languages and language varieties.
- **Enhanced model performance**, improving macro-F1 from 95.95 to 97.41.
- **Built on Hugging Face Transformers** for seamless integration.
- **Optimized for easy use** with the Hugging Face pipeline.
- **Better efficiency and accuracy**, making it more robust for African langauges identification.


## How to use AfroLID v1.5?

``` python
from transformers import pipeline


afrolid = pipeline("text-classification", model='UBC-NLP/afrolid_1.5')

input_text="6Acï looi aya në wuöt dït kɔ̈k yiic ku lɔ wuöt tɔ̈u tëmec piny de Manatha ku Eparaim ku Thimion , ku ɣään mec tɔ̈u të lɔ rut cï Naptali"

result = afrolid(input_text)

# Extract the label and score from the first result
language = result[0]['label']
score = result[0]['score']

print(f"detected langauge: {language}\tscore: {round(score*100, 2)}")

```
**Output**:
```
detected langauge: dip	score: 99.99
```

## Supported languages
Please refer to [**suported-languages**](https://github.com/UBC-NLP/afrolid/blob/main/supported-languages)

## Citation
If you use the AfroLID v1.5 model for your scientific publication, or if you find the resources in this repository useful, please cite our papers as follows:

**AfroLID's paper***
```
@article{adebara2022afrolid,
  title={AfroLID: A Neural Language Identification Tool for African Languages},
  author={Adebara, Ife and Elmadany, AbdelRahim and Abdul-Mageed, Muhammad and Inciarte, Alcides Alcoba},
  booktitle = "Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  month = December,
  year = "2022",
}
```
**Serengeti's Paper**
```
@inproceedings{adebara-etal-2023-serengeti,
    title = "{SERENGETI}: Massively Multilingual Language Models for {A}frica",
    author = "Adebara, Ife  and
      Elmadany, AbdelRahim  and
      Abdul-Mageed, Muhammad  and
      Alcoba Inciarte, Alcides",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-acl.97",
    doi = "10.18653/v1/2023.findings-acl.97",
    pages = "1498--1537",
}

```