---
license: cc-by-nc-4.0
---


## Model description
This self-supervised speech model (a.k.a. SSA-HuBERT-base-60k-V2) is based on a HuBERT Base architecture (~95M params) [1].   
It was trained on nearly 60 000 hours of speech segments and covers 21 languages and variants spoken in Sub-Saharan Africa. 
	
### Pretraining data
- Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech). 

- Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba-Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje).

## ASR fine-tuning
The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine-tune the model.    
Fine-tuning is done for each language using the FLEURS dataset [2].   
The pretrained model (SSA-HuBERT-base-60k) is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top.
 
## License
This model is released under the CC-by-NC 4.0 conditions.

## Results
The following results are obtained in a greedy mode **(no language model rescoring)**.    
Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset.

| **Language**       | **CER**                         |                         |                         | **WER**                         |                         |                         |
| :----------------- | :------------------------------ | :---------------------- | :---------------------- | :------------------------------ | :---------------------- | :---------------------- |
|                    | **base-V2** | **large** | **XL** | **base-V2** | **large** | **XL** |
| **Afrikaans**     | 19.8 | 13.0 | **12.4** | 59.1 | 42.3 | **39.8** |
| **Amharic**       | 13.3 | **9.9** | 10.3 | 44.3 | **32.9** | 34.3 |
| **Fula**          | 16.8 | **15.4** | 16.4 | 54.2 | **50.9** | 52.7 |
| **Ganda**         | 10.3 | 9.4 | **9.0** | 49.4 | 46.9 | **45.6** |
| **Hausa**         | 8.5 | 6.6 | **5.5** | 28.1 | 21.6 | **19.6** |
| **Igbo**          | 15.8 | 13.2 | **12.8** | 49.7 | 44.2 | **43.3** |
| **Kamba**         | 14.5 | 11.4 | **10.7** | 50.2 | 41.8 | **39.7** |
| **Lingala**       | 6.9 | 4.9 | **4.3** | 20.4 | 14.9 | **13.6** |
| **Luo**           | 7.6 | 6.1 | **5.8** | 33.6 | 28.0 | **27.0** |
| **Northen-Sotho** | 10.7 | 8.4 | **8.0** | 35.9 | **28.8** | 33.7 |
| **Nyanja**        | 10.6 | 8.0 | **7.0** | 44.5 | 35.3 | **32.7** |
| **Oromo**         | 19.4 | **18.2** | 18.3 | 73.1 | **66.9** | 67.7 |
| **Shona**         | 7.3 | 5.1 | **4.7** | 34.6 | 24.6 | **23.2** | 
| **Somali**        | 19.1 | 15.5 | **15.3** | 58.6 | 49.8 | **49.2** |
| **Swahili**       | 4.8 | 3.3 | **2.7** | 17.6 | 12.0 | **10.1** |
| **Umbundu**       | 18.3 | 15.1 | **14.6** | 53.7 | **47.7** | 50.6 |
| **Wolof**         | 16.3 | 13.7 | **12.4** | 48.7 | 42.2 | **40.0** |
| **Xhosa**         | 8.9 | 6.7 | **6.3** | 42.2 | 34.6 | **33.5** |
| **Yoruba**        | 21.6 | 19.9 | **19.0** | 62.2 | 57.9 | **55.9** |
| **Zulu**          | 9.1 | 6.7 | **6.2** | 42.1 | 33.3 | **31.0** |
| *Overall average* | 13.0 | 10.5 | 10.1 | 45.1 | 37.8 | 37.2 |


## References
[1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291.   
[2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.