--- license: cc-by-nc-4.0 --- ## Model description This self-supervised speech model (a.k.a. SSA-HuBERT-base-60k-V2) is based on a HuBERT Base architecture (~95M params) [1]. It was trained on nearly 60 000 hours of speech segments and covers 21 languages and variants spoken in Sub-Saharan Africa. ### Pretraining data - Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech). - Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba-Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje). ## ASR fine-tuning The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine-tune the model. Fine-tuning is done for each language using the FLEURS dataset [2]. The pretrained model (SSA-HuBERT-base-60k) is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top. ## License This model is released under the CC-by-NC 4.0 conditions. ## Results The following results are obtained in a greedy mode **(no language model rescoring)**. Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset. | **Language** | **CER** | | | **WER** | | | | :----------------- | :------------------------------ | :---------------------- | :---------------------- | :------------------------------ | :---------------------- | :---------------------- | | | **base-V2** | **large** | **XL** | **base-V2** | **large** | **XL** | | **Afrikaans** | 19.8 | 13.0 | **12.4** | 59.1 | 42.3 | **39.8** | | **Amharic** | 13.3 | **9.9** | 10.3 | 44.3 | **32.9** | 34.3 | | **Fula** | 16.8 | **15.4** | 16.4 | 54.2 | **50.9** | 52.7 | | **Ganda** | 10.3 | 9.4 | **9.0** | 49.4 | 46.9 | **45.6** | | **Hausa** | 8.5 | 6.6 | **5.5** | 28.1 | 21.6 | **19.6** | | **Igbo** | 15.8 | 13.2 | **12.8** | 49.7 | 44.2 | **43.3** | | **Kamba** | 14.5 | 11.4 | **10.7** | 50.2 | 41.8 | **39.7** | | **Lingala** | 6.9 | 4.9 | **4.3** | 20.4 | 14.9 | **13.6** | | **Luo** | 7.6 | 6.1 | **5.8** | 33.6 | 28.0 | **27.0** | | **Northen-Sotho** | 10.7 | 8.4 | **8.0** | 35.9 | **28.8** | 33.7 | | **Nyanja** | 10.6 | 8.0 | **7.0** | 44.5 | 35.3 | **32.7** | | **Oromo** | 19.4 | **18.2** | 18.3 | 73.1 | **66.9** | 67.7 | | **Shona** | 7.3 | 5.1 | **4.7** | 34.6 | 24.6 | **23.2** |  | **Somali** | 19.1 | 15.5 | **15.3** | 58.6 | 49.8 | **49.2** | | **Swahili** | 4.8 | 3.3 | **2.7** | 17.6 | 12.0 | **10.1** | | **Umbundu** | 18.3 | 15.1 | **14.6** | 53.7 | **47.7** | 50.6 | | **Wolof** | 16.3 | 13.7 | **12.4** | 48.7 | 42.2 | **40.0** | | **Xhosa** | 8.9 | 6.7 | **6.3** | 42.2 | 34.6 | **33.5** | | **Yoruba** | 21.6 | 19.9 | **19.0** | 62.2 | 57.9 | **55.9** | | **Zulu** | 9.1 | 6.7 | **6.2** | 42.1 | 33.3 | **31.0** | | *Overall average* | 13.0 | 10.5 | 10.1 | 45.1 | 37.8 | 37.2 | ## References [1] Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291. [2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.