Seasme CSM Fine-Tuned on Common Voice 17 Arabic
This model is a fine-tuned version of sesame/csm-1b on the Arabic subset of Common Voice 17.0 dataset.
Model Description
Seasme Conversational Speech Model (CSM) is a state-of-the-art text-to-speech model that generates natural, conversational speech. This version has been specifically fine-tuned for Arabic speech synthesis.
The model did show learning the new language and is showing some encouraging signs. However performance is was below average, whcich was expected due to noise in the Common Voice 17 dataset which can use more pre-procesing for better results
Training Details
Training Data
- Dataset: Mozilla Common Voice 17.0 (Arabic subset)
- Language: Arabic (ar)
Training Hyperparameters
After running 15 sweep runs with different hyperparameters, the following were the best performing ones:
- Batch Size: 24
- Learning Rate: 3e-6
- Epochs: 25
- Optimizer: AdamW with exponential LR decay
- Weight Decay: 0.014182
- Max Gradient Norm: 2.923641
- Warmup Steps: 569
- Gradient Accumulation Steps: 1
- Decoder Loss Weight: 0.5
- Mixed Precision: Enabled (AMP)
Training Configuration
batch_size: 24
decoder_loss_weight: 0.5
device: "cuda"
gen_every: 2000
gen_speaker: 999
grad_acc_steps: 1
learning_rate: 0.000003
log_every: 10
lr_decay: "exponential"
max_grad_norm: 2.923641
n_epochs: 25
partial_data_loading: false
save_every: 2000
train_from_scratch: false
use_amp: true
val_every: 200
warmup_steps: 569
weight_decay: 0.014182
Generation Sample
The model was tested with the following Arabic text during training:
"في صباحٍ مشرق، تجمّع الأطفال في الساحة يلعبون ويضحكون تحت أشعة الشمس، بينما كانت الطيور تغرّد فوق الأشجار. الأمل يملأ القلوب، والحياة تمضي بخطى هادئة نحو غدٍ أجمل."
Model Architecture
- Backbone: LLaMA-1B based architecture
- Decoder: LLaMA-100M based decoder
- Audio Codebooks: 32
- Audio Vocabulary Size: 2,051
- Text Vocabulary Size: 128,256
Usage
Use the following repo to run the model with Gradio: https://github.com/Saganaki22/CSM-WebUI You need at least 8GB VRAM to run the model.
Limitations and Bias
- This model is specifically trained for Arabic speech synthesis
- Performance may vary with different Arabic dialects
- The model inherits any biases present in the Common Voice 17.0 Arabic dataset
Acknowledgments
- Original CSM model by Sesame team
- Mozilla Foundation for the Common Voice dataset
- HuggingFace for the model hosting platform
- Modal labs for the compute
- Downloads last month
- 23
Model tree for samehelalfi/Seasmed-Fine-Tuned-on-Common-Voice-17-Arabic
Base model
sesame/csm-1b