Question about Training Details

by GordonChang - opened 1 day ago

Hi, thank you for releasing Irodori-TTS-500M-v2! The emoji-based style control feature is particularly impressive.

I already found the WandB training logs linked in the model card, so I was able to confirm that training ran from roughly 250k steps (v1) to around 600k steps (v2), which lines up with the 2.5x increase mentioned in the model card. That said, I still had a few questions about the broader training details:

Dataset size: Approximately how many hours of Japanese speech data were used for training?
Hardware: What GPU setup was used (e.g., number of GPUs, GPU type), and roughly how long did training take?
Hyperparameters: Are key hyperparameters such as batch size, learning rate, and scheduler documented anywhere, or are there plans to release a training config?
Data filtering: Could you elaborate on the "stricter data filtering" applied in v2? What criteria were used?
Qwen3-Omni Fine-tuning: The model card mentions that emoji annotations were automatically generated using a fine-tuned version of Qwen/Qwen3-Omni-30B-A3B-Instruct. Could you share more details about this fine-tuning process — such as what training data was used, how the annotation labels were initially defined, and whether there are any plans to release the fine-tuned annotation model?

Any additional details would be greatly appreciated, especially for reproducibility purposes. Thanks again for the great work!

Aratako

Owner 1 day ago

Hi, thank you for looking into the details and checking out the WandB logs! Here are the answers to your questions:

Dataset size: The model was trained on approximately 50k hours of Japanese speech data.
Hardware: I used a setup with 8x H200 GPUs, and the training took about 10 days. That being said, I believe there is still room to speed this up through further code optimization.
Hyperparameters: Aside from the total number of steps, the configuration file available on the GitHub repository is identical to the one used for the actual run.
Data filtering: The filtering process was relatively straightforward. It mainly relies on metrics like characters per second (CPS) to detect ASR errors, silence ratios, and a few other basic checks. For future updates, I am planning to incorporate more advanced filtering using tools like DNSMOS.

I hope this helps! Thanks again for your interest in the model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment