New feature - Save already cloned voices with same params

#3
by MrCasquette - opened

Hi, I’d like to suggest a new feature for Chatterbox-TTS-French: the ability to save cloned voices along with their generation parameters (exaggeration, temperature, cfg_weight) and the processed audio representation of the voice itself.

Currently, generating a new phrase requires re-processing the audio prompt every time. Saving the cloned voice with its parameters would allow users to quickly generate new text with the exact same voice, ensuring consistency and reducing processing time.

Thanks for considering this!

Well, it seems to happen automatically.
This discussion suggested that conds (representations of speaker characteristics) remain stored in the model once used.
Try this on your computer:

text = """
If music be the food of love, play on;
"""
import torchaudio as ta
wav = model.generate(text, audio_prompt_path="ref.mp3")
ta.save("test-1.wav", wav, model.sr)

then if do the same thing without audio_prompt_path, you get:

import torchaudio as ta
text = """
If music be the food of love, play on;
"""
wav = model.generate(text)
ta.save("test-1.wav", wav, model.sr) # -> Same audio (if it's the same instance of the ChatterboxTTS class)

The problem is that there is no gain in latency.

The architecture of the model requires latents to be obtained from audio, and the amount of time it takes to generate them is infinitesimal.

So there's nothing for you to do, because most of the generation time is devoted to the audio itself, and not to the latents.

Sign up or log in to comment