Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias
Abstract
Mega-TTS, a zero-shot TTS system, uses attribute-specific models trained on large datasets to generate high-quality speech with superior naturalness, robustness, and speaker similarity in zero-shot tasks.
Scaling text-to-speech to a large and wild dataset has been proven to be highly effective in achieving timbre and speech style generalization, particularly in zero-shot TTS. However, previous works usually encode speech into latent using audio codec and use autoregressive language models or diffusion models to generate it, which ignores the intrinsic nature of speech and may lead to inferior or uncontrollable results. We argue that speech can be decomposed into several attributes (e.g., content, timbre, prosody, and phase) and each of them should be modeled using a module with appropriate inductive biases. From this perspective, we carefully design a novel and large zero-shot TTS system called Mega-TTS, which is trained with large-scale wild data and models different attributes in different ways: 1) Instead of using latent encoded by audio codec as the intermediate feature, we still choose spectrogram as it separates the phase and other attributes very well. Phase can be appropriately constructed by the GAN-based vocoder and does not need to be modeled by the language model. 2) We model the timbre using global vectors since timbre is a global attribute that changes slowly over time. 3) We further use a VQGAN-based acoustic model to generate the spectrogram and a latent code language model to fit the distribution of prosody, since prosody changes quickly over time in a sentence, and language models can capture both local and long-range dependencies. We scale Mega-TTS to multi-domain datasets with 20K hours of speech and evaluate its performance on unseen speakers. Experimental results demonstrate that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS, speech editing, and cross-lingual TTS tasks, with superior naturalness, robustness, and speaker similarity due to the proper inductive bias of each module. Audio samples are available at https://mega-tts.github.io/demo-page.
Community
"Text-to-speech, an integral part of the automatic speech recognition system, embodies the marvel of transforming written language into audible, human-like speech. This technological wonder transcends mere algorithms; it's a gateway to universal accessibility, bridging gaps and empowering those with visual impairments or language barriers to engage with content effortlessly.
The elegance of text-to-speech lies in its ability to not just convert text into audible output but to infuse it with emotive tones, cadences, and nuances akin to human speech. It stands as a testament to the remarkable advancements in artificial intelligence and natural language processing, where complex algorithms decode text structures, linguistic patterns, and phonetics to produce seamless spoken language.
This transformative technology extends beyond convenience; it embodies inclusivity by providing a voice to the written word, breathing life into literature, aiding in education, enabling accessibility in digital spaces, and enhancing user experiences across various applications.
As we witness the evolution of text-to-speech systems, we glimpse the potential to create more personalized, expressive, and immersive auditory experiences. In its convergence with AI, machine learning, and neural networks, text-to-speech holds the promise of a future where communication knows no bounds, fostering connections and understanding across diverse cultures and languages."
I am testing text to speech as we know as an automatical speech recognition system
thanks @fujindemi
你好
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
 AK
							AK 
					 
					 
					