Abstract
SALAD, a per-token latent diffusion model using continuous representations, achieves superior intelligibility in zero-shot text-to-speech without compromising speech quality and speaker similarity.
The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.
Community
Sample page is available here:
https://s3.us-south.objectstorage.softlayer.net/zk-wav-data/Webpages/PerTokenLatentDiffusion/index.html
Sounds cool! Weights would be cool too :)
Great work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer (2024)
- Single-stage TTS with Masked Audio Token Modeling and Semantic Knowledge Distillation (2024)
- Sample-Efficient Diffusion for Text-To-Speech Synthesis (2024)
- DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization (2024)
- Joint Semantic Knowledge Distillation and Masked Acoustic Modeling for Full-band Speech Restoration with Improved Intelligibility (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
 You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: 
@librarian-bot
	 recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
 Avihu Dekel
							Avihu Dekel 
					 
					 
						
 
						 
					