Papers
arxiv:2501.15907

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Published on Jan 27
· Submitted by HarryHe11 on Jan 28
Authors:
,
,
,
,
,
,
,

Abstract

Emilia-Pipe preprocessing pipeline creates Emilia and Emilia-Large datasets, enhancing speech generation with high-quality in-the-wild multilingual data, surpassing audiobook datasets.

AI-generated summary

Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

Community

Paper author Paper submitter

Extended version of Emilia, submitted to TASLP. The initial 101k hours version of Emilia has already been open-sourced on HuggingFace: https://huggingface.co/datasets/amphion/Emilia-Dataset.
Now, we are releasing an extended version over 250k hours of speech data!!! Coming soon on HuggingFace!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2501.15907 in a model README.md to link it from this page.

Datasets citing this paper 4

Spaces citing this paper 128

Collections including this paper 2