|
|
--- |
|
|
license: cc-by-4.0 |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: text-to-speech |
|
|
tags: |
|
|
- voxtream |
|
|
- text-to-speech |
|
|
--- |
|
|
|
|
|
# Model Card for VoXtream |
|
|
|
|
|
VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word. |
|
|
|
|
|
### Key featues |
|
|
|
|
|
- **Streaming**: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks. |
|
|
- **Speed**: Works **5x** times faster than real-time and achieves **102 ms** first packet latency on GPU. |
|
|
- **Quality and efficiency**: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets. |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [repo](https://github.com/herimor/voxtream) |
|
|
- **Paper:** [paper](https://herimor.github.io/voxtream) |
|
|
- **Demo:** [demo](https://herimor.github.io/voxtream) |
|
|
|
|
|
## Get started |
|
|
|
|
|
Clone our [repo](https://github.com/herimor/voxtream) and follow instructions in README file. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws. |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on a 9k-hour subset from [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) and [HiFiTTS2](https://huggingface.co/datasets/nvidia/hifitts-2) datasets. For more details please check our paper. |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@article{torgashov2025voxtream, |
|
|
author = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel}, |
|
|
title = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency}, |
|
|
journal = {arXiv}, |
|
|
year = {2025} |
|
|
} |
|
|
``` |