voxtream / README.md
herimor's picture
Update README
fa815a1 verified
|
raw
history blame
1.93 kB
metadata
license: cc-by-4.0
language:
  - en
pipeline_tag: text-to-speech
tags:
  - voxtream
  - text-to-speech

Model Card for VoXtream

VoXtream, a fully autoregressive, zero-shot streaming text-to-speech system for real-time use that begins speaking from the first word.

Key featues

  • Streaming: Support a full-stream scenario, where the full sentence is not known in advance. The model takes the text stream coming word-by-word as input and outputs an audio stream in 80ms chunks.
  • Speed: Works 5x times faster than real-time and achieves 102 ms first packet latency on GPU.
  • Quality and efficiency: With only 9k hours of training data, it matches or surpasses the quality and intelligibility of larger models or models trained on large datasets.

Model Sources

Get started

Clone our repo and follow instructions in README file.

Out-of-Scope Use

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Training Data

The model was trained on a 9k-hour subset from Emilia and HiFiTTS2 datasets. For more details please check our paper.

Citation

@article{torgashov2025voxtream,
  author    = {Torgashov, Nikita and Henter, Gustav Eje and Skantze, Gabriel},
  title     = {Vo{X}tream: Full-Stream Text-to-Speech with Extremely Low Latency},
  journal   = {arXiv},
  year      = {2025}
}