license: apache-2.0
language:
- en
pipeline_tag: text-to-speech
Dia2-2B
Dia2-2B is a streaming dialogue TTS model built on top of Mimi RQ codes. The
bundle here contains everything the open-source dia2 runtime needs at
inference time.
Contents
config.json— parsed bydia2.config.load_config(includesruntime.max_context_steps = 1500).model.safetensors— decoder/depformer/linear weights (bias-free layout).- Tokenizer files (
tokenizer.json,tokenizer_config.json,special_tokens_map.json,vocab.json,merges.txt,added_tokens.json). dia2_assets.json— helper manifest that points Dia2 at the tokenizer and Mimi codec repo (kyutai/mimi).
Quickstart
# 1) Grab the runtime
git clone https://github.com/nari-labs/dia2.git
cd dia2
uv sync
# 2) Generate audio
uv run -m dia2.cli \
--hf nari-labs/Dia2-2B \
--input input.txt \
--dtype bfloat16 \
--cfg 6.0 --temperature 0.8 \
--cuda-graph --verbose \
output.wav
The first invocation downloads this HF bundle plus Mimi automatically. Use
--prefix-speaker-1/2 to warm up with reference voices.
Python API
from dia2 import Dia2, GenerationConfig, SamplingConfig
dia = Dia2.from_repo("nari-labs/Dia2-2B", device="cuda", dtype="bfloat16")
config = GenerationConfig(
cfg_scale=6.0,
audio=SamplingConfig(temperature=0.8, top_k=50),
use_cuda_graph=True,
)
result = dia.generate("[S1] Hello Dia2!", config=config, output_wav="hello.wav", verbose=True)
Generation runs until EOS or the config-driven max_context_steps (1500 in this
bundle).
Training Notes
The architecture follows KyutaiTTS: a text decoder predicts word boundaries and codebook 0, while a depformer generates the remaining 31 Mimi codebooks with compute amortization (1/16). Audio is delayed 16 frames relative to text with a 2-frame semantic offset. Dia2-2B trained for 250k steps (batch 512, 120 s segments, 20% unconditional CFG) on ~800k hours of conversational English using TPU v5p-64.
Safety
This model is provided for research and prototyping. Do not impersonate real people, generate deceptive content, or deploy for illegal/malicious purposes. Obtain explicit consent before cloning any real voice. You are responsible for complying with local laws and platform policies.
Authors: Toby Kim, Jay Sung, and the Nari Labs team.