Dia2-2B / README.md
NariLabs's picture
Update README for dia2 runtime
362ad2e verified
|
raw
history blame
2.38 kB
metadata
license: apache-2.0
language:
  - en
pipeline_tag: text-to-speech

Dia2-2B

Dia2-2B is a streaming dialogue TTS model built on top of Mimi RQ codes. The bundle here contains everything the open-source dia2 runtime needs at inference time.

Contents

  • config.json — parsed by dia2.config.load_config (includes runtime.max_context_steps = 1500).
  • model.safetensors — decoder/depformer/linear weights (bias-free layout).
  • Tokenizer files (tokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.json, merges.txt, added_tokens.json).
  • dia2_assets.json — helper manifest that points Dia2 at the tokenizer and Mimi codec repo (kyutai/mimi).

Quickstart

# 1) Grab the runtime
git clone https://github.com/nari-labs/dia2.git
cd dia2
uv sync

# 2) Generate audio
uv run -m dia2.cli \
  --hf nari-labs/Dia2-2B \
  --input input.txt \
  --dtype bfloat16 \
  --cfg 6.0 --temperature 0.8 \
  --cuda-graph --verbose \
  output.wav

The first invocation downloads this HF bundle plus Mimi automatically. Use --prefix-speaker-1/2 to warm up with reference voices.

Python API

from dia2 import Dia2, GenerationConfig, SamplingConfig

dia = Dia2.from_repo("nari-labs/Dia2-2B", device="cuda", dtype="bfloat16")
config = GenerationConfig(
    cfg_scale=6.0,
    audio=SamplingConfig(temperature=0.8, top_k=50),
    use_cuda_graph=True,
)
result = dia.generate("[S1] Hello Dia2!", config=config, output_wav="hello.wav", verbose=True)

Generation runs until EOS or the config-driven max_context_steps (1500 in this bundle).

Training Notes

The architecture follows KyutaiTTS: a text decoder predicts word boundaries and codebook 0, while a depformer generates the remaining 31 Mimi codebooks with compute amortization (1/16). Audio is delayed 16 frames relative to text with a 2-frame semantic offset. Dia2-2B trained for 250k steps (batch 512, 120 s segments, 20% unconditional CFG) on ~800k hours of conversational English using TPU v5p-64.

Safety

This model is provided for research and prototyping. Do not impersonate real people, generate deceptive content, or deploy for illegal/malicious purposes. Obtain explicit consent before cloning any real voice. You are responsible for complying with local laws and platform policies.

Authors: Toby Kim, Jay Sung, and the Nari Labs team.