nari-labs
/

Dia2-2B

@@ -6,58 +6,65 @@ pipeline_tag: text-to-speech
 ---
 # Dia2-2B
-This repo holds the inference assets for the Dia2-2B model.
-This is a model for streaming text-to-speech (TTS) specialized for conversations. Following KyutaiTTS, our model starts to output audio as soon as the first few words from text have been given as input.
-## Details
-The architecture is a RQ-transformer that receives tokenized text as input, and outputs 32 Mimi audio tokens in 12.5Hz. The backbone model is responsible for predicting (1) when a new word will start (binary classification), and (2) the first audio codebook.
-The depth transformer predicts the rest (31) of the Mimi codes. This way, we can apply compute amortization (1/16) in the depth transformer, driving training time and memory usage down while maintaining output quality.
-The audio is shifted by 16 steps with respect to the text, and we use an acoustic/semantic delay of 2, following KyutaiTTS.
-- `config.json`: minimal runtime config consumed by `new_dia.config.load_config`.
-- `model.safetensors`: FP32 weights in the bias-free linear layout.
-- Tokenizer bundle (`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.json`, `merges.txt`, `added_tokens.json`).
-## Usage
 ```bash
-pip install -U torch transformers safetensors huggingface_hub
-uv run -m new_dia.cli \
-  --config nari-labs/Dia2-2B --weights nari-labs/Dia2-2B \
-  --out output.wav --cfg 2.0 --temperature 0.8 --dtype bfloat16
 ```
-Or via Python:
 ```python
-from new_dia.runtime.generator import TextToSpeechGenerator
-runtime = TextToSpeechGenerator.from_paths(
-    config_path="nari-labs/Dia2-2B",
-    weights_path="nari-labs/Dia2-2B",
-    device="cuda",
-    dtype="bfloat16",
 )
 ```
-Mimi codec weights are fetched from `kyutai/mimi` at runtime.
-## Training Details
-The model was trained for 250k steps: batch size 512, segment duration of 120 seconds, and 20% chance of unconditional training for CFG at inference time.
-We use approxmiately 800k hours of english dialogue and monologue data in order to model conversational prosody.
-The training took ~5 days on TPU v5p-64 graciously provided by the TPU Research Program ([TRC](https://sites.research.google/trc/about/)).
-## Misuse and Abuse
-This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:
-- **Identity Misuse**: Do not produce audio resembling real individuals without permission.
-- **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
-- **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm.
-By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We are **not responsible** for any misuse and firmly oppose any unethical usage of this technology.
-**Authors**: Toby Kim, Jay Sung, and the Nari Labs team.

 ---
 # Dia2-2B
+Dia2-2B is a streaming dialogue TTS model built on top of Mimi RQ codes. The
+bundle here contains everything the open-source `dia2` runtime needs at
+inference time.
+## Contents
+- `config.json` — parsed by `dia2.config.load_config` (includes
+  `runtime.max_context_steps = 1500`).
+- `model.safetensors` — decoder/depformer/linear weights (bias-free layout).
+- Tokenizer files (`tokenizer.json`, `tokenizer_config.json`,
+  `special_tokens_map.json`, `vocab.json`, `merges.txt`, `added_tokens.json`).
+- `dia2_assets.json` — helper manifest that points Dia2 at the tokenizer and
+  Mimi codec repo (`kyutai/mimi`).
+## Quickstart
 ```bash
+# 1) Grab the runtime
+git clone https://github.com/nari-labs/dia2.git
+cd dia2
+uv sync
+# 2) Generate audio
+uv run -m dia2.cli \
+  --hf nari-labs/Dia2-2B \
+  --input input.txt \
+  --dtype bfloat16 \
+  --cfg 6.0 --temperature 0.8 \
+  --cuda-graph --verbose \
+  output.wav
 ```
+The first invocation downloads this HF bundle plus Mimi automatically. Use
+`--prefix-speaker-1/2` to warm up with reference voices.
+## Python API
 ```python
+from dia2 import Dia2, GenerationConfig, SamplingConfig
+dia = Dia2.from_repo("nari-labs/Dia2-2B", device="cuda", dtype="bfloat16")
+config = GenerationConfig(
+    cfg_scale=6.0,
+    audio=SamplingConfig(temperature=0.8, top_k=50),
+    use_cuda_graph=True,
 )
+result = dia.generate("[S1] Hello Dia2!", config=config, output_wav="hello.wav", verbose=True)
 ```
+Generation runs until EOS or the config-driven `max_context_steps` (1500 in this
+bundle).
+## Training Notes
+The architecture follows KyutaiTTS: a text decoder predicts word boundaries and
+codebook 0, while a depformer generates the remaining 31 Mimi codebooks with
+compute amortization (1/16). Audio is delayed 16 frames relative to text with a
+2-frame semantic offset. Dia2-2B trained for 250k steps (batch 512, 120 s
+segments, 20% unconditional CFG) on ~800k hours of conversational English using
+TPU v5p-64.
+## Safety
+This model is provided for research and prototyping. Do **not** impersonate real
+people, generate deceptive content, or deploy for illegal/malicious purposes.
+Obtain explicit consent before cloning any real voice. You are responsible for
+complying with local laws and platform policies.
+**Authors**: Toby Kim, Jay Sung, and the Nari Labs team.