Update README for dia2 runtime
Browse files
README.md
CHANGED
|
@@ -6,58 +6,65 @@ pipeline_tag: text-to-speech
|
|
| 6 |
---
|
| 7 |
# Dia2-2B
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
##
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
## Usage
|
| 25 |
-
|
| 26 |
```bash
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
```
|
|
|
|
|
|
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
```python
|
| 36 |
-
from
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
|
|
|
| 42 |
)
|
|
|
|
| 43 |
```
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
## Training
|
| 48 |
-
The
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
**Authors**: Toby Kim, Jay Sung, and the Nari Labs team.
|
|
|
|
| 6 |
---
|
| 7 |
# Dia2-2B
|
| 8 |
|
| 9 |
+
Dia2-2B is a streaming dialogue TTS model built on top of Mimi RQ codes. The
|
| 10 |
+
bundle here contains everything the open-source `dia2` runtime needs at
|
| 11 |
+
inference time.
|
| 12 |
+
|
| 13 |
+
## Contents
|
| 14 |
+
- `config.json` — parsed by `dia2.config.load_config` (includes
|
| 15 |
+
`runtime.max_context_steps = 1500`).
|
| 16 |
+
- `model.safetensors` — decoder/depformer/linear weights (bias-free layout).
|
| 17 |
+
- Tokenizer files (`tokenizer.json`, `tokenizer_config.json`,
|
| 18 |
+
`special_tokens_map.json`, `vocab.json`, `merges.txt`, `added_tokens.json`).
|
| 19 |
+
- `dia2_assets.json` — helper manifest that points Dia2 at the tokenizer and
|
| 20 |
+
Mimi codec repo (`kyutai/mimi`).
|
| 21 |
+
|
| 22 |
+
## Quickstart
|
|
|
|
|
|
|
|
|
|
| 23 |
```bash
|
| 24 |
+
# 1) Grab the runtime
|
| 25 |
+
git clone https://github.com/nari-labs/dia2.git
|
| 26 |
+
cd dia2
|
| 27 |
+
uv sync
|
| 28 |
+
|
| 29 |
+
# 2) Generate audio
|
| 30 |
+
uv run -m dia2.cli \
|
| 31 |
+
--hf nari-labs/Dia2-2B \
|
| 32 |
+
--input input.txt \
|
| 33 |
+
--dtype bfloat16 \
|
| 34 |
+
--cfg 6.0 --temperature 0.8 \
|
| 35 |
+
--cuda-graph --verbose \
|
| 36 |
+
output.wav
|
| 37 |
```
|
| 38 |
+
The first invocation downloads this HF bundle plus Mimi automatically. Use
|
| 39 |
+
`--prefix-speaker-1/2` to warm up with reference voices.
|
| 40 |
|
| 41 |
+
## Python API
|
|
|
|
| 42 |
```python
|
| 43 |
+
from dia2 import Dia2, GenerationConfig, SamplingConfig
|
| 44 |
+
|
| 45 |
+
dia = Dia2.from_repo("nari-labs/Dia2-2B", device="cuda", dtype="bfloat16")
|
| 46 |
+
config = GenerationConfig(
|
| 47 |
+
cfg_scale=6.0,
|
| 48 |
+
audio=SamplingConfig(temperature=0.8, top_k=50),
|
| 49 |
+
use_cuda_graph=True,
|
| 50 |
)
|
| 51 |
+
result = dia.generate("[S1] Hello Dia2!", config=config, output_wav="hello.wav", verbose=True)
|
| 52 |
```
|
| 53 |
+
Generation runs until EOS or the config-driven `max_context_steps` (1500 in this
|
| 54 |
+
bundle).
|
| 55 |
+
|
| 56 |
+
## Training Notes
|
| 57 |
+
The architecture follows KyutaiTTS: a text decoder predicts word boundaries and
|
| 58 |
+
codebook 0, while a depformer generates the remaining 31 Mimi codebooks with
|
| 59 |
+
compute amortization (1/16). Audio is delayed 16 frames relative to text with a
|
| 60 |
+
2-frame semantic offset. Dia2-2B trained for 250k steps (batch 512, 120 s
|
| 61 |
+
segments, 20% unconditional CFG) on ~800k hours of conversational English using
|
| 62 |
+
TPU v5p-64.
|
| 63 |
+
|
| 64 |
+
## Safety
|
| 65 |
+
This model is provided for research and prototyping. Do **not** impersonate real
|
| 66 |
+
people, generate deceptive content, or deploy for illegal/malicious purposes.
|
| 67 |
+
Obtain explicit consent before cloning any real voice. You are responsible for
|
| 68 |
+
complying with local laws and platform policies.
|
| 69 |
+
|
| 70 |
+
**Authors**: Toby Kim, Jay Sung, and the Nari Labs team.
|
|
|
|
|
|