NariLabs commited on
Commit
362ad2e
·
verified ·
1 Parent(s): 7acb31c

Update README for dia2 runtime

Browse files
Files changed (1) hide show
  1. README.md +56 -49
README.md CHANGED
@@ -6,58 +6,65 @@ pipeline_tag: text-to-speech
6
  ---
7
  # Dia2-2B
8
 
9
- This repo holds the inference assets for the Dia2-2B model.
10
-
11
- This is a model for streaming text-to-speech (TTS) specialized for conversations. Following KyutaiTTS, our model starts to output audio as soon as the first few words from text have been given as input.
12
-
13
- ## Details
14
- The architecture is a RQ-transformer that receives tokenized text as input, and outputs 32 Mimi audio tokens in 12.5Hz. The backbone model is responsible for predicting (1) when a new word will start (binary classification), and (2) the first audio codebook.
15
-
16
- The depth transformer predicts the rest (31) of the Mimi codes. This way, we can apply compute amortization (1/16) in the depth transformer, driving training time and memory usage down while maintaining output quality.
17
-
18
- The audio is shifted by 16 steps with respect to the text, and we use an acoustic/semantic delay of 2, following KyutaiTTS.
19
-
20
- - `config.json`: minimal runtime config consumed by `new_dia.config.load_config`.
21
- - `model.safetensors`: FP32 weights in the bias-free linear layout.
22
- - Tokenizer bundle (`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.json`, `merges.txt`, `added_tokens.json`).
23
-
24
- ## Usage
25
-
26
  ```bash
27
- pip install -U torch transformers safetensors huggingface_hub
28
- uv run -m new_dia.cli \
29
- --config nari-labs/Dia2-2B --weights nari-labs/Dia2-2B \
30
- --out output.wav --cfg 2.0 --temperature 0.8 --dtype bfloat16
 
 
 
 
 
 
 
 
 
31
  ```
 
 
32
 
33
- Or via Python:
34
-
35
  ```python
36
- from new_dia.runtime.generator import TextToSpeechGenerator
37
- runtime = TextToSpeechGenerator.from_paths(
38
- config_path="nari-labs/Dia2-2B",
39
- weights_path="nari-labs/Dia2-2B",
40
- device="cuda",
41
- dtype="bfloat16",
 
42
  )
 
43
  ```
44
-
45
- Mimi codec weights are fetched from `kyutai/mimi` at runtime.
46
-
47
- ## Training Details
48
- The model was trained for 250k steps: batch size 512, segment duration of 120 seconds, and 20% chance of unconditional training for CFG at inference time.
49
-
50
- We use approxmiately 800k hours of english dialogue and monologue data in order to model conversational prosody.
51
-
52
- The training took ~5 days on TPU v5p-64 graciously provided by the TPU Research Program ([TRC](https://sites.research.google/trc/about/)).
53
-
54
- ## Misuse and Abuse
55
- This project offers a high-fidelity speech generation model intended for research and educational use. The following uses are strictly forbidden:
56
-
57
- - **Identity Misuse**: Do not produce audio resembling real individuals without permission.
58
- - **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
59
- - **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm.
60
-
61
- By using this model, you agree to uphold relevant legal standards and ethical responsibilities. We are **not responsible** for any misuse and firmly oppose any unethical usage of this technology.
62
-
63
- **Authors**: Toby Kim, Jay Sung, and the Nari Labs team.
 
6
  ---
7
  # Dia2-2B
8
 
9
+ Dia2-2B is a streaming dialogue TTS model built on top of Mimi RQ codes. The
10
+ bundle here contains everything the open-source `dia2` runtime needs at
11
+ inference time.
12
+
13
+ ## Contents
14
+ - `config.json` parsed by `dia2.config.load_config` (includes
15
+ `runtime.max_context_steps = 1500`).
16
+ - `model.safetensors` decoder/depformer/linear weights (bias-free layout).
17
+ - Tokenizer files (`tokenizer.json`, `tokenizer_config.json`,
18
+ `special_tokens_map.json`, `vocab.json`, `merges.txt`, `added_tokens.json`).
19
+ - `dia2_assets.json` — helper manifest that points Dia2 at the tokenizer and
20
+ Mimi codec repo (`kyutai/mimi`).
21
+
22
+ ## Quickstart
 
 
 
23
  ```bash
24
+ # 1) Grab the runtime
25
+ git clone https://github.com/nari-labs/dia2.git
26
+ cd dia2
27
+ uv sync
28
+
29
+ # 2) Generate audio
30
+ uv run -m dia2.cli \
31
+ --hf nari-labs/Dia2-2B \
32
+ --input input.txt \
33
+ --dtype bfloat16 \
34
+ --cfg 6.0 --temperature 0.8 \
35
+ --cuda-graph --verbose \
36
+ output.wav
37
  ```
38
+ The first invocation downloads this HF bundle plus Mimi automatically. Use
39
+ `--prefix-speaker-1/2` to warm up with reference voices.
40
 
41
+ ## Python API
 
42
  ```python
43
+ from dia2 import Dia2, GenerationConfig, SamplingConfig
44
+
45
+ dia = Dia2.from_repo("nari-labs/Dia2-2B", device="cuda", dtype="bfloat16")
46
+ config = GenerationConfig(
47
+ cfg_scale=6.0,
48
+ audio=SamplingConfig(temperature=0.8, top_k=50),
49
+ use_cuda_graph=True,
50
  )
51
+ result = dia.generate("[S1] Hello Dia2!", config=config, output_wav="hello.wav", verbose=True)
52
  ```
53
+ Generation runs until EOS or the config-driven `max_context_steps` (1500 in this
54
+ bundle).
55
+
56
+ ## Training Notes
57
+ The architecture follows KyutaiTTS: a text decoder predicts word boundaries and
58
+ codebook 0, while a depformer generates the remaining 31 Mimi codebooks with
59
+ compute amortization (1/16). Audio is delayed 16 frames relative to text with a
60
+ 2-frame semantic offset. Dia2-2B trained for 250k steps (batch 512, 120 s
61
+ segments, 20% unconditional CFG) on ~800k hours of conversational English using
62
+ TPU v5p-64.
63
+
64
+ ## Safety
65
+ This model is provided for research and prototyping. Do **not** impersonate real
66
+ people, generate deceptive content, or deploy for illegal/malicious purposes.
67
+ Obtain explicit consent before cloning any real voice. You are responsible for
68
+ complying with local laws and platform policies.
69
+
70
+ **Authors**: Toby Kim, Jay Sung, and the Nari Labs team.