File size: 22,539 Bytes
b48d7b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
╔════════════════════════════════════════════════════════════════════════════════╗
β•‘              DETAILED SOURCE FILE LISTING BY CATEGORY                          β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

MAIN INFERENCE PIPELINE FILES
═════════════════════════════════════════════════════════════════════════════════

/home/user/IndexTTS-Rust/indextts/infer_v2.py (739 LINES) ⭐⭐⭐ CRITICAL
β”œβ”€ Purpose: Main TTS inference class (IndexTTS2)
β”œβ”€ Key Classes:
β”‚  β”œβ”€ QwenEmotion (emotion text-to-vector conversion)
β”‚  β”œβ”€ IndexTTS2 (main inference class)
β”‚  └─ Helper functions for emotion/audio processing
β”œβ”€ Key Methods:
β”‚  β”œβ”€ __init__() - Initialize all models and codecs
β”‚  β”œβ”€ infer() - Single text generation with emotion control
β”‚  β”œβ”€ infer_fast() - Parallel segment generation
β”‚  β”œβ”€ get_emb() - Extract semantic embeddings
β”‚  β”œβ”€ remove_long_silence() - Silence token removal
β”‚  β”œβ”€ insert_interval_silence() - Silence insertion
β”‚  └─ Cache management for repeated generation
β”œβ”€ Models Loaded:
β”‚  β”œβ”€ UnifiedVoice (GPT model for mel token generation)
β”‚  β”œβ”€ W2V-BERT (semantic feature extraction)
β”‚  β”œβ”€ RepCodec (semantic codec)
β”‚  β”œβ”€ S2Mel model (semantic-to-mel conversion)
β”‚  β”œβ”€ CAMPPlus (speaker embedding)
β”‚  β”œβ”€ BigVGAN vocoder
β”‚  β”œβ”€ Qwen-based emotion model
β”‚  └─ Emotion/speaker matrices
└─ External Dependencies: torch, transformers, librosa, safetensors

/home/user/IndexTTS-Rust/webui.py (18KB) ⭐⭐⭐ WEB INTERFACE
β”œβ”€ Purpose: Gradio-based web UI for IndexTTS
β”œβ”€ Key Components:
β”‚  β”œβ”€ Model initialization (IndexTTS2 instance)
β”‚  β”œβ”€ Language selection (Chinese/English)
β”‚  β”œβ”€ Emotion control modes (4 modes)
β”‚  β”œβ”€ Example case loading from cases.jsonl
β”‚  β”œβ”€ Progress bar integration
β”‚  └─ Output management
β”œβ”€ Features:
β”‚  β”œβ”€ Real-time inference
β”‚  β”œβ”€ Multiple emotion control methods
β”‚  β”œβ”€ Batch processing
β”‚  β”œβ”€ Task caching
β”‚  β”œβ”€ i18n support
β”‚  └─ Pre-loaded example cases
└─ Web Framework: Gradio 5.34.1

/home/user/IndexTTS-Rust/indextts/cli.py (64 LINES)
β”œβ”€ Purpose: Command-line interface
β”œβ”€ Usage: python -m indextts.cli <text> -v <voice.wav> -o <output.wav> [options]
β”œβ”€ Arguments:
β”‚  β”œβ”€ text: Text to synthesize
β”‚  β”œβ”€ -v/--voice: Voice reference audio
β”‚  β”œβ”€ -o/--output_path: Output file path
β”‚  β”œβ”€ -c/--config: Config file path
β”‚  β”œβ”€ --model_dir: Model directory
β”‚  β”œβ”€ --fp16: Use FP16 precision
β”‚  β”œβ”€ -d/--device: Device (cpu/cuda/mps/xpu)
β”‚  └─ -f/--force: Force overwrite
└─ Uses: IndexTTS (v1 model)

TEXT PROCESSING & NORMALIZATION FILES
═════════════════════════════════════════════════════════════════════════════════

/home/user/IndexTTS-Rust/indextts/utils/front.py (700 LINES) ⭐⭐⭐ CRITICAL
β”œβ”€ Purpose: Text normalization and tokenization
β”œβ”€ Key Classes:
β”‚  β”œβ”€ TextNormalizer (700+ lines)
β”‚  β”‚  β”œβ”€ Pattern Definitions:
β”‚  β”‚  β”‚  β”œβ”€ PINYIN_TONE_PATTERN (regex for pinyin with tones 1-5)
β”‚  β”‚  β”‚  β”œβ”€ NAME_PATTERN (regex for Chinese names)
β”‚  β”‚  β”‚  └─ ENGLISH_CONTRACTION_PATTERN (regex for 's contractions)
β”‚  β”‚  β”œβ”€ Methods:
β”‚  β”‚  β”‚  β”œβ”€ normalize() - Main normalization
β”‚  β”‚  β”‚  β”œβ”€ use_chinese() - Language detection
β”‚  β”‚  β”‚  β”œβ”€ save_pinyin_tones() - Extract pinyin with tones
β”‚  β”‚  β”‚  β”œβ”€ restore_pinyin_tones() - Restore pinyin
β”‚  β”‚  β”‚  β”œβ”€ save_names() - Extract names
β”‚  β”‚  β”‚  β”œβ”€ restore_names() - Restore names
β”‚  β”‚  β”‚  β”œβ”€ correct_pinyin() - Phoneme correction (jqxβ†’v)
β”‚  β”‚  β”‚  └─ char_rep_map - Character replacement dictionary
β”‚  β”‚  └─ Normalizers:
β”‚  β”‚     β”œβ”€ zh_normalizer (Chinese) - Uses WeTextProcessing/wetext
β”‚  β”‚     └─ en_normalizer (English) - Uses tn library
β”‚  β”‚
β”‚  └─ TextTokenizer (200+ lines)
β”‚     β”œβ”€ Methods:
β”‚     β”‚  β”œβ”€ encode() - Text to token IDs
β”‚     β”‚  β”œβ”€ decode() - Token IDs to text
β”‚     β”‚  β”œβ”€ convert_tokens_to_ids()
β”‚     β”‚  β”œβ”€ convert_ids_to_tokens()
β”‚     β”‚  └─ Vocab management
β”‚     β”œβ”€ Special Tokens:
β”‚     β”‚  β”œβ”€ BOS: "<s>" (ID 0)
β”‚     β”‚  β”œβ”€ EOS: "</s>" (ID 1)
β”‚     β”‚  └─ UNK: "<unk>"
β”‚     └─ Tokenizer: SentencePiece (BPE-based)
β”œβ”€ Language Support:
β”‚  β”œβ”€ Chinese (simplified & traditional)
β”‚  β”œβ”€ English
β”‚  └─ Mixed Chinese-English
└─ Critical Pattern Matching:
   β”œβ”€ Pinyin tone detection
   β”œβ”€ Name entity detection
   β”œβ”€ Email matching
   β”œβ”€ Character replacement
   └─ Punctuation handling

GPT MODEL ARCHITECTURE FILES
═════════════════════════════════════════════════════════════════════════════════

/home/user/IndexTTS-Rust/indextts/gpt/model_v2.py (747 LINES) ⭐⭐⭐ CRITICAL
β”œβ”€ Purpose: UnifiedVoice GPT-based TTS model
β”œβ”€ Key Classes:
β”‚  β”œβ”€ UnifiedVoice (700+ lines)
β”‚  β”‚  β”œβ”€ Architecture:
β”‚  β”‚  β”‚  β”œβ”€ Input Embeddings: Text (256 vocab), Mel (8194 vocab)
β”‚  β”‚  β”‚  β”œβ”€ Position Embeddings: Learned embeddings for mel/text
β”‚  β”‚  β”‚  β”œβ”€ GPT Transformer: Configurable layers/heads
β”‚  β”‚  β”‚  β”œβ”€ Conditioning Encoder: Conformer or Perceiver-based
β”‚  β”‚  β”‚  β”œβ”€ Emotion Conditioning: Separate conformer + perceiver
β”‚  β”‚  β”‚  └─ Output Heads: Text prediction, Mel prediction
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Parameters:
β”‚  β”‚  β”‚  β”œβ”€ layers: 8 (transformer depth)
β”‚  β”‚  β”‚  β”œβ”€ model_dim: 512 (embedding dimension)
β”‚  β”‚  β”‚  β”œβ”€ heads: 8 (attention heads)
β”‚  β”‚  β”‚  β”œβ”€ max_text_tokens: 120
β”‚  β”‚  β”‚  β”œβ”€ max_mel_tokens: 250
β”‚  β”‚  β”‚  β”œβ”€ number_mel_codes: 8194
β”‚  β”‚  β”‚  β”œβ”€ condition_type: "conformer_perceiver" or "conformer_encoder"
β”‚  β”‚  β”‚  └─ Various activation functions
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Key Methods:
β”‚  β”‚  β”‚  β”œβ”€ forward() - Forward pass
β”‚  β”‚  β”‚  β”œβ”€ post_init_gpt2_config() - Initialize for inference
β”‚  β”‚  β”‚  β”œβ”€ generate_mel() - Mel token generation
β”‚  β”‚  β”‚  β”œβ”€ forward_with_cond_scale() - With classifier-free guidance
β”‚  β”‚  β”‚  └─ Cache management
β”‚  β”‚  β”‚
β”‚  β”‚  └─ Conditioning System:
β”‚  β”‚     β”œβ”€ Speaker conditioning via mel spectrogram
β”‚  β”‚     β”œβ”€ Conformer encoder for speaker features
β”‚  β”‚     β”œβ”€ Perceiver for attention pooling
β”‚  β”‚     β”œβ”€ Emotion conditioning (separate pathway)
β”‚  β”‚     └─ Emotion vector support (8-dimensional)
β”‚  β”‚
β”‚  β”œβ”€ ResBlock (40+ lines)
β”‚  β”‚  β”œβ”€ Conv1d layers with GroupNorm
β”‚  β”‚  └─ ReLU activation with residual connection
β”‚  β”‚
β”‚  β”œβ”€ GPT2InferenceModel (200+ lines)
β”‚  β”‚  β”œβ”€ Inference wrapper for GPT2
β”‚  β”‚  β”œβ”€ KV cache support
β”‚  β”‚  β”œβ”€ Model parallelism support
β”‚  β”‚  └─ Token-by-token generation
β”‚  β”‚
β”‚  β”œβ”€ ConditioningEncoder (30 lines)
β”‚  β”‚  β”œβ”€ Conv1d initialization
β”‚  β”‚  β”œβ”€ Attention blocks
β”‚  β”‚  └─ Optional mean pooling
β”‚  β”‚
β”‚  β”œβ”€ MelEncoder (30 lines)
β”‚  β”‚  β”œβ”€ Conv1d layers
β”‚  β”‚  β”œβ”€ ResBlocks
β”‚  β”‚  └─ 4x reduction
β”‚  β”‚
β”‚  β”œβ”€ LearnedPositionEmbeddings (15 lines)
β”‚  β”‚  └─ Learnable positional embeddings
β”‚  β”‚
β”‚  └─ build_hf_gpt_transformer() (20 lines)
β”‚     └─ Builds HuggingFace GPT2 with custom embeddings
β”‚
β”œβ”€ External Dependencies: torch, transformers, indextts.gpt modules
└─ Critical Inference Parameters:
   β”œβ”€ Temperature control for generation
   β”œβ”€ Top-k/top-p sampling
   β”œβ”€ Classifier-free guidance scale
   └─ Generation length limits

/home/user/IndexTTS-Rust/indextts/gpt/conformer_encoder.py (520 LINES) ⭐⭐
β”œβ”€ Purpose: Conformer-based speaker conditioning encoder
β”œβ”€ Key Classes:
β”‚  β”œβ”€ ConformerEncoder (main)
β”‚  β”‚  β”œβ”€ Modules:
β”‚  β”‚  β”‚  β”œβ”€ Subsampling layer (Conv2d)
β”‚  β”‚  β”‚  β”œβ”€ Positional encoding
β”‚  β”‚  β”‚  β”œβ”€ Conformer blocks
β”‚  β”‚  β”‚  β”œβ”€ Layer normalization
β”‚  β”‚  β”‚  └─ Optional projection layer
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Configuration Parameters:
β”‚  β”‚  β”‚  β”œβ”€ input_size: 1024 (mel spectrogram bins)
β”‚  β”‚  β”‚  β”œβ”€ output_size: depends on config
β”‚  β”‚  β”‚  β”œβ”€ linear_units: hidden dim for FFN
β”‚  β”‚  β”‚  β”œβ”€ attention_heads: 8
β”‚  β”‚  β”‚  β”œβ”€ num_blocks: 4
β”‚  β”‚  β”‚  └─ input_layer: "linear" or "conv2d"
β”‚  β”‚  β”‚
β”‚  β”‚  └─ Architecture: Conv β†’ Pos Enc β†’ [Conformer Block] * N β†’ LayerNorm
β”‚  β”‚
β”‚  β”œβ”€ ConformerBlock (80+ lines)
β”‚  β”‚  β”œβ”€ Residual connections
β”‚  β”‚  β”œβ”€ FFN β†’ Attention β†’ Conv β†’ FFN structure
β”‚  β”‚  β”œβ”€ Feed-forward network (2-layer with dropout)
β”‚  β”‚  β”œβ”€ Multi-head self-attention
β”‚  β”‚  β”œβ”€ Convolution module (depthwise)
β”‚  β”‚  └─ Layer normalization
β”‚  β”‚
β”‚  β”œβ”€ ConvolutionModule (50 lines)
β”‚  β”‚  β”œβ”€ Pointwise Conv 1x1
β”‚  β”‚  β”œβ”€ Depthwise Conv with kernel_size (e.g., 15)
β”‚  β”‚  β”œβ”€ Batch normalization or layer normalization
β”‚  β”‚  β”œβ”€ Activation (ReLU/SiLU)
β”‚  β”‚  └─ Projection
β”‚  β”‚
β”‚  β”œβ”€ PositionwiseFeedForward (15 lines)
β”‚  β”‚  β”œβ”€ Dense layer (idim β†’ hidden)
β”‚  β”‚  β”œβ”€ Activation (ReLU)
β”‚  β”‚  β”œβ”€ Dropout
β”‚  β”‚  └─ Dense layer (hidden β†’ idim)
β”‚  β”‚
β”‚  └─ MultiHeadedAttention (custom)
β”‚     β”œβ”€ Scaled dot-product attention
β”‚     β”œβ”€ Multiple heads
β”‚     └─ Optional relative position bias
β”‚
β”œβ”€ External Dependencies: torch, custom conformer modules
└─ Use Case: Processing mel spectrogram to extract speaker features

/home/user/IndexTTS-Rust/indextts/gpt/perceiver.py (317 LINES) ⭐⭐
β”œβ”€ Purpose: Perceiver resampler for attention pooling
β”œβ”€ Key Classes:
β”‚  β”œβ”€ PerceiverResampler (250+ lines)
β”‚  β”‚  β”œβ”€ Architecture:
β”‚  β”‚  β”‚  β”œβ”€ Learnable latent queries
β”‚  β”‚  β”‚  β”œβ”€ Cross-attention layers
β”‚  β”‚  β”‚  β”œβ”€ Feed-forward networks
β”‚  β”‚  β”‚  └─ Layer normalization
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Parameters:
β”‚  β”‚  β”‚  β”œβ”€ dim: 512 (embedding dimension)
β”‚  β”‚  β”‚  β”œβ”€ dim_context: 512 (context dimension)
β”‚  β”‚  β”‚  β”œβ”€ num_latents: 32 (number of latent queries)
β”‚  β”‚  β”‚  β”œβ”€ num_latent_channels: 64
β”‚  β”‚  β”‚  β”œβ”€ num_layers: 6
β”‚  β”‚  β”‚  β”œβ”€ ff_mult: 4 (FFN expansion)
β”‚  β”‚  β”‚  └─ heads: 8
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Key Methods:
β”‚  β”‚  β”‚  β”œβ”€ forward() - Attend and pool
β”‚  β”‚  β”‚  └─ _cross_attend_block() - Single cross-attention layer
β”‚  β”‚  β”‚
β”‚  β”‚  └─ Cross-Attention Mechanism:
β”‚  β”‚     β”œβ”€ Queries: Learnable latents
β”‚  β”‚     β”œβ”€ Keys/Values: Input context
β”‚  β”‚     β”œβ”€ Output: Pooled features (num_latents Γ— dim)
β”‚  β”‚     └─ FFN projection for dimension mixing
β”‚  β”‚
β”‚  └─ FeedForward (15 lines)
β”‚     β”œβ”€ Dense (dim β†’ hidden)
β”‚     β”œβ”€ GELU activation
β”‚     └─ Dense (hidden β†’ dim)
β”‚
β”œβ”€ External Dependencies: torch, einsum operations
└─ Use Case: Pool conditioning encoder output to fixed-size representation

VOCODER & AUDIO SYNTHESIS FILES
═════════════════════════════════════════════════════════════════════════════════

/home/user/IndexTTS-Rust/indextts/BigVGAN/models.py (1000+ LINES) ⭐⭐⭐
β”œβ”€ Purpose: BigVGAN neural vocoder for mel-to-audio conversion
β”œβ”€ Key Classes:
β”‚  β”œβ”€ BigVGAN (400+ lines)
β”‚  β”‚  β”œβ”€ Architecture:
β”‚  β”‚  β”‚  β”œβ”€ Initial Conv1d (80 mel bins β†’ 192 channels)
β”‚  β”‚  β”‚  β”œβ”€ Upsampling layers (transposed conv)
β”‚  β”‚  β”‚  β”œβ”€ AMP blocks (anti-aliased multi-period)
β”‚  β”‚  β”‚  β”œβ”€ Final Conv1d (channels β†’ 1 waveform)
β”‚  β”‚  β”‚  └─ Tanh activation for output
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Upsampling: 4x β†’ 8x β†’ 8x β†’ 4x (256x total)
β”‚  β”‚  β”‚  β”œβ”€ Maps from 22050 Hz mel frames to audio samples
β”‚  β”‚  β”‚  β”œβ”€ Kernel sizes: [16, 16, 4, 4]
β”‚  β”‚  β”‚  └─ Padding: [6, 6, 2, 2]
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Parameters:
β”‚  β”‚  β”‚  β”œβ”€ num_mels: 80
β”‚  β”‚  β”‚  β”œβ”€ num_freq: 513
β”‚  β”‚  β”‚  β”œβ”€ num_mels: 80
β”‚  β”‚  β”‚  β”œβ”€ n_fft: 1024
β”‚  β”‚  β”‚  β”œβ”€ hop_size: 256
β”‚  β”‚  β”‚  β”œβ”€ win_size: 1024
β”‚  β”‚  β”‚  β”œβ”€ sampling_rate: 22050
β”‚  β”‚  β”‚  β”œβ”€ freq_min: 0
β”‚  β”‚  β”‚  β”œβ”€ freq_max: None
β”‚  β”‚  β”‚  └─ use_cuda_kernel: bool
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Key Methods:
β”‚  β”‚  β”‚  β”œβ”€ forward() - Mel β†’ audio waveform
β”‚  β”‚  β”‚  β”œβ”€ from_pretrained() - Load from HuggingFace
β”‚  β”‚  β”‚  β”œβ”€ remove_weight_norm() - Remove spectral normalization
β”‚  β”‚  β”‚  └─ eval() - Set to evaluation mode
β”‚  β”‚  β”‚
β”‚  β”‚  └─ Special Features:
β”‚  β”‚     β”œβ”€ Weight normalization for training stability
β”‚  β”‚     β”œβ”€ Spectral normalization option
β”‚  β”‚     β”œβ”€ CUDA kernel support for activation functions
β”‚  β”‚     β”œβ”€ Snake/SnakeBeta activation (periodic)
β”‚  β”‚     └─ Anti-aliasing filters for high-quality upsampling
β”‚  β”‚
β”‚  β”œβ”€ AMPBlock1 (50 lines)
β”‚  β”‚  β”œβ”€ Architecture: Conv1d Γ— 2 with activations
β”‚  β”‚  β”œβ”€ Multiple dilation patterns [1, 3, 5]
β”‚  β”‚  β”œβ”€ Residual connections
β”‚  β”‚  β”œβ”€ Activation1d wrapper for anti-aliasing
β”‚  β”‚  └─ Weight normalization
β”‚  β”‚
β”‚  β”œβ”€ AMPBlock2 (40 lines)
β”‚  β”‚  β”œβ”€ Similar to AMPBlock1 but simpler
β”‚  β”‚  β”œβ”€ Dilation patterns [1, 3]
β”‚  β”‚  └─ Residual connections
β”‚  β”‚
β”‚  β”œβ”€ Activation1d (custom, from alias_free_activation/)
β”‚  β”‚  β”œβ”€ Applies activation function (Snake/SnakeBeta)
β”‚  β”‚  β”œβ”€ Optional anti-aliasing filter
β”‚  β”‚  └─ Optional CUDA kernel for efficiency
β”‚  β”‚
β”‚  β”œβ”€ Snake Activation (from activations.py)
β”‚  β”‚  β”œβ”€ Formula: x + (1/alpha) * sinΒ²(alpha * x)
β”‚  β”‚  β”œβ”€ Periodic nonlinearity
β”‚  β”‚  └─ Learnable alpha parameter
β”‚  β”‚
β”‚  └─ SnakeBeta Activation (from activations.py)
β”‚     β”œβ”€ More complex periodic activation
β”‚     └─ Improved harmonic modeling
β”‚
β”œβ”€ External Dependencies: torch, scipy, librosa
└─ Model Size: ~100 MB (pretrained weights)

/home/user/IndexTTS-Rust/indextts/s2mel/modules/audio.py (83 LINES)
β”œβ”€ Purpose: Mel-spectrogram computation (DSP)
β”œβ”€ Key Functions:
β”‚  β”œβ”€ load_wav() - Load WAV file with scipy
β”‚  β”œβ”€ mel_spectrogram() - Compute mel spectrogram
β”‚  β”‚  β”œβ”€ Parameters:
β”‚  β”‚  β”‚  β”œβ”€ y: waveform tensor
β”‚  β”‚  β”‚  β”œβ”€ n_fft: 1024
β”‚  β”‚  β”‚  β”œβ”€ num_mels: 80
β”‚  β”‚  β”‚  β”œβ”€ sampling_rate: 22050
β”‚  β”‚  β”‚  β”œβ”€ hop_size: 256
β”‚  β”‚  β”‚  β”œβ”€ win_size: 1024
β”‚  β”‚  β”‚  β”œβ”€ fmin: 0
β”‚  β”‚  β”‚  └─ fmax: None or 8000
β”‚  β”‚  β”‚
β”‚  β”‚  β”œβ”€ Process:
β”‚  β”‚  β”‚  1. Pad input with reflect padding
β”‚  β”‚  β”‚  2. Compute STFT (Short-Time Fourier Transform)
β”‚  β”‚  β”‚  3. Convert to magnitude spectrogram
β”‚  β”‚  β”‚  4. Apply mel filterbank (librosa)
β”‚  β”‚  β”‚  5. Apply dynamic range compression (log)
β”‚  β”‚  β”‚  └─ Output: [1, 80, T] tensor
β”‚  β”‚  β”‚
β”‚  β”‚  └─ Caching:
β”‚  β”‚     β”œβ”€ Caches mel filterbank matrices
β”‚  β”‚     β”œβ”€ Caches Hann windows
β”‚  β”‚     └─ Device-specific caching
β”‚  β”‚
β”‚  β”œβ”€ dynamic_range_compression() - Log compression
β”‚  β”œβ”€ dynamic_range_decompression() - Inverse
β”‚  └─ spectral_normalize/denormalize()
β”‚
β”œβ”€ Critical DSP Parameters:
β”‚  β”œβ”€ STFT Window: Hann window
β”‚  β”œβ”€ FFT Size: 1024
β”‚  β”œβ”€ Hop Size: 256 (11.6 ms at 22050 Hz)
β”‚  β”œβ”€ Mel Bins: 80 (perceptual scale)
β”‚  β”œβ”€ Min Freq: 0 Hz
β”‚  └─ Max Freq: Variable (8000 Hz or Nyquist)
β”‚
└─ External Dependencies: torch, librosa, scipy

SEMANTIC CODEC & FEATURE EXTRACTION FILES
═════════════════════════════════════════════════════════════════════════════════

/home/user/IndexTTS-Rust/indextts/utils/maskgct_utils.py (250 LINES)
β”œβ”€ Purpose: Build and manage semantic codecs
β”œβ”€ Key Functions:
β”‚  β”œβ”€ build_semantic_model()
β”‚  β”‚  β”œβ”€ Loads: facebook/w2v-bert-2.0 model
β”‚  β”‚  β”œβ”€ Extracts: wav2vec 2.0 BERT embeddings
β”‚  β”‚  β”œβ”€ Returns: model, mean, std (for normalization)
β”‚  β”‚  └─ Output: 1024-dimensional embeddings
β”‚  β”‚
β”‚  β”œβ”€ build_semantic_codec()
β”‚  β”‚  β”œβ”€ Creates: RepCodec (residual vector quantization)
β”‚  β”‚  β”œβ”€ Quantizes: Semantic embeddings
β”‚  β”‚  β”œβ”€ Returns: Codec model
β”‚  β”‚  └─ Output: Discrete tokens
β”‚  β”‚
β”‚  β”œβ”€ build_s2a_model()
β”‚  β”‚  β”œβ”€ Builds: MaskGCT_S2A (semantic-to-acoustic)
β”‚  β”‚  └─ Maps: Semantic codes β†’ acoustic codes
β”‚  β”‚
β”‚  β”œβ”€ build_acoustic_codec()
β”‚  β”‚  β”œβ”€ Encoder: Encodes acoustic features
β”‚  β”‚  β”œβ”€ Decoder: Decodes codes β†’ audio
β”‚  β”‚  └─ Multiple codec variants
β”‚  β”‚
β”‚  └─ Inference_Pipeline (class)
β”‚     β”œβ”€ Combines all codecs
β”‚     β”œβ”€ Methods:
β”‚     β”‚  β”œβ”€ get_emb() - Get semantic embeddings
β”‚     β”‚  β”œβ”€ get_scode() - Quantize to semantic codes
β”‚     β”‚  β”œβ”€ semantic2acoustic() - Convert codes
β”‚     β”‚  └─ s2a_inference() - Full pipeline
β”‚     └─ Diffusion-based generation options
β”‚
β”œβ”€ External Dependencies: torch, transformers, huggingface_hub
└─ Pre-trained Models:
   β”œβ”€ W2V-BERT-2.0: 614M parameters
   β”œβ”€ MaskGCT: From amphion/MaskGCT
   └─ Various codec checkpoints

CONFIGURATION & UTILITY FILES
═════════════════════════════════════════════════════════════════════════════════

/home/user/IndexTTS-Rust/indextts/utils/checkpoint.py (50 LINES)
β”œβ”€ Purpose: Load model checkpoints
β”œβ”€ Key Functions:
β”‚  β”œβ”€ load_checkpoint() - Load weights into model
β”‚  └─ Device handling (CPU/GPU/XPU/MPS)
└─ Supported Formats: .pth, .safetensors

/home/user/IndexTTS-Rust/indextts/utils/arch_util.py
β”œβ”€ Purpose: Architecture utility modules
β”œβ”€ Key Classes:
β”‚  └─ AttentionBlock - Generic attention layer
└─ Used in: Conditioning encoder, other modules

/home/user/IndexTTS-Rust/indextts/utils/xtransformers.py (1,600 LINES)
β”œβ”€ Purpose: Extended transformer utilities
β”œβ”€ Key Components:
β”‚  β”œβ”€ Advanced attention mechanisms
β”‚  β”œβ”€ Relative position bias
β”‚  β”œβ”€ Cross-attention patterns
β”‚  └─ Various position encoding schemes
└─ Used in: GPT model, encoders

TESTING FILES
═════════════════════════════════════════════════════════════════════════════════

/home/user/IndexTTS-Rust/tests/regression_test.py
β”œβ”€ Test Cases:
β”‚  β”œβ”€ Chinese text with pinyin tones (ζ™• XUAN4)
β”‚  β”œβ”€ English text
β”‚  β”œβ”€ Mixed Chinese-English
β”‚  β”œβ”€ Long-form text with multiple sentences
β”‚  β”œβ”€ Named entities (Joseph Gordon-Levitt)
β”‚  β”œβ”€ Chinese names (ηΊ¦η‘Ÿε€«Β·ι«˜η™»-θŽ±η»΄η‰Ή)
β”‚  └─ Extended passages for robustness
β”œβ”€ Inference Modes:
β”‚  β”œβ”€ Single inference (infer)
β”‚  └─ Fast inference (infer_fast)
└─ Output: WAV files in outputs/ directory

/home/user/IndexTTS-Rust/tests/padding_test.py
β”œβ”€ Test Scenarios:
β”‚  β”œβ”€ Variable length inputs
β”‚  β”œβ”€ Batch processing
β”‚  β”œβ”€ Edge cases
β”‚  └─ Padding handling
└─ Purpose: Ensure robust padding mechanics

═════════════════════════════════════════════════════════════════════════════════

KEY ALGORITHMS SUMMARY:

1. TEXT PROCESSING:
   - Regex-based pattern matching for pinyin/names
   - Character-level CJK tokenization
   - SentencePiece BPE encoding
   - Language detection (Chinese vs English)

2. FEATURE EXTRACTION:
   - W2V-BERT semantic embeddings (1024-dim)
   - RepCodec quantization
   - Mel-spectrogram (STFT-based, 80-dim)
   - CAMPPlus speaker embeddings (192-dim)

3. SEQUENCE GENERATION:
   - GPT-based autoregressive generation
   - Conformer speaker conditioning
   - Perceiver pooling for attention
   - Classifier-free guidance (optional)
   - Temperature/top-k/top-p sampling

4. AUDIO SYNTHESIS:
   - Transposed convolution upsampling (256x)
   - Anti-aliased activation functions
   - Residual connections
   - Weight/spectral normalization

5. EMOTION CONTROL:
   - 8-dimensional emotion vectors
   - Text-based emotion detection (via Qwen)
   - Audio-based emotion extraction
   - Emotion matrix interpolation

═════════════════════════════════════════════════════════════════════════════════