File size: 13,992 Bytes
b48d7b7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
IndexTTS-Rust/ (Complete Directory Structure)
β”‚
β”œβ”€β”€ indextts/                                    # Main Python package (194 files)
β”‚   β”‚
β”‚   β”œβ”€β”€ __init__.py                              # Package initialization
β”‚   β”œβ”€β”€ cli.py                                   # Command-line interface (64 lines)
β”‚   β”œβ”€β”€ infer.py                                 # Original inference (v1) - 690 lines
β”‚   β”œβ”€β”€ infer_v2.py                              # Main inference v2 - 739 lines ⭐⭐⭐
β”‚   β”‚
β”‚   β”œβ”€β”€ gpt/                                     # GPT-based TTS model (9 files, 16,953 lines)
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ model.py                             # Original UnifiedVoice (713L)
β”‚   β”‚   β”œβ”€β”€ model_v2.py                          # UnifiedVoice v2 ⭐⭐⭐ (747L)
β”‚   β”‚   β”œβ”€β”€ conformer_encoder.py                 # Conformer encoder ⭐⭐ (520L)
β”‚   β”‚   β”œβ”€β”€ perceiver.py                         # Perceiver resampler (317L)
β”‚   β”‚   β”œβ”€β”€ conformer_encoder.py                 # Conformer components
β”‚   β”‚   β”œβ”€β”€ transformers_gpt2.py                 # GPT2 implementation (1,878L)
β”‚   β”‚   β”œβ”€β”€ transformers_generation_utils.py     # Generation utilities (4,747L)
β”‚   β”‚   β”œβ”€β”€ transformers_beam_search.py          # Beam search (1,013L)
β”‚   β”‚   └── transformers_modeling_utils.py       # Model utilities (5,525L)
β”‚   β”‚
β”‚   β”œβ”€β”€ BigVGAN/                                 # Neural Vocoder (6+ files, ~1000+ lines)
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ models.py                            # BigVGAN architecture ⭐⭐⭐
β”‚   β”‚   β”œβ”€β”€ ECAPA_TDNN.py                        # Speaker encoder
β”‚   β”‚   β”œβ”€β”€ activations.py                       # Snake, SnakeBeta activations
β”‚   β”‚   β”œβ”€β”€ utils.py                             # Helper functions
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ alias_free_activation/               # CUDA kernel variants
β”‚   β”‚   β”‚   β”œβ”€β”€ cuda/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ activation1d.py              # CUDA kernel loader
β”‚   β”‚   β”‚   β”‚   └── load.py
β”‚   β”‚   β”‚   └── torch/
β”‚   β”‚   β”‚       β”œβ”€β”€ act.py                       # PyTorch activation
β”‚   β”‚   β”‚       β”œβ”€β”€ filter.py                    # Anti-aliasing filter
β”‚   β”‚   β”‚       └── resample.py                  # Resampling
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ alias_free_torch/                    # PyTorch-only fallback
β”‚   β”‚   β”‚   β”œβ”€β”€ act.py
β”‚   β”‚   β”‚   β”œβ”€β”€ filter.py
β”‚   β”‚   β”‚   └── resample.py
β”‚   β”‚   β”‚
β”‚   β”‚   └── nnet/                                # Network modules
β”‚   β”‚       β”œβ”€β”€ linear.py
β”‚   β”‚       β”œβ”€β”€ normalization.py
β”‚   β”‚       └── CNN.py
β”‚   β”‚
β”‚   β”œβ”€β”€ s2mel/                                   # Semantic-to-Mel Models (~500+ lines)
β”‚   β”‚   β”œβ”€β”€ modules/                             # Core modules (10+ files)
β”‚   β”‚   β”‚   β”œβ”€β”€ audio.py                         # Mel-spectrogram computation ⭐
β”‚   β”‚   β”‚   β”œβ”€β”€ commons.py                       # Common utilities (21KB)
β”‚   β”‚   β”‚   β”œβ”€β”€ layers.py                        # NN layers (13KB)
β”‚   β”‚   β”‚   β”œβ”€β”€ length_regulator.py              # Duration modeling
β”‚   β”‚   β”‚   β”œβ”€β”€ flow_matching.py                 # Continuous flow matching
β”‚   β”‚   β”‚   β”œβ”€β”€ diffusion_transformer.py         # Diffusion model
β”‚   β”‚   β”‚   β”œβ”€β”€ rmvpe.py                         # Pitch extraction (22KB)
β”‚   β”‚   β”‚   β”œβ”€β”€ quantize.py                      # Quantization
β”‚   β”‚   β”‚   β”œβ”€β”€ encodec.py                       # EnCodec codec
β”‚   β”‚   β”‚   β”œβ”€β”€ wavenet.py                       # WaveNet implementation
β”‚   β”‚   β”‚   β”‚
β”‚   β”‚   β”‚   β”œβ”€β”€ bigvgan/                         # BigVGAN vocoder
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ modules.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ config.json
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ bigvgan.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ alias_free_activation/      # Variants
β”‚   β”‚   β”‚   β”‚   └── models.py
β”‚   β”‚   β”‚   β”‚
β”‚   β”‚   β”‚   β”œβ”€β”€ vocos/                           # Vocos codec
β”‚   β”‚   β”‚   β”œβ”€β”€ hifigan/                         # HiFiGAN vocoder
β”‚   β”‚   β”‚   β”œβ”€β”€ openvoice/                       # OpenVoice components (11 files)
β”‚   β”‚   β”‚   β”œβ”€β”€ campplus/                        # CAMPPlus speaker encoder
β”‚   β”‚   β”‚   β”‚   └── DTDNN.py                     # DTDNN architecture
β”‚   β”‚   β”‚   └── gpt_fast/                        # Fast GPT inference
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ dac/                                 # DAC codec
β”‚   β”‚   β”‚   β”œβ”€β”€ model/
β”‚   β”‚   β”‚   β”œβ”€β”€ nn/
β”‚   β”‚   β”‚   └── utils/
β”‚   β”‚   β”‚
β”‚   β”‚   └── (other s2mel implementations)
β”‚   β”‚
β”‚   β”œβ”€β”€ utils/                                   # Text & Feature Utils (12+ files, ~500L)
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ front.py                             # TextNormalizer, TextTokenizer ⭐⭐⭐ (700L)
β”‚   β”‚   β”œβ”€β”€ maskgct_utils.py                     # Semantic codec builders (250L)
β”‚   β”‚   β”œβ”€β”€ arch_util.py                         # AttentionBlock, utilities
β”‚   β”‚   β”œβ”€β”€ checkpoint.py                        # Model loading
β”‚   β”‚   β”œβ”€β”€ xtransformers.py                     # Transformer utils (1,600L)
β”‚   β”‚   β”œβ”€β”€ feature_extractors.py                # MelSpectrogramFeatures
β”‚   β”‚   β”œβ”€β”€ common.py                            # Common functions
β”‚   β”‚   β”œβ”€β”€ text_utils.py                        # Text utilities
β”‚   β”‚   β”œβ”€β”€ typical_sampling.py                  # TypicalLogitsWarper sampling
β”‚   β”‚   β”œβ”€β”€ utils.py                             # General utils
β”‚   β”‚   β”œβ”€β”€ webui_utils.py                       # Web UI helpers
β”‚   β”‚   β”œβ”€β”€ tagger_cache/                        # Text normalization cache
β”‚   β”‚   β”‚
β”‚   β”‚   └── maskgct/                             # MaskGCT codec (100+ files, 10KB+)
β”‚   β”‚       └── models/
β”‚   β”‚           β”œβ”€β”€ codec/                       # Multiple codec implementations
β”‚   β”‚           β”‚   β”œβ”€β”€ amphion_codec/           # Amphion codec
β”‚   β”‚           β”‚   β”‚   β”œβ”€β”€ codec.py
β”‚   β”‚           β”‚   β”‚   β”œβ”€β”€ vocos.py
β”‚   β”‚           β”‚   β”‚   └── quantize/            # Quantization
β”‚   β”‚           β”‚   β”‚       β”œβ”€β”€ vector_quantize.py
β”‚   β”‚           β”‚   β”‚       β”œβ”€β”€ residual_vq.py
β”‚   β”‚           β”‚   β”‚       β”œβ”€β”€ factorized_vector_quantize.py
β”‚   β”‚           β”‚   β”‚       └── lookup_free_quantize.py
β”‚   β”‚           β”‚   β”‚
β”‚   β”‚           β”‚   β”œβ”€β”€ facodec/                 # FACodec variant
β”‚   β”‚           β”‚   β”‚   β”œβ”€β”€ facodec_inference.py
β”‚   β”‚           β”‚   β”‚   β”œβ”€β”€ modules/
β”‚   β”‚           β”‚   β”‚   β”‚   β”œβ”€β”€ commons.py
β”‚   β”‚           β”‚   β”‚   β”‚   β”œβ”€β”€ attentions.py
β”‚   β”‚           β”‚   β”‚   β”‚   β”œβ”€β”€ layers.py
β”‚   β”‚           β”‚   β”‚   β”‚   β”œβ”€β”€ quantize.py
β”‚   β”‚           β”‚   β”‚   β”‚   β”œβ”€β”€ wavenet.py
β”‚   β”‚           β”‚   β”‚   β”‚   β”œβ”€β”€ style_encoder.py
β”‚   β”‚           β”‚   β”‚   β”‚   β”œβ”€β”€ gradient_reversal.py
β”‚   β”‚           β”‚   β”‚   β”‚   └── JDC/ (pitch detection)
β”‚   β”‚           β”‚   β”‚   └── alias_free_torch/    # Anti-aliasing
β”‚   β”‚           β”‚   β”‚
β”‚   β”‚           β”‚   β”œβ”€β”€ speechtokenizer/         # Speech Tokenizer codec
β”‚   β”‚           β”‚   β”‚   β”œβ”€β”€ model.py
β”‚   β”‚           β”‚   β”‚   └── modules/
β”‚   β”‚           β”‚   β”‚       β”œβ”€β”€ seanet.py
β”‚   β”‚           β”‚   β”‚       β”œβ”€β”€ lstm.py
β”‚   β”‚           β”‚   β”‚       β”œβ”€β”€ norm.py
β”‚   β”‚           β”‚   β”‚       β”œβ”€β”€ conv.py
β”‚   β”‚           β”‚   β”‚       └── quantization/
β”‚   β”‚           β”‚   β”‚
β”‚   β”‚           β”‚   β”œβ”€β”€ ns3_codec/                # NS3 codec variant
β”‚   β”‚           β”‚   β”œβ”€β”€ vevo/                     # VEVo codec
β”‚   β”‚           β”‚   β”œβ”€β”€ kmeans/                   # KMeans codec
β”‚   β”‚           β”‚   β”œβ”€β”€ melvqgan/                 # MelVQ-GAN codec
β”‚   β”‚           β”‚   β”‚
β”‚   β”‚           β”‚   β”œβ”€β”€ codec_inference.py
β”‚   β”‚           β”‚   β”œβ”€β”€ codec_sampler.py
β”‚   β”‚           β”‚   β”œβ”€β”€ codec_trainer.py
β”‚   β”‚           β”‚   └── codec_dataset.py
β”‚   β”‚           β”‚
β”‚   β”‚           └── tts/
β”‚   β”‚               └── maskgct/
β”‚   β”‚                   β”œβ”€β”€ maskgct_s2a.py        # Semantic-to-acoustic
β”‚   β”‚                   └── ckpt/
β”‚   β”‚
β”‚   └── vqvae/                                   # Vector Quantized VAE
β”‚       β”œβ”€β”€ xtts_dvae.py                         # Discrete VAE (currently disabled)
β”‚       └── (other VAE components)
β”‚
β”œβ”€β”€ examples/                                    # Sample Data & Test Cases
β”‚   β”œβ”€β”€ cases.jsonl                              # Example test cases
β”‚   β”œβ”€β”€ voice_*.wav                              # Sample voice prompts (12 files)
β”‚   β”œβ”€β”€ emo_*.wav                                # Emotion reference samples (2 files)
β”‚   └── sample_prompt.wav                        # Default prompt (implied)
β”‚
β”œβ”€β”€ tests/                                       # Test Suite
β”‚   β”œβ”€β”€ regression_test.py                       # Main regression tests ⭐
β”‚   └── padding_test.py                          # Padding/batch tests
β”‚
β”œβ”€β”€ tools/                                       # Utility Scripts & i18n
β”‚   β”œβ”€β”€ download_files.py                        # Model downloading from HF
β”‚   └── i18n/                                    # Internationalization
β”‚       β”œβ”€β”€ i18n.py                              # Translation system
β”‚       β”œβ”€β”€ scan_i18n.py                         # i18n scanner
β”‚       └── locale/
β”‚           β”œβ”€β”€ en_US.json                       # English translations
β”‚           └── zh_CN.json                       # Chinese translations
β”‚
β”œβ”€β”€ archive/                                     # Historical Docs
β”‚   └── README_INDEXTTS_1_5.md                   # IndexTTS 1.5 documentation
β”‚
β”œβ”€β”€ webui.py                                     # Gradio Web UI ⭐⭐⭐ (18KB)
β”œβ”€β”€ cli.py                                       # Command-line interface
β”œβ”€β”€ requirements.txt                             # Python dependencies
β”œβ”€β”€ MANIFEST.in                                  # Package manifest
β”œβ”€β”€ .gitignore                                   # Git ignore rules
β”œβ”€β”€ .gitattributes                               # Git attributes
└── LICENSE                                      # Apache 2.0 License

═══════════════════════════════════════════════════════════════════════════════
KEY FILES BY IMPORTANCE:
═══════════════════════════════════════════════════════════════════════════════

⭐⭐⭐ CRITICAL (Core Logic - MUST Convert First)
  1. indextts/infer_v2.py              - Main inference pipeline (739L)
  2. indextts/gpt/model_v2.py          - UnifiedVoice GPT model (747L)
  3. indextts/utils/front.py           - Text processing (700L)
  4. indextts/BigVGAN/models.py        - Vocoder (1000+L)
  5. indextts/s2mel/modules/audio.py   - Mel-spectrogram (83L, critical DSP)

⭐⭐ HIGH PRIORITY (Major Components)
  1. indextts/gpt/conformer_encoder.py - Conformer blocks (520L)
  2. indextts/gpt/perceiver.py         - Perceiver attention (317L)
  3. indextts/utils/maskgct_utils.py   - Codec builders (250L)
  4. indextts/s2mel/modules/commons.py - Common utilities (21KB)

⭐ MEDIUM PRIORITY (Utilities & Optimization)
  1. indextts/utils/xtransformers.py   - Transformer utils (1,600L)
  2. indextts/BigVGAN/activations.py   - Activation functions
  3. indextts/s2mel/modules/rmvpe.py   - Pitch extraction (22KB)

OPTIONAL (Web UI, Tools)
  1. webui.py                          - Gradio interface
  2. tools/download_files.py           - Model downloading

═══════════════════════════════════════════════════════════════════════════════
TOTAL STATISTICS:
═══════════════════════════════════════════════════════════════════════════════
Total Python Files:        194
Total Lines of Code:       ~25,000+
GPT Module:                16,953 lines
MaskGCT Codecs:            ~10,000+ lines
S2Mel Models:              ~2,000+ lines
BigVGAN:                   ~1,000+ lines
Utils:                     ~500 lines
Tests:                     ~100 lines

Models Supported:          6 major HuggingFace models
Languages:                 Chinese (full), English (full), Mixed
Emotion Dimensions:        8-dimensional emotion control
Audio Sample Rate:         22,050 Hz (primary)
Max Text Tokens:           120
Max Mel Tokens:            250
Mel Spectrogram Bins:      80