| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β DETAILED SOURCE FILE LISTING BY CATEGORY β | |
| ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| MAIN INFERENCE PIPELINE FILES | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| /home/user/IndexTTS-Rust/indextts/infer_v2.py (739 LINES) βββ CRITICAL | |
| ββ Purpose: Main TTS inference class (IndexTTS2) | |
| ββ Key Classes: | |
| β ββ QwenEmotion (emotion text-to-vector conversion) | |
| β ββ IndexTTS2 (main inference class) | |
| β ββ Helper functions for emotion/audio processing | |
| ββ Key Methods: | |
| β ββ __init__() - Initialize all models and codecs | |
| β ββ infer() - Single text generation with emotion control | |
| β ββ infer_fast() - Parallel segment generation | |
| β ββ get_emb() - Extract semantic embeddings | |
| β ββ remove_long_silence() - Silence token removal | |
| β ββ insert_interval_silence() - Silence insertion | |
| β ββ Cache management for repeated generation | |
| ββ Models Loaded: | |
| β ββ UnifiedVoice (GPT model for mel token generation) | |
| β ββ W2V-BERT (semantic feature extraction) | |
| β ββ RepCodec (semantic codec) | |
| β ββ S2Mel model (semantic-to-mel conversion) | |
| β ββ CAMPPlus (speaker embedding) | |
| β ββ BigVGAN vocoder | |
| β ββ Qwen-based emotion model | |
| β ββ Emotion/speaker matrices | |
| ββ External Dependencies: torch, transformers, librosa, safetensors | |
| /home/user/IndexTTS-Rust/webui.py (18KB) βββ WEB INTERFACE | |
| ββ Purpose: Gradio-based web UI for IndexTTS | |
| ββ Key Components: | |
| β ββ Model initialization (IndexTTS2 instance) | |
| β ββ Language selection (Chinese/English) | |
| β ββ Emotion control modes (4 modes) | |
| β ββ Example case loading from cases.jsonl | |
| β ββ Progress bar integration | |
| β ββ Output management | |
| ββ Features: | |
| β ββ Real-time inference | |
| β ββ Multiple emotion control methods | |
| β ββ Batch processing | |
| β ββ Task caching | |
| β ββ i18n support | |
| β ββ Pre-loaded example cases | |
| ββ Web Framework: Gradio 5.34.1 | |
| /home/user/IndexTTS-Rust/indextts/cli.py (64 LINES) | |
| ββ Purpose: Command-line interface | |
| ββ Usage: python -m indextts.cli <text> -v <voice.wav> -o <output.wav> [options] | |
| ββ Arguments: | |
| β ββ text: Text to synthesize | |
| β ββ -v/--voice: Voice reference audio | |
| β ββ -o/--output_path: Output file path | |
| β ββ -c/--config: Config file path | |
| β ββ --model_dir: Model directory | |
| β ββ --fp16: Use FP16 precision | |
| β ββ -d/--device: Device (cpu/cuda/mps/xpu) | |
| β ββ -f/--force: Force overwrite | |
| ββ Uses: IndexTTS (v1 model) | |
| TEXT PROCESSING & NORMALIZATION FILES | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| /home/user/IndexTTS-Rust/indextts/utils/front.py (700 LINES) βββ CRITICAL | |
| ββ Purpose: Text normalization and tokenization | |
| ββ Key Classes: | |
| β ββ TextNormalizer (700+ lines) | |
| β β ββ Pattern Definitions: | |
| β β β ββ PINYIN_TONE_PATTERN (regex for pinyin with tones 1-5) | |
| β β β ββ NAME_PATTERN (regex for Chinese names) | |
| β β β ββ ENGLISH_CONTRACTION_PATTERN (regex for 's contractions) | |
| β β ββ Methods: | |
| β β β ββ normalize() - Main normalization | |
| β β β ββ use_chinese() - Language detection | |
| β β β ββ save_pinyin_tones() - Extract pinyin with tones | |
| β β β ββ restore_pinyin_tones() - Restore pinyin | |
| β β β ββ save_names() - Extract names | |
| β β β ββ restore_names() - Restore names | |
| β β β ββ correct_pinyin() - Phoneme correction (jqxβv) | |
| β β β ββ char_rep_map - Character replacement dictionary | |
| β β ββ Normalizers: | |
| β β ββ zh_normalizer (Chinese) - Uses WeTextProcessing/wetext | |
| β β ββ en_normalizer (English) - Uses tn library | |
| β β | |
| β ββ TextTokenizer (200+ lines) | |
| β ββ Methods: | |
| β β ββ encode() - Text to token IDs | |
| β β ββ decode() - Token IDs to text | |
| β β ββ convert_tokens_to_ids() | |
| β β ββ convert_ids_to_tokens() | |
| β β ββ Vocab management | |
| β ββ Special Tokens: | |
| β β ββ BOS: "<s>" (ID 0) | |
| β β ββ EOS: "</s>" (ID 1) | |
| β β ββ UNK: "<unk>" | |
| β ββ Tokenizer: SentencePiece (BPE-based) | |
| ββ Language Support: | |
| β ββ Chinese (simplified & traditional) | |
| β ββ English | |
| β ββ Mixed Chinese-English | |
| ββ Critical Pattern Matching: | |
| ββ Pinyin tone detection | |
| ββ Name entity detection | |
| ββ Email matching | |
| ββ Character replacement | |
| ββ Punctuation handling | |
| GPT MODEL ARCHITECTURE FILES | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| /home/user/IndexTTS-Rust/indextts/gpt/model_v2.py (747 LINES) βββ CRITICAL | |
| ββ Purpose: UnifiedVoice GPT-based TTS model | |
| ββ Key Classes: | |
| β ββ UnifiedVoice (700+ lines) | |
| β β ββ Architecture: | |
| β β β ββ Input Embeddings: Text (256 vocab), Mel (8194 vocab) | |
| β β β ββ Position Embeddings: Learned embeddings for mel/text | |
| β β β ββ GPT Transformer: Configurable layers/heads | |
| β β β ββ Conditioning Encoder: Conformer or Perceiver-based | |
| β β β ββ Emotion Conditioning: Separate conformer + perceiver | |
| β β β ββ Output Heads: Text prediction, Mel prediction | |
| β β β | |
| β β ββ Parameters: | |
| β β β ββ layers: 8 (transformer depth) | |
| β β β ββ model_dim: 512 (embedding dimension) | |
| β β β ββ heads: 8 (attention heads) | |
| β β β ββ max_text_tokens: 120 | |
| β β β ββ max_mel_tokens: 250 | |
| β β β ββ number_mel_codes: 8194 | |
| β β β ββ condition_type: "conformer_perceiver" or "conformer_encoder" | |
| β β β ββ Various activation functions | |
| β β β | |
| β β ββ Key Methods: | |
| β β β ββ forward() - Forward pass | |
| β β β ββ post_init_gpt2_config() - Initialize for inference | |
| β β β ββ generate_mel() - Mel token generation | |
| β β β ββ forward_with_cond_scale() - With classifier-free guidance | |
| β β β ββ Cache management | |
| β β β | |
| β β ββ Conditioning System: | |
| β β ββ Speaker conditioning via mel spectrogram | |
| β β ββ Conformer encoder for speaker features | |
| β β ββ Perceiver for attention pooling | |
| β β ββ Emotion conditioning (separate pathway) | |
| β β ββ Emotion vector support (8-dimensional) | |
| β β | |
| β ββ ResBlock (40+ lines) | |
| β β ββ Conv1d layers with GroupNorm | |
| β β ββ ReLU activation with residual connection | |
| β β | |
| β ββ GPT2InferenceModel (200+ lines) | |
| β β ββ Inference wrapper for GPT2 | |
| β β ββ KV cache support | |
| β β ββ Model parallelism support | |
| β β ββ Token-by-token generation | |
| β β | |
| β ββ ConditioningEncoder (30 lines) | |
| β β ββ Conv1d initialization | |
| β β ββ Attention blocks | |
| β β ββ Optional mean pooling | |
| β β | |
| β ββ MelEncoder (30 lines) | |
| β β ββ Conv1d layers | |
| β β ββ ResBlocks | |
| β β ββ 4x reduction | |
| β β | |
| β ββ LearnedPositionEmbeddings (15 lines) | |
| β β ββ Learnable positional embeddings | |
| β β | |
| β ββ build_hf_gpt_transformer() (20 lines) | |
| β ββ Builds HuggingFace GPT2 with custom embeddings | |
| β | |
| ββ External Dependencies: torch, transformers, indextts.gpt modules | |
| ββ Critical Inference Parameters: | |
| ββ Temperature control for generation | |
| ββ Top-k/top-p sampling | |
| ββ Classifier-free guidance scale | |
| ββ Generation length limits | |
| /home/user/IndexTTS-Rust/indextts/gpt/conformer_encoder.py (520 LINES) ββ | |
| ββ Purpose: Conformer-based speaker conditioning encoder | |
| ββ Key Classes: | |
| β ββ ConformerEncoder (main) | |
| β β ββ Modules: | |
| β β β ββ Subsampling layer (Conv2d) | |
| β β β ββ Positional encoding | |
| β β β ββ Conformer blocks | |
| β β β ββ Layer normalization | |
| β β β ββ Optional projection layer | |
| β β β | |
| β β ββ Configuration Parameters: | |
| β β β ββ input_size: 1024 (mel spectrogram bins) | |
| β β β ββ output_size: depends on config | |
| β β β ββ linear_units: hidden dim for FFN | |
| β β β ββ attention_heads: 8 | |
| β β β ββ num_blocks: 4 | |
| β β β ββ input_layer: "linear" or "conv2d" | |
| β β β | |
| β β ββ Architecture: Conv β Pos Enc β [Conformer Block] * N β LayerNorm | |
| β β | |
| β ββ ConformerBlock (80+ lines) | |
| β β ββ Residual connections | |
| β β ββ FFN β Attention β Conv β FFN structure | |
| β β ββ Feed-forward network (2-layer with dropout) | |
| β β ββ Multi-head self-attention | |
| β β ββ Convolution module (depthwise) | |
| β β ββ Layer normalization | |
| β β | |
| β ββ ConvolutionModule (50 lines) | |
| β β ββ Pointwise Conv 1x1 | |
| β β ββ Depthwise Conv with kernel_size (e.g., 15) | |
| β β ββ Batch normalization or layer normalization | |
| β β ββ Activation (ReLU/SiLU) | |
| β β ββ Projection | |
| β β | |
| β ββ PositionwiseFeedForward (15 lines) | |
| β β ββ Dense layer (idim β hidden) | |
| β β ββ Activation (ReLU) | |
| β β ββ Dropout | |
| β β ββ Dense layer (hidden β idim) | |
| β β | |
| β ββ MultiHeadedAttention (custom) | |
| β ββ Scaled dot-product attention | |
| β ββ Multiple heads | |
| β ββ Optional relative position bias | |
| β | |
| ββ External Dependencies: torch, custom conformer modules | |
| ββ Use Case: Processing mel spectrogram to extract speaker features | |
| /home/user/IndexTTS-Rust/indextts/gpt/perceiver.py (317 LINES) ββ | |
| ββ Purpose: Perceiver resampler for attention pooling | |
| ββ Key Classes: | |
| β ββ PerceiverResampler (250+ lines) | |
| β β ββ Architecture: | |
| β β β ββ Learnable latent queries | |
| β β β ββ Cross-attention layers | |
| β β β ββ Feed-forward networks | |
| β β β ββ Layer normalization | |
| β β β | |
| β β ββ Parameters: | |
| β β β ββ dim: 512 (embedding dimension) | |
| β β β ββ dim_context: 512 (context dimension) | |
| β β β ββ num_latents: 32 (number of latent queries) | |
| β β β ββ num_latent_channels: 64 | |
| β β β ββ num_layers: 6 | |
| β β β ββ ff_mult: 4 (FFN expansion) | |
| β β β ββ heads: 8 | |
| β β β | |
| β β ββ Key Methods: | |
| β β β ββ forward() - Attend and pool | |
| β β β ββ _cross_attend_block() - Single cross-attention layer | |
| β β β | |
| β β ββ Cross-Attention Mechanism: | |
| β β ββ Queries: Learnable latents | |
| β β ββ Keys/Values: Input context | |
| β β ββ Output: Pooled features (num_latents Γ dim) | |
| β β ββ FFN projection for dimension mixing | |
| β β | |
| β ββ FeedForward (15 lines) | |
| β ββ Dense (dim β hidden) | |
| β ββ GELU activation | |
| β ββ Dense (hidden β dim) | |
| β | |
| ββ External Dependencies: torch, einsum operations | |
| ββ Use Case: Pool conditioning encoder output to fixed-size representation | |
| VOCODER & AUDIO SYNTHESIS FILES | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| /home/user/IndexTTS-Rust/indextts/BigVGAN/models.py (1000+ LINES) βββ | |
| ββ Purpose: BigVGAN neural vocoder for mel-to-audio conversion | |
| ββ Key Classes: | |
| β ββ BigVGAN (400+ lines) | |
| β β ββ Architecture: | |
| β β β ββ Initial Conv1d (80 mel bins β 192 channels) | |
| β β β ββ Upsampling layers (transposed conv) | |
| β β β ββ AMP blocks (anti-aliased multi-period) | |
| β β β ββ Final Conv1d (channels β 1 waveform) | |
| β β β ββ Tanh activation for output | |
| β β β | |
| β β ββ Upsampling: 4x β 8x β 8x β 4x (256x total) | |
| β β β ββ Maps from 22050 Hz mel frames to audio samples | |
| β β β ββ Kernel sizes: [16, 16, 4, 4] | |
| β β β ββ Padding: [6, 6, 2, 2] | |
| β β β | |
| β β ββ Parameters: | |
| β β β ββ num_mels: 80 | |
| β β β ββ num_freq: 513 | |
| β β β ββ num_mels: 80 | |
| β β β ββ n_fft: 1024 | |
| β β β ββ hop_size: 256 | |
| β β β ββ win_size: 1024 | |
| β β β ββ sampling_rate: 22050 | |
| β β β ββ freq_min: 0 | |
| β β β ββ freq_max: None | |
| β β β ββ use_cuda_kernel: bool | |
| β β β | |
| β β ββ Key Methods: | |
| β β β ββ forward() - Mel β audio waveform | |
| β β β ββ from_pretrained() - Load from HuggingFace | |
| β β β ββ remove_weight_norm() - Remove spectral normalization | |
| β β β ββ eval() - Set to evaluation mode | |
| β β β | |
| β β ββ Special Features: | |
| β β ββ Weight normalization for training stability | |
| β β ββ Spectral normalization option | |
| β β ββ CUDA kernel support for activation functions | |
| β β ββ Snake/SnakeBeta activation (periodic) | |
| β β ββ Anti-aliasing filters for high-quality upsampling | |
| β β | |
| β ββ AMPBlock1 (50 lines) | |
| β β ββ Architecture: Conv1d Γ 2 with activations | |
| β β ββ Multiple dilation patterns [1, 3, 5] | |
| β β ββ Residual connections | |
| β β ββ Activation1d wrapper for anti-aliasing | |
| β β ββ Weight normalization | |
| β β | |
| β ββ AMPBlock2 (40 lines) | |
| β β ββ Similar to AMPBlock1 but simpler | |
| β β ββ Dilation patterns [1, 3] | |
| β β ββ Residual connections | |
| β β | |
| β ββ Activation1d (custom, from alias_free_activation/) | |
| β β ββ Applies activation function (Snake/SnakeBeta) | |
| β β ββ Optional anti-aliasing filter | |
| β β ββ Optional CUDA kernel for efficiency | |
| β β | |
| β ββ Snake Activation (from activations.py) | |
| β β ββ Formula: x + (1/alpha) * sinΒ²(alpha * x) | |
| β β ββ Periodic nonlinearity | |
| β β ββ Learnable alpha parameter | |
| β β | |
| β ββ SnakeBeta Activation (from activations.py) | |
| β ββ More complex periodic activation | |
| β ββ Improved harmonic modeling | |
| β | |
| ββ External Dependencies: torch, scipy, librosa | |
| ββ Model Size: ~100 MB (pretrained weights) | |
| /home/user/IndexTTS-Rust/indextts/s2mel/modules/audio.py (83 LINES) | |
| ββ Purpose: Mel-spectrogram computation (DSP) | |
| ββ Key Functions: | |
| β ββ load_wav() - Load WAV file with scipy | |
| β ββ mel_spectrogram() - Compute mel spectrogram | |
| β β ββ Parameters: | |
| β β β ββ y: waveform tensor | |
| β β β ββ n_fft: 1024 | |
| β β β ββ num_mels: 80 | |
| β β β ββ sampling_rate: 22050 | |
| β β β ββ hop_size: 256 | |
| β β β ββ win_size: 1024 | |
| β β β ββ fmin: 0 | |
| β β β ββ fmax: None or 8000 | |
| β β β | |
| β β ββ Process: | |
| β β β 1. Pad input with reflect padding | |
| β β β 2. Compute STFT (Short-Time Fourier Transform) | |
| β β β 3. Convert to magnitude spectrogram | |
| β β β 4. Apply mel filterbank (librosa) | |
| β β β 5. Apply dynamic range compression (log) | |
| β β β ββ Output: [1, 80, T] tensor | |
| β β β | |
| β β ββ Caching: | |
| β β ββ Caches mel filterbank matrices | |
| β β ββ Caches Hann windows | |
| β β ββ Device-specific caching | |
| β β | |
| β ββ dynamic_range_compression() - Log compression | |
| β ββ dynamic_range_decompression() - Inverse | |
| β ββ spectral_normalize/denormalize() | |
| β | |
| ββ Critical DSP Parameters: | |
| β ββ STFT Window: Hann window | |
| β ββ FFT Size: 1024 | |
| β ββ Hop Size: 256 (11.6 ms at 22050 Hz) | |
| β ββ Mel Bins: 80 (perceptual scale) | |
| β ββ Min Freq: 0 Hz | |
| β ββ Max Freq: Variable (8000 Hz or Nyquist) | |
| β | |
| ββ External Dependencies: torch, librosa, scipy | |
| SEMANTIC CODEC & FEATURE EXTRACTION FILES | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| /home/user/IndexTTS-Rust/indextts/utils/maskgct_utils.py (250 LINES) | |
| ββ Purpose: Build and manage semantic codecs | |
| ββ Key Functions: | |
| β ββ build_semantic_model() | |
| β β ββ Loads: facebook/w2v-bert-2.0 model | |
| β β ββ Extracts: wav2vec 2.0 BERT embeddings | |
| β β ββ Returns: model, mean, std (for normalization) | |
| β β ββ Output: 1024-dimensional embeddings | |
| β β | |
| β ββ build_semantic_codec() | |
| β β ββ Creates: RepCodec (residual vector quantization) | |
| β β ββ Quantizes: Semantic embeddings | |
| β β ββ Returns: Codec model | |
| β β ββ Output: Discrete tokens | |
| β β | |
| β ββ build_s2a_model() | |
| β β ββ Builds: MaskGCT_S2A (semantic-to-acoustic) | |
| β β ββ Maps: Semantic codes β acoustic codes | |
| β β | |
| β ββ build_acoustic_codec() | |
| β β ββ Encoder: Encodes acoustic features | |
| β β ββ Decoder: Decodes codes β audio | |
| β β ββ Multiple codec variants | |
| β β | |
| β ββ Inference_Pipeline (class) | |
| β ββ Combines all codecs | |
| β ββ Methods: | |
| β β ββ get_emb() - Get semantic embeddings | |
| β β ββ get_scode() - Quantize to semantic codes | |
| β β ββ semantic2acoustic() - Convert codes | |
| β β ββ s2a_inference() - Full pipeline | |
| β ββ Diffusion-based generation options | |
| β | |
| ββ External Dependencies: torch, transformers, huggingface_hub | |
| ββ Pre-trained Models: | |
| ββ W2V-BERT-2.0: 614M parameters | |
| ββ MaskGCT: From amphion/MaskGCT | |
| ββ Various codec checkpoints | |
| CONFIGURATION & UTILITY FILES | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| /home/user/IndexTTS-Rust/indextts/utils/checkpoint.py (50 LINES) | |
| ββ Purpose: Load model checkpoints | |
| ββ Key Functions: | |
| β ββ load_checkpoint() - Load weights into model | |
| β ββ Device handling (CPU/GPU/XPU/MPS) | |
| ββ Supported Formats: .pth, .safetensors | |
| /home/user/IndexTTS-Rust/indextts/utils/arch_util.py | |
| ββ Purpose: Architecture utility modules | |
| ββ Key Classes: | |
| β ββ AttentionBlock - Generic attention layer | |
| ββ Used in: Conditioning encoder, other modules | |
| /home/user/IndexTTS-Rust/indextts/utils/xtransformers.py (1,600 LINES) | |
| ββ Purpose: Extended transformer utilities | |
| ββ Key Components: | |
| β ββ Advanced attention mechanisms | |
| β ββ Relative position bias | |
| β ββ Cross-attention patterns | |
| β ββ Various position encoding schemes | |
| ββ Used in: GPT model, encoders | |
| TESTING FILES | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| /home/user/IndexTTS-Rust/tests/regression_test.py | |
| ββ Test Cases: | |
| β ββ Chinese text with pinyin tones (ζ XUAN4) | |
| β ββ English text | |
| β ββ Mixed Chinese-English | |
| β ββ Long-form text with multiple sentences | |
| β ββ Named entities (Joseph Gordon-Levitt) | |
| β ββ Chinese names (ηΊ¦η倫·ι«η»-θ±η»΄ηΉ) | |
| β ββ Extended passages for robustness | |
| ββ Inference Modes: | |
| β ββ Single inference (infer) | |
| β ββ Fast inference (infer_fast) | |
| ββ Output: WAV files in outputs/ directory | |
| /home/user/IndexTTS-Rust/tests/padding_test.py | |
| ββ Test Scenarios: | |
| β ββ Variable length inputs | |
| β ββ Batch processing | |
| β ββ Edge cases | |
| β ββ Padding handling | |
| ββ Purpose: Ensure robust padding mechanics | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| KEY ALGORITHMS SUMMARY: | |
| 1. TEXT PROCESSING: | |
| - Regex-based pattern matching for pinyin/names | |
| - Character-level CJK tokenization | |
| - SentencePiece BPE encoding | |
| - Language detection (Chinese vs English) | |
| 2. FEATURE EXTRACTION: | |
| - W2V-BERT semantic embeddings (1024-dim) | |
| - RepCodec quantization | |
| - Mel-spectrogram (STFT-based, 80-dim) | |
| - CAMPPlus speaker embeddings (192-dim) | |
| 3. SEQUENCE GENERATION: | |
| - GPT-based autoregressive generation | |
| - Conformer speaker conditioning | |
| - Perceiver pooling for attention | |
| - Classifier-free guidance (optional) | |
| - Temperature/top-k/top-p sampling | |
| 4. AUDIO SYNTHESIS: | |
| - Transposed convolution upsampling (256x) | |
| - Anti-aliased activation functions | |
| - Residual connections | |
| - Weight/spectral normalization | |
| 5. EMOTION CONTROL: | |
| - 8-dimensional emotion vectors | |
| - Text-based emotion detection (via Qwen) | |
| - Audio-based emotion extraction | |
| - Emotion matrix interpolation | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |