IndexTTS-Rust / SOURCE_FILE_LISTING.txt
Claude
Add codebase analysis documentation and update gitignore
b48d7b7 unverified
╔════════════════════════════════════════════════════════════════════════════════╗
β•‘ DETAILED SOURCE FILE LISTING BY CATEGORY β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
MAIN INFERENCE PIPELINE FILES
═════════════════════════════════════════════════════════════════════════════════
/home/user/IndexTTS-Rust/indextts/infer_v2.py (739 LINES) ⭐⭐⭐ CRITICAL
β”œβ”€ Purpose: Main TTS inference class (IndexTTS2)
β”œβ”€ Key Classes:
β”‚ β”œβ”€ QwenEmotion (emotion text-to-vector conversion)
β”‚ β”œβ”€ IndexTTS2 (main inference class)
β”‚ └─ Helper functions for emotion/audio processing
β”œβ”€ Key Methods:
β”‚ β”œβ”€ __init__() - Initialize all models and codecs
β”‚ β”œβ”€ infer() - Single text generation with emotion control
β”‚ β”œβ”€ infer_fast() - Parallel segment generation
β”‚ β”œβ”€ get_emb() - Extract semantic embeddings
β”‚ β”œβ”€ remove_long_silence() - Silence token removal
β”‚ β”œβ”€ insert_interval_silence() - Silence insertion
β”‚ └─ Cache management for repeated generation
β”œβ”€ Models Loaded:
β”‚ β”œβ”€ UnifiedVoice (GPT model for mel token generation)
β”‚ β”œβ”€ W2V-BERT (semantic feature extraction)
β”‚ β”œβ”€ RepCodec (semantic codec)
β”‚ β”œβ”€ S2Mel model (semantic-to-mel conversion)
β”‚ β”œβ”€ CAMPPlus (speaker embedding)
β”‚ β”œβ”€ BigVGAN vocoder
β”‚ β”œβ”€ Qwen-based emotion model
β”‚ └─ Emotion/speaker matrices
└─ External Dependencies: torch, transformers, librosa, safetensors
/home/user/IndexTTS-Rust/webui.py (18KB) ⭐⭐⭐ WEB INTERFACE
β”œβ”€ Purpose: Gradio-based web UI for IndexTTS
β”œβ”€ Key Components:
β”‚ β”œβ”€ Model initialization (IndexTTS2 instance)
β”‚ β”œβ”€ Language selection (Chinese/English)
β”‚ β”œβ”€ Emotion control modes (4 modes)
β”‚ β”œβ”€ Example case loading from cases.jsonl
β”‚ β”œβ”€ Progress bar integration
β”‚ └─ Output management
β”œβ”€ Features:
β”‚ β”œβ”€ Real-time inference
β”‚ β”œβ”€ Multiple emotion control methods
β”‚ β”œβ”€ Batch processing
β”‚ β”œβ”€ Task caching
β”‚ β”œβ”€ i18n support
β”‚ └─ Pre-loaded example cases
└─ Web Framework: Gradio 5.34.1
/home/user/IndexTTS-Rust/indextts/cli.py (64 LINES)
β”œβ”€ Purpose: Command-line interface
β”œβ”€ Usage: python -m indextts.cli <text> -v <voice.wav> -o <output.wav> [options]
β”œβ”€ Arguments:
β”‚ β”œβ”€ text: Text to synthesize
β”‚ β”œβ”€ -v/--voice: Voice reference audio
β”‚ β”œβ”€ -o/--output_path: Output file path
β”‚ β”œβ”€ -c/--config: Config file path
β”‚ β”œβ”€ --model_dir: Model directory
β”‚ β”œβ”€ --fp16: Use FP16 precision
β”‚ β”œβ”€ -d/--device: Device (cpu/cuda/mps/xpu)
β”‚ └─ -f/--force: Force overwrite
└─ Uses: IndexTTS (v1 model)
TEXT PROCESSING & NORMALIZATION FILES
═════════════════════════════════════════════════════════════════════════════════
/home/user/IndexTTS-Rust/indextts/utils/front.py (700 LINES) ⭐⭐⭐ CRITICAL
β”œβ”€ Purpose: Text normalization and tokenization
β”œβ”€ Key Classes:
β”‚ β”œβ”€ TextNormalizer (700+ lines)
β”‚ β”‚ β”œβ”€ Pattern Definitions:
β”‚ β”‚ β”‚ β”œβ”€ PINYIN_TONE_PATTERN (regex for pinyin with tones 1-5)
β”‚ β”‚ β”‚ β”œβ”€ NAME_PATTERN (regex for Chinese names)
β”‚ β”‚ β”‚ └─ ENGLISH_CONTRACTION_PATTERN (regex for 's contractions)
β”‚ β”‚ β”œβ”€ Methods:
β”‚ β”‚ β”‚ β”œβ”€ normalize() - Main normalization
β”‚ β”‚ β”‚ β”œβ”€ use_chinese() - Language detection
β”‚ β”‚ β”‚ β”œβ”€ save_pinyin_tones() - Extract pinyin with tones
β”‚ β”‚ β”‚ β”œβ”€ restore_pinyin_tones() - Restore pinyin
β”‚ β”‚ β”‚ β”œβ”€ save_names() - Extract names
β”‚ β”‚ β”‚ β”œβ”€ restore_names() - Restore names
β”‚ β”‚ β”‚ β”œβ”€ correct_pinyin() - Phoneme correction (jqxβ†’v)
β”‚ β”‚ β”‚ └─ char_rep_map - Character replacement dictionary
β”‚ β”‚ └─ Normalizers:
β”‚ β”‚ β”œβ”€ zh_normalizer (Chinese) - Uses WeTextProcessing/wetext
β”‚ β”‚ └─ en_normalizer (English) - Uses tn library
β”‚ β”‚
β”‚ └─ TextTokenizer (200+ lines)
β”‚ β”œβ”€ Methods:
β”‚ β”‚ β”œβ”€ encode() - Text to token IDs
β”‚ β”‚ β”œβ”€ decode() - Token IDs to text
β”‚ β”‚ β”œβ”€ convert_tokens_to_ids()
β”‚ β”‚ β”œβ”€ convert_ids_to_tokens()
β”‚ β”‚ └─ Vocab management
β”‚ β”œβ”€ Special Tokens:
β”‚ β”‚ β”œβ”€ BOS: "<s>" (ID 0)
β”‚ β”‚ β”œβ”€ EOS: "</s>" (ID 1)
β”‚ β”‚ └─ UNK: "<unk>"
β”‚ └─ Tokenizer: SentencePiece (BPE-based)
β”œβ”€ Language Support:
β”‚ β”œβ”€ Chinese (simplified & traditional)
β”‚ β”œβ”€ English
β”‚ └─ Mixed Chinese-English
└─ Critical Pattern Matching:
β”œβ”€ Pinyin tone detection
β”œβ”€ Name entity detection
β”œβ”€ Email matching
β”œβ”€ Character replacement
└─ Punctuation handling
GPT MODEL ARCHITECTURE FILES
═════════════════════════════════════════════════════════════════════════════════
/home/user/IndexTTS-Rust/indextts/gpt/model_v2.py (747 LINES) ⭐⭐⭐ CRITICAL
β”œβ”€ Purpose: UnifiedVoice GPT-based TTS model
β”œβ”€ Key Classes:
β”‚ β”œβ”€ UnifiedVoice (700+ lines)
β”‚ β”‚ β”œβ”€ Architecture:
β”‚ β”‚ β”‚ β”œβ”€ Input Embeddings: Text (256 vocab), Mel (8194 vocab)
β”‚ β”‚ β”‚ β”œβ”€ Position Embeddings: Learned embeddings for mel/text
β”‚ β”‚ β”‚ β”œβ”€ GPT Transformer: Configurable layers/heads
β”‚ β”‚ β”‚ β”œβ”€ Conditioning Encoder: Conformer or Perceiver-based
β”‚ β”‚ β”‚ β”œβ”€ Emotion Conditioning: Separate conformer + perceiver
β”‚ β”‚ β”‚ └─ Output Heads: Text prediction, Mel prediction
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Parameters:
β”‚ β”‚ β”‚ β”œβ”€ layers: 8 (transformer depth)
β”‚ β”‚ β”‚ β”œβ”€ model_dim: 512 (embedding dimension)
β”‚ β”‚ β”‚ β”œβ”€ heads: 8 (attention heads)
β”‚ β”‚ β”‚ β”œβ”€ max_text_tokens: 120
β”‚ β”‚ β”‚ β”œβ”€ max_mel_tokens: 250
β”‚ β”‚ β”‚ β”œβ”€ number_mel_codes: 8194
β”‚ β”‚ β”‚ β”œβ”€ condition_type: "conformer_perceiver" or "conformer_encoder"
β”‚ β”‚ β”‚ └─ Various activation functions
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Key Methods:
β”‚ β”‚ β”‚ β”œβ”€ forward() - Forward pass
β”‚ β”‚ β”‚ β”œβ”€ post_init_gpt2_config() - Initialize for inference
β”‚ β”‚ β”‚ β”œβ”€ generate_mel() - Mel token generation
β”‚ β”‚ β”‚ β”œβ”€ forward_with_cond_scale() - With classifier-free guidance
β”‚ β”‚ β”‚ └─ Cache management
β”‚ β”‚ β”‚
β”‚ β”‚ └─ Conditioning System:
β”‚ β”‚ β”œβ”€ Speaker conditioning via mel spectrogram
β”‚ β”‚ β”œβ”€ Conformer encoder for speaker features
β”‚ β”‚ β”œβ”€ Perceiver for attention pooling
β”‚ β”‚ β”œβ”€ Emotion conditioning (separate pathway)
β”‚ β”‚ └─ Emotion vector support (8-dimensional)
β”‚ β”‚
β”‚ β”œβ”€ ResBlock (40+ lines)
β”‚ β”‚ β”œβ”€ Conv1d layers with GroupNorm
β”‚ β”‚ └─ ReLU activation with residual connection
β”‚ β”‚
β”‚ β”œβ”€ GPT2InferenceModel (200+ lines)
β”‚ β”‚ β”œβ”€ Inference wrapper for GPT2
β”‚ β”‚ β”œβ”€ KV cache support
β”‚ β”‚ β”œβ”€ Model parallelism support
β”‚ β”‚ └─ Token-by-token generation
β”‚ β”‚
β”‚ β”œβ”€ ConditioningEncoder (30 lines)
β”‚ β”‚ β”œβ”€ Conv1d initialization
β”‚ β”‚ β”œβ”€ Attention blocks
β”‚ β”‚ └─ Optional mean pooling
β”‚ β”‚
β”‚ β”œβ”€ MelEncoder (30 lines)
β”‚ β”‚ β”œβ”€ Conv1d layers
β”‚ β”‚ β”œβ”€ ResBlocks
β”‚ β”‚ └─ 4x reduction
β”‚ β”‚
β”‚ β”œβ”€ LearnedPositionEmbeddings (15 lines)
β”‚ β”‚ └─ Learnable positional embeddings
β”‚ β”‚
β”‚ └─ build_hf_gpt_transformer() (20 lines)
β”‚ └─ Builds HuggingFace GPT2 with custom embeddings
β”‚
β”œβ”€ External Dependencies: torch, transformers, indextts.gpt modules
└─ Critical Inference Parameters:
β”œβ”€ Temperature control for generation
β”œβ”€ Top-k/top-p sampling
β”œβ”€ Classifier-free guidance scale
└─ Generation length limits
/home/user/IndexTTS-Rust/indextts/gpt/conformer_encoder.py (520 LINES) ⭐⭐
β”œβ”€ Purpose: Conformer-based speaker conditioning encoder
β”œβ”€ Key Classes:
β”‚ β”œβ”€ ConformerEncoder (main)
β”‚ β”‚ β”œβ”€ Modules:
β”‚ β”‚ β”‚ β”œβ”€ Subsampling layer (Conv2d)
β”‚ β”‚ β”‚ β”œβ”€ Positional encoding
β”‚ β”‚ β”‚ β”œβ”€ Conformer blocks
β”‚ β”‚ β”‚ β”œβ”€ Layer normalization
β”‚ β”‚ β”‚ └─ Optional projection layer
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Configuration Parameters:
β”‚ β”‚ β”‚ β”œβ”€ input_size: 1024 (mel spectrogram bins)
β”‚ β”‚ β”‚ β”œβ”€ output_size: depends on config
β”‚ β”‚ β”‚ β”œβ”€ linear_units: hidden dim for FFN
β”‚ β”‚ β”‚ β”œβ”€ attention_heads: 8
β”‚ β”‚ β”‚ β”œβ”€ num_blocks: 4
β”‚ β”‚ β”‚ └─ input_layer: "linear" or "conv2d"
β”‚ β”‚ β”‚
β”‚ β”‚ └─ Architecture: Conv β†’ Pos Enc β†’ [Conformer Block] * N β†’ LayerNorm
β”‚ β”‚
β”‚ β”œβ”€ ConformerBlock (80+ lines)
β”‚ β”‚ β”œβ”€ Residual connections
β”‚ β”‚ β”œβ”€ FFN β†’ Attention β†’ Conv β†’ FFN structure
β”‚ β”‚ β”œβ”€ Feed-forward network (2-layer with dropout)
β”‚ β”‚ β”œβ”€ Multi-head self-attention
β”‚ β”‚ β”œβ”€ Convolution module (depthwise)
β”‚ β”‚ └─ Layer normalization
β”‚ β”‚
β”‚ β”œβ”€ ConvolutionModule (50 lines)
β”‚ β”‚ β”œβ”€ Pointwise Conv 1x1
β”‚ β”‚ β”œβ”€ Depthwise Conv with kernel_size (e.g., 15)
β”‚ β”‚ β”œβ”€ Batch normalization or layer normalization
β”‚ β”‚ β”œβ”€ Activation (ReLU/SiLU)
β”‚ β”‚ └─ Projection
β”‚ β”‚
β”‚ β”œβ”€ PositionwiseFeedForward (15 lines)
β”‚ β”‚ β”œβ”€ Dense layer (idim β†’ hidden)
β”‚ β”‚ β”œβ”€ Activation (ReLU)
β”‚ β”‚ β”œβ”€ Dropout
β”‚ β”‚ └─ Dense layer (hidden β†’ idim)
β”‚ β”‚
β”‚ └─ MultiHeadedAttention (custom)
β”‚ β”œβ”€ Scaled dot-product attention
β”‚ β”œβ”€ Multiple heads
β”‚ └─ Optional relative position bias
β”‚
β”œβ”€ External Dependencies: torch, custom conformer modules
└─ Use Case: Processing mel spectrogram to extract speaker features
/home/user/IndexTTS-Rust/indextts/gpt/perceiver.py (317 LINES) ⭐⭐
β”œβ”€ Purpose: Perceiver resampler for attention pooling
β”œβ”€ Key Classes:
β”‚ β”œβ”€ PerceiverResampler (250+ lines)
β”‚ β”‚ β”œβ”€ Architecture:
β”‚ β”‚ β”‚ β”œβ”€ Learnable latent queries
β”‚ β”‚ β”‚ β”œβ”€ Cross-attention layers
β”‚ β”‚ β”‚ β”œβ”€ Feed-forward networks
β”‚ β”‚ β”‚ └─ Layer normalization
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Parameters:
β”‚ β”‚ β”‚ β”œβ”€ dim: 512 (embedding dimension)
β”‚ β”‚ β”‚ β”œβ”€ dim_context: 512 (context dimension)
β”‚ β”‚ β”‚ β”œβ”€ num_latents: 32 (number of latent queries)
β”‚ β”‚ β”‚ β”œβ”€ num_latent_channels: 64
β”‚ β”‚ β”‚ β”œβ”€ num_layers: 6
β”‚ β”‚ β”‚ β”œβ”€ ff_mult: 4 (FFN expansion)
β”‚ β”‚ β”‚ └─ heads: 8
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Key Methods:
β”‚ β”‚ β”‚ β”œβ”€ forward() - Attend and pool
β”‚ β”‚ β”‚ └─ _cross_attend_block() - Single cross-attention layer
β”‚ β”‚ β”‚
β”‚ β”‚ └─ Cross-Attention Mechanism:
β”‚ β”‚ β”œβ”€ Queries: Learnable latents
β”‚ β”‚ β”œβ”€ Keys/Values: Input context
β”‚ β”‚ β”œβ”€ Output: Pooled features (num_latents Γ— dim)
β”‚ β”‚ └─ FFN projection for dimension mixing
β”‚ β”‚
β”‚ └─ FeedForward (15 lines)
β”‚ β”œβ”€ Dense (dim β†’ hidden)
β”‚ β”œβ”€ GELU activation
β”‚ └─ Dense (hidden β†’ dim)
β”‚
β”œβ”€ External Dependencies: torch, einsum operations
└─ Use Case: Pool conditioning encoder output to fixed-size representation
VOCODER & AUDIO SYNTHESIS FILES
═════════════════════════════════════════════════════════════════════════════════
/home/user/IndexTTS-Rust/indextts/BigVGAN/models.py (1000+ LINES) ⭐⭐⭐
β”œβ”€ Purpose: BigVGAN neural vocoder for mel-to-audio conversion
β”œβ”€ Key Classes:
β”‚ β”œβ”€ BigVGAN (400+ lines)
β”‚ β”‚ β”œβ”€ Architecture:
β”‚ β”‚ β”‚ β”œβ”€ Initial Conv1d (80 mel bins β†’ 192 channels)
β”‚ β”‚ β”‚ β”œβ”€ Upsampling layers (transposed conv)
β”‚ β”‚ β”‚ β”œβ”€ AMP blocks (anti-aliased multi-period)
β”‚ β”‚ β”‚ β”œβ”€ Final Conv1d (channels β†’ 1 waveform)
β”‚ β”‚ β”‚ └─ Tanh activation for output
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Upsampling: 4x β†’ 8x β†’ 8x β†’ 4x (256x total)
β”‚ β”‚ β”‚ β”œβ”€ Maps from 22050 Hz mel frames to audio samples
β”‚ β”‚ β”‚ β”œβ”€ Kernel sizes: [16, 16, 4, 4]
β”‚ β”‚ β”‚ └─ Padding: [6, 6, 2, 2]
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Parameters:
β”‚ β”‚ β”‚ β”œβ”€ num_mels: 80
β”‚ β”‚ β”‚ β”œβ”€ num_freq: 513
β”‚ β”‚ β”‚ β”œβ”€ num_mels: 80
β”‚ β”‚ β”‚ β”œβ”€ n_fft: 1024
β”‚ β”‚ β”‚ β”œβ”€ hop_size: 256
β”‚ β”‚ β”‚ β”œβ”€ win_size: 1024
β”‚ β”‚ β”‚ β”œβ”€ sampling_rate: 22050
β”‚ β”‚ β”‚ β”œβ”€ freq_min: 0
β”‚ β”‚ β”‚ β”œβ”€ freq_max: None
β”‚ β”‚ β”‚ └─ use_cuda_kernel: bool
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Key Methods:
β”‚ β”‚ β”‚ β”œβ”€ forward() - Mel β†’ audio waveform
β”‚ β”‚ β”‚ β”œβ”€ from_pretrained() - Load from HuggingFace
β”‚ β”‚ β”‚ β”œβ”€ remove_weight_norm() - Remove spectral normalization
β”‚ β”‚ β”‚ └─ eval() - Set to evaluation mode
β”‚ β”‚ β”‚
β”‚ β”‚ └─ Special Features:
β”‚ β”‚ β”œβ”€ Weight normalization for training stability
β”‚ β”‚ β”œβ”€ Spectral normalization option
β”‚ β”‚ β”œβ”€ CUDA kernel support for activation functions
β”‚ β”‚ β”œβ”€ Snake/SnakeBeta activation (periodic)
β”‚ β”‚ └─ Anti-aliasing filters for high-quality upsampling
β”‚ β”‚
β”‚ β”œβ”€ AMPBlock1 (50 lines)
β”‚ β”‚ β”œβ”€ Architecture: Conv1d Γ— 2 with activations
β”‚ β”‚ β”œβ”€ Multiple dilation patterns [1, 3, 5]
β”‚ β”‚ β”œβ”€ Residual connections
β”‚ β”‚ β”œβ”€ Activation1d wrapper for anti-aliasing
β”‚ β”‚ └─ Weight normalization
β”‚ β”‚
β”‚ β”œβ”€ AMPBlock2 (40 lines)
β”‚ β”‚ β”œβ”€ Similar to AMPBlock1 but simpler
β”‚ β”‚ β”œβ”€ Dilation patterns [1, 3]
β”‚ β”‚ └─ Residual connections
β”‚ β”‚
β”‚ β”œβ”€ Activation1d (custom, from alias_free_activation/)
β”‚ β”‚ β”œβ”€ Applies activation function (Snake/SnakeBeta)
β”‚ β”‚ β”œβ”€ Optional anti-aliasing filter
β”‚ β”‚ └─ Optional CUDA kernel for efficiency
β”‚ β”‚
β”‚ β”œβ”€ Snake Activation (from activations.py)
β”‚ β”‚ β”œβ”€ Formula: x + (1/alpha) * sinΒ²(alpha * x)
β”‚ β”‚ β”œβ”€ Periodic nonlinearity
β”‚ β”‚ └─ Learnable alpha parameter
β”‚ β”‚
β”‚ └─ SnakeBeta Activation (from activations.py)
β”‚ β”œβ”€ More complex periodic activation
β”‚ └─ Improved harmonic modeling
β”‚
β”œβ”€ External Dependencies: torch, scipy, librosa
└─ Model Size: ~100 MB (pretrained weights)
/home/user/IndexTTS-Rust/indextts/s2mel/modules/audio.py (83 LINES)
β”œβ”€ Purpose: Mel-spectrogram computation (DSP)
β”œβ”€ Key Functions:
β”‚ β”œβ”€ load_wav() - Load WAV file with scipy
β”‚ β”œβ”€ mel_spectrogram() - Compute mel spectrogram
β”‚ β”‚ β”œβ”€ Parameters:
β”‚ β”‚ β”‚ β”œβ”€ y: waveform tensor
β”‚ β”‚ β”‚ β”œβ”€ n_fft: 1024
β”‚ β”‚ β”‚ β”œβ”€ num_mels: 80
β”‚ β”‚ β”‚ β”œβ”€ sampling_rate: 22050
β”‚ β”‚ β”‚ β”œβ”€ hop_size: 256
β”‚ β”‚ β”‚ β”œβ”€ win_size: 1024
β”‚ β”‚ β”‚ β”œβ”€ fmin: 0
β”‚ β”‚ β”‚ └─ fmax: None or 8000
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€ Process:
β”‚ β”‚ β”‚ 1. Pad input with reflect padding
β”‚ β”‚ β”‚ 2. Compute STFT (Short-Time Fourier Transform)
β”‚ β”‚ β”‚ 3. Convert to magnitude spectrogram
β”‚ β”‚ β”‚ 4. Apply mel filterbank (librosa)
β”‚ β”‚ β”‚ 5. Apply dynamic range compression (log)
β”‚ β”‚ β”‚ └─ Output: [1, 80, T] tensor
β”‚ β”‚ β”‚
β”‚ β”‚ └─ Caching:
β”‚ β”‚ β”œβ”€ Caches mel filterbank matrices
β”‚ β”‚ β”œβ”€ Caches Hann windows
β”‚ β”‚ └─ Device-specific caching
β”‚ β”‚
β”‚ β”œβ”€ dynamic_range_compression() - Log compression
β”‚ β”œβ”€ dynamic_range_decompression() - Inverse
β”‚ └─ spectral_normalize/denormalize()
β”‚
β”œβ”€ Critical DSP Parameters:
β”‚ β”œβ”€ STFT Window: Hann window
β”‚ β”œβ”€ FFT Size: 1024
β”‚ β”œβ”€ Hop Size: 256 (11.6 ms at 22050 Hz)
β”‚ β”œβ”€ Mel Bins: 80 (perceptual scale)
β”‚ β”œβ”€ Min Freq: 0 Hz
β”‚ └─ Max Freq: Variable (8000 Hz or Nyquist)
β”‚
└─ External Dependencies: torch, librosa, scipy
SEMANTIC CODEC & FEATURE EXTRACTION FILES
═════════════════════════════════════════════════════════════════════════════════
/home/user/IndexTTS-Rust/indextts/utils/maskgct_utils.py (250 LINES)
β”œβ”€ Purpose: Build and manage semantic codecs
β”œβ”€ Key Functions:
β”‚ β”œβ”€ build_semantic_model()
β”‚ β”‚ β”œβ”€ Loads: facebook/w2v-bert-2.0 model
β”‚ β”‚ β”œβ”€ Extracts: wav2vec 2.0 BERT embeddings
β”‚ β”‚ β”œβ”€ Returns: model, mean, std (for normalization)
β”‚ β”‚ └─ Output: 1024-dimensional embeddings
β”‚ β”‚
β”‚ β”œβ”€ build_semantic_codec()
β”‚ β”‚ β”œβ”€ Creates: RepCodec (residual vector quantization)
β”‚ β”‚ β”œβ”€ Quantizes: Semantic embeddings
β”‚ β”‚ β”œβ”€ Returns: Codec model
β”‚ β”‚ └─ Output: Discrete tokens
β”‚ β”‚
β”‚ β”œβ”€ build_s2a_model()
β”‚ β”‚ β”œβ”€ Builds: MaskGCT_S2A (semantic-to-acoustic)
β”‚ β”‚ └─ Maps: Semantic codes β†’ acoustic codes
β”‚ β”‚
β”‚ β”œβ”€ build_acoustic_codec()
β”‚ β”‚ β”œβ”€ Encoder: Encodes acoustic features
β”‚ β”‚ β”œβ”€ Decoder: Decodes codes β†’ audio
β”‚ β”‚ └─ Multiple codec variants
β”‚ β”‚
β”‚ └─ Inference_Pipeline (class)
β”‚ β”œβ”€ Combines all codecs
β”‚ β”œβ”€ Methods:
β”‚ β”‚ β”œβ”€ get_emb() - Get semantic embeddings
β”‚ β”‚ β”œβ”€ get_scode() - Quantize to semantic codes
β”‚ β”‚ β”œβ”€ semantic2acoustic() - Convert codes
β”‚ β”‚ └─ s2a_inference() - Full pipeline
β”‚ └─ Diffusion-based generation options
β”‚
β”œβ”€ External Dependencies: torch, transformers, huggingface_hub
└─ Pre-trained Models:
β”œβ”€ W2V-BERT-2.0: 614M parameters
β”œβ”€ MaskGCT: From amphion/MaskGCT
└─ Various codec checkpoints
CONFIGURATION & UTILITY FILES
═════════════════════════════════════════════════════════════════════════════════
/home/user/IndexTTS-Rust/indextts/utils/checkpoint.py (50 LINES)
β”œβ”€ Purpose: Load model checkpoints
β”œβ”€ Key Functions:
β”‚ β”œβ”€ load_checkpoint() - Load weights into model
β”‚ └─ Device handling (CPU/GPU/XPU/MPS)
└─ Supported Formats: .pth, .safetensors
/home/user/IndexTTS-Rust/indextts/utils/arch_util.py
β”œβ”€ Purpose: Architecture utility modules
β”œβ”€ Key Classes:
β”‚ └─ AttentionBlock - Generic attention layer
└─ Used in: Conditioning encoder, other modules
/home/user/IndexTTS-Rust/indextts/utils/xtransformers.py (1,600 LINES)
β”œβ”€ Purpose: Extended transformer utilities
β”œβ”€ Key Components:
β”‚ β”œβ”€ Advanced attention mechanisms
β”‚ β”œβ”€ Relative position bias
β”‚ β”œβ”€ Cross-attention patterns
β”‚ └─ Various position encoding schemes
└─ Used in: GPT model, encoders
TESTING FILES
═════════════════════════════════════════════════════════════════════════════════
/home/user/IndexTTS-Rust/tests/regression_test.py
β”œβ”€ Test Cases:
β”‚ β”œβ”€ Chinese text with pinyin tones (ζ™• XUAN4)
β”‚ β”œβ”€ English text
β”‚ β”œβ”€ Mixed Chinese-English
β”‚ β”œβ”€ Long-form text with multiple sentences
β”‚ β”œβ”€ Named entities (Joseph Gordon-Levitt)
β”‚ β”œβ”€ Chinese names (ηΊ¦η‘Ÿε€«Β·ι«˜η™»-θŽ±η»΄η‰Ή)
β”‚ └─ Extended passages for robustness
β”œβ”€ Inference Modes:
β”‚ β”œβ”€ Single inference (infer)
β”‚ └─ Fast inference (infer_fast)
└─ Output: WAV files in outputs/ directory
/home/user/IndexTTS-Rust/tests/padding_test.py
β”œβ”€ Test Scenarios:
β”‚ β”œβ”€ Variable length inputs
β”‚ β”œβ”€ Batch processing
β”‚ β”œβ”€ Edge cases
β”‚ └─ Padding handling
└─ Purpose: Ensure robust padding mechanics
═════════════════════════════════════════════════════════════════════════════════
KEY ALGORITHMS SUMMARY:
1. TEXT PROCESSING:
- Regex-based pattern matching for pinyin/names
- Character-level CJK tokenization
- SentencePiece BPE encoding
- Language detection (Chinese vs English)
2. FEATURE EXTRACTION:
- W2V-BERT semantic embeddings (1024-dim)
- RepCodec quantization
- Mel-spectrogram (STFT-based, 80-dim)
- CAMPPlus speaker embeddings (192-dim)
3. SEQUENCE GENERATION:
- GPT-based autoregressive generation
- Conformer speaker conditioning
- Perceiver pooling for attention
- Classifier-free guidance (optional)
- Temperature/top-k/top-p sampling
4. AUDIO SYNTHESIS:
- Transposed convolution upsampling (256x)
- Anti-aliased activation functions
- Residual connections
- Weight/spectral normalization
5. EMOTION CONTROL:
- 8-dimensional emotion vectors
- Text-based emotion detection (via Qwen)
- Audio-based emotion extraction
- Emotion matrix interpolation
═════════════════════════════════════════════════════════════════════════════════